BUG: DataFrame().to_parquet() does not write Parquet compliant data for nested arrays (original) (raw)

Reproducible Example

import pyarrow as pa import pyarrow.parquet as pq import pandas as pd

df = pd.DataFrame({'int_array_col': [[[1,2,3]], [[4,5,6]]]}) df.to_parquet('/tmp/test', engine='pyarrow')

pandas_parquet_table = pq.read_table('/tmp/test')

pyarrow_table = pa.Table.from_pandas(df) writer = pa.BufferOutputStream() pq.write_table( pyarrow_table, writer, use_compliant_nested_type=True ) reader = pa.BufferReader(writer.getvalue()) parquet_table = pq.read_table(reader)

print("Pandas:", pandas_parquet_table.schema.types) print("Non-compliant Parquet:", pyarrow_table.schema.types) print("Compliant Parquet:", parquet_table.schema.types) assert pandas_parquet_table.schema.types == pyarrow_table.types assert pandas_parquet_table.schema.types == parquet_table.schema.types

```python-traceback Pandas: [ListType(list<item: list<item: int64>>)] Non-compliant Parquet: [ListType(list<item: list<item: int64>>)] Compliant Parquet: [ListType(list<element: list<element: int64>>)] Traceback (most recent call last): File "/Users/judahrand/test_dir/pandas_parquet.py", line 25, in assert pandas_parquet_table.schema.types == parquet_table.schema.types AssertionError

Issue Description

This method currently does not write adherent Parquet Logical Types for nested arrays as defined here. This can cause problems when trying to Parquet as in intermediate format, for example loading data into BigQuery which expects adherent data.

This was an issue in PyArrow itself, however, it was fixed in ARROW-11497. I believe that this flag should be set in Pandas if we are to claim that Pandas .to_parquet() method actually outputs Parquet.

Expected Behavior

Output complaint Parquet.