DOC: Update parquet metadata format description around index levels by cpcloud · Pull Request #18201 · pandas-dev/pandas (original) (raw)

@cpcloud Another illustration of the possible difficulties with the mapping between the pandas metadata and the actual field names in the schema: currently this mapping is possible in pyarrow because it assumes that the index columns are the last ones inside the schema.
However, this assumption breaks eg with files written by fastparquet, because they do not put index levels at the end, but in the beginning:

In [79]: df = pd.DataFrame({'a': [1, 2, 3], 'b': [.1, .2, .3], 'c':pd.date_range("2017-01-01", periods=3, tz='Europe/Brussels')}, index=pd.Index(['A', 'B', 'C'], name='index'))

In [80]: df
Out[80]: 
       a    b                         c
index                                  
A      1  0.1 2017-01-01 00:00:00+01:00
B      2  0.2 2017-01-02 00:00:00+01:00
C      3  0.3 2017-01-03 00:00:00+01:00

In [81]: fastparquet.write("__test_index_fastparquet-dev.parquet", df)
/home/joris/miniconda3/envs/pyarrow-dev/lib/python3.6/site-packages/fastparquet/writer.py:136: UserWarning: Coercing datetimes to UTC
  warnings.warn('Coercing datetimes to UTC')

In [82]: pq.read_table("__test_fastparquet-dev.parquet")
Out[82]: 
pyarrow.Table
index: string
a: int64
b: double
c: timestamp[us]
metadata
--------
{b'pandas': b'{"columns": [{"metadata": null, "name": "index", "numpy_type": "'
            b'object", "pandas_type": "unicode"}, {"metadata": null, "name": "'
            b'a", "numpy_type": "int64", "pandas_type": "int64"}, {"metadata":'
            b' null, "name": "b", "numpy_type": "float64", "pandas_type": "flo'
            b'at64"}, {"metadata": {"timezone": "Europe/Brussels"}, "name": "c'
            b'", "numpy_type": "datetime64[ns, Europe/Brussels]", "pandas_type'
            b'": "datetimetz"}], "index_columns": ["index"], "pandas_version":'
            b' "0.22.0.dev0+260.g5da3759"}'}

In [83]: pq.read_table("__test_fastparquet-dev.parquet").to_pandas()
...
ArrowNotImplementedError: No cast implemented from string to timestamp[ns, tz=Europe/Brussels]

Of course, this could (and maybe should) be fixed in fastparquet (they still have to update with regards to the index name change as well: dask/fastparquet#251). And aonce fixed, this again will not be an impossible problem. However, this assumption was up to now not written in the pandas metadata specifications (what you are updating now), so they are not to blame. And to me it shows a bit the brittleness of this assumption.