ENH: support non default indexes in writing to Parquet by dhirschfeld · Pull Request #18629 · pandas-dev/pandas (original) (raw)

Not sure what to do about that but since fastparquet throws a sensible error I'd be inclined to just leave it and a future update will likely add the missing functionality.

It's a bit unfortunate it only comes up when reading it back in, so either we are ok with that, or either we need to to raise on writing MIs with fastparquet.

If I name the index levels you can actually pull them out by name:

That will not always work. Eg if they have no names, and pyarrow master has also changed this and the index levels will never have the level names but a __index_level_n_ pattern, so the workaround you show will not work anymore with the new release.

The thing we can do is the following:

In [85]: import pyarrow.parquet as pq

In [86]: f = pq.ParquetFile("__test_mi.parquet")

In [87]: f.metadata.metadata[b'pandas']
Out[87]: b'{"columns": [{"name": "A", "pandas_type": "float64", "numpy_type": "float64", "metadata": null}, {"name": "B", "pandas_type": "float64", "numpy_type": "float64", "metadata": null}, {"name": "C", "pandas_type": "float64", "numpy_type": "float64", "metadata": null}, {"name": "__index_level_0__", "pandas_type": "unicode", "numpy_type": "object", "metadata": null}, {"name": "__index_level_1__", "pandas_type": "datetime", "numpy_type": "datetime64[ns]", "metadata": null}], "index_columns": ["__index_level_0__", "__index_level_1__"], "pandas_version": "0.22.0.dev0+288.g6b6cfb8"}'

In [88]: import json

In [89]: pandas_metadata = json.loads(f.metadata.metadata[b'pandas'].decode('utf8'))

In [90]: pandas_metadata['index_columns']
Out[90]: ['__index_level_0__', '__index_level_1__']

and then add those to the columns list.
And something similar for fastparquet:

In [98]: import fastparquet

In [99]: f = fastparquet.ParquetFile("__test_mi.parquet")

In [100]: json.loads(f.key_value_metadata['pandas'])['index_columns']
Out[100]: ['__index_level_0__', '__index_level_1__']