BUG: Parquet does not support saving fp16 GH#44846 by Anirudhsekar96 · Pull Request #44847 · pandas-dev/pandas (original) (raw)

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service andprivacy statement. We’ll occasionally send you account related emails.

Already on GitHub?Sign in to your account

Conversation16 Commits7 Checks0 Files changed

Conversation

This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.Learn more about bidirectional Unicode characters

[ Show hidden characters]({{ revealButtonHref }})

closes BUG: Parquet format does not support saving float16 columns #44846
tests added / passed
Ensure all linting tests pass, see here for how to run them
Adds validation check for float16 column in to_parquet function by raising a ValueError

Hello @Anirudhsekar96! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2021-12-14 04:01:15 UTC

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would need a test and release note

Is this explicit check needed? It seems pyarrow itself already gives a decent error message?

In [4]: import pandas as pd
   ...: import numpy as np
   ...: 
   ...: data = np.arange(2, 10, dtype=np.float16)
   ...: df = pd.DataFrame(data=data, columns=['fp16'])
   ...: df.to_parquet('./fp16.parquet')
...
ArrowNotImplementedError: Unhandled type for Arrow to Parquet schema conversion: halffloat

(adding an explicit check also means we wouldn't automatically get the improvement if this is fixed on pyarrow's side)

In the given example, PyArrow creates a file called 'fp16.parquet' before throwing the error resulting in an empty file. An explicit check before passing to PyArrow would prevent the creation of the empty file.

Additionally, it seems that there is little interest in adding support for float16 in PyArrow at the moment (see ref: apache/arrow#2691). Parquet format also does not seem to have adding extensions for supporting float16 in the road map (ref: https://issues.apache.org/jira/browse/ARROW-7242).

Another option would be to coerce all the columns with float16 coerce to float32 and adding a warning indicating the conversion. Currently fastparquet handles fp16 using the conversion method.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i am ok with this, a partial write is not a great end result here.

could you add a note in I/O section for 1.4.0

… name to test_unsupported_fp16

In the given example, PyArrow creates a file called 'fp16.parquet' before throwing the error resulting in an empty file. An explicit check before passing to PyArrow would prevent the creation of the empty file.

That might actually be a bug in the pandas code, as I can't reproduce it with just pyarrow (I do reproduce it with the pandas version above):

import pyarrow as pa import pyarrow.parquet as pq

table = pa.table({'a': np.array([0, 1, 2], dtype="float16")}) pq.write_table(table, "test_halffloat.parquet")

So this "creates emtpy file" issue also happens for other data types that aren't supported with Parquet (such as timedelta64, or for any other reason that pyarrow raises an error while writing). I think that is something we should fix in general instead of doing a special case check for float16.

good point @jorisvandenbossche

ok let's close this issue and open one for the writing partial files

Opened a new bug report for the broader issue:

#44914