PERF: using use_nullable_dtypes=True
in read_parquet
slows performance on large dataframes · Issue #47345 · pandas-dev/pandas (original) (raw)
Pandas version checks
- I have checked that this issue has not already been reported.
- I have confirmed this issue exists on the latest version of pandas.
- I have confirmed this issue exists on the main branch of pandas.
Reproducible Example
Generate data
The effect becomes visible for large n
, ie. >1000000.
import pandas as pd
n = 10000000 df = pd.DataFrame( { "a": np.random.choice([0, 1, None], n), "b": np.random.choice([0, 10, 90129, None], n), "c": np.random.choice([True, False, None], n), "d": np.random.choice(["a", "b", None], n), }, dtype=object, ) df["a"] = df["a"].astype("Int8") df = df.convert_dtypes()
To check that types set correctly
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000000 entries, 0 to 9999999
Data columns (total 4 columns):
# Column Dtype
--- ------ -----
0 a Int8
1 b Int64
2 c boolean
3 d string
dtypes: Int64(1), Int8(1), boolean(1), string(1)
memory usage: 200.3 MB
Write it out with pyarrow
:
import pyarrow as pa from pyarrow import parquet as pq
table = pa.Table.from_pandas(df)
Remove metadata to prevent automatic use of nullable types
table = table.replace_schema_metadata() pq.write_table(table, "test.parquet")
Speed test
%timeit df = pd.read_parquet("test.parquet", use_nullable_dtypes=True)
997 ms ± 20.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit df = pd.read_parquet("test.parquet", use_nullable_dtypes=False)
632 ms ± 20.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Note that the use_nullable_dtypes=False
returns a dataframe with wrong types, ie. [float64, float64, object, object].
I would expect the version with True to run faster - after all it doesn't have to cast any types. The opposite is the case.
Installed Versions
INSTALLED VERSIONS
commit : 4bfe3d0
python : 3.9.2.final.0
python-bits : 64
OS : Linux
OS-release : 5.10.0-9-amd64
Version : #1 SMP Debian 5.10.70-1 (2021-09-30)
machine : x86_64
processor :
byteorder : little
LC_ALL : None
LANG : en_US.UTF-8
LOCALE : en_US.UTF-8
pandas : 1.4.2
numpy : 1.22.4
pytz : 2022.1
dateutil : 2.8.2
pip : 20.3.4
setuptools : 44.1.1
Cython : None
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : 2.9.3
jinja2 : None
IPython : 8.4.0
pandas_datareader: None
bs4 : None
bottleneck : None
brotli : None
fastparquet : 0.8.1
fsspec : 2022.5.0
gcsfs : None
markupsafe : None
matplotlib : 3.5.2
numba : None
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : 8.0.0
pyreadstat : None
pyxlsb : None
s3fs : None
scipy : 1.8.1
snappy : None
sqlalchemy : None
tables : None
tabulate : None
xarray : None
xlrd : None
xlwt : None
zstandard : None
Prior Performance
No response