PERF: using `use_nullable_dtypes=True` in `read_parquet` slows performance on large dataframes (original) (raw)

Pandas version checks

I have checked that this issue has not already been reported.
I have confirmed this issue exists on the latest version of pandas.
I have confirmed this issue exists on the main branch of pandas.

Reproducible Example

Generate data

The effect becomes visible for large n, ie. >1000000.

import pandas as pd

n = 10000000 df = pd.DataFrame( { "a": np.random.choice([0, 1, None], n), "b": np.random.choice([0, 10, 90129, None], n), "c": np.random.choice([True, False, None], n), "d": np.random.choice(["a", "b", None], n), }, dtype=object, ) df["a"] = df["a"].astype("Int8") df = df.convert_dtypes()

To check that types set correctly

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000000 entries, 0 to 9999999
Data columns (total 4 columns):
 #   Column  Dtype  
---  ------  -----  
 0   a       Int8   
 1   b       Int64  
 2   c       boolean
 3   d       string 
dtypes: Int64(1), Int8(1), boolean(1), string(1)
memory usage: 200.3 MB

Write it out with `pyarrow`:

import pyarrow as pa from pyarrow import parquet as pq

table = pa.Table.from_pandas(df)

Remove metadata to prevent automatic use of nullable types

table = table.replace_schema_metadata() pq.write_table(table, "test.parquet")

Speed test

%timeit df = pd.read_parquet("test.parquet", use_nullable_dtypes=True)

997 ms ± 20.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%timeit df = pd.read_parquet("test.parquet", use_nullable_dtypes=False)

632 ms ± 20.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Note that the use_nullable_dtypes=False returns a dataframe with wrong types, ie. [float64, float64, object, object].
I would expect the version with True to run faster - after all it doesn't have to cast any types. The opposite is the case.

Installed Versions

INSTALLED VERSIONS

commit : 4bfe3d0
python : 3.9.2.final.0
python-bits : 64
OS : Linux
OS-release : 5.10.0-9-amd64
Version : #1 SMP Debian 5.10.70-1 (2021-09-30)
machine : x86_64
processor :
byteorder : little
LC_ALL : None
LANG : en_US.UTF-8
LOCALE : en_US.UTF-8

pandas : 1.4.2
numpy : 1.22.4
pytz : 2022.1
dateutil : 2.8.2
pip : 20.3.4
setuptools : 44.1.1
Cython : None
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : 2.9.3
jinja2 : None
IPython : 8.4.0
pandas_datareader: None
bs4 : None
bottleneck : None
brotli : None
fastparquet : 0.8.1
fsspec : 2022.5.0
gcsfs : None
markupsafe : None
matplotlib : 3.5.2
numba : None
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : 8.0.0
pyreadstat : None
pyxlsb : None
s3fs : None
scipy : 1.8.1
snappy : None
sqlalchemy : None
tables : None
tabulate : None
xarray : None
xlrd : None
xlwt : None
zstandard : None

Prior Performance

No response

PERF: using use_nullable_dtypes=True in read_parquet slows performance on large dataframes (original) (raw)