BUG: regression in read_parquet that raises a pyarrow.lib.ArrowInvalid: offset overflow while concatenating arrays
· Issue #55606 · pandas-dev/pandas (original) (raw)
Pandas version checks
- I have checked that this issue has not already been reported.
- I have confirmed this bug exists on the latest version of pandas.
- I have confirmed this bug exists on the main branch of pandas.
Reproducible Example
import pandas as pd pd.read_parquet(all_tx_info)
The file I am using is 71GB. So I cannot share it easily. If the error is clear from the stack trace, that is fine. If ever it is not clear, I can spend time trying the reproduce this issue with a synthetic dataFrame.
Issue Description
I have the following parquet stored on disk:
>>> tx.info(memory_usage='deep')
<class 'pandas.core.frame.DataFrame'>
Index: 904346808 entries, 0 to 804503
Data columns (total 11 columns):
# Column Dtype
--- ------ -----
0 tx_hash string
1 block_height Int64
2 block_date datetime64[ns]
3 tx_version Int64
4 is_coinbase boolean
5 is_segwit boolean
6 total_tx_in_value Int64
7 total_tx_out_value Int64
8 tx_fee Int64
9 nb_in Int64
10 nb_out Int64
dtypes: Int64(7), boolean(2), datetime64[ns](1), string(1)
memory usage: 171.8 GB
With pandas 2.0.3, when I load it using read_parquet
, it works without any issue.
However, with pandas 2.1.1, I have the following exception
pd.read_parquet(all_tx_info)
Traceback (most recent call last):
File "/home/alegout/bc_tools.py", line 694, in load_all_tx_info_to_dataframe
return pd.read_parquet(all_tx_info)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/user/alegout/home/miniconda3/envs/py3.11/lib/python3.11/site-packages/pandas/io/parquet.py", line 670, in read_parquet
return impl.read(
^^^^^^^^^^
File "/user/alegout/home/miniconda3/envs/py3.11/lib/python3.11/site-packages/pandas/io/parquet.py", line 279, in read
result = pa_table.to_pandas(**to_pandas_kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "pyarrow/array.pxi", line 830, in pyarrow.lib._PandasConvertible.to_pandas
File "pyarrow/table.pxi", line 3990, in pyarrow.lib.Table._to_pandas
File "/user/alegout/home/miniconda3/envs/py3.11/lib/python3.11/site-packages/pyarrow/pandas_compat.py", line 820, in table_to_blockmanager
blocks = _table_to_blocks(options, table, categories, ext_columns_dtypes)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/user/alegout/home/miniconda3/envs/py3.11/lib/python3.11/site-packages/pyarrow/pandas_compat.py", line 1171, in _table_to_blocks
return [_reconstruct_block(item, columns, extension_columns)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/user/alegout/home/miniconda3/envs/py3.11/lib/python3.11/site-packages/pyarrow/pandas_compat.py", line 1171, in <listcomp>
return [_reconstruct_block(item, columns, extension_columns)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/user/alegout/home/miniconda3/envs/py3.11/lib/python3.11/site-packages/pyarrow/pandas_compat.py", line 780, in _reconstruct_block
pd_ext_arr = pandas_dtype.__from_arrow__(arr)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/user/alegout/home/miniconda3/envs/py3.11/lib/python3.11/site-packages/pandas/core/arrays/string_.py", line 229, in __from_arrow__
arr = pyarrow.concat_arrays(chunks).to_numpy(zero_copy_only=False)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "pyarrow/array.pxi", line 3039, in pyarrow.lib.concat_arrays
File "pyarrow/error.pxi", line 144, in pyarrow.lib.pyarrow_internal_check_status
File "pyarrow/error.pxi", line 100, in pyarrow.lib.check_status
pyarrow.lib.ArrowInvalid: offset overflow while concatenating arrays
Expected Behavior
The dataframe loads from the parquet without an exception
Installed Versions
Here are the installed version for the environment that works with pandas 2.0.3
INSTALLED VERSIONS ------------------ commit : 0f43794python : 3.11.5.final.0 python-bits : 64 OS : Linux OS-release : 6.5.6-200.fc38.x86_64 Version : #1 SMP PREEMPT_DYNAMIC Fri Oct 6 19:02:35 UTC 2023 machine : x86_64 processor : byteorder : little LC_ALL : None LANG : en_US.UTF-8 LOCALE : en_US.UTF-8
pandas : 2.0.3
numpy : 1.24.3
pytz : 2023.3.post1
dateutil : 2.8.2
setuptools : 68.0.0
pip : 23.3
Cython : None
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : None
IPython : None
pandas_datareader: None
bs4 : None
bottleneck : 1.3.5
brotli : None
fastparquet : None
fsspec : None
gcsfs : None
matplotlib : 3.7.2
numba : 0.58.0
numexpr : 2.8.7
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : 11.0.0
pyreadstat : None
pyxlsb : None
s3fs : None
scipy : None
snappy : None
sqlalchemy : None
tables : None
tabulate : None
xarray : None
xlrd : None
zstandard : None
tzdata : 2023.3
qtpy : None
pyqt5 : None
And here is the environment for the version of pandas 2.1.1 that does not work
INSTALLED VERSIONS ------------------ commit : e86ed37python : 3.11.5.final.0 python-bits : 64 OS : Linux OS-release : 6.5.6-200.fc38.x86_64 Version : #1 SMP PREEMPT_DYNAMIC Fri Oct 6 19:02:35 UTC 2023 machine : x86_64 processor : byteorder : little LC_ALL : None LANG : en_US.UTF-8 LOCALE : en_US.UTF-8
pandas : 2.1.1
numpy : 1.24.3
pytz : 2023.3.post1
dateutil : 2.8.2
setuptools : 68.0.0
pip : 23.2.1
Cython : 3.0.0
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : 4.9.3
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : 3.1.2
IPython : 8.15.0
pandas_datareader : 0.10.0
bs4 : 4.12.2
bottleneck : 1.3.5
dataframe-api-compat: None
fastparquet : None
fsspec : 2023.9.2
gcsfs : None
matplotlib : 3.7.2
numba : 0.58.0
numexpr : 2.8.7
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : 11.0.0
pyreadstat : None
pyxlsb : None
s3fs : None
scipy : 1.11.3
sqlalchemy : 2.0.21
tables : None
tabulate : 0.8.10
xarray : 2023.6.0
xlrd : None
zstandard : None
tzdata : 2023.3
qtpy : 2.2.0
pyqt5 : None