BUG: regression in read_parquet that raises a `pyarrow.lib.ArrowInvalid: offset overflow while concatenating arrays` · Issue #55606 · pandas-dev/pandas (original) (raw)

Pandas version checks

I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of pandas.
I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

import pandas as pd pd.read_parquet(all_tx_info)

The file I am using is 71GB. So I cannot share it easily. If the error is clear from the stack trace, that is fine. If ever it is not clear, I can spend time trying the reproduce this issue with a synthetic dataFrame.

Issue Description

I have the following parquet stored on disk:

>>> tx.info(memory_usage='deep')
<class 'pandas.core.frame.DataFrame'>
Index: 904346808 entries, 0 to 804503
Data columns (total 11 columns):
 #   Column              Dtype
---  ------              -----
 0   tx_hash             string
 1   block_height        Int64
 2   block_date          datetime64[ns]
 3   tx_version          Int64
 4   is_coinbase         boolean
 5   is_segwit           boolean
 6   total_tx_in_value   Int64
 7   total_tx_out_value  Int64
 8   tx_fee              Int64
 9   nb_in               Int64
 10  nb_out              Int64
dtypes: Int64(7), boolean(2), datetime64[ns](1), string(1)
memory usage: 171.8 GB

With pandas 2.0.3, when I load it using read_parquet, it works without any issue.

However, with pandas 2.1.1, I have the following exception

pd.read_parquet(all_tx_info)
Traceback (most recent call last):
  File "/home/alegout/bc_tools.py", line 694, in load_all_tx_info_to_dataframe
    return pd.read_parquet(all_tx_info)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/user/alegout/home/miniconda3/envs/py3.11/lib/python3.11/site-packages/pandas/io/parquet.py", line 670, in read_parquet
    return impl.read(
           ^^^^^^^^^^
  File "/user/alegout/home/miniconda3/envs/py3.11/lib/python3.11/site-packages/pandas/io/parquet.py", line 279, in read
    result = pa_table.to_pandas(**to_pandas_kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "pyarrow/array.pxi", line 830, in pyarrow.lib._PandasConvertible.to_pandas
  File "pyarrow/table.pxi", line 3990, in pyarrow.lib.Table._to_pandas
  File "/user/alegout/home/miniconda3/envs/py3.11/lib/python3.11/site-packages/pyarrow/pandas_compat.py", line 820, in table_to_blockmanager
    blocks = _table_to_blocks(options, table, categories, ext_columns_dtypes)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/user/alegout/home/miniconda3/envs/py3.11/lib/python3.11/site-packages/pyarrow/pandas_compat.py", line 1171, in _table_to_blocks
    return [_reconstruct_block(item, columns, extension_columns)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/user/alegout/home/miniconda3/envs/py3.11/lib/python3.11/site-packages/pyarrow/pandas_compat.py", line 1171, in <listcomp>
    return [_reconstruct_block(item, columns, extension_columns)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/user/alegout/home/miniconda3/envs/py3.11/lib/python3.11/site-packages/pyarrow/pandas_compat.py", line 780, in _reconstruct_block
    pd_ext_arr = pandas_dtype.__from_arrow__(arr)
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/user/alegout/home/miniconda3/envs/py3.11/lib/python3.11/site-packages/pandas/core/arrays/string_.py", line 229, in __from_arrow__
    arr = pyarrow.concat_arrays(chunks).to_numpy(zero_copy_only=False)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "pyarrow/array.pxi", line 3039, in pyarrow.lib.concat_arrays
  File "pyarrow/error.pxi", line 144, in pyarrow.lib.pyarrow_internal_check_status
  File "pyarrow/error.pxi", line 100, in pyarrow.lib.check_status
pyarrow.lib.ArrowInvalid: offset overflow while concatenating arrays

Expected Behavior

The dataframe loads from the parquet without an exception

Installed Versions

Here are the installed version for the environment that works with pandas 2.0.3

INSTALLED VERSIONS ------------------ commit : 0f43794python : 3.11.5.final.0 python-bits : 64 OS : Linux OS-release : 6.5.6-200.fc38.x86_64 Version : #1 SMP PREEMPT_DYNAMIC Fri Oct 6 19:02:35 UTC 2023 machine : x86_64 processor : byteorder : little LC_ALL : None LANG : en_US.UTF-8 LOCALE : en_US.UTF-8

pandas : 2.0.3
numpy : 1.24.3
pytz : 2023.3.post1
dateutil : 2.8.2
setuptools : 68.0.0
pip : 23.3
Cython : None
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : None
IPython : None
pandas_datareader: None
bs4 : None
bottleneck : 1.3.5
brotli : None
fastparquet : None
fsspec : None
gcsfs : None
matplotlib : 3.7.2
numba : 0.58.0
numexpr : 2.8.7
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : 11.0.0
pyreadstat : None
pyxlsb : None
s3fs : None
scipy : None
snappy : None
sqlalchemy : None
tables : None
tabulate : None
xarray : None
xlrd : None
zstandard : None
tzdata : 2023.3
qtpy : None
pyqt5 : None

And here is the environment for the version of pandas 2.1.1 that does not work

INSTALLED VERSIONS ------------------ commit : e86ed37python : 3.11.5.final.0 python-bits : 64 OS : Linux OS-release : 6.5.6-200.fc38.x86_64 Version : #1 SMP PREEMPT_DYNAMIC Fri Oct 6 19:02:35 UTC 2023 machine : x86_64 processor : byteorder : little LC_ALL : None LANG : en_US.UTF-8 LOCALE : en_US.UTF-8

pandas : 2.1.1
numpy : 1.24.3
pytz : 2023.3.post1
dateutil : 2.8.2
setuptools : 68.0.0
pip : 23.2.1
Cython : 3.0.0
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : 4.9.3
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : 3.1.2
IPython : 8.15.0
pandas_datareader : 0.10.0
bs4 : 4.12.2
bottleneck : 1.3.5
dataframe-api-compat: None
fastparquet : None
fsspec : 2023.9.2
gcsfs : None
matplotlib : 3.7.2
numba : 0.58.0
numexpr : 2.8.7
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : 11.0.0
pyreadstat : None
pyxlsb : None
s3fs : None
scipy : 1.11.3
sqlalchemy : 2.0.21
tables : None
tabulate : 0.8.10
xarray : 2023.6.0
xlrd : None
zstandard : None
tzdata : 2023.3
qtpy : 2.2.0
pyqt5 : None

BUG: regression in read_parquet that raises a pyarrow.lib.ArrowInvalid: offset overflow while concatenating arrays · Issue #55606 · pandas-dev/pandas (original) (raw)

Pandas version checks

Reproducible Example

Issue Description

Expected Behavior

Installed Versions

BUG: regression in read_parquet that raises a `pyarrow.lib.ArrowInvalid: offset overflow while concatenating arrays` · Issue #55606 · pandas-dev/pandas (original) (raw)