BUG: pd.to_numeric(.., dtype_backend='pyarrow') crashes for string[pyarrow] · Issue #52588 · pandas-dev/pandas (original) (raw)

Pandas version checks

Reproducible Example

import pandas as pd pd.DataFrame({'x': ['1', '2', 'x']}).to_csv('test.csv') df = pd.read_csv('test.csv', engine='pyarrow', dtype_backend='pyarrow')

this works

pd.to_numeric(df['x'], errors='coerce')

this works

pd.to_numeric(df['x'].astype('str'), errors='coerce', dtype_backend='pyarrow')

this crashes

pd.to_numeric(df['x'], errors='coerce', dtype_backend='pyarrow')

Issue Description

the call to to_numeric crashes with the follow error:

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In[385], line 9
      7 pd.to_numeric(df['x'].astype('str'), errors='coerce', dtype_backend='pyarrow')
      8 # this crashes
----> 9 pd.to_numeric(df['x'], errors='coerce', dtype_backend='pyarrow')

File ~/miniconda3/envs/mostly-data/lib/python3.9/site-packages/pandas/core/tools/numeric.py:279, in to_numeric(arg, errors, downcast, dtype_backend)
    277 assert isinstance(mask, np.ndarray)
    278 data = np.zeros(mask.shape, dtype=values.dtype)
--> 279 data[~mask] = values
    281 from pandas.core.arrays import (
    282     ArrowExtensionArray,
    283     BooleanArray,
    284     FloatingArray,
    285     IntegerArray,
    286 )
    288 klass: type[IntegerArray] | type[BooleanArray] | type[FloatingArray]

ValueError: NumPy boolean array indexing assignment cannot assign 2 input values to the 3 output values where the mask is true

Expected Behavior

No crash, and same output as for pd.to_numeric(df['x'], errors='coerce')

Installed Versions

INSTALLED VERSIONS ------------------ commit : 478d340python : 3.9.16.final.0 python-bits : 64 OS : Darwin OS-release : 22.3.0 Version : Darwin Kernel Version 22.3.0: Mon Jan 30 20:42:11 PST 2023; root:xnu-8792.81.3~2/RELEASE_X86_64 machine : x86_64 processor : i386 byteorder : little LC_ALL : None LANG : en_US.UTF-8 LOCALE : en_US.UTF-8

pandas : 2.0.0
numpy : 1.24.2
pytz : 2023.3
dateutil : 2.8.2
setuptools : 67.6.1
pip : 23.0.1
Cython : None
pytest : 7.2.2
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : 2.9.6
jinja2 : 3.1.2
IPython : 8.12.0
pandas_datareader: None
bs4 : 4.12.1
bottleneck : None
brotli : None
fastparquet : 0.8.3
fsspec : 2023.3.0
gcsfs : 2023.3.0
matplotlib : 3.6.3
numba : None
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : 10.0.1
pyreadstat : None
pyxlsb : None
s3fs : 2023.3.0
scipy : None
snappy : None
sqlalchemy : 2.0.9
tables : None
tabulate : None
xarray : None
xlrd : None
zstandard : None
tzdata : 2023.3
qtpy : None
pyqt5 : None