BUG: pd.to_numeric(.., dtype_backend='pyarrow')
crashes for string[pyarrow]
· Issue #52588 · pandas-dev/pandas (original) (raw)
Pandas version checks
- I have checked that this issue has not already been reported.
- I have confirmed this bug exists on the latest version of pandas.
- I have confirmed this bug exists on the main branch of pandas.
Reproducible Example
import pandas as pd pd.DataFrame({'x': ['1', '2', 'x']}).to_csv('test.csv') df = pd.read_csv('test.csv', engine='pyarrow', dtype_backend='pyarrow')
this works
pd.to_numeric(df['x'], errors='coerce')
this works
pd.to_numeric(df['x'].astype('str'), errors='coerce', dtype_backend='pyarrow')
this crashes
pd.to_numeric(df['x'], errors='coerce', dtype_backend='pyarrow')
Issue Description
the call to to_numeric
crashes with the follow error:
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
Cell In[385], line 9
7 pd.to_numeric(df['x'].astype('str'), errors='coerce', dtype_backend='pyarrow')
8 # this crashes
----> 9 pd.to_numeric(df['x'], errors='coerce', dtype_backend='pyarrow')
File ~/miniconda3/envs/mostly-data/lib/python3.9/site-packages/pandas/core/tools/numeric.py:279, in to_numeric(arg, errors, downcast, dtype_backend)
277 assert isinstance(mask, np.ndarray)
278 data = np.zeros(mask.shape, dtype=values.dtype)
--> 279 data[~mask] = values
281 from pandas.core.arrays import (
282 ArrowExtensionArray,
283 BooleanArray,
284 FloatingArray,
285 IntegerArray,
286 )
288 klass: type[IntegerArray] | type[BooleanArray] | type[FloatingArray]
ValueError: NumPy boolean array indexing assignment cannot assign 2 input values to the 3 output values where the mask is true
Expected Behavior
No crash, and same output as for pd.to_numeric(df['x'], errors='coerce')
Installed Versions
INSTALLED VERSIONS ------------------ commit : 478d340python : 3.9.16.final.0 python-bits : 64 OS : Darwin OS-release : 22.3.0 Version : Darwin Kernel Version 22.3.0: Mon Jan 30 20:42:11 PST 2023; root:xnu-8792.81.3~2/RELEASE_X86_64 machine : x86_64 processor : i386 byteorder : little LC_ALL : None LANG : en_US.UTF-8 LOCALE : en_US.UTF-8
pandas : 2.0.0
numpy : 1.24.2
pytz : 2023.3
dateutil : 2.8.2
setuptools : 67.6.1
pip : 23.0.1
Cython : None
pytest : 7.2.2
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : 2.9.6
jinja2 : 3.1.2
IPython : 8.12.0
pandas_datareader: None
bs4 : 4.12.1
bottleneck : None
brotli : None
fastparquet : 0.8.3
fsspec : 2023.3.0
gcsfs : 2023.3.0
matplotlib : 3.6.3
numba : None
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : 10.0.1
pyreadstat : None
pyxlsb : None
s3fs : 2023.3.0
scipy : None
snappy : None
sqlalchemy : 2.0.9
tables : None
tabulate : None
xarray : None
xlrd : None
zstandard : None
tzdata : 2023.3
qtpy : None
pyqt5 : None