BUG: Pandas 1.1.3 read_csv raises a TypeError when dtype, and index_col are provided, and file has >1M rows · Issue #37094 · pandas-dev/pandas (original) (raw)
- I have checked that this issue has not already been reported.
- I have confirmed this bug exists on the latest version of pandas.
- (optional) I have confirmed this bug exists on the master branch of pandas.
Code Sample, a copy-pastable example
import pandas as pd import numpy as np
ROWS = 1000001 # <--------- with 1000000, it works
with open('out.dat', 'w') as fd: for i in range(ROWS): fd.write('%d\n' % i)
df = pd.read_csv('out.dat', names=['a'], dtype={'a': np.float64}, index_col=['a'])
Problem description
When ROWS = 1000001
, I get the following traceback:
Traceback (most recent call last):
File "try.py", line 10, in <module>
df = pd.read_csv('out.dat', names=['a'], dtype={'a': np.float64}, index_col=['a'])
File "/tmp/new_pandas/lib64/python3.6/site-packages/pandas/io/parsers.py", line 686, in read_csv
return _read(filepath_or_buffer, kwds)
File "/tmp/new_pandas/lib64/python3.6/site-packages/pandas/io/parsers.py", line 458, in _read
data = parser.read(nrows)
File "/tmp/new_pandas/lib64/python3.6/site-packages/pandas/io/parsers.py", line 1196, in read
ret = self._engine.read(nrows)
File "/tmp/new_pandas/lib64/python3.6/site-packages/pandas/io/parsers.py", line 2231, in read
index, names = self._make_index(data, alldata, names)
File "/tmp/new_pandas/lib64/python3.6/site-packages/pandas/io/parsers.py", line 1677, in _make_index
index = self._agg_index(index)
File "/tmp/new_pandas/lib64/python3.6/site-packages/pandas/io/parsers.py", line 1770, in _agg_index
arr, _ = self._infer_types(arr, col_na_values | col_na_fvalues)
File "/tmp/new_pandas/lib64/python3.6/site-packages/pandas/io/parsers.py", line 1871, in _infer_types
mask = algorithms.isin(values, list(na_values))
File "/tmp/new_pandas/lib64/python3.6/site-packages/pandas/core/algorithms.py", line 443, in isin
if np.isnan(values).any():
TypeError: ufunc 'isnan' not supported for the input types, and the inputs could not be safely coerced to any supported types according to the casting rule ''safe''
Expected Output
With pandas 1.1.2, or ROWS = 1000000, it works fine.
Output of pd.show_versions()
INSTALLED VERSIONS ------------------ commit : db08276python : 3.6.3.final.0 python-bits : 64 OS : Linux OS-release : 3.10.0-957.38.3.el7.x86_64 Version : #1 SMP Mon Nov 11 12:01:33 EST 2019 machine : x86_64 processor : x86_64 byteorder : little LC_ALL : None LANG : en_US.UTF-8 LOCALE : en_US.UTF-8
pandas : 1.1.3
numpy : 1.19.2
pytz : 2020.1
dateutil : 2.8.1
pip : 20.2.3
setuptools : 50.3.0
Cython : None
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : None
IPython : 7.16.1
pandas_datareader: None
bs4 : None
bottleneck : None
fsspec : None
fastparquet : None
gcsfs : None
matplotlib : None
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : None
pytables : None
pyxlsb : None
s3fs : None
scipy : None
sqlalchemy : None
tables : None
tabulate : None
xarray : None
xlrd : None
xlwt : None
numba : None