BUG: Pandas 1.1.3 read_csv raises a TypeError when dtype, and index_col are provided, and file has >1M rows · Issue #37094 · pandas-dev/pandas (original) (raw)


Code Sample, a copy-pastable example

import pandas as pd import numpy as np

ROWS = 1000001 # <--------- with 1000000, it works

with open('out.dat', 'w') as fd: for i in range(ROWS): fd.write('%d\n' % i)

df = pd.read_csv('out.dat', names=['a'], dtype={'a': np.float64}, index_col=['a'])

Problem description

When ROWS = 1000001, I get the following traceback:

Traceback (most recent call last):
  File "try.py", line 10, in <module>
    df = pd.read_csv('out.dat', names=['a'], dtype={'a': np.float64}, index_col=['a'])
  File "/tmp/new_pandas/lib64/python3.6/site-packages/pandas/io/parsers.py", line 686, in read_csv
    return _read(filepath_or_buffer, kwds)
  File "/tmp/new_pandas/lib64/python3.6/site-packages/pandas/io/parsers.py", line 458, in _read
    data = parser.read(nrows)
  File "/tmp/new_pandas/lib64/python3.6/site-packages/pandas/io/parsers.py", line 1196, in read
    ret = self._engine.read(nrows)
  File "/tmp/new_pandas/lib64/python3.6/site-packages/pandas/io/parsers.py", line 2231, in read
    index, names = self._make_index(data, alldata, names)
  File "/tmp/new_pandas/lib64/python3.6/site-packages/pandas/io/parsers.py", line 1677, in _make_index
    index = self._agg_index(index)
  File "/tmp/new_pandas/lib64/python3.6/site-packages/pandas/io/parsers.py", line 1770, in _agg_index
    arr, _ = self._infer_types(arr, col_na_values | col_na_fvalues)
  File "/tmp/new_pandas/lib64/python3.6/site-packages/pandas/io/parsers.py", line 1871, in _infer_types
    mask = algorithms.isin(values, list(na_values))
  File "/tmp/new_pandas/lib64/python3.6/site-packages/pandas/core/algorithms.py", line 443, in isin
    if np.isnan(values).any():
TypeError: ufunc 'isnan' not supported for the input types, and the inputs could not be safely coerced to any supported types according to the casting rule ''safe''

Expected Output

With pandas 1.1.2, or ROWS = 1000000, it works fine.

Output of pd.show_versions()

INSTALLED VERSIONS ------------------ commit : db08276python : 3.6.3.final.0 python-bits : 64 OS : Linux OS-release : 3.10.0-957.38.3.el7.x86_64 Version : #1 SMP Mon Nov 11 12:01:33 EST 2019 machine : x86_64 processor : x86_64 byteorder : little LC_ALL : None LANG : en_US.UTF-8 LOCALE : en_US.UTF-8

pandas : 1.1.3
numpy : 1.19.2
pytz : 2020.1
dateutil : 2.8.1
pip : 20.2.3
setuptools : 50.3.0
Cython : None
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : None
IPython : 7.16.1
pandas_datareader: None
bs4 : None
bottleneck : None
fsspec : None
fastparquet : None
gcsfs : None
matplotlib : None
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : None
pytables : None
pyxlsb : None
s3fs : None
scipy : None
sqlalchemy : None
tables : None
tabulate : None
xarray : None
xlrd : None
xlwt : None
numba : None