read_csv fails with TypeError: object cannot be converted to an IntegerDtype
yet succeeds when reading chunks · Issue #25472 · pandas-dev/pandas (original) (raw)
Code Sample, a copy-pastable example if possible
Download this file upload.txt
Your code here
import pandas as pd from enum import Enum, IntEnum, auto import argparse
I attached the file in the github issue
filename = "upload.txt"
this field is coded on 64 bits so 'UInt64' looks perfect.
column = "tcp.options.mptcp.sendkey"
with open(filename) as fd:
print("READ CHUNK BY CHUNK")
res = pd.read_csv(
fd,
comment='#',
sep='|',
dtype={column: 'UInt64' },
usecols=[column],
chunksize=1
)
for chunk in (res):
# print("chunk %d" % i)
print(chunk)
fd.seek(0) # rewind
print("READ THE WHOLE FILE AT ONCE ")
res = pd.read_csv(
fd,
comment='#',
sep='|',
usecols=[column],
dtype={"tcp.options.mptcp.sendkey": 'UInt64' }
)
print(res)
If I read in chunks, read_csv succeeds, if I try to read the column at once, I get
Traceback (most recent call last):
File "test2.py", line 34, in <module>
dtype={"tcp.options.mptcp.sendkey": 'UInt64' }
File "/nix/store/mhiszrb8cpicjkzgraq796asj2sxpjch-python3.7-pandas-0.24.1/lib/python3.7/site-packages/pandas/io/parsers.py", line 702, in parser_f
return _read(filepath_or_buffer, kwds)
File "/nix/store/mhiszrb8cpicjkzgraq796asj2sxpjch-python3.7-pandas-0.24.1/lib/python3.7/site-packages/pandas/io/parsers.py", line 435, in _read
data = parser.read(nrows)
File "/nix/store/mhiszrb8cpicjkzgraq796asj2sxpjch-python3.7-pandas-0.24.1/lib/python3.7/site-packages/pandas/io/parsers.py", line 1139, in read
ret = self._engine.read(nrows)
File "/nix/store/mhiszrb8cpicjkzgraq796asj2sxpjch-python3.7-pandas-0.24.1/lib/python3.7/site-packages/pandas/io/parsers.py", line 1995, in read
data = self._reader.read(nrows)
File "pandas/_libs/parsers.pyx", line 900, in pandas._libs.parsers.TextReader.read
File "pandas/_libs/parsers.pyx", line 915, in pandas._libs.parsers.TextReader._read_low_memory
File "pandas/_libs/parsers.pyx", line 992, in pandas._libs.parsers.TextReader._read_rows
File "pandas/_libs/parsers.pyx", line 1124, in pandas._libs.parsers.TextReader._convert_column_data
File "pandas/_libs/parsers.pyx", line 1155, in pandas._libs.parsers.TextReader._convert_tokens
File "pandas/_libs/parsers.pyx", line 1235, in pandas._libs.parsers.TextReader._convert_with_dtype
File "/nix/store/mhiszrb8cpicjkzgraq796asj2sxpjch-python3.7-pandas-0.24.1/lib/python3.7/site-packages/pandas/core/arrays/integer.py", line 308, in _from_sequence_of_strings
return cls._from_sequence(scalars, dtype, copy)
File "/nix/store/mhiszrb8cpicjkzgraq796asj2sxpjch-python3.7-pandas-0.24.1/lib/python3.7/site-packages/pandas/core/arrays/integer.py", line 303, in _from_sequence
return integer_array(scalars, dtype=dtype, copy=copy)
File "/nix/store/mhiszrb8cpicjkzgraq796asj2sxpjch-python3.7-pandas-0.24.1/lib/python3.7/site-packages/pandas/core/arrays/integer.py", line 111, in integer_array
values, mask = coerce_to_array(values, dtype=dtype, copy=copy)
File "/nix/store/mhiszrb8cpicjkzgraq796asj2sxpjch-python3.7-pandas-0.24.1/lib/python3.7/site-packages/pandas/core/arrays/integer.py", line 188, in coerce_to_array
values.dtype))
TypeError: object cannot be converted to an IntegerDtype
Expected Output
I would like the call to read_csv to succeed without having to read in chunks (which seems to have other side effects as well).
Output of pd.show_versions()
I am using v0.23.4 with a patch from master to fix some other bug. [paste the output of ``pd.show_versions()`` here below this line] commit: None python: 3.7.2.final.0 python-bits: 64 OS: Linux OS-release: 4.19.0 machine: x86_64 processor: byteorder: little LC_ALL: None LANG: fr_FR.UTF-8 LOCALE: fr_FR.UTF-8
pandas: 0+unknown
pytest: None
pip: 18.1
setuptools: 40.6.3
Cython: None
numpy: 1.16.0
scipy: 1.2.0
pyarrow: None
xarray: None
IPython: None
sphinx: None
patsy: None
dateutil: 2.7.5
pytz: 2018.7
blosc: None
bottleneck: 1.2.1
tables: 3.4.4
numexpr: 2.6.9
feather: None
matplotlib: 3.0.2
openpyxl: 2.5.12
xlrd: 1.1.0
xlwt: 1.3.0
xlsxwriter: None
lxml.etree: 4.2.6
bs4: 4.6.3
html5lib: 1.0.1
sqlalchemy: 1.2.14
pymysql: None
psycopg2: None
jinja2: None
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None
gcsfs: None