BUG: read_csv throws TypeError with iterator, nrows · Issue #59079 · pandas-dev/pandas (original) (raw)
Pandas version checks
- I have checked that this issue has not already been reported.
- I have confirmed this bug exists on the latest version of pandas.
- I have confirmed this bug exists on the main branch of pandas.
Reproducible Example
from io import BytesIO import pandas
csv = b'a,b\n1,2\n3,4' with BytesIO(csv) as f: it = pd.read_csv( f, nrows=1, iterator=True, ) for df in it: pass
Behavior:
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
Cell In[82], line 11
5 with BytesIO(csv) as f:
6 it = pd.read_csv(
7 f,
8 nrows=1,
9 iterator=True,
10 )
---> 11 for df in it:
12 pass
File ~\AppData\Local\anaconda3\Lib\site-packages\pandas\io\parsers\readers.py:1843, in TextFileReader.__next__(self)
1841 def __next__(self) -> DataFrame:
1842 try:
-> 1843 return self.get_chunk()
1844 except StopIteration:
1845 self.close()
File ~\AppData\Local\anaconda3\Lib\site-packages\pandas\io\parsers\readers.py:1984, in TextFileReader.get_chunk(self, size)
1982 if self._currow >= self.nrows:
1983 raise StopIteration
-> 1984 size = min(size, self.nrows - self._currow)
1985 return self.read(nrows=size)
TypeError: '<' not supported between instances of 'int' and 'NoneType'
### Issue Description
It seems that `read_csv` throws a `TypeError` when combining `nrows` and `iterable`.
My context: I want to convert a CSV to parquet. The CSV is larger than memory, so I want to use `iterator=True`. The CSV contains a footer. But since `skipfooter` is not supported for the fast engines (c or pyarrow), and `comment` can only be a single character, I want to instead indirectly skip the footer by using `nrows`. (I know in advance the number of rows.) My data contains `\r\n` line endings, although the error happens with normal `\n`.
### Expected Behavior
The script runs without error. `nrows` rows of data are returned (across one chunk in this small example).
### Installed Versions
<details>
INSTALLED VERSIONS
------------------
commit : d9cdd2ee5a58015ef6f4d15c7226110c9aab8140
python : 3.11.9.final.0
python-bits : 64
OS : Windows
OS-release : 10
Version : 10.0.19045
machine : AMD64
processor : Intel64 Family 6 Model 142 Stepping 12, GenuineIntel
byteorder : little
LC_ALL : None
LANG : None
LOCALE : English_Australia.1252
pandas : 2.2.2
numpy : 1.26.4
pytz : 2024.1
dateutil : 2.9.0.post0
setuptools : 69.5.1
pip : 24.0
Cython : None
pytest : 7.4.4
hypothesis : None
sphinx : 7.3.7
blosc : None
feather : 0.4.1
xlsxwriter : 3.2.0
lxml.etree : 5.2.1
html5lib : 1.1
pymysql : None
psycopg2 : None
jinja2 : 3.1.4
IPython : 8.25.0
pandas_datareader : None
adbc-driver-postgresql: None
adbc-driver-sqlite : None
bs4 : 4.12.2
bottleneck : 1.3.7
dataframe-api-compat : None
fastparquet : None
fsspec : 2024.3.1
gcsfs : None
matplotlib : 3.8.4
numba : 0.59.1
numexpr : 2.8.7
odfpy : None
openpyxl : 3.1.2
pandas_gbq : None
pyarrow : 14.0.2
pyreadstat : None
python-calamine : None
pyxlsb : None
s3fs : 2024.3.1
scipy : 1.13.1
sqlalchemy : 2.0.30
tables : 3.9.2
tabulate : 0.9.0
xarray : 2023.6.0
xlrd : 2.0.1
zstandard : 0.22.0
tzdata : 2023.3
qtpy : 2.4.1
pyqt5 : None
</details>