read_csv python engine errors (original) (raw)

Only thing I changed from my usually working reduction pipeline is to try engine="python" (because I wanted to use nrows for a smaller test-read, but that fails as well, and I thought maybe the python engine is buggy currently):

$ python reduction.py ~/data/planet4/2015-06-21_planet_four_classifications.csv INFO:Starting reduction. Traceback (most recent call last): File "reduction.py", line 258, in args.test_n_rows, args.remove_duplicates) File "reduction.py", line 182, in main data = [chunk for chunk in reader] File "reduction.py", line 182, in data = [chunk for chunk in reader] File "/Users/klay6683/miniconda3/lib/python3.4/site-packages/pandas-0.16.2_58_g01995b2-py3.4-macosx-10.5-x86_64.egg/pandas/io/parsers.py", line 697, in iter yield self.read(self.chunksize) File "/Users/klay6683/miniconda3/lib/python3.4/site-packages/pandas-0.16.2_58_g01995b2-py3.4-macosx-10.5-x86_64.egg/pandas/io/parsers.py", line 721, in read ret = self._engine.read(nrows) File "/Users/klay6683/miniconda3/lib/python3.4/site-packages/pandas-0.16.2_58_g01995b2-py3.4-macosx-10.5-x86_64.egg/pandas/io/parsers.py", line 1556, in read content = self._get_lines(rows) File "/Users/klay6683/miniconda3/lib/python3.4/site-packages/pandas-0.16.2_58_g01995b2-py3.4-macosx-10.5-x86_64.egg/pandas/io/parsers.py", line 2007, in _get_lines for _ in range(rows): TypeError: 'float' object cannot be interpreted as an integer

My function call is this:

as chunksize and nrows cannot be used together yet, i switch chunksize

to None if I want test_n_rows for a small test database:

if test_n_rows: chunks = None else: chunks = 1e6

creating reader object with pandas interface for csv parsing

doing this in chunks as its faster. Also, later will do a split

into multiple processes to do this.

reader = pd.read_csv(fname, chunksize=chunks, na_values=['null'], usecols=analysis_cols, nrows=test_n_rows, engine='c')

Using pandas-0.16.2_58_g01995b2-py3.4