read_csv() & extra trailing comma(s) cause parsing issues. · Issue #2886 · pandas-dev/pandas (original) (raw)

I have run into a few opportunities to further improve the wonderful read_csv() function. I'm using the latest x64 0.10.1 build from 10-Feb-2013 16:52.

I believe the opportunity to fix these exceptions would require simply ignoring any extra trailing commas. This is how many CSV readers work such as opening CSVs in excel. In my case I regularly work with 100K line CSVs that occasionally have extra trailing columns causing read_csv() to fail. Perhaps its possible to have an option to ignore trailing commas, or even better an option to ignore/skip any malformed rows without raising a terminal exception. :)

If a CSV has 'n' matched extra trailing columns and you do not specify any index_col then the parser will correctly assume that the first 'n' columns are the index , if you set index_col=False it fails with: IndexError: list index out of range

In [1]: import numpy as np

In [2]: import pandas as pd

In [3]: import StringIO

In [4]: pd.version Out[4]: '0.10.1'

In [5]: data = 'a,b,c\n4,apple,bat,,\n8,orange,cow,,' <-- Matched extra commas

In [6]: data2 = 'a,b,c\n4,apple,bat,,\n8,orange,cow,,,' <-- Miss-matched extra commas

In [7]: print data a,b,c 4,apple,bat,, 8,orange,cow,,

In [8]: print data2 a,b,c 4,apple,bat,, 8,orange,cow,,,

In [9]: df = pd.read_csv(StringIO.StringIO(data))

In [10]: df Out[10]: a b c 4 apple bat NaN NaN 8 orange cow NaN NaN

In [11]: df.index Out[11]: MultiIndex [(4, apple), (8, orange)]

In [12]: df2 = pd.read_csv(StringIO.StringIO(data), index_col=False)

IndexError Traceback (most recent call last) in () ----> 1 df2 = pd.read_csv(StringIO.StringIO(data), index_col=False)

C:\Python27\lib\site-packages\pandas\io\parsers.pyc in parser_f(filepath_or_buffer, sep, dialect, compression, doublequote, escapechar, quotechar, quoting, skipinitialspace, lineterminator, header, index_col, names, prefix, skiprows, skipfooter, skip_footer, na_values, true_values, false_values, delimiter, converters, dtype, usecols, engine, delim_whitespace, as_recarray, na_filter, compact_ints, use_unsigned, low_memory, buffer_lines, warn_bad_lines, error_bad_lines, keep_default_na, thousands, comment, decimal, parse_dates, keep_date_col, dayfirst, date_parser, memory_map, nrows, iterator, chunksize, verbose, encoding, squeeze) 397 buffer_lines=buffer_lines) 398 --> 399 return _read(filepath_or_buffer, kwds) 400 401 parser_f.name = name

C:\Python27\lib\site-packages\pandas\io\parsers.pyc in _read(filepath_or_buffer, kwds) 213 return parser 214 --> 215 return parser.read() 216 217 _parser_defaults = {

C:\Python27\lib\site-packages\pandas\io\parsers.pyc in read(self, nrows) 629 # self._engine.set_error_bad_lines(False) 630 --> 631 ret = self._engine.read(nrows) 632 633 if self.options.get('as_recarray'):

C:\Python27\lib\site-packages\pandas\io\parsers.pyc in read(self, nrows) 952 953 try: --> 954 data = self._reader.read(nrows) 955 except StopIteration: 956 if nrows is None:

C:\Python27\lib\site-packages\pandas_parser.pyd in pandas._parser.TextReader.read (pandas\src\parser.c:5915)()

C:\Python27\lib\site-packages\pandas_parser.pyd in pandas._parser.TextReader._read_low_memory (pandas\src\parser.c:6132)()

C:\Python27\lib\site-packages\pandas_parser.pyd in pandas._parser.TextReader._read_rows (pandas\src\parser.c:6946)()

C:\Python27\lib\site-packages\pandas_parser.pyd in pandas._parser.TextReader._convert_column_data (pandas\src\parser.c:7670)()

C:\Python27\lib\site-packages\pandas_parser.pyd in pandas._parser.TextReader._get_column_name (pandas\src\parser.c:10545)()

IndexError: list index out of range

In [13]: df3 = pd.read_csv(StringIO.StringIO(data2), index_col=False)

CParserError Traceback (most recent call last) in () ----> 1 df3 = pd.read_csv(StringIO.StringIO(data2), index_col=False)

C:\Python27\lib\site-packages\pandas\io\parsers.pyc in _read(filepath_or_buffer, kwds) 213 return parser 214 --> 215 return parser.read() 216 217 _parser_defaults = {

C:\Python27\lib\site-packages\pandas\io\parsers.pyc in read(self, nrows) 952 953 try: --> 954 data = self._reader.read(nrows) 955 except StopIteration: 956 if nrows is None:

C:\Python27\lib\site-packages\pandas_parser.pyd in pandas._parser.TextReader.read (pandas\src\parser.c:5915)()

C:\Python27\lib\site-packages\pandas_parser.pyd in pandas._parser.TextReader._read_low_memory (pandas\src\parser.c:6132)()

C:\Python27\lib\site-packages\pandas_parser.pyd in pandas._parser.TextReader._read_rows (pandas\src\parser.c:6734)()

C:\Python27\lib\site-packages\pandas_parser.pyd in pandas._parser.TextReader._tokenize_rows (pandas\src\parser.c:6619)()

C:\Python27\lib\site-packages\pandas_parser.pyd in pandas._parser.raise_parser_error (pandas\src\parser.c:17023)()

CParserError: Error tokenizing data. C error: Expected 5 fields in line 3, saw 6

show quoted text -