read_csv fails for UTF-16 with BOM (maybe also other encodings with BOM) and skiprows · Issue #2298 · pandas-dev/pandas (original) (raw)

��Name  Ad performance report                           
Type    Ad                                  
Frequency   One time                            
Date range  Custom date range                       
Dates   Sep 19, 2012-Nov 19, 2012                       
Account Day Campaign    Ad group    Ad ID   Client name Destination URL Impressions Clicks  Cost    Avg. position   Status  Conv. (1-per-click)
Categories 2    15.11.2012  something: ��;�C�7�:�8� [somethinglse]{test}: ��;�C�7�:�8�  16902484818 Categories 2    http://www.someurl?ad=291012    333 2   4.7 5.5 approved    0

I guess that the beginning of the file is the BOM and that this causes problems when skipping the rows. Without skiprows everything gets read into one row with the first column containing the BOM.

<class 'pandas.core.frame.DataFrame'>
Index: 0 entries
Data columns:
��Name\tAd performance report\t\t\t\t\t\t\t\t\t\t\t
...

pd.read_csv('/home/arthur/Desktop/client 139 - ads report/test_pandas.csv', sep='\t', skiprows=5)
/usr/local/lib/python2.7/dist-packages/pandas-0.9.2.dev_b8dae94-py2.7-linux-x86_64.egg/pandas/io/parsers.pyc in parser_f(filepath_or_buffer, sep, dialect, compression, doublequote, escapechar, quotechar, quoting, skipinitialspace, header, index_col, names, skiprows, skipfooter, skip_footer, na_values, delimiter, converters, dtype, usecols, engine, delim_whitespace, as_recarray, na_filter, compact_ints, use_unsigned, low_memory, buffer_lines, warn_bad_lines, error_bad_lines, keep_default_na, thousands, comment, parse_dates, keep_date_col, dayfirst, date_parser, memory_map, nrows, iterator, chunksize, verbose, encoding, squeeze)
    361                     buffer_lines=buffer_lines)
    362 
--> 363         return _read(filepath_or_buffer, kwds)
    364 
    365     parser_f.__name__ = name

/usr/local/lib/python2.7/dist-packages/pandas-0.9.2.dev_b8dae94-py2.7-linux-x86_64.egg/pandas/io/parsers.pyc in _read(filepath_or_buffer, kwds)
    185 
    186     # Create the parser.
--> 187     parser = TextFileReader(filepath_or_buffer, **kwds)
    188 
    189     if nrows is not None:

/usr/local/lib/python2.7/dist-packages/pandas-0.9.2.dev_b8dae94-py2.7-linux-x86_64.egg/pandas/io/parsers.pyc in __init__(self, f, engine, **kwds)
    465         self.options, self.engine = self._clean_options(options, engine)
    466 
--> 467         self._make_engine(self.engine)
    468 
    469     def _get_options_with_defaults(self, engine):

/usr/local/lib/python2.7/dist-packages/pandas-0.9.2.dev_b8dae94-py2.7-linux-x86_64.egg/pandas/io/parsers.pyc in _make_engine(self, engine)
    567     def _make_engine(self, engine='c'):
    568         if engine == 'c':
--> 569             self._engine = CParserWrapper(self.f, **self.options)
    570         else:
    571             if engine == 'python':

/usr/local/lib/python2.7/dist-packages/pandas-0.9.2.dev_b8dae94-py2.7-linux-x86_64.egg/pandas/io/parsers.pyc in __init__(self, src, **kwds)
    787         ParserBase.__init__(self, kwds)
    788 
--> 789         self._reader = _parser.TextReader(src, **kwds)
    790 
    791         # XXX

/usr/local/lib/python2.7/dist-packages/pandas-0.9.2.dev_b8dae94-py2.7-linux-x86_64.egg/pandas/_parser.so in pandas._parser.TextReader.__cinit__ (pandas/src/parser.c:3579)()

/usr/local/lib/python2.7/dist-packages/pandas-0.9.2.dev_b8dae94-py2.7-linux-x86_64.egg/pandas/_parser.so in pandas._parser.TextReader._get_header (pandas/src/parser.c:4590)()

CParserError: Passed header=0 but only 0 lines in file