Pandas v0.15.2 breaks read_csv with skiprows, delim_whitespace=True and explicit naming of columns · Issue #9079 · pandas-dev/pandas (original) (raw)

Hi,

The latest version of pandas 0.15.2 can no longer read a file that was no problem before.

The file has the following format.

SMOSMANIA  SMOSMANIA       Narbonne          43.15000     2.95670  112.00    0.05    0.05 ThetaProbe-ML2X 
2007/01/01 01:00   0.2140 U M 
2007/01/01 02:00   0.2140 U M 
2007/01/01 03:00   0.2140 U M 

The file can be found here

Since the first line is not directly related to the number of columns below I use skiprows=1 and specify the names explicitly.

fname='https://raw.githubusercontent.com/TUW-GEO/pytesmo/master/tests/test_ismn/test_data/format_header_values/SMOSMANIA/SMOSMANIA_SMOSMANIA_Narbonne_sm_0.050000_0.050000_ThetaProbe-ML2X_20070101_20070131.stm'

pd.read_csv(fname, skiprows=1, delim_whitespace=True, names=['date', 'time', 'variable','flag','orig_flag'])

Please compare the two code sample below. The first using pandas 0.15.2, the second one 0.15.1

0.15.2

In [1]: import pandas as pd

In [2]: from pandas.util.print_versions import show_versions

In [3]: show_versions()

INSTALLED VERSIONS
------------------
commit: None
python: 2.7.6.final.0
python-bits: 64
OS: Linux
OS-release: 3.13.0-24-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8

pandas: 0.15.2
nose: 1.3.4
Cython: 0.21.1
numpy: 1.9.1
scipy: 0.14.0
statsmodels: 0.6.0
IPython: 2.3.1
sphinx: 1.2.3
patsy: 0.3.0
dateutil: 2.2
pytz: 2014.9
bottleneck: None
tables: None
numexpr: None
matplotlib: 1.4.2
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: None
httplib2: None
apiclient: None
rpy2: None
sqlalchemy: None
pymysql: None
psycopg2: None

In [4]: pd.read_csv('https://raw.githubusercontent.com/TUW-GEO/pytesmo/master/tests/test_ismn/test_data/format_header_values/SMOSMANIA/SMOSMANIA_SMOSMANIA_Narbonne_sm_0.050000_0.050000_ThetaProbe-ML2X_20070101_20070131.stm', skiprows=1, delim_whitespace=True, names=['date', 'time', 'variable','flag','orig_flag'])
Out[4]: 
Empty DataFrame
Columns: [date, time, variable, flag, orig_flag]
Index: []

0.15.1

In [1]: import pandas as pd

In [2]: from pandas.util.print_versions import show_versions

In [3]: show_versions()

INSTALLED VERSIONS
------------------
commit: None
python: 2.7.6.final.0
python-bits: 64
OS: Linux
OS-release: 3.13.0-24-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8

pandas: 0.15.1
nose: 1.3.4
Cython: 0.21.1
numpy: 1.9.1
scipy: 0.14.0
statsmodels: 0.6.0
IPython: 2.3.1
sphinx: 1.2.3
patsy: 0.3.0
dateutil: 2.2
pytz: 2014.9
bottleneck: None
tables: None
numexpr: None
matplotlib: 1.4.2
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: None
httplib2: None
apiclient: None
rpy2: None
sqlalchemy: None
pymysql: None
psycopg2: None

In [4]: pd.read_csv('https://raw.githubusercontent.com/TUW-GEO/pytesmo/master/tests/test_ismn/test_data/format_header_values/SMOSMANIA/SMOSMANIA_SMOSMANIA_Narbonne_sm_0.050000_0.050000_ThetaProbe-ML2X_20070101_20070131.stm', skiprows=1, delim_whitespace=True, names=['date', 'time', 'variable','flag','orig_flag'])
Out[4]: 
           date   time  variable flag orig_flag
0    2007/01/01  01:00    0.2140    U         M
1    2007/01/01  02:00    0.2140    U         M
2    2007/01/01  03:00    0.2140    U         M
3    2007/01/01  04:00    0.2140    U         M
4    2007/01/01  05:00    0.2140    U         M
5    2007/01/01  06:00    0.2140    U         M
6    2007/01/01  07:00    0.2135    U         M
7    2007/01/01  08:00    0.2135    U         M
8    2007/01/01  09:00    0.2135    U         M
9    2007/01/01  10:00    0.2140    U         M
10   2007/01/01  11:00    0.2140    U         M
11   2007/01/01  12:00    0.2145    U         M
12   2007/01/01  13:00    0.2149    U         M
13   2007/01/01  14:00    0.2149    U         M
14   2007/01/01  15:00    0.2149    U         M
15   2007/01/01  16:00    0.2145    U         M
16   2007/01/01  17:00    0.2135    U         M
17   2007/01/01  18:00    0.2130    U         M
18   2007/01/01  19:00    0.2130    U         M
19   2007/01/01  20:00    0.2126    U         M
20   2007/01/01  21:00    0.2121    U         M
21   2007/01/01  22:00    0.2121    U       NaN
22   2007/01/01  23:00    0.2116    U         M
23   2007/01/02  00:00    0.2116    U         M
24   2007/01/02  01:00    0.2112    U         M
25   2007/01/02  02:00    0.2107    U         M
26   2007/01/02  03:00    0.2107    U         M
27   2007/01/02  04:00    0.2102    U         M
28   2007/01/02  05:00    0.2098    U         M
29   2007/01/02  06:00    0.2098    U         M
..          ...    ...       ...  ...       ...
711  2007/01/30  18:00    0.1538    U         M
712  2007/01/30  19:00    0.1534    U         M
713  2007/01/30  20:00    0.1534    U         M
714  2007/01/30  21:00    0.1534    U         M
715  2007/01/30  22:00    0.1534    U         M
716  2007/01/30  23:00    0.1534    U         M
717  2007/01/31  00:00    0.1534    U         M
718  2007/01/31  01:00    0.1531    U         M
719  2007/01/31  02:00    0.1531    U         M
720  2007/01/31  03:00    0.1527    U         M
721  2007/01/31  04:00    0.1527    U         M
722  2007/01/31  05:00    0.1524    U         M
723  2007/01/31  06:00    0.1524    U         M
724  2007/01/31  07:00    0.1524    U         M
725  2007/01/31  08:00    0.1521    U         M
726  2007/01/31  09:00    0.1521    U         M
727  2007/01/31  10:00    0.1521    U         M
728  2007/01/31  11:00    0.1524    U         M
729  2007/01/31  12:00    0.1527    U         M
730  2007/01/31  13:00    0.1534    U         M
731  2007/01/31  14:00    0.1541    U         M
732  2007/01/31  15:00    0.1545    U         M
733  2007/01/31  16:00    0.1541    U         M
734  2007/01/31  17:00    0.1538    U         M
735  2007/01/31  18:00    0.1534    U         M
736  2007/01/31  19:00    0.1531    U         M
737  2007/01/31  20:00    0.1527    U         M
738  2007/01/31  21:00    0.1527    U         M
739  2007/01/31  22:00    0.1524    U         M
740  2007/01/31  23:00    0.1524    U         M

[741 rows x 5 columns]