Pandas v0.15.2 breaks read_csv with skiprows, delim_whitespace=True and explicit naming of columns · Issue #9079 · pandas-dev/pandas (original) (raw)
Hi,
The latest version of pandas 0.15.2 can no longer read a file that was no problem before.
The file has the following format.
SMOSMANIA SMOSMANIA Narbonne 43.15000 2.95670 112.00 0.05 0.05 ThetaProbe-ML2X
2007/01/01 01:00 0.2140 U M
2007/01/01 02:00 0.2140 U M
2007/01/01 03:00 0.2140 U M
The file can be found here
Since the first line is not directly related to the number of columns below I use skiprows=1 and specify the names explicitly.
fname='https://raw.githubusercontent.com/TUW-GEO/pytesmo/master/tests/test_ismn/test_data/format_header_values/SMOSMANIA/SMOSMANIA_SMOSMANIA_Narbonne_sm_0.050000_0.050000_ThetaProbe-ML2X_20070101_20070131.stm'
pd.read_csv(fname, skiprows=1, delim_whitespace=True, names=['date', 'time', 'variable','flag','orig_flag'])
Please compare the two code sample below. The first using pandas 0.15.2, the second one 0.15.1
0.15.2
In [1]: import pandas as pd
In [2]: from pandas.util.print_versions import show_versions
In [3]: show_versions()
INSTALLED VERSIONS
------------------
commit: None
python: 2.7.6.final.0
python-bits: 64
OS: Linux
OS-release: 3.13.0-24-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
pandas: 0.15.2
nose: 1.3.4
Cython: 0.21.1
numpy: 1.9.1
scipy: 0.14.0
statsmodels: 0.6.0
IPython: 2.3.1
sphinx: 1.2.3
patsy: 0.3.0
dateutil: 2.2
pytz: 2014.9
bottleneck: None
tables: None
numexpr: None
matplotlib: 1.4.2
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: None
httplib2: None
apiclient: None
rpy2: None
sqlalchemy: None
pymysql: None
psycopg2: None
In [4]: pd.read_csv('https://raw.githubusercontent.com/TUW-GEO/pytesmo/master/tests/test_ismn/test_data/format_header_values/SMOSMANIA/SMOSMANIA_SMOSMANIA_Narbonne_sm_0.050000_0.050000_ThetaProbe-ML2X_20070101_20070131.stm', skiprows=1, delim_whitespace=True, names=['date', 'time', 'variable','flag','orig_flag'])
Out[4]:
Empty DataFrame
Columns: [date, time, variable, flag, orig_flag]
Index: []
0.15.1
In [1]: import pandas as pd
In [2]: from pandas.util.print_versions import show_versions
In [3]: show_versions()
INSTALLED VERSIONS
------------------
commit: None
python: 2.7.6.final.0
python-bits: 64
OS: Linux
OS-release: 3.13.0-24-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
pandas: 0.15.1
nose: 1.3.4
Cython: 0.21.1
numpy: 1.9.1
scipy: 0.14.0
statsmodels: 0.6.0
IPython: 2.3.1
sphinx: 1.2.3
patsy: 0.3.0
dateutil: 2.2
pytz: 2014.9
bottleneck: None
tables: None
numexpr: None
matplotlib: 1.4.2
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: None
httplib2: None
apiclient: None
rpy2: None
sqlalchemy: None
pymysql: None
psycopg2: None
In [4]: pd.read_csv('https://raw.githubusercontent.com/TUW-GEO/pytesmo/master/tests/test_ismn/test_data/format_header_values/SMOSMANIA/SMOSMANIA_SMOSMANIA_Narbonne_sm_0.050000_0.050000_ThetaProbe-ML2X_20070101_20070131.stm', skiprows=1, delim_whitespace=True, names=['date', 'time', 'variable','flag','orig_flag'])
Out[4]:
date time variable flag orig_flag
0 2007/01/01 01:00 0.2140 U M
1 2007/01/01 02:00 0.2140 U M
2 2007/01/01 03:00 0.2140 U M
3 2007/01/01 04:00 0.2140 U M
4 2007/01/01 05:00 0.2140 U M
5 2007/01/01 06:00 0.2140 U M
6 2007/01/01 07:00 0.2135 U M
7 2007/01/01 08:00 0.2135 U M
8 2007/01/01 09:00 0.2135 U M
9 2007/01/01 10:00 0.2140 U M
10 2007/01/01 11:00 0.2140 U M
11 2007/01/01 12:00 0.2145 U M
12 2007/01/01 13:00 0.2149 U M
13 2007/01/01 14:00 0.2149 U M
14 2007/01/01 15:00 0.2149 U M
15 2007/01/01 16:00 0.2145 U M
16 2007/01/01 17:00 0.2135 U M
17 2007/01/01 18:00 0.2130 U M
18 2007/01/01 19:00 0.2130 U M
19 2007/01/01 20:00 0.2126 U M
20 2007/01/01 21:00 0.2121 U M
21 2007/01/01 22:00 0.2121 U NaN
22 2007/01/01 23:00 0.2116 U M
23 2007/01/02 00:00 0.2116 U M
24 2007/01/02 01:00 0.2112 U M
25 2007/01/02 02:00 0.2107 U M
26 2007/01/02 03:00 0.2107 U M
27 2007/01/02 04:00 0.2102 U M
28 2007/01/02 05:00 0.2098 U M
29 2007/01/02 06:00 0.2098 U M
.. ... ... ... ... ...
711 2007/01/30 18:00 0.1538 U M
712 2007/01/30 19:00 0.1534 U M
713 2007/01/30 20:00 0.1534 U M
714 2007/01/30 21:00 0.1534 U M
715 2007/01/30 22:00 0.1534 U M
716 2007/01/30 23:00 0.1534 U M
717 2007/01/31 00:00 0.1534 U M
718 2007/01/31 01:00 0.1531 U M
719 2007/01/31 02:00 0.1531 U M
720 2007/01/31 03:00 0.1527 U M
721 2007/01/31 04:00 0.1527 U M
722 2007/01/31 05:00 0.1524 U M
723 2007/01/31 06:00 0.1524 U M
724 2007/01/31 07:00 0.1524 U M
725 2007/01/31 08:00 0.1521 U M
726 2007/01/31 09:00 0.1521 U M
727 2007/01/31 10:00 0.1521 U M
728 2007/01/31 11:00 0.1524 U M
729 2007/01/31 12:00 0.1527 U M
730 2007/01/31 13:00 0.1534 U M
731 2007/01/31 14:00 0.1541 U M
732 2007/01/31 15:00 0.1545 U M
733 2007/01/31 16:00 0.1541 U M
734 2007/01/31 17:00 0.1538 U M
735 2007/01/31 18:00 0.1534 U M
736 2007/01/31 19:00 0.1531 U M
737 2007/01/31 20:00 0.1527 U M
738 2007/01/31 21:00 0.1527 U M
739 2007/01/31 22:00 0.1524 U M
740 2007/01/31 23:00 0.1524 U M
[741 rows x 5 columns]