Incorrect skipping of lines with inline comments and printing warnings · Issue #16472 · pandas-dev/pandas (original) (raw)
Code Sample, a copy-pastable example if possible
Your code here
from io import StringIO import numpy as np import pandas as pd
test_input = u"""
1 2
2 2 3
3 2 3 # 3 fields
4 2 3# 3 fields
5 2 # 2 fields
6 2# 2 fields
7 # 1 field, NaN
8# 1 field, NaN
9 2 3 # skipped line
comment"""
df = pd.read_table(StringIO(test_input), comment='#', header=None, delimiter='\s+', skiprows=0, error_bad_lines=False)
print df
Expected: only lines with <= 2 fields should appear in the df, others should be warned as skipped
assert (df == pd.DataFrame([[1, 2], [5, 2], [6, 2], [7, np.nan], [8, np.nan]], index=list(range(5)), columns=[0,1])).all().all()
Problem description
Only lines with <= 2 fields should appear in the df, others should be skipped and their warning should be printed on stderr.
Output
Skipping line 2: expected 2 fields, saw 3
Skipping line 4: expected 2 fields, saw 6
Skipping line 6: expected 2 fields, saw 4
0 1
0 1 2
1 7 8
Problems:
- Lines skipped due to more fields than expected and which end with inline comments are never printed as skipped on stderr (lines 3-9)
- Lines which end with inline comment after a space count one more field than present, so incorrectly skip or not skip the line (lines 3, 5, 7, 9)
- The incorrect accounting joined the lines 7 and 8 as well which was not expected.
Expected Output
Skipping line 2: expected 2 fields, saw 3
Skipping line 3: expected 2 fields, saw 3
Skipping line 4: expected 2 fields, saw 3
Skipping line 9: expected 2 fields, saw 3
0 1
0 1 2.0
1 5 2.0
2 6 2.0
3 7 NaN
4 8 NaN
Output of pd.show_versions()
# Paste the output here pd.show_versions() here
INSTALLED VERSIONS
commit: None
python: 2.7.13.final.0
python-bits: 64
OS: Linux
OS-release: 4.10.14-200.fc25.x86_64
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8
pandas: 0.19.2
nose: 1.3.7
pip: 9.0.1
setuptools: 34.3.3
Cython: 0.25.2
numpy: 1.11.3
scipy: 0.19.0
statsmodels: 0.8.0
xarray: None
IPython: 5.3.0
sphinx: None
patsy: 0.4.1
dateutil: 2.6.0
pytz: 2016.3
blosc: None
bottleneck: None
tables: 3.3.0
numexpr: 2.6.2
matplotlib: 2.0.0
openpyxl: 2.4.1
xlrd: None
xlwt: None
xlsxwriter: None
lxml: 3.7.3
bs4: 4.4.1
html5lib: 0.999
httplib2: None
apiclient: None
sqlalchemy: 1.1.6
pymysql: None
psycopg2: None
jinja2: 2.9.5
boto: None
pandas_datareader: 0.2.1