header argument in read_html() ignores empty trs, only within thead · Issue #21641 · pandas-dev/pandas (original) (raw)

Code Sample, a copy-pastable example if possible

With a table that has no <thead>, read_html()'s header argument looks at the correct row:

>>> pandas.read_html('''<table><tbody><tr><th></th><th></th></tr><tr><th>A</th><th>B</th></tr><tr><td>a</td><td>b</td></tr></tbody></table>''', header=0)
[  Unnamed: 0 Unnamed: 1
0          A          B
1          a          b]

... but when a row is in a <thead>, Pandas behaves differently:

>>> pandas.read_html('''<table><thead><tr><th></th><th></th></tr><tr><th>A</th><th>B</th></tr></thead><tbody><tr><td>a</td><td>b</td></tr></tbody></table>''', header=0)
[   A  B
0  a  b]

Problem description

Structurally, within HTML, <thead> and <tbody> serve the same purpose. It's unnatural to specify different <tr> indices depending on which element contains them. XPath and CSS selectors (e.g., //table//tr) are consistent, and read_html() isn't.

In code, the error is that Pandas deletes empty <tr>s within <thead> before passing them to pandas.io.html._data_to_frame(), which is where the header argument is parsed.

If the issue has not been resolved there, go ahead and file it in the issue tracker.

Expected Output

Output of pd.show_versions()

[paste the output of pd.show_versions() here below this line]

pandas.show_versions()
INSTALLED VERSIONS


commit: None
python: 3.6.5.final.0
python-bits: 64
OS: Linux
OS-release: 4.13.0-37-generic
machine: x86_64
processor:
byteorder: little
LC_ALL: None
LANG: C.UTF-8
LOCALE: en_US.UTF-8
pandas: 0.23.0
pytest: None
pip: 10.0.1
setuptools: 39.2.0
Cython: None
numpy: 1.11.3
scipy: None
pyarrow: 0.8.0
xarray: None
IPython: None
sphinx: 1.7.5
patsy: None
dateutil: 2.7.3
pytz: 2018.4
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: None
openpyxl: 2.5.4
xlrd: 1.0.0
xlwt: None
xlsxwriter: None
lxml: 4.2.1
bs4: 4.6.0
html5lib: 1.0.1
sqlalchemy: None
pymysql: None
psycopg2: 2.7.1 (dt dec pq3 ext lo64)
jinja2: 2.10
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None