header
argument in read_html()
ignores empty trs, only within thead · Issue #21641 · pandas-dev/pandas (original) (raw)
Code Sample, a copy-pastable example if possible
With a table that has no <thead>
, read_html()
's header
argument looks at the correct row:
>>> pandas.read_html('''<table><tbody><tr><th></th><th></th></tr><tr><th>A</th><th>B</th></tr><tr><td>a</td><td>b</td></tr></tbody></table>''', header=0)
[ Unnamed: 0 Unnamed: 1
0 A B
1 a b]
... but when a row is in a <thead>
, Pandas behaves differently:
>>> pandas.read_html('''<table><thead><tr><th></th><th></th></tr><tr><th>A</th><th>B</th></tr></thead><tbody><tr><td>a</td><td>b</td></tr></tbody></table>''', header=0)
[ A B
0 a b]
Problem description
Structurally, within HTML, <thead>
and <tbody>
serve the same purpose. It's unnatural to specify different <tr>
indices depending on which element contains them. XPath and CSS selectors (e.g., //table//tr
) are consistent, and read_html()
isn't.
In code, the error is that Pandas deletes empty <tr>
s within <thead>
before passing them to pandas.io.html._data_to_frame()
, which is where the header
argument is parsed.
If the issue has not been resolved there, go ahead and file it in the issue tracker.
Expected Output
Output of pd.show_versions()
[paste the output of pd.show_versions()
here below this line]
pandas.show_versions()
INSTALLED VERSIONS
commit: None
python: 3.6.5.final.0
python-bits: 64
OS: Linux
OS-release: 4.13.0-37-generic
machine: x86_64
processor:
byteorder: little
LC_ALL: None
LANG: C.UTF-8
LOCALE: en_US.UTF-8
pandas: 0.23.0
pytest: None
pip: 10.0.1
setuptools: 39.2.0
Cython: None
numpy: 1.11.3
scipy: None
pyarrow: 0.8.0
xarray: None
IPython: None
sphinx: 1.7.5
patsy: None
dateutil: 2.7.3
pytz: 2018.4
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: None
openpyxl: 2.5.4
xlrd: 1.0.0
xlwt: None
xlsxwriter: None
lxml: 4.2.1
bs4: 4.6.0
html5lib: 1.0.1
sqlalchemy: None
pymysql: None
psycopg2: 2.7.1 (dt dec pq3 ext lo64)
jinja2: 2.10
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None