read_html() doesn't handle tables with multiple header rows · Issue #13434 · pandas-dev/pandas (original) (raw)
The read_html()
function seems to treat every <th>
in a table as a column, even if they occur in separate <tr>
s. This means that it breaks even on simple tables generated by pandas' to_html()
function.
Code Sample, a copy-pastable example if possible
df = pd.DataFrame(
columns=["Name", "Age", "Party"],
data = [("Hillary", 68, "D"), ("Bernie", 74, "D"), ("Donald", 69, "R")])
df = df.set_index("Name")
html = df.to_html()
df2 = pd.read_html(html)[0]
print df2
This is the value of html
, generated by the to_html()
function on the original data frame:
Age | Party | |
---|---|---|
Name | ||
Hillary | 68 | D |
Bernie | 74 | D |
Donald | 69 | R |
<table border="1" class="dataframe">
<thead>
<tr style="text-align: right;">
<th></th>
<th>Age</th>
<th>Party</th>
</tr>
<tr>
<th>Name</th>
<th></th>
<th></th>
</tr>
</thead>
<tbody>
<tr>
<th>Hillary</th>
<td>68</td>
<td>D</td>
</tr>
...
</tbody>
</table>
And this is the printed output of the newly-parsed dataframe df2
:
Unnamed: 0 Age Party Name Unnamed: 4 Unnamed: 5
0 Hillary 68 D NaN NaN NaN
1 Bernie 74 D NaN NaN NaN
2 Donald 69 R NaN NaN NaN
What happens is that the to_html()
function produces an html table with two header rows, one for the column names and one with the index name. However the read_html()
parser interprets each individual th
cell as an expected column, resulting in twice the number of columns. Even worse, this produces a column with the same name as the original index but without any data.
Expected Output
The read_html
parser could either treat the multi-row header fully correctly:
Age Party
Name
Hillary 68 D
Bernie 74 D
Donald 69 R
Or it could just ignore any rows after the first one:
Unnamed: 0 Age Party
0 Hillary 68 D
1 Bernie 74 D
2 Donald 69 R
output of pd.show_versions()
INSTALLED VERSIONS
commit: None
python: 2.7.11.final.0
python-bits: 64
OS: Darwin
OS-release: 14.5.0
machine: x86_64
processor: i386
byteorder: little
LC_ALL: None
LANG: en_IE.UTF-8
pandas: 0.18.1
nose: 1.3.7
pip: 8.1.1
setuptools: 21.0.0
Cython: 0.23.4
numpy: 1.11.0
scipy: 0.17.0
statsmodels: 0.6.1
xarray: None
IPython: 4.1.1
sphinx: 1.3.5
patsy: 0.4.1
dateutil: 2.5.3
pytz: 2016.4
blosc: None
bottleneck: 1.0.0
tables: 3.2.2
numexpr: 2.5
matplotlib: 1.5.1
openpyxl: 2.1.2
xlrd: None
xlwt: None
xlsxwriter: None
lxml: 3.3.5
bs4: 4.4.1
html5lib: 1.0b3
httplib2: 0.9.2
apiclient: None
sqlalchemy: 1.0.11
pymysql: None
psycopg2: 2.6.1 (dt dec pq3 ext lo64)
jinja2: 2.8
boto: 2.40.0
pandas_datareader: None