read_html() doesn't handle tables with multiple header rows (original) (raw)

The read_html() function seems to treat every <th> in a table as a column, even if they occur in separate <tr>s. This means that it breaks even on simple tables generated by pandas' to_html() function.

Code Sample, a copy-pastable example if possible

df = pd.DataFrame(
    columns=["Name", "Age", "Party"], 
    data = [("Hillary", 68, "D"), ("Bernie", 74, "D"), ("Donald", 69, "R")])
df = df.set_index("Name")
html = df.to_html()
df2 = pd.read_html(html)[0]
print df2

This is the value of html, generated by the to_html() function on the original data frame:

Age	Party
Name
Hillary	68	D
Bernie	74	D
Donald	69	R

<table border="1" class="dataframe">
  <thead>
    <tr style="text-align: right;">
      <th></th>
      <th>Age</th>
      <th>Party</th>
    </tr>
    <tr>
      <th>Name</th>
      <th></th>
      <th></th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <th>Hillary</th>
      <td>68</td>
      <td>D</td>
    </tr>
...
  </tbody>
</table>

And this is the printed output of the newly-parsed dataframe df2:

  Unnamed: 0  Age Party  Name  Unnamed: 4  Unnamed: 5
0    Hillary   68     D   NaN         NaN         NaN
1     Bernie   74     D   NaN         NaN         NaN
2     Donald   69     R   NaN         NaN         NaN

What happens is that the to_html() function produces an html table with two header rows, one for the column names and one with the index name. However the read_html() parser interprets each individual th cell as an expected column, resulting in twice the number of columns. Even worse, this produces a column with the same name as the original index but without any data.

Expected Output

The read_html parser could either treat the multi-row header fully correctly:

         Age Party
Name              
Hillary   68     D
Bernie    74     D
Donald    69     R

Or it could just ignore any rows after the first one:

  Unnamed: 0  Age Party
0    Hillary   68     D
1     Bernie   74     D
2     Donald   69     R

output of `pd.show_versions()`

INSTALLED VERSIONS

commit: None
python: 2.7.11.final.0
python-bits: 64
OS: Darwin
OS-release: 14.5.0
machine: x86_64
processor: i386
byteorder: little
LC_ALL: None
LANG: en_IE.UTF-8

pandas: 0.18.1
nose: 1.3.7
pip: 8.1.1
setuptools: 21.0.0
Cython: 0.23.4
numpy: 1.11.0
scipy: 0.17.0
statsmodels: 0.6.1
xarray: None
IPython: 4.1.1
sphinx: 1.3.5
patsy: 0.4.1
dateutil: 2.5.3
pytz: 2016.4
blosc: None
bottleneck: 1.0.0
tables: 3.2.2
numexpr: 2.5
matplotlib: 1.5.1
openpyxl: 2.1.2
xlrd: None
xlwt: None
xlsxwriter: None
lxml: 3.3.5
bs4: 4.4.1
html5lib: 1.0b3
httplib2: 0.9.2
apiclient: None
sqlalchemy: 1.0.11
pymysql: None
psycopg2: 2.6.1 (dt dec pq3 ext lo64)
jinja2: 2.8
boto: 2.40.0
pandas_datareader: None

read_html() doesn't handle tables with multiple header rows (original) (raw)

Code Sample, a copy-pastable example if possible

Expected Output

output of pd.show_versions()

INSTALLED VERSIONS

output of `pd.show_versions()`