dfs = read_html(html) File "/us...">

Crash on read_html(url, flavor="bs4") if table has only one column · Issue #9178 · pandas-dev/pandas (original) (raw)

I was trying to read a package tracking table from finnish post office's website and I got

Traceback (most recent call last):
  File "./posti.py", line 69, in <module>
    dfs = read_html(html)
  File "/usr/lib/python3.4/site-packages/pandas/io/html.py", line 851, in read_html
    parse_dates, tupleize_cols, thousands, attrs, encoding)
  File "/usr/lib/python3.4/site-packages/pandas/io/html.py", line 721, in _parse
    infer_types, parse_dates, tupleize_cols, thousands))
  File "/usr/lib/python3.4/site-packages/pandas/io/html.py", line 609, in _data_to_frame
    _expand_elements(body)
  File "/usr/lib/python3.4/site-packages/pandas/io/html.py", line 586, in _expand_elements
    lens = Series(lmap(len, body))
  File "/usr/lib/python3.4/site-packages/pandas/compat/__init__.py", line 87, in lmap
    return list(map(*args, **kwargs))
TypeError: len() of unsized object

I isolated the offending table into this script:

https://gist.github.com/boarpig/de4044f4188fac700c68

The problem seems to be related to parse_raw_thead function

def _parse_raw_thead(self, table):
    thead = self._parse_thead(table)
    res = []
    if thead:
        res = lmap(self._text_getter, self._parse_th(thead[0]))
    return np.array(res).squeeze() if res and len(res) == 1 else res

Where res contains ['Tapahtumat'] which comes out of numpy array creation as

array('Tapahtumat', dtype='<U10')

which then produces previously mentioned error because you cannot take a len from that.