Crash on read_html(url, flavor="bs4") if table has only one column · Issue #9178 · pandas-dev/pandas (original) (raw)
I was trying to read a package tracking table from finnish post office's website and I got
Traceback (most recent call last):
File "./posti.py", line 69, in <module>
dfs = read_html(html)
File "/usr/lib/python3.4/site-packages/pandas/io/html.py", line 851, in read_html
parse_dates, tupleize_cols, thousands, attrs, encoding)
File "/usr/lib/python3.4/site-packages/pandas/io/html.py", line 721, in _parse
infer_types, parse_dates, tupleize_cols, thousands))
File "/usr/lib/python3.4/site-packages/pandas/io/html.py", line 609, in _data_to_frame
_expand_elements(body)
File "/usr/lib/python3.4/site-packages/pandas/io/html.py", line 586, in _expand_elements
lens = Series(lmap(len, body))
File "/usr/lib/python3.4/site-packages/pandas/compat/__init__.py", line 87, in lmap
return list(map(*args, **kwargs))
TypeError: len() of unsized object
I isolated the offending table into this script:
https://gist.github.com/boarpig/de4044f4188fac700c68
The problem seems to be related to parse_raw_thead function
def _parse_raw_thead(self, table):
thead = self._parse_thead(table)
res = []
if thead:
res = lmap(self._text_getter, self._parse_th(thead[0]))
return np.array(res).squeeze() if res and len(res) == 1 else res
Where res
contains ['Tapahtumat']
which comes out of numpy array creation as
array('Tapahtumat', dtype='<U10')
which then produces previously mentioned error because you cannot take a len from that.