REF/BUG/ENH/API: refactor read_html to use TextParser by cpcloud · Pull Request #4770 · pandas-dev/pandas (original) (raw)
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service andprivacy statement. We’ll occasionally send you account related emails.
Already on GitHub?Sign in to your account
Conversation69 Commits5 Checks0 Files changed
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.Learn more about bidirectional Unicode characters
[ Show hidden characters]({{ revealButtonHref }})
closes #4697 (refactor issue) (REF/ENH)
closes #4700 (header inconsistency issue) (API)
closes #5029 (comma issue, added this data set, ordering issue) (BUG)
closes #5048 (header type conversion issue) (BUG)
closes #5066 (index_col issue) (BUG)
- figure out
skiprows
,header
, andindex_col
interaction (a somewhat longstandingMultiIndex
sorting issue, I just took the long way to get there :)) spam url not working anymore(US gov "shutdown" is responsible for this, it correctly skips)table ordering doc blurb/HTML gotchas(was an actual "bug", now fixed in this PR)add tests for rows with a different length(this is already done by the existing tests)
@cpcloud Does this PR actually close #4679 ? That one is specific to Excel.
@jtratner Nope it doesn't, I think I must've mixed up 79 and 97. Thanks
this diff is very difficult to read ... sigh
I think you are trying to bump your stats with html files! lol
@jreback yep
last issue is for me to make sure that not using it anymore ... sticking with our old pal codec.open
with my chosen errors param is correctopen
. would be nice to handle encoding decoding ... but that's for the future
@jreback nah ... i'm trying to avoid slowing down the test suite with a bunch of @network
tests
raise Exception("invalid names passed _stack_arrays") |
---|
nitems, nstacked = len(items), len(stacked) |
if nitems != nstacked: |
raise BadDataError('number of names in ref_items must equal the' |
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you leave a note here that says "Caller must catch this error"
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
or I don't know, maybe not, just something like, if you think this could happen then you should catch this error and try to say something more meaningful.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
I did a first pass. I'd probably like to go over it again and see what I see, but I'm sure it's good.
As an aside, can you help me understand how ordering works for the outputted tables? My assumption is that the ordering is deterministic. Does it follow the order in the HTML data that's passed in? (i.e., if you found all the line numbers of <table>
elements, the order of outputted tables would be the same as the order of line numbers [unless a table isn't parseable])
@cpcloud Maybe also add that note about table ordering to the html gotchas?
look ok
on tupleize_cols your explanation is odd - no other functions have it as true (by default)
Probably a cycle in the HTML parse tree. Does copy-pasting just the table work?
Surprised that site works...
i think maybe a timeout parameter might be useful
I interrupted and this is the trace:
/home/alex/git/pandas/pandas/io/html.py in read_html(io, match, flavor, header, index_col, skiprows, infer_types, attrs, parse_dates, tupleize_cols, thousands) 838 'data (you passed a negative value)') 839 return _parse(flavor, io, match, header, index_col, skiprows, infer_types, --> 840 parse_dates, tupleize_cols, thousands, attrs)
/home/alex/git/pandas/pandas/io/html.py in _parse(flavor, io, match, header, index_col, skiprows, infer_types, parse_dates, tupleize_cols, thousands, attrs) 700 701 try: --> 702 tables = p.parse_tables() 703 except Exception as caught: 704 retained = caught
/home/alex/git/pandas/pandas/io/html.py in parse_tables(self) 172 173 def parse_tables(self): --> 174 tables = self._parse_tables(self._build_doc(), self.match, self.attrs) 175 return (self._build_table(table) for table in tables) 176
/home/alex/git/pandas/pandas/io/html.py in _parse_tables(self, doc, match, attrs) 396 def _parse_tables(self, doc, match, attrs): 397 element_name = self._strainer.name --> 398 tables = doc.find_all(element_name, attrs=attrs) 399 400 if not tables:
/usr/local/lib/python2.7/dist-packages/bs4/element.pyc in find_all(self, name, attrs, recursive, text, limit, **kwargs) 1165 if not recursive: 1166 generator = self.children -> 1167 return self._find_all(name, attrs, text, limit, generator, **kwargs) 1168 findAll = find_all # BS3 1169 findChildren = find_all # BS2
/usr/local/lib/python2.7/dist-packages/bs4/element.pyc in _find_all(self, name, attrs, text, limit, generator, **kwargs) 483 # Optimization to find all tags with a given name. 484 elif isinstance(name, basestring): --> 485 return [element for element in generator 486 if isinstance(element, Tag) and element.name == name] 487 else:
/usr/local/lib/python2.7/dist-packages/bs4/element.pyc in descendants(self) 1182 current = self.contents[0] 1183 while current is not stopNode: -> 1184 yield current 1185 current = current.next_element 1186
Yep ... @jseabold had a similar issue ... let me find it
essentially the borked html is in a cycle, in this case a node goes to its child then when the current node (the child) goes to the next element, it's actually the previous node (its parent) and on and on ...
Am I doing something wrong here:
pd.read_html("/home/alex/table.html",infer_types=False,header=[0])
---------------------------------------------------------------------------
TypeError
----> 1 pd.read_html("/home/alex/table.html",infer_types=False,header=[0])
/home/alex/git/pandas/pandas/io/html.pyc in read_html(io, match, flavor, header, index_col, skiprows, infer_types, attrs, parse_dates, tupleize_cols, thousands)
838 'data (you passed a negative value)')
839 return _parse(flavor, io, match, header, index_col, skiprows, infer_types,
--> 840 parse_dates, tupleize_cols, thousands, attrs)
/home/alex/git/pandas/pandas/io/html.pyc in _parse(flavor, io, match, header, index_col, skiprows, infer_types, parse_dates, tupleize_cols, thousands, attrs)
710 return [_data_to_frame(table, header, index_col, skiprows, infer_types,
711 parse_dates, tupleize_cols, thousands)
--> 712 for table in tables]
713
714
/home/alex/git/pandas/pandas/io/html.pyc in _data_to_frame(data, header, index_col, skiprows, infer_types, parse_dates, tupleize_cols, thousands)
600 skiprows=_get_skiprows(skiprows),
601 parse_dates=parse_dates, tupleize_cols=tupleize_cols,
--> 602 thousands=thousands)
603 df = tp.read()
604
/home/alex/git/pandas/pandas/io/parsers.pyc in TextParser(*args, **kwds)
1173 """
1174 kwds['engine'] = 'python'
-> 1175 return TextFileReader(*args, **kwds)
1176
1177
/home/alex/git/pandas/pandas/io/parsers.pyc in __init__(self, f, engine, **kwds)
485 self.options['has_index_names'] = kwds['has_index_names']
486
--> 487 self._make_engine(self.engine)
488
489 def _get_options_with_defaults(self, engine):
/home/alex/git/pandas/pandas/io/parsers.pyc in _make_engine(self, engine)
601 elif engine == 'python-fwf':
602 klass = FixedWidthFieldParser
--> 603 self._engine = klass(self.f, **self.options)
604
605 def _failover_to_python(self):
/home/alex/git/pandas/pandas/io/parsers.pyc in __init__(self, f, **kwds)
1296 if len(self.columns) > 1:
1297 self.columns, self.index_names, self.col_names, _ = self._extract_multi_indexer_columns(
-> 1298 self.columns, self.index_names, self.col_names)
1299 else:
1300 self.columns = self.columns[0]
/home/alex/git/pandas/pandas/io/parsers.pyc in _extract_multi_indexer_columns(self, header, index_names, col_names, passed_names)
736 # if we find 'Unnamed' all of a single level, then our header was too long
737 for n in range(len(columns[0])):
--> 738 if all([ 'Unnamed' in c[n] for c in columns ]):
739 raise _parser.CParserError("Passed header=[%s] are too many rows for this "
740 "multi_index of columns" % ','.join([ str(x) for x in self.header ]))
TypeError: argument of type 'float' is not iterable
This works fine:
pd.read_html("/home/alex/table.html",infer_types=False,header=0)
this is an issue with TextParser
if you need to pass only a single header just use the second version
it doesn't really make a whole lot of sense to pass a singleton list if you just want the first row anyway
only pass a list if you need more than 1 row as a MultiIndex
header
i'll open an issue about TextParser
That might have been a poor example, I see this issue even with a proper list:
In [6]: pd.read_html("/home/alex/table.html",infer_types=False,header=[0,1])
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-6-1b482817d4da> in <module>()
----> 1 pd.read_html("/home/alex/table.html",infer_types=False,header=[0,1])
/home/alex/git/pandas/pandas/io/html.pyc in read_html(io, match, flavor, header, index_col, skiprows, infer_types, attrs, parse_dates, tupleize_cols, thousands)
838 'data (you passed a negative value)')
839 return _parse(flavor, io, match, header, index_col, skiprows, infer_types,
--> 840 parse_dates, tupleize_cols, thousands, attrs)
/home/alex/git/pandas/pandas/io/html.pyc in _parse(flavor, io, match, header, index_col, skiprows, infer_types, parse_dates, tupleize_cols, thousands, attrs)
710 return [_data_to_frame(table, header, index_col, skiprows, infer_types,
711 parse_dates, tupleize_cols, thousands)
--> 712 for table in tables]
713
714
/home/alex/git/pandas/pandas/io/html.pyc in _data_to_frame(data, header, index_col, skiprows, infer_types, parse_dates, tupleize_cols, thousands)
600 skiprows=_get_skiprows(skiprows),
601 parse_dates=parse_dates, tupleize_cols=tupleize_cols,
--> 602 thousands=thousands)
603 df = tp.read()
604
/home/alex/git/pandas/pandas/io/parsers.pyc in TextParser(*args, **kwds)
1173 """
1174 kwds['engine'] = 'python'
-> 1175 return TextFileReader(*args, **kwds)
1176
1177
/home/alex/git/pandas/pandas/io/parsers.pyc in __init__(self, f, engine, **kwds)
485 self.options['has_index_names'] = kwds['has_index_names']
486
--> 487 self._make_engine(self.engine)
488
489 def _get_options_with_defaults(self, engine):
/home/alex/git/pandas/pandas/io/parsers.pyc in _make_engine(self, engine)
601 elif engine == 'python-fwf':
602 klass = FixedWidthFieldParser
--> 603 self._engine = klass(self.f, **self.options)
604
605 def _failover_to_python(self):
/home/alex/git/pandas/pandas/io/parsers.pyc in __init__(self, f, **kwds)
1296 if len(self.columns) > 1:
1297 self.columns, self.index_names, self.col_names, _ = self._extract_multi_indexer_columns(
-> 1298 self.columns, self.index_names, self.col_names)
1299 else:
1300 self.columns = self.columns[0]
/home/alex/git/pandas/pandas/io/parsers.pyc in _extract_multi_indexer_columns(self, header, index_names, col_names, passed_names)
736 # if we find 'Unnamed' all of a single level, then our header was too long
737 for n in range(len(columns[0])):
--> 738 if all([ 'Unnamed' in c[n] for c in columns ]):
739 raise _parser.CParserError("Passed header=[%s] are too many rows for this "
740 "multi_index of columns" % ','.join([ str(x) for x in self.header ]))
TypeError: argument of type 'float' is not iterable
What does your table look like?
GitHub actually supports tables:
| | | Three months endedApril 30 | | Six months endedApril 30 | | | | | | | | | | | | -------------------------------------------------- | ------------------------------ | - | ---------------------------- | - | -------- | ------ | -------- | - | ------ | - | - | ------ | - | | | | 2013 | | 2012 | | 2013 | | 2012 | | | | | | | | | | In millions | | | | | | | | | | | | | | Net revenue: | | | | | | | | | | | | | | | Notebooks | | $ | 3,718 | | $ | 4,900 | | $ | 7,846 | | $ | 9,842 | | | Desktops | | | 3,103 | | | 3,827 | | | 6,424 | | | 7,033 | | | Workstations | | | 521 | | | 537 | | | 1,056 | | | 1,072 | | | Other | | | 242 | | | 206 | | | 462 | | | 415 | | | | | | | | | | | | | | | | | | | Personal Systems | | | 7,584 | | | 9,470 | | | 15,788 | | | 18,362 | | | | | | | | | | | | | | | | | | | Supplies | | | 4,122 | | | 4,060 | | | 8,015 | | | 8,139 | | | Commercial Hardware | | | 1,398 | | | 1,479 | | | 2,752 | | | 2,968 | | | Consumer Hardware | | | 561 | | | 593 | | | 1,240 | | | 1,283 | | | | | | | | | | | | | | | | | | | Printing | | | 6,081 | | | 6,132 | | | 12,007 | | | 12,390 | | | | | | | | | | | | | | | | | | | Printing and Personal Systems Group | | | 13,665 | | | 15,602 | | | 27,795 | | | 30,752 | | | | | | | | | | | | | | | | | | | Industry Standard Servers | | | 2,806 | | | 3,186 | | | 5,800 | | | 6,258 | | | Technology Services | | | 2,272 | | | 2,335 | | | 4,515 | | | 4,599 | | | Storage | | | 857 | | | 990 | | | 1,690 | | | 1,945 | | | Networking | | | 618 | | | 614 | | | 1,226 | | | 1,200 | | | Business Critical Systems | | | 266 | | | 421 | | | 572 | | | 826 | | | | | | | | | | | | | | | | | | | Enterprise Group | | | 6,819 | | | 7,546 | | | 13,803 | | | 14,828 | | | | | | | | | | | | | | | | | | | Infrastructure Technology Outsourcing | | | 3,721 | | | 3,954 | | | 7,457 | | | 7,934 | | | Application and Business Services | | | 2,278 | | | 2,535 | | | 4,461 | | | 4,926 | | | | | | | | | | | | | | | | | | | Enterprise Services | | | 5,999 | | | 6,489 | | | 11,918 | | | 12,860 | | | | | | | | | | | | | | | | | | | Software | | | 941 | | | 970 | | | 1,867 | | | 1,916 | | | HP Financial Services | | | 881 | | | 968 | | | 1,838 | | | 1,918 | | | Corporate Investments | | | 10 | | | 7 | | | 14 | | | 37 | | | | | | | | | | | | | | | | | | | Total segments | | | 28,315 | | | 31,582 | | | 57,235 | | | 62,311 | | | | | | | | | | | | | | | | | | | Eliminations of intersegment net revenue and other | | | (733 | ) | | (889 | ) | | (1,294 | ) | | (1,582 | ) | | | | | | | | | | | | | | | | | | Total HP consolidated net revenue | | $ | 27,582 | | $ | 30,693 | | $ | 55,941 | | $ | 60,729 | | | | | | | | | | | | | | | | | |
im betting that if you try to get the part of the table after the header it will work ... otherwise i'll have to take a look later
Aside from the leading $ and trailing ) (which are like that in the HTML), GitHub does a great job of rendering that table.
So this seems to work better:
ret = pd.read_html("/home/alex/table.html",infer_types=False,skiprows=3)[0]
but I am still left with a lot of nan's:
0 1 2 3 4 \
0 Net revenue: nan nan nan nan
1 Notebooks nan $ 3718 nan
2 Desktops nan nan 3103 nan
3 Workstations nan nan 521 nan
4 Other nan nan 242 nan
5 nan nan nan nan nan
6 Personal Systems nan nan 7584 nan
7 nan nan nan nan nan
8 Supplies nan nan 4122 nan
9 Commercial Hardware nan nan 1398 nan
10 Consumer Hardware nan nan 561 nan
11 nan nan nan nan nan
12 Printing nan nan 6081 nan
13 nan nan nan nan nan
14 Printing and Personal Systems Group nan nan 13665 nan
15 nan nan nan nan nan
16 Industry Standard Servers nan nan 2806 nan
17 Technology Services nan nan 2272 nan
18 Storage nan nan 857 nan
19 Networking nan nan 618 nan
20 Business Critical Systems nan nan 266 nan
21 nan nan nan nan nan
22 Enterprise Group nan nan 6819 nan
23 nan nan nan nan nan
24 Infrastructure Technology Outsourcing nan nan 3721 nan
25 Application and Business Services nan nan 2278 nan
26 nan nan nan nan nan
27 Enterprise Services nan nan 5999 nan
28 nan nan nan nan nan
29 Software nan nan 941 nan
30 HP Financial Services nan nan 881 nan
31 Corporate Investments nan nan 10 nan
32 nan nan nan nan nan
33 Total segments nan nan 28315 nan
34 nan nan nan nan nan
35 Eliminations of intersegment net revenue and o... nan nan (733 )
36 nan nan nan nan nan
37 Total HP consolidated net revenue nan $ 27582 nan
38 nan nan nan nan nan
where those nans actually appear to be strings:
In [22]: ret[5][0]
Out[22]: u'nan'
yep as i suspected.
the string nan
s are because you passed infer_types=False
which converts everything to a string (kind of kludgy i know, it's for back compat). infer_types
will have no effect starting in 0.14
I think that parse_dates
should have a little more documentation (I did look at docs on read_csv
as well).
It is not clear to me what the default value of parse_dates
is.
Also, on read_html
, parse_dates is listed as "bool" whereas on read_csv as "boolean, list of ints or names, list of lists, or dict".
I am not sure what setting parse_dates=False
does.
It seems that some values are being incorrectly parsed as dates (See columns 9-12):
In [38]: pd.read_html("/home/alex/table.html",skiprows=3,parse_dates=False)[0]
Out[38]:
0 1 2 3 4 5 6 7 8 9 10 11 12 13
0 Net revenue: NaN NaN NaN NaN NaN NaN NaN NaN NaT NaN NaN NaT NaN
1 Notebooks NaN $ 3718 NaN $ 4900 NaN $ NaT NaN $ NaT NaN
2 Desktops NaN NaN 3103 NaN NaN 3827 NaN NaN NaT NaN NaN NaT NaN
3 Workstations NaN NaN 521 NaN NaN 537 NaN NaN NaT NaN NaN NaT NaN
4 Other NaN NaN 242 NaN NaN 206 NaN NaN NaT NaN NaN NaT NaN
5 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaT NaN NaN NaT NaN
6 Personal Systems NaN NaN 7584 NaN NaN 9470 NaN NaN NaT NaN NaN NaT NaN
7 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaT NaN NaN NaT NaN
8 Supplies NaN NaN 4122 NaN NaN 4060 NaN NaN NaT NaN NaN NaT NaN
9 Commercial Hardware NaN NaN 1398 NaN NaN 1479 NaN NaN NaT NaN NaN NaT NaN
10 Consumer Hardware NaN NaN 561 NaN NaN 593 NaN NaN NaT NaN NaN NaT NaN
11 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaT NaN NaN NaT NaN
12 Printing NaN NaN 6081 NaN NaN 6132 NaN NaN NaT NaN NaN NaT NaN
13 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaT NaN NaN NaT NaN
14 Printing and Personal Systems Group NaN NaN 13665 NaN NaN 15602 NaN NaN NaT NaN NaN NaT NaN
15 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaT NaN NaN NaT NaN
16 Industry Standard Servers NaN NaN 2806 NaN NaN 3186 NaN NaN NaT NaN NaN NaT NaN
17 Technology Services NaN NaN 2272 NaN NaN 2335 NaN NaN NaT NaN NaN NaT NaN
18 Storage NaN NaN 857 NaN NaN 990 NaN NaN 1690-01-01 00:00:00 NaN NaN 1945-01-01 00:00:00 NaN
19 Networking NaN NaN 618 NaN NaN 614 NaN NaN NaT NaN NaN NaT NaN
20 Business Critical Systems NaN NaN 266 NaN NaN 421 NaN NaN NaT NaN NaN NaT NaN
21 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaT NaN NaN NaT NaN
22 Enterprise Group NaN NaN 6819 NaN NaN 7546 NaN NaN NaT NaN NaN NaT NaN
23 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaT NaN NaN NaT NaN
24 Infrastructure Technology Outsourcing NaN NaN 3721 NaN NaN 3954 NaN NaN NaT NaN NaN NaT NaN
25 Application and Business Services NaN NaN 2278 NaN NaN 2535 NaN NaN NaT NaN NaN NaT NaN
26 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaT NaN NaN NaT NaN
27 Enterprise Services NaN NaN 5999 NaN NaN 6489 NaN NaN NaT NaN NaN NaT NaN
28 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaT NaN NaN NaT NaN
29 Software NaN NaN 941 NaN NaN 970 NaN NaN 1867-01-01 00:00:00 NaN NaN 1916-01-01 00:00:00 NaN
30 HP Financial Services NaN NaN 881 NaN NaN 968 NaN NaN 1838-01-01 00:00:00 NaN NaN 1918-01-01 00:00:00 NaN
31 Corporate Investments NaN NaN 10 NaN NaN 7 NaN NaN NaT NaN NaN NaT NaN
32 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaT NaN NaN NaT NaN
33 Total segments NaN NaN 28315 NaN NaN 31582 NaN NaN NaT NaN NaN NaT NaN
34 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaT NaN NaN NaT NaN
35 Eliminations of intersegment net revenue and o... NaN NaN (733 ) NaN (889 ) NaN NaT ) NaN NaT )
36 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaT NaN NaN NaT NaN
37 Total HP consolidated net revenue NaN $ 27582 NaN $ 30693 NaN $ NaT NaN $ NaT NaN
38 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaT NaN NaN NaT NaN
This was referenced
Apr 3, 2014
What are we supposed to do if we want to use read_html and not have Pandas infer types then?
read_html, to_html, Styler.apply, Styler.applymap
label
jreback added a commit to jreback/pandas that referenced this pull request
Labels
read_html, to_html, Styler.apply, Styler.applymap