Read html tables into DataFrames by cpcloud · Pull Request #3477 · pandas-dev/pandas (original) (raw)
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service andprivacy statement. We’ll occasionally send you account related emails.
Already on GitHub?Sign in to your account
Conversation58 Commits1 Checks0 Files changed
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.Learn more about bidirectional Unicode characters
[ Show hidden characters]({{ revealButtonHref }})
This PR adds new functionality for reading HTML tables from a URI, string, or file-like object into a DataFrame.
#3369
@y-p Ready for those notes whenever you are.
First of all, thanks for all the work. much appreciated especially being so thorough.
Here's some points, mostly I guess for consistency in API, we can discuss
what works best. I looked at the code a couple of days ago, my apologies if
some points are now stale:
- 1
io
arg might be a url, in which case, should do the right thing. pd.read_csv does this
( I voted this down originally and I was wrong). see helper inio.parsers._read
for free lunch. - 2- i agree with @jreback that you should return a list of dataframes, and my vote goes for
returning a singleton list even when a match is specified/only one table. I see this
aspd.read_html()
. The user is then free to index the returned list so I suggest jetissoningtable_number
. - 3- the docstring note on "expect some cleanup" is somewhat opaque. what does it mean?
- 4
df.applymap
currently bombs on duplicate labels (df.applymap duplicates data with frame has dupe columns #2786), need to workaround for now. - 5_get_skiprows_iter is not actually used anywhere
- 6 not clear what other legal values for
flavor
are. - 7- dedicating all kwargs to be passed off somewhere is a back-compat killer,
since if we need to change something later on we can't add anything to the signature,
make it a named dict arg instead (perhaps yanked by name from **kwargs).
Also, doesn't make sense if you're usinglxml
directly. so if the user specified them
but uses a diff engine, should probably raise in that case. - 8 -name bikeshedding: s/convert/
infer_types
?
- urls are now handled properly, insofar as lxml can handle them (lxml only handles file, http, and ftp protocols) for bs4 there is a little bit of kludge around inferring the type of the input since bs4 doesn't handle it as transparently as lxml
- TODO
- TODO
- not used anymore
- used in _rows_to_frame
- flavor is gone because there's no reason to use bs4 if you have lxml (that might be too opinionated :)
- TODO
- TODO
One thing that I realized is glaringly obvious is that I haven't done thorough testing using bs4, need to remove the ci line that installs lxml and run travis again.
- the referenced
read_
wraps everything up into a .read()-able object, and abstracts that
away from teh underlying parse.
Are you aware that you can use detox/tox_prll.sh and scripts/use_build_cache.py
to
do all the testing locally and usually much faster?
Oh ok thanks. Didn't know that. Great.
see #3156.
read the instructions at the top of scripts/use_build_cache.py(create dir, export envar) run
scripts/use_build_cache.py` which will monkey patch setup.py, then run tox/tox_prll.sh/detox
to launch the build, second time around you only pay for running the tests.
runs py2.6/7/3.2/3.3.
I think I will print that out as a poster and frame it, it's so helpful.
The **kwargs in the case of lxml builds an expression that does what bs4 would do in that case see _build_node_xpath_expr
. Is that a good idea? It's sort of a hack, but I thought that was a "nice-to-have" from bs4. Basically, [@attr1='value1' and @attr2='value2' and ... and @attrN='valueN']
is what is built.
It's fine, it should be a dict that's passed in, and not take over the entire kwargs by definition,
because that paints us into a corner wrt future changes. Just a general rule.
so not:
def a(foo,**kwargs):
my_delegate(**kwargs)
but
def a(foo,**kwargs):
dkwds = kwargs.get("lxml_kwds",{})
my_delegate(**dkwds)
@y-p Maybe I'm being a bit thick here, but _read
tries to parse a delimited text file and thus fails when I pass in a url/string/HTML blob. Is there something I'm missing? It doesn't return a read
able object it actually tries to do the reading.
Yeah, my bad, TextFileReader is actually a line-oriented, delimited file reader,
not a an IO source abstraction, sorry.
Oh wow, tox where have you been all my life :D
You've obviously spent considerable time on this and it shows: clean code, great docstrings. very nice.
in the "spam" example, I see that bs4 does a much better jobs then lxml in extracting
text from nested tags, while lxml's ".text" doesn't go the extra mile.
I actually think the flavor
argument was a good choice, since getting better data back trumps
speed for many website scraping situations. It's great that you did both, so might as well let the
user enjoy whichever solves his/her problem best.
Also, have you checked the license on all the HTML pages you included in the PR?
Perhaps it's best to just leave everything as a @network test.
@y-p Cool. Thanks for the complements. Will add flavor
back in. Also, tidying up the docstrings a bit, some of them are slightly inaccurate now. And yep, while bs4 is slower than lxml it's more intuitive and generally gives back "what you want". I also wanted the challenge of learning some xpath along the way so I thought why not do both?
Anyway both are public domain, and I've only included those in the tests that are not @network
so they're okay.
Build is failing because of failed HDFStore test. I haven't touched that code. Investigating...
This is especially strange since I don't get the failure locally (I've rebased on upstream/master)
I have seen an occasional failure (not on this particular one), but can never reproduce them....prob some wonkiness with travis..
just rebase (squash some commits together in any event), and try again (btw...add PTF in your commit message to make it work faster)
ok. just did git commit --amend -C HEAD
to create a new hash a la @y-p's tip in #3156. @jreback What is PTF?
Please Travis Faster....a y-p "hack" to make your whole testing run in about 4 min (it uses cached builds)....but have to put in a particular commit
I think I'm done twiddling here. Sorry about that, just wanted to be totally thorough.
no problem, thanks for addressing all the issues. marking for merge in 0.11.1.
need to tweak docstring to mention io accepts urls, suggest you make bs4
the default for better results
out of the box and easier install, I think the typical use case for prasing HTML values quality
over speed.
calling with the spam url now works with lxml but fails with flavor=bs4
.
@y-p can u post the error you're getting? The tests run successfully for both flavors over here.
Also the io docstring mentions that it accepts urls. Should I add more detail? Would be happy to do that.
@y-p Cool, thanks for the heads up. It's failing because you didn't provide a match, which I didn't write a test for. Doing that now and fixing up the bs4 issues with no match. It should return a list of DataFrames of all tables if no match is provided.
- bs4 doesn't deal well with no match provided. fix that
- make bs4 the default
- update
flavor
docstring - add tests for failing parameter values
- make docstrings even more awesome
The issue was that bs4 needs to be told to find the tables first, with attrs and then search for text within the found tables if any (fails if none are found before searching for text, as per the docstring), whereas lxml can do it all at once via xpath.
@y-p This is ready to go.
Getting ready to merge this. What happens if match
is provided and more then one table matches?
@y-p A list of DataFrame
s is returned. That was the agreed upon API, no? I can write a test quickly.
use import on 2.6
extra code from previous merges and importlib failure on tests
fix bs4 issues with no match provided
docstring storm!
markup slow tests (bank list data) and add tests for failing parameter values PTF
ok that is really it for docstring mania
add testfor multiple matches
yes, that's fine, it's hard to tell it's actually returning a singleton list, because the dataframe repr
expands.
Okay, cool. Test was added to show this. There are some examples in the documentation as well that note this.
merged as 6518c79. great functionality, thank you.
No problem :) I had fun hacking it out. Would love to contribute more.
ghost mentioned this pull request
not a big deal, a lot of your tests fail if you don't have bs4 installed.....I installed it an all was fine (I had lxml installed though)....is this correct?
Hm no. There's a function that raises nose.SkipTest
after trying to import bs4 first, then lxml. Let me check this out.
Okay, got it. Wasn't skipping correctly.
Damn, not good. bs4
's fallback parser is not very lenient (html.parser
), thus tests are failing because Python yields different results from lxml (which bs4
uses by default if available thus hiding this bug: I should have tested by mutually excluding lxml and bs4).
I will have this patch by this evening. It will support all of the parsers that bs4 supports.
should i just raise the caught exception here instead? unicodedecodeerror occurs for some urls with lxml (spam was one of them) and ioerror is thrown for invalid urls so try to parse from a string if it's not a url and then if that doesn't work check to make sure the url protocol is valid (for bs4) else the only thing i could think of is that it's a faulty connection...am I missing anything here? also why the heck isn't that parsing? is it possible that travis has a timeout on things using the network, or just processes in general (obvs. os has that): is there an additional mechanism?
ok just decided to catch UnicodeDecodeError
otherwise catch Exception
and re-raise if the string is a url and does not have a valid protocol.
Current state of read_html
Compatibility Issues (check mark means something sane is returned)
- lxml
- bs4 + lxml
- bs4 + html5lib (actually gives more "valid" results since it returns a valid HTML5 page from a garbage page)
Additional Documentation
- Doc about what to do in the event of failure (will probably just say install both
bs4
andlxml
) - Doc the important user-facing results of the decisions made below.
- The default column index now consists of the
<th>
elements of the<thead>
element of a table, otherwise pandas defaults take over. - Any
<tfoot>
elements will be left as rows. An additional parameter, calledfooter
can be used to signify that the footer should be given a different index in the frame <colgroup>
elements will not be processed (even though they could potentially be used to create a rowMultiIndex
).
- The default column index now consists of the
- bs4 parsing now checks for duplicate
<table>
elements using Python'sset
objects. So, for example if a bunch of strings are found that have the same<table>
parent only one will be returned since their parent tables will all hash to the same value. - elements that have empty strings are no longer removed
infer_types
now defaults toFalse
.<th>
and<td>
elements will be parsed from the header of table.
Parsing Issues
- deal with
<thead>
/<tfoot>
elements that lack children (the issue here is that some html contains<thead>
elements with no<tr>
elements (but possibly containing text), thus the parser might leave out a row that is valid upon visual inspection [less surprises for users]). As per the HTML spec for elements, and the spec for elements these elements can contain only 0 or more<tr>
elements (obviously, this ignores nested tables which I'm not going to deal with). This issue (of empty<thead>
/<tfoot>
elements) then is easily dealt with by parsing<thead>
,<tbody>
and<tfoot>
elements separately, which brings me to the next issue... - Decide what to do with
<tfoot>
elements.- There's only one allowed in valid HTML, however I will not raise an error if there is more than 1 since most of the parsers (if you're using Python >= 2.7) will accept this (will add a test for this).
- If there is a
<thead>
element should aMultiIndex
in the columns be created with the name'footer'
? I think no here. - If there's no
<thead>
but a<tfoot>
, should<tfoot>
become the column index? Again, I think no is the most practical choice here. - I think the least surprising option here would be to keep
<tfoot>
elements as rows, BUT
* Use standard indexing (integer based or something else if the user provides a related option)
* Label the row containing the individual elements of the<tfoot>
elements'footer'
.
- Make
<thead>
elements the default column index of the resultingDataFrame
(s). Motivated by the default ofread_csv
. Theheader
parameter inread_html
would override this behavior. - Include Ground Truth™ data in
banklist.csv
to compare against parsed data frombanklist.html
- No need to parse
colgroup
elements since those are only used for formatting.
Weirdness
'Gold Canyon'
row from the failed bank list is not parsed bylxml
(by definition by thebs4
+lxml
combo as well), fix this. This is really annoying. This single element won't parse...arghhhhhhhh!!
Performance Comparison via vbench
- For users who are concerned about the relative performance of the 3 different parsers.
Thanks guys for taking on this one, really great addition to pandas
is the footer on a table, usually a 'summary' type of row or something? one possibiilty is to return it as a separate dataframe (as you are already returning a list of frames). you could also have an option to 'control' this, footer='separate'
?
Yep, but they behave just like a regular row. Wouldn't it be more consistent to return a series since there can only be one unless someone is nesting other tables in their html. I'll probably go with the footer='separate'
option.
I've decided not to support the native Python parsing library. It's god-awful (read: not lenient enough for the real world).
@jreback @y-p Is there a way to map a callable over the elements of a specific block type?
what I mean is look at the convert method which iterates over items
@jreback @y-p What do you guys think about this named footer thing? I think it's not worth it...u can check out the branch if you want it's called read-html-named-footer
.
This pull request was closed.