Read html tables into DataFrames by cpcloud · Pull Request #3477 · pandas-dev/pandas (original) (raw)

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service andprivacy statement. We’ll occasionally send you account related emails.

Already on GitHub?Sign in to your account

Conversation58 Commits1 Checks0 Files changed

Conversation

This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.Learn more about bidirectional Unicode characters

[ Show hidden characters]({{ revealButtonHref }})

cpcloud

This PR adds new functionality for reading HTML tables from a URI, string, or file-like object into a DataFrame.
#3369

@cpcloud

@y-p Ready for those notes whenever you are.

@ghost

First of all, thanks for all the work. much appreciated especially being so thorough.
Here's some points, mostly I guess for consistency in API, we can discuss
what works best. I looked at the code a couple of days ago, my apologies if
some points are now stale:

@cpcloud

  1. urls are now handled properly, insofar as lxml can handle them (lxml only handles file, http, and ftp protocols) for bs4 there is a little bit of kludge around inferring the type of the input since bs4 doesn't handle it as transparently as lxml
  2. TODO
  3. TODO
  4. not used anymore
  5. used in _rows_to_frame
  6. flavor is gone because there's no reason to use bs4 if you have lxml (that might be too opinionated :)
  7. TODO
  8. TODO

One thing that I realized is glaringly obvious is that I haven't done thorough testing using bs4, need to remove the ci line that installs lxml and run travis again.

@ghost

  1. the referenced read_ wraps everything up into a .read()-able object, and abstracts that
    away from teh underlying parse.

Are you aware that you can use detox/tox_prll.sh and scripts/use_build_cache.py to
do all the testing locally and usually much faster?

@cpcloud

Oh ok thanks. Didn't know that. Great.

@ghost

see #3156.

read the instructions at the top of scripts/use_build_cache.py(create dir, export envar) runscripts/use_build_cache.py` which will monkey patch setup.py, then run tox/tox_prll.sh/detox
to launch the build, second time around you only pay for running the tests.
runs py2.6/7/3.2/3.3.

@cpcloud

I think I will print that out as a poster and frame it, it's so helpful.

@cpcloud

The **kwargs in the case of lxml builds an expression that does what bs4 would do in that case see _build_node_xpath_expr. Is that a good idea? It's sort of a hack, but I thought that was a "nice-to-have" from bs4. Basically, [@attr1='value1' and @attr2='value2' and ... and @attrN='valueN'] is what is built.

@ghost

It's fine, it should be a dict that's passed in, and not take over the entire kwargs by definition,
because that paints us into a corner wrt future changes. Just a general rule.

so not:

def a(foo,**kwargs):
   my_delegate(**kwargs)

but

def a(foo,**kwargs):
   dkwds = kwargs.get("lxml_kwds",{})
   my_delegate(**dkwds)

@cpcloud

@y-p Maybe I'm being a bit thick here, but _read tries to parse a delimited text file and thus fails when I pass in a url/string/HTML blob. Is there something I'm missing? It doesn't return a readable object it actually tries to do the reading.

@ghost

Yeah, my bad, TextFileReader is actually a line-oriented, delimited file reader,
not a an IO source abstraction, sorry.

@cpcloud

Oh wow, tox where have you been all my life :D

@ghost

You've obviously spent considerable time on this and it shows: clean code, great docstrings. very nice.

in the "spam" example, I see that bs4 does a much better jobs then lxml in extracting
text from nested tags, while lxml's ".text" doesn't go the extra mile.
I actually think the flavor argument was a good choice, since getting better data back trumps
speed for many website scraping situations. It's great that you did both, so might as well let the
user enjoy whichever solves his/her problem best.

Also, have you checked the license on all the HTML pages you included in the PR?
Perhaps it's best to just leave everything as a @network test.

@cpcloud

@y-p Cool. Thanks for the complements. Will add flavor back in. Also, tidying up the docstrings a bit, some of them are slightly inaccurate now. And yep, while bs4 is slower than lxml it's more intuitive and generally gives back "what you want". I also wanted the challenge of learning some xpath along the way so I thought why not do both?

@cpcloud

@cpcloud

@cpcloud

Anyway both are public domain, and I've only included those in the tests that are not @network so they're okay.

@cpcloud

Build is failing because of failed HDFStore test. I haven't touched that code. Investigating...

@cpcloud

This is especially strange since I don't get the failure locally (I've rebased on upstream/master)

@jreback

I have seen an occasional failure (not on this particular one), but can never reproduce them....prob some wonkiness with travis..

just rebase (squash some commits together in any event), and try again (btw...add PTF in your commit message to make it work faster)

@cpcloud

ok. just did git commit --amend -C HEAD to create a new hash a la @y-p's tip in #3156. @jreback What is PTF?

@cpcloud

@jreback

Please Travis Faster....a y-p "hack" to make your whole testing run in about 4 min (it uses cached builds)....but have to put in a particular commit

@cpcloud

@cpcloud

I think I'm done twiddling here. Sorry about that, just wanted to be totally thorough.

@ghost

no problem, thanks for addressing all the issues. marking for merge in 0.11.1.

@ghost

need to tweak docstring to mention io accepts urls, suggest you make bs4 the default for better results
out of the box and easier install, I think the typical use case for prasing HTML values quality
over speed.

@ghost

calling with the spam url now works with lxml but fails with flavor=bs4.

@cpcloud

@y-p can u post the error you're getting? The tests run successfully for both flavors over here.

@cpcloud

Also the io docstring mentions that it accepts urls. Should I add more detail? Would be happy to do that.

@cpcloud

@y-p Cool, thanks for the heads up. It's failing because you didn't provide a match, which I didn't write a test for. Doing that now and fixing up the bs4 issues with no match. It should return a list of DataFrames of all tables if no match is provided.

The issue was that bs4 needs to be told to find the tables first, with attrs and then search for text within the found tables if any (fails if none are found before searching for text, as per the docstring), whereas lxml can do it all at once via xpath.

@cpcloud

@y-p This is ready to go.

@ghost

Getting ready to merge this. What happens if match is provided and more then one table matches?

@cpcloud

@y-p A list of DataFrames is returned. That was the agreed upon API, no? I can write a test quickly.

@cpcloud

use import on 2.6

extra code from previous merges and importlib failure on tests

fix bs4 issues with no match provided

docstring storm!

markup slow tests (bank list data) and add tests for failing parameter values PTF

ok that is really it for docstring mania

add testfor multiple matches

@ghost

yes, that's fine, it's hard to tell it's actually returning a singleton list, because the dataframe repr
expands.

@cpcloud

Okay, cool. Test was added to show this. There are some examples in the documentation as well that note this.

@ghost

merged as 6518c79. great functionality, thank you.

@cpcloud

No problem :) I had fun hacking it out. Would love to contribute more.

@ghost ghost mentioned this pull request

May 4, 2013

@jreback

not a big deal, a lot of your tests fail if you don't have bs4 installed.....I installed it an all was fine (I had lxml installed though)....is this correct?

@cpcloud

Hm no. There's a function that raises nose.SkipTest after trying to import bs4 first, then lxml. Let me check this out.

@cpcloud

Okay, got it. Wasn't skipping correctly.

@cpcloud

Damn, not good. bs4's fallback parser is not very lenient (html.parser), thus tests are failing because Python yields different results from lxml (which bs4 uses by default if available thus hiding this bug: I should have tested by mutually excluding lxml and bs4).

@cpcloud

@cpcloud

I will have this patch by this evening. It will support all of the parsers that bs4 supports.

@jreback

@cpcloud

should i just raise the caught exception here instead? unicodedecodeerror occurs for some urls with lxml (spam was one of them) and ioerror is thrown for invalid urls so try to parse from a string if it's not a url and then if that doesn't work check to make sure the url protocol is valid (for bs4) else the only thing i could think of is that it's a faulty connection...am I missing anything here? also why the heck isn't that parsing? is it possible that travis has a timeout on things using the network, or just processes in general (obvs. os has that): is there an additional mechanism?

@cpcloud

ok just decided to catch UnicodeDecodeError otherwise catch Exception and re-raise if the string is a url and does not have a valid protocol.

@cpcloud

Current state of read_html

Compatibility Issues (check mark means something sane is returned)

Additional Documentation

Parsing Issues

Weirdness

Performance Comparison via vbench

@wesm

Thanks guys for taking on this one, really great addition to pandas

@cpcloud

@jreback

is the footer on a table, usually a 'summary' type of row or something? one possibiilty is to return it as a separate dataframe (as you are already returning a list of frames). you could also have an option to 'control' this, footer='separate'?

@cpcloud

Yep, but they behave just like a regular row. Wouldn't it be more consistent to return a series since there can only be one unless someone is nesting other tables in their html. I'll probably go with the footer='separate' option.

@cpcloud

I've decided not to support the native Python parsing library. It's god-awful (read: not lenient enough for the real world).

@cpcloud

@jreback @y-p Is there a way to map a callable over the elements of a specific block type?

@jreback

@jreback

what I mean is look at the convert method which iterates over items

@cpcloud

@jreback @y-p What do you guys think about this named footer thing? I think it's not worth it...u can check out the branch if you want it's called read-html-named-footer.

This pull request was closed.