[Python-Dev] Fixing the XML batteries (original) (raw)

Bill Janssen janssen at parc.com
Sat Dec 10 21:54:09 CET 2011


Stefan Behnel <stefan_ml at behnel.de> wrote:

Bill Janssen, 09.12.2011 19:15: > I think another thing that might go into "refreshing the batteries" is a > feature comparison of BeautifulSoup and HTML5lib against the stdlib > competition, to see what needs to be added/revised. Having to switch to > an outside package for parsing possibly invalid HTML is a pain.

Such a feature request should be worth a separate thread. Note, however, that html5lib is likely way too big to add it to the stdlib, and that BeautifulSoup lacks a parser for non-conforming HTML in Python 3, which would be the target release series for better HTML support. So, whatever library or API you would want to use for HTML processing is currently only the second question as long as Py3 lacks a real-world HTML parser in the stdlib, as well as a robust character detection mechanism. I don't think that can be fixed all that easily.

Sounds like it needs a PEP.

I'm only advocating spending some thought on what needs to be done -- whether outside libraries need to be adopted into the stdlib would be a step after that. But understanding why those libraries exist and are widely used should be a prerequisite to "refreshing" the stdlib's support.

Bill



More information about the Python-Dev mailing list