[Python-Dev] HTMLParser and HTML5 (original) (raw)

Matt mattbasta at gmail.com
Thu Jul 28 20:25:15 CEST 2011


Hello all,

I wanted to ask a few questions and start a discussion about HTML5 support within the HTMLParser class(es). Over on issue 670664, an inconsistency with the way browsers and the HTMLParser parse script and style tags was discovered. Currently, HTMLParser adheres strictly to the HTML4 standard, which says that these tags should exit CDATA mode when the start of any closing tag is found. No browsers, to my knowledge, have ever supported this (at least in the 21st century). Instead, all browsers implement the behavior described in the HTML5 spec, which states that script tags should exit their "raw text mode" when the full closing tag for that element is encountered.

The repercussions of adhering to the HTML4 standard in HTMLParser are somewhat serious: a good number of documents will either encounter exceptions for broken markup (which aren't actually broken). Libraries like Beautiful Soup (which depend on HTMLParser) are also affected, requiring the use of hacks just to get the document to parse at all.

Rather than bore you all with another paragraph about how HTML4 is terrible, feel free to look at the issue (http://bugs.python.org/issue670664), which quite thoroughly outlines the pros and cons of this particular change. Any feedback/input on the proposed changes is welcome.

So here are my questions:

Given the semi-backward-compatible nature of HTML5's syntax, this seems like a rather unique problem that could use some more discussion.

Thanks

Matt Basta



More information about the Python-Dev mailing list