[Python-Dev] HTMLParser and HTML5 (original) (raw)

Matt mattbasta at gmail.com
Thu Jul 28 20:25:15 CEST 2011

Previous message: [Python-Dev] Status of the PEP 400? (deprecate codecs.StreamReader/StreamWriter)
Next message: [Python-Dev] HTMLParser and HTML5
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

Hello all,

I wanted to ask a few questions and start a discussion about HTML5 support within the HTMLParser class(es). Over on issue 670664, an inconsistency with the way browsers and the HTMLParser parse script and style tags was discovered. Currently, HTMLParser adheres strictly to the HTML4 standard, which says that these tags should exit CDATA mode when the start of any closing tag is found. No browsers, to my knowledge, have ever supported this (at least in the 21st century). Instead, all browsers implement the behavior described in the HTML5 spec, which states that script tags should exit their "raw text mode" when the full closing tag for that element is encountered.

The repercussions of adhering to the HTML4 standard in HTMLParser are somewhat serious: a good number of documents will either encounter exceptions for broken markup (which aren't actually broken). Libraries like Beautiful Soup (which depend on HTMLParser) are also affected, requiring the use of hacks just to get the document to parse at all.

Rather than bore you all with another paragraph about how HTML4 is terrible, feel free to look at the issue (http://bugs.python.org/issue670664), which quite thoroughly outlines the pros and cons of this particular change. Any feedback/input on the proposed changes is welcome.

So here are my questions:

What plans, if any, are there to support HTML5 parsing behaviors, since the HTML5 spec effectively describes current web browser behavior?
What policies are in place for keeping parity with other HTML parsers (such as those in web browsers)?

Given the semi-backward-compatible nature of HTML5's syntax, this seems like a rather unique problem that could use some more discussion.

Thanks

Matt Basta

Previous message: [Python-Dev] Status of the PEP 400? (deprecate codecs.StreamReader/StreamWriter)
Next message: [Python-Dev] HTMLParser and HTML5
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

More information about the Python-Dev mailing list