(original) (raw)

On Thu, Jul 28, 2011 at 11:25, Matt <mattbasta@gmail.com> wrote:

Hello all,

I wanted to ask a few questions and start a discussion about HTML5

support within the HTMLParser class(es). Over on issue 670664, an

inconsistency with the way browsers and the HTMLParser parse script

and style tags was discovered. Currently, HTMLParser adheres strictly

to the HTML4 standard, which says that these tags should exit CDATA

mode when the start of *any* closing tag is found. No browsers, to my

knowledge, have ever supported this (at least in the 21st century).

Instead, all browsers implement the behavior described in the HTML5

spec, which states that script tags should exit their "raw text mode"

when the full closing tag for that element is encountered.

The repercussions of adhering to the HTML4 standard in HTMLParser are

somewhat serious: a good number of documents will either encounter

exceptions for broken markup (which aren't actually broken). Libraries

like Beautiful Soup (which depend on HTMLParser) are also affected,

requiring the use of hacks just to get the document to parse at all.

Rather than bore you all with another paragraph about how HTML4 is

terrible, feel free to look at the issue

(http://bugs.python.org/issue670664), which quite thoroughly outlines

the pros and cons of this particular change. Any feedback/input on

the proposed changes is welcome.

So here are my questions:

- What plans, if any, are there to support HTML5 parsing behaviors,

since the HTML5 spec effectively describes current web browser

behavior?

There are not specific plans that have been publicly brought up (to my knowledge).

- What policies are in place for keeping parity with other HTML

parsers (such as those in web browsers)?

There aren't any beyond "it would be nice".

Given the semi-backward-compatible nature of HTML5's syntax, this
seems like a rather unique problem that could use some more
discussion.

It's more of an issue of someone caring enough to do the coding work to bring the parser up to spec for HTML5 (or introduce new code to live beside the HTML4 parsing code). IOW there is no policies specifically about this topic beyond the general desire to stay up-to-date with stable specs.