Issue 620243: HTMLParser:endtag events in comments (original) (raw)

Logged In: YES user_id=3066

I'm not convinced this is a bug, but that's mainly due to details of the specification some people consider obscure, and to (ta da!)... the version of the HTML spec you look at!

Here's a quick synopsis; refer to the latest edition of the HTML 4 spec for more details.

There are two kinds of character data in HTML documents, PCDATA and CDATA. Most is PCDATA, which means all markup constructs are allowed. A few elements (SCRIPT and STYLE in particular) contain the more restrictive CDATA, which allows only end elements. Since SCRIPT contains CDATA (in the more recent versions of HTML), comments are not recognized -- the characters '<!--' are just plain text, which HTMLParser gets right. The end tag '' is an end tag, so it's a legal token at that position in the input document. It is not legal in the HTML syntax, though: SCRIPT must have an explicit end tag, and no H1 was open anyway. A proper HTML parser (based on SGML) would raise an error.

Now, the application we wrote HTMLParser for originally did not want to perform all the same checks, and the information provided is sufficient to allow an application to extend the parser to provide the right checks, so we figured that was good enough -- our app could enforce the checks it did care about, otherwise not mess with the provided HTML (we wanted round-tripability and non-interferance as much as possible).

So here the real catch: different versions of HTML deal with the differently. What's "right" depends on the version of the specification the input document is expected to conform to.
For the most part, applications shouldn't really need to care, but we're seeing here what happens when incredibly lenient implementations become the norm, as often happens when we talk about "Internet time." ;-(

HTML 3.2 and newer define SCRIPT and STYLE as CDATA, but earlier versions did not define them at all, so browsers (as the ultimately permissive parsers) simply ignored them, and treated their content as PCDATA. So the comments were parsed as such. When they were added, there was a desire to not require having to escape every random greater-than or less-than character in a script, so they were made CDATA.

So the result is that we do the right thing... but only for HTML 3.2 and newer. The behavior you're expecting would be reasonable for HTML 2.0... unless we threw an exception because there was an undefined element in the document in some hypothetical "strict mode."

So it's not clear that anything needs to be changed.