msg152806 - (view) |
Author: Ezio Melotti (ezio.melotti) *  |
Date: 2012-02-07 11:56 |
html.parser fails to handle the following invalid comments: <! foo > <! bar --> <! -- baz --> The attached patch follows the HTML5 specs [0], and parses them as "bogus comments". Currently the patch fixes the problem only when strict=False, but it might be better to make this the default behavior and apply it to 2.7 too. [0]: http://www.w3.org/TR/html5/tokenization.html#bogus-comment-state |
|
|
msg152861 - (view) |
Author: Éric Araujo (eric.araujo) *  |
Date: 2012-02-08 14:28 |
LGTM. What did our last discussion about following HTML5 rules for Python 2.7 lead to? I don’t remember if we agreed that “3.3 is soon enough” or “let’s fix the bugs with HTML5 as reference”. |
|
|
msg152869 - (view) |
Author: Éric Araujo (eric.araujo) *  |
Date: 2012-02-08 15:30 |
After reading some emails again, I’m +1 on porting the fixes to 2.7. 1) We agree that HTML5 is the reference specification. 2) I don’t think there is sane code that would be broken if some previously unparsable page became parsable (an exception can be HTML parsers, but the obvious example BeautifulSoup does not use HTMLParser for example); HTMLParser is not a validating parser and never made any guarantee about the validity of handled pages. 3) Most people should be happy to have more pages handled by HTMLParser. 4) 2.7 is unique as long-term support, last 2.7 release. |
|
|
msg153032 - (view) |
Author: Ezio Melotti (ezio.melotti) *  |
Date: 2012-02-10 08:10 |
I'll fix this for 3.x non-strict and then see if it can be backported to 2.7 (there are still other fixes that should be backported to 2.7 before this can be applied). |
|
|
msg153035 - (view) |
Author: Roundup Robot (python-dev)  |
Date: 2012-02-10 08:51 |
New changeset 242b697449d8 by Ezio Melotti in branch '3.2': #13960: HTMLParser is now able to handle broken comments when strict=False. http://hg.python.org/cpython/rev/242b697449d8 New changeset 44366541dd86 by Ezio Melotti in branch 'default': #13960: merge with 3.2. http://hg.python.org/cpython/rev/44366541dd86 |
|
|
msg153036 - (view) |
Author: Ezio Melotti (ezio.melotti) *  |
Date: 2012-02-10 08:52 |
This is now fixed in 3.2/3.3, I'll wait for 2.7 before closing it. On a side note, the empty <!> comment doesn't seem to be valid in HTML5. HTMLParser just ignores it, and doesn't report it as an empty comment (so this should be fine). |
|
|
msg153271 - (view) |
Author: Roundup Robot (python-dev)  |
Date: 2012-02-13 14:10 |
New changeset 333e3acf2008 by Ezio Melotti in branch '2.7': #13960: HTMLParser is now able to handle broken comments. http://hg.python.org/cpython/rev/333e3acf2008 |
|
|
msg153272 - (view) |
Author: Ezio Melotti (ezio.melotti) *  |
Date: 2012-02-13 14:14 |
I now backported this to 2.7, together with some improvements in the handling of declarations that I committed on 3.2 (4c4ff9fd19b6) and 3.3 (06a6fed0da56). Apparently <!> is not a valid comment in HTML5, but it is considered a bogus comment and should still emit a "comment" with no content. This is now fixed too. |
|
|