Issue 670664: HTMLParser.py - more robust SCRIPT tag parsing (original) (raw)
Created on 2003-01-19 14:07 by fantoozler, last changed 2022-04-10 16:06 by admin. This issue is now closed.
Messages (39)
Author: j paulson (fantoozler)
Date: 2003-01-19 14:07
http://www.ebay.com contains a script element of the form
which is not enclosed in "" comments. The parser choked on that line, indicating it was a mal-formed end tag.
The changes are:
interesting_cdata is now a dict mapping start tag to an re matching the end tag, a "<--" or \Z
HTMLParser.set_cdata_mode takes an extra argument, the start tag
Author: j paulson (fantoozler)
Date: 2003-01-25 03:58
Logged In: YES user_id=690612
Found regression test, used it, found error, fixed it.
Author: Fred Drake (fdrake)
Date: 2003-01-28 22:24
Logged In: YES user_id=3066
From python-dev:
John Paulson wrote:
[...] A side-effect of this is that any "<!--" .. "-->" within a script/style will be parsed as a comment. If that behavior is incorrect, the regex can be modified.
Jerry Williams wrote: Does this mean that the following won't work:
That could be a problem, since this is commonly used to support browsers that don't understand ".
def bs(input): pattern = re.compile('"+"') match = lambda x: "" massage = copy.copy(BeautifulSoup.MARKUP_MASSAGE) massage.extend([(pattern, match)]) return BeautifulSoup(input, markupMassage=massage)
Author: Yotam Medini (yotam) *
Date: 2010-09-30 21:50
The HTMLParser.py fails when inside it can fooled by JavaScript with less-than '<' conditional expressions. In the attached example:
$ tar tvzf lt-in-script-example.tgz | cut -c24- 796 2010-09-30 16:52 h2t.py 23678 2010-09-30 16:39 t.html
here's what happens:
$ python h2t.py t.html /tmp/t.txt HTMLParser: /home/yotam/src/wog/HTMLParser.bug/HTMLParser.py Traceback (most recent call last): File "h2t.py", line 31, in text = html2text(f_html.read()) File "h2t.py", line 23, in html2text te = TextExtractor(html) File "h2t.py", line 15, in init self.feed(html) File "/home/yotam/src/wog/HTMLParser.bug/HTMLParser.py", line 108, in feed self.goahead(0) File "/home/yotam/src/wog/HTMLParser.bug/HTMLParser.py", line 148, in goahead k = self.parse_starttag(i) File "/home/yotam/src/wog/HTMLParser.bug/HTMLParser.py", line 229, in parse_starttag endpos = self.check_for_whole_start_tag(i) File "/home/yotam/src/wog/HTMLParser.bug/HTMLParser.py", line 304, in check_for_whole_start_tag self.error("malformed start tag") File "/home/yotam/src/wog/HTMLParser.bug/HTMLParser.py", line 115, in error raise HTMLParseError(message, self.getpos()) HTMLParser.HTMLParseError: malformed start tag, at line 396, column 332
I have a suggested patch HTMLParser.diff fixing this problem, soon to be attached.
-- yotam
Author: Yotam Medini (yotam) *
Date: 2010-09-30 21:52
The attached suggested patch fixes the problems shown in .
Author: Éric Araujo (eric.araujo) *
Date: 2010-11-02 22:10
Would it be reasonable to add knowledge to html.parser to make it recognize script elements as CDATA and handle it correctly (that is let “<” pass)?
Author: Yotam Medini (yotam) *
Date: 2011-01-02 20:50
Suggested fix for the attached cases: lt-in-script-example.tgz endtag-space.html dollar-extra.html
Author: Senthil Kumaran (orsenthil) *
Date: 2011-01-03 04:05
If you provide some tests augumenting the currently existing tests test_htmlparser.py and also ensure that no existing test breaks, it would be help better to review the patch. I do see some changes made to the regex and parsing. So tests would definitely help.
Author: Alexander (friday)
Date: 2011-03-08 11:00
This is small patch for related bug which actually is not related to this bug.
Author: Alexander (friday)
Date: 2011-03-08 11:28
And this patch fix the both bugs in more elegant way
Author: Ezio Melotti (ezio.melotti) *
Date: 2011-03-12 22:46
Thanks for the patch, however it would be better if you could get a clone of the CPython repo and make a patch against it. The patch should also include tests.
You can check http://docs.python.org/devguide/ for more information.
Author: Matt Basta (Matt.Basta)
Date: 2011-07-27 03:24
The number of problems produced by this bug can be greatly reduced by adding a relatively small check to the parser. Currently, ') data: 'foobar' # this looks ok myhp.feed('') data: '
foo' # where's the
? myhp.feed('') data: 'foo' # some tags missing, 2 chunks received data: 'bar' myhp.feed("") data: '
foo' data: " '" Traceback (most recent call last): File "", line 1, in File "/usr/lib/python2.7/HTMLParser.py", line 108, in feed self.goahead(0) File "/usr/lib/python2.7/HTMLParser.py", line 150, in goahead k = self.parse_endtag(i) File "/usr/lib/python2.7/HTMLParser.py", line 317, in parse_endtag self.error("bad end tag: %r" % (rawdata[i:j],)) File "/usr/lib/python2.7/HTMLParser.py", line 115, in error raise HTMLParseError(message, self.getpos()) HTMLParser.HTMLParseError: bad end tag: "</scr'+'ipt>", at line 1, column 247
with the patch:
myhp.feed('') data: 'foobar' # ok myhp.feed('') data: '
foo' # all the content is there, but why 2 chunks? data: '
' myhp.feed('') data: 'foo' # same as previous data: '
' data: 'bar' data: '' myhp.feed("")
data: 'foo' # same data: '
' data: " '" data: "</scr'+'ipt>" data: "' bar" data: ''
So my question is: is it normal that the data is passed to handle_data in chunks? AFAIU HTML parser should see CDATA as a single chunk of bytes they don't care about, so the fact that further parsing happens on the content of script/style seems wrong to me. If I'm reading the code correctly that's because the "interesting" regex is set to look for a closing tag ('</') -- maybe assuming that the CDATA section doesn't contain any other tag (usually true in case of