Issue 670664: HTMLParser.py - more robust SCRIPT tag parsing (original) (raw)

Created on 2003-01-19 14:07 by fantoozler, last changed 2022-04-10 16:06 by admin. This issue is now closed.

Messages (39)

msg42474 - (view)

Author: j paulson (fantoozler)

Date: 2003-01-19 14:07

http://www.ebay.com contains a script element of the form

which is not enclosed in "" comments. The parser choked on that line, indicating it was a mal-formed end tag.

The changes are:

interesting_cdata is now a dict mapping start tag to an re matching the end tag, a "<--" or \Z

HTMLParser.set_cdata_mode takes an extra argument, the start tag

msg42475 - (view)

Author: j paulson (fantoozler)

Date: 2003-01-25 03:58

Logged In: YES user_id=690612

Found regression test, used it, found error, fixed it.

msg42476 - (view)

Author: Fred Drake (fdrake) (Python committer)

Date: 2003-01-28 22:24

Logged In: YES user_id=3066

From python-dev:

John Paulson wrote:

[...]  A side-effect of this is that
any "<!--" .. "-->" within a script/style will
be parsed as a comment.  If that behavior is
incorrect, the regex can be modified.

Jerry Williams wrote: Does this mean that the following won't work:

That could be a problem, since this is commonly used to support browsers that don't understand ".

def bs(input): pattern = re.compile('"+"') match = lambda x: "" massage = copy.copy(BeautifulSoup.MARKUP_MASSAGE) massage.extend([(pattern, match)]) return BeautifulSoup(input, markupMassage=massage)

msg117762 - (view)

Author: Yotam Medini (yotam) *

Date: 2010-09-30 21:50

The HTMLParser.py fails when inside it can fooled by JavaScript with less-than '<' conditional expressions. In the attached example:

$ tar tvzf lt-in-script-example.tgz | cut -c24- 796 2010-09-30 16:52 h2t.py 23678 2010-09-30 16:39 t.html

here's what happens:

$ python h2t.py t.html /tmp/t.txt HTMLParser: /home/yotam/src/wog/HTMLParser.bug/HTMLParser.py Traceback (most recent call last): File "h2t.py", line 31, in text = html2text(f_html.read()) File "h2t.py", line 23, in html2text te = TextExtractor(html) File "h2t.py", line 15, in init self.feed(html) File "/home/yotam/src/wog/HTMLParser.bug/HTMLParser.py", line 108, in feed self.goahead(0) File "/home/yotam/src/wog/HTMLParser.bug/HTMLParser.py", line 148, in goahead k = self.parse_starttag(i) File "/home/yotam/src/wog/HTMLParser.bug/HTMLParser.py", line 229, in parse_starttag endpos = self.check_for_whole_start_tag(i) File "/home/yotam/src/wog/HTMLParser.bug/HTMLParser.py", line 304, in check_for_whole_start_tag self.error("malformed start tag") File "/home/yotam/src/wog/HTMLParser.bug/HTMLParser.py", line 115, in error raise HTMLParseError(message, self.getpos()) HTMLParser.HTMLParseError: malformed start tag, at line 396, column 332

I have a suggested patch HTMLParser.diff fixing this problem, soon to be attached.

-- yotam

msg117763 - (view)

Author: Yotam Medini (yotam) *

Date: 2010-09-30 21:52

The attached suggested patch fixes the problems shown in .

msg120265 - (view)

Author: Éric Araujo (eric.araujo) * (Python committer)

Date: 2010-11-02 22:10

Would it be reasonable to add knowledge to html.parser to make it recognize script elements as CDATA and handle it correctly (that is let “<” pass)?

msg125096 - (view)

Author: Yotam Medini (yotam) *

Date: 2011-01-02 20:50

Suggested fix for the attached cases: lt-in-script-example.tgz endtag-space.html dollar-extra.html

msg125154 - (view)

Author: Senthil Kumaran (orsenthil) * (Python committer)

Date: 2011-01-03 04:05

If you provide some tests augumenting the currently existing tests test_htmlparser.py and also ensure that no existing test breaks, it would be help better to review the patch. I do see some changes made to the regex and parsing. So tests would definitely help.

msg130319 - (view)

Author: Alexander (friday)

Date: 2011-03-08 11:00

This is small patch for related bug which actually is not related to this bug.

msg130326 - (view)

Author: Alexander (friday)

Date: 2011-03-08 11:28

And this patch fix the both bugs in more elegant way

msg130702 - (view)

Author: Ezio Melotti (ezio.melotti) * (Python committer)

Date: 2011-03-12 22:46

Thanks for the patch, however it would be better if you could get a clone of the CPython repo and make a patch against it. The patch should also include tests.

You can check http://docs.python.org/devguide/ for more information.

msg141204 - (view)

Author: Matt Basta (Matt.Basta)

Date: 2011-07-27 03:24

The number of problems produced by this bug can be greatly reduced by adding a relatively small check to the parser. Currently, ') data: 'foobar' # this looks ok myhp.feed('') data: '

foo' # where's the

? myhp.feed('') data: '

foo' # some tags missing, 2 chunks received data: 'bar' myhp.feed("") data: '

foo' data: " '" Traceback (most recent call last): File "", line 1, in File "/usr/lib/python2.7/HTMLParser.py", line 108, in feed self.goahead(0) File "/usr/lib/python2.7/HTMLParser.py", line 150, in goahead k = self.parse_endtag(i) File "/usr/lib/python2.7/HTMLParser.py", line 317, in parse_endtag self.error("bad end tag: %r" % (rawdata[i:j],)) File "/usr/lib/python2.7/HTMLParser.py", line 115, in error raise HTMLParseError(message, self.getpos()) HTMLParser.HTMLParseError: bad end tag: "</scr'+'ipt>", at line 1, column 247

with the patch:

myhp.feed('') data: 'foobar' # ok myhp.feed('') data: '
foo' # all the content is there, but why 2 chunks? data: '
' myhp.feed('') data: '
foo' # same as previous data: '
' data: 'bar' data: '' myhp.feed("")
data: '
foo' # same data: '
' data: " '" data: "</scr'+'ipt>" data: "' bar" data: ''

So my question is: is it normal that the data is passed to handle_data in chunks? AFAIU HTML parser should see CDATA as a single chunk of bytes they don't care about, so the fact that further parsing happens on the content of script/style seems wrong to me. If I'm reading the code correctly that's because the "interesting" regex is set to look for a closing tag ('</') -- maybe assuming that the CDATA section doesn't contain any other tag (usually true in case of