Issue 12629: HTMLParser silently stops parsing with malformed attributes (original) (raw)

Created on 2011-07-24 18:35 by teoryn, last changed 2022-04-11 14:57 by admin. This issue is now closed.

Files
File name	Uploaded	Description	Edit
test.py	teoryn,2011-07-24 18:35	Example of the broken behavior
issue12629.diff	ezio.melotti,2011-11-01 13:21	Failing test	review

Messages (8)
msg141051 - (view)	Author: Kevin Stock (teoryn)	Date: 2011-07-24 18:35
Given the input '<y z=""o"" />', HTMLParser only detects the opening x tag, and then stops parsing. Ideally this should behave like the case '<y z="""" />' which raises an error and then can continue parsing the close x tag.
msg141174 - (view)	Author: Kevin Stock (teoryn)	Date: 2011-07-26 18:07
A workaround is to call close() after feed(), which I supposed I should have done anyways. However, this does not resolve the issue that the two cases behave so differently. The code that causes the difference is lines 351-355 of parser.py, which also has a misleading comment stating it detects the / in a /> ending (which is actually done at 334).
msg146774 - (view)	Author: Ezio Melotti (ezio.melotti) *	Date: 2011-11-01 13:21
I think <y z=""o"" /> should be parser as , and the o"" should be ignored. <y z="""" /> should be parser as , and the last two "" should be ignored. This is what Firefox seems to do. Currently the parser doesn't seem to handle extraneous data in the start tag too well, because the locatestarttagend_tolerant regex looks for (more or less) well-formed attributes. Attached a patch for test_htmlparser with the two examples provided by Kevin.
msg146848 - (view)	Author: Éric Araujo (eric.araujo) *	Date: 2011-11-02 16:53
> This is what Firefox seems to do. I think more confidence would be good. Doesn’t the HTML5 spec define that? Have you found their test suite? Do you have more than one browser known to be compliant (trick: not sure there is even one)?
msg146852 - (view)	Author: Ezio Melotti (ezio.melotti) *	Date: 2011-11-02 17:08
I haven't found anything in the HTML5 spec but I haven't looked closely. I'll do some more research when I'll start working on an actual patch.
msg147192 - (view)	Author: Ezio Melotti (ezio.melotti) *	Date: 2011-11-06 22:51
http://www.w3.org/TR/html5/tokenization.html#before-attribute-name-state
msg147612 - (view)	Author: Roundup Robot (python-dev)	Date: 2011-11-14 16:57
New changeset 3c3009f63700 by Ezio Melotti in branch '2.7': #1745761, #755670, #13357, #12629, #1200313: improve attribute handling in HTMLParser. http://hg.python.org/cpython/rev/3c3009f63700 New changeset 16ed15ff0d7c by Ezio Melotti in branch '3.2': #1745761, #755670, #13357, #12629, #1200313: improve attribute handling in HTMLParser. http://hg.python.org/cpython/rev/16ed15ff0d7c New changeset 426f7a2b1826 by Ezio Melotti in branch 'default': #1745761, #755670, #13357, #12629, #1200313: merge with 3.2. http://hg.python.org/cpython/rev/426f7a2b1826
msg147620 - (view)	Author: Ezio Melotti (ezio.melotti) *	Date: 2011-11-14 17:12
Fixed, thanks for the report! Apparently the correct way to parse <y z=""o"" /> is: starttag y attribute z with value "" attribute o"" with no value So this is what HTMLParser does now.

History
Date	User	Action	Args
2022-04-11 14:57:20	admin	set	github: 56838
2011-11-14 17:12:07	ezio.melotti	set	status: open -> closedversions: + Python 2.7messages: + resolution: fixedstage: needs patch -> resolved
2011-11-14 16:57:13	python-dev	set	nosy: + python-devmessages: +
2011-11-14 12:44:28	ezio.melotti	set	assignee: ezio.melotti
2011-11-06 22:51:14	ezio.melotti	set	messages: +
2011-11-02 17:08:43	ezio.melotti	set	messages: +
2011-11-02 16:53:00	eric.araujo	set	messages: +
2011-11-01 13:21:41	ezio.melotti	set	files: + issue12629.diffnosy: + ezio.melottimessages: + keywords: + patchstage: needs patch
2011-07-29 16:24:57	eric.araujo	set	nosy: + eric.araujo, r.david.murray
2011-07-26 18:07:46	teoryn	set	messages: +
2011-07-24 18:35:07	teoryn	create