Issue 25258: HtmlParser doesn't handle void element tags correctly (original) (raw)
Issue25258
Created on 2015-09-28 19:26 by Chenyun Yang, last changed 2022-04-11 14:58 by admin.
Messages (15) | ||
---|---|---|
msg251792 - (view) | Author: Chenyun Yang (Chenyun Yang) | Date: 2015-09-28 19:26 |
For void elements such as (, |
||
msg251813 - (view) | Author: Josh Rosenberg (josh.r) * ![]() |
Date: 2015-09-29 02:44 |
The example for Parsing an element with a few attributes and a title:" in https://docs.python.org/2/library/htmlparser.html#examples demonstrates this as expected behavior, so I'm not sure it can be changed: >>> parser.feed('![]() Python') Start tag: h1 Data : Python End tag : h1 |
||
msg251821 - (view) | Author: Xiang Zhang (xiang.zhang) * ![]() |
Date: 2015-09-29 05:22 |
From the specification, void element has no end tag, so I think this behaviour can not be called incorrect. For void element, only handle_starttag is called. And for start tag ends with '/>', actually HTMLParser calls handle_startendtag, which invokes handle_starttag and handle_endtag. I think there are two solutions, filter void elements in the library and then invoke handle_startendtag, or filter void elements in the application in handle_starttag and then invoke handle_endtag. | ||
msg251883 - (view) | Author: Martin Panter (martin.panter) * ![]() |
Date: 2015-09-29 20:27 |
Also applies to Python 3, though I’m not sure I would consider it a bug. | ||
msg251891 - (view) | Author: Chenyun Yang (Chenyun Yang) | Date: 2015-09-29 21:35 |
I think the bug is mostly about inconsistent behavior: |
||
msg251987 - (view) | Author: Martin Panter (martin.panter) * ![]() |
Date: 2015-10-01 02:05 |
My thinking is that the knowledge that
|
||
msg252150 - (view) | Author: Chenyun Yang (Chenyun Yang) | Date: 2015-10-02 19:18 |
the example you give for
|
||
msg252152 - (view) | Author: Ezio Melotti (ezio.melotti) * ![]() |
Date: 2015-10-02 19:42 |
Note that HTMLParser tries to follow the HTML5 specs, and for this case they say [0]: "Set the self-closing flag of the current tag token. Switch to the data state. Emit the current tag token." So it seems that for |
||
msg252154 - (view) | Author: Chenyun Yang (Chenyun Yang) | Date: 2015-10-02 20:16 |
I am fine with either handle_startendtag or handle_starttag, The issue is that the behavior is consistent for the two equally valid syntax ( |
||
msg252156 - (view) | Author: Chenyun Yang (Chenyun Yang) | Date: 2015-10-02 20:21 |
Correct for previous comment, consistent -> not consistent On Fri, Oct 2, 2015 at 1:16 PM, Chenyun Yang <report@bugs.python.org> wrote: > > Chenyun Yang added the comment: > > I am fine with either handle_startendtag or handle_starttag, > > The issue is that the behavior is consistent for the two equally valid > syntax ( |
||
msg252168 - (view) | Author: Ezio Melotti (ezio.melotti) * ![]() |
Date: 2015-10-02 22:13 |
> this inconsistent cannot be fixed from the inherited class as (handle_* > calls are dispatched in the internal method of HTMLParser) You can override handle_startendtag() like this: >>> class MyHTMLParser(HTMLParser): ... def handle_starttag(self, tag, attrs): ... print('start', tag) ... def handle_endtag(self, tag): ... print('end', tag) ... def handle_startendtag(self, tag, attrs): ... self.handle_starttag(tag, attrs) ... >>> parser = MyHTMLParser() >>> parser.feed(' |
||
msg252214 - (view) | Author: R. David Murray (r.david.murray) * ![]() |
Date: 2015-10-03 14:47 |
I suspect that calling startendtag is also backward incompatible, in that there may be parsers out there that are depending on starttag getting called for |
||
msg252234 - (view) | Author: Chenyun Yang (Chenyun Yang) | Date: 2015-10-03 19:43 |
handle_startendtag is also called for non-void elements, such as , so the override example will break in those situation. The compatible patch I proposed right now is just one liner checker: # http://www.w3.org/TR/html5/syntax.html#void-elements <https://www.google.com/url?q=http://www.w3.org/TR/html5/syntax.html%23void-elements&usg=AFQjCNFVtfyZ53NDOHlPq896qmX5b8fPTA>_VOID_ELEMENT_TAGS = frozenset([ 'area', 'base', 'br', 'col', 'embed', 'hr', 'img', 'input', 'keygen', 'link', 'meta', 'param', 'source', 'track', 'wbr'])class HTMLParser.HTMLParser: # Internal -- handle starttag, return end or -1 if not terminated def parse_starttag(self, i): #... if end.endswith('/>'): # XHTML-style empty tag: self.handle_startendtag(tag, attrs) ############# PATCH ################# elif end.endswith('>') and tag in _VOID_ELEMENT_TAGS: self.handle_startendtag(tag, attrs) ############# PATCH ################# | ||
msg384362 - (view) | Author: karl (karlcow) * | Date: 2021-01-05 01:29 |
The parsing rules for tokenization of html are at https://html.spec.whatwg.org/multipage/parsing.html#tokenization In the stack of open elements, there are specific rules for certain elements. https://html.spec.whatwg.org/multipage/parsing.html#special from a DOM point of view, there is indeed no difference in between |
||
msg384363 - (view) | Author: karl (karlcow) * | Date: 2021-01-05 01:34 |
I wonder if the confusion comes from the name. The HTMLParser is kind of a tokenizer more than a full HTML parser, but that's probably a detail. It doesn't create a DOM Tree which you can access, but could help you to build a DOM Tree (!= DOM Document object) https://html.spec.whatwg.org/multipage/parsing.html#overview-of-the-parsing-model > Implementations that do not support scripting do not have to actually create a DOM Document object, but the DOM tree in such cases is still used as the model for the rest of the specification. |
History | |||
---|---|---|---|
Date | User | Action | Args |
2022-04-11 14:58:21 | admin | set | github: 69445 |
2021-01-05 01:34:11 | karlcow | set | messages: + |
2021-01-05 01:29:11 | karlcow | set | nosy: + karlcowmessages: + |
2015-10-03 19:43:29 | Chenyun Yang | set | messages: + |
2015-10-03 14:47:47 | r.david.murray | set | nosy: + r.david.murraymessages: + |
2015-10-02 22:13:55 | ezio.melotti | set | messages: + |
2015-10-02 20:21:29 | Chenyun Yang | set | messages: + |
2015-10-02 20:16:02 | Chenyun Yang | set | messages: + |
2015-10-02 19:42:35 | ezio.melotti | set | type: behaviormessages: + |
2015-10-02 19🔞01 | Chenyun Yang | set | messages: + |
2015-10-01 02:05:15 | martin.panter | set | messages: + |
2015-09-29 21:35:43 | Chenyun Yang | set | messages: + |
2015-09-29 20:27:04 | martin.panter | set | nosy: + martin.pantermessages: + versions: + Python 3.4, Python 3.5, Python 3.6 |
2015-09-29 05:22:58 | xiang.zhang | set | nosy: + xiang.zhangmessages: + |
2015-09-29 02:44:39 | josh.r | set | nosy: + josh.rmessages: + |
2015-09-28 19:29:41 | r.david.murray | set | nosy: + ezio.melotti |
2015-09-28 19:26:34 | Chenyun Yang | create |