Issue 25258: HtmlParser doesn't handle void element tags correctly (original) (raw)

Issue25258

Created on 2015-09-28 19:26 by Chenyun Yang, last changed 2022-04-11 14:58 by admin.

Messages (15)
msg251792 - (view) Author: Chenyun Yang (Chenyun Yang) Date: 2015-09-28 19:26
For void elements such as (, ), there doesn't need to have xhtml empty end tag. HtmlParser which relies on the XHTML empty end syntax failed to handle this situation. from HTMLParser import HTMLParser # create a subclass and override the handler methods class MyHTMLParser(HTMLParser): def handle_starttag(self, tag, attrs): print "Encountered a start tag:", tag def handle_endtag(self, tag): print "Encountered an end tag :", tag def handle_data(self, data): print "Encountered some data :", data >>> parser.feed('') Encountered a start tag: link Encountered a start tag: img >>> parser.feed('') Encountered a start tag: link Encountered an end tag : link Encountered a start tag: img Encountered an end tag : img Reference: https://github.com/python/cpython/blob/bdfb14c688b873567d179881fc5bb67363a6074c/Lib/html/parser.py http://www.w3.org/TR/html5/syntax.html#void-elements
msg251813 - (view) Author: Josh Rosenberg (josh.r) * (Python triager) Date: 2015-09-29 02:44
The example for Parsing an element with a few attributes and a title:" in https://docs.python.org/2/library/htmlparser.html#examples demonstrates this as expected behavior, so I'm not sure it can be changed: >>> parser.feed('The Python logo') Start tag: img attr: ('src', 'python-logo.png') attr: ('alt', 'The Python logo') >>> >>> parser.feed('

Python

') Start tag: h1 Data : Python End tag : h1
msg251821 - (view) Author: Xiang Zhang (xiang.zhang) * (Python committer) Date: 2015-09-29 05:22
From the specification, void element has no end tag, so I think this behaviour can not be called incorrect. For void element, only handle_starttag is called. And for start tag ends with '/>', actually HTMLParser calls handle_startendtag, which invokes handle_starttag and handle_endtag. I think there are two solutions, filter void elements in the library and then invoke handle_startendtag, or filter void elements in the application in handle_starttag and then invoke handle_endtag.
msg251883 - (view) Author: Martin Panter (martin.panter) * (Python committer) Date: 2015-09-29 20:27
Also applies to Python 3, though I’m not sure I would consider it a bug.
msg251891 - (view) Author: Chenyun Yang (Chenyun Yang) Date: 2015-09-29 21:35
I think the bug is mostly about inconsistent behavior: and shouldn't be parsed differently. This causes problem in the case that the parser won't be able to know consistently whether it has ended the visit of tag. I propose one fix which will be: in the `parse_internal' method call, check for void elements and call `handle_startendtag' On Tue, Sep 29, 2015 at 1:27 PM, Martin Panter <report@bugs.python.org> wrote: > > Martin Panter added the comment: > > Also applies to Python 3, though I’m not sure I would consider it a bug. > > ---------- > nosy: +martin.panter > versions: +Python 3.4, Python 3.5, Python 3.6 > > _______________________________________ > Python tracker <report@bugs.python.org> > <http://bugs.python.org/issue25258> > _______________________________________ >
msg251987 - (view) Author: Martin Panter (martin.panter) * (Python committer) Date: 2015-10-01 02:05
My thinking is that the knowledge that does not have a closing tag is at a higher level than the current HTMLParser class. It is similar to knowing where the following HTML implicitly closes the
  • elements:
    • Item A
    • Item B
    In both cases I would not expect the HTMLParser to report “virtual” empty or closing tags. I don’t think it should report an empty or closing tag just because that is easy to do, because it would be inconsistent with other implied HTML tags. But maybe see what other people say. I don’t know your particular use case, but I would suggest if you need to parse non-XML HTML tags, use the handle_starttag() method and don’t rely on the end tag :)
  • msg252150 - (view) Author: Chenyun Yang (Chenyun Yang) Date: 2015-10-02 19:18
    the example you give for
  • is a different case. , are void elements which are allowed to have no close tag;
  • without
  • is a browser implementation detail, most browser autocompletes . Without the parser calls the handle_endtag(), the client code which uses HTMLParser won't be able to know whether the a traversal is finished. Do you have a strong reason why we should include the knowledge of void elements into the HTMLParser at this line? https://github.com/python/cpython/blob/bdfb14c688b873567d179881fc5bb67363a6074c/Lib/html/parser.py#L341 if end.endswith('/>') or (end.endswith('>') and tag in VOID_ELEMENTS) On Wed, Sep 30, 2015 at 7:05 PM, Martin Panter <report@bugs.python.org> wrote: > > Martin Panter added the comment: > > My thinking is that the knowledge that does not have a closing tag > is at a higher level than the current HTMLParser class. It is similar to > knowing where the following HTML implicitly closes the
  • elements: > >
    • Item A
    • Item B
    > > In both cases I would not expect the HTMLParser to report “virtual” empty > or closing tags. I don’t think it should report an empty or closing > tag just because that is easy to do, because it would be > inconsistent with other implied HTML tags. But maybe see what other people > say. > > I don’t know your particular use case, but I would suggest if you need to > parse non-XML HTML tags, use the handle_starttag() method and don’t > rely on the end tag :) > > ---------- > > _______________________________________ > Python tracker <report@bugs.python.org> > <http://bugs.python.org/issue25258> > _______________________________________ >
  • msg252152 - (view) Author: Ezio Melotti (ezio.melotti) * (Python committer) Date: 2015-10-02 19:42
    Note that HTMLParser tries to follow the HTML5 specs, and for this case they say [0]: "Set the self-closing flag of the current tag token. Switch to the data state. Emit the current tag token." So it seems that for , only the (and not the closing ) should be emitted. HTMLParser has no way to set the self-closing flag, so calling handle_startendtag seems the most reasonable things to do, since it allows tree-builders to set the flag themselves. That said, the default implementation of handle_startendtag should indeed just call handle_starttag, however this would be a backward-incompatible change. [0]: http://www.w3.org/TR/html5/syntax.html#self-closing-start-tag-state
    msg252154 - (view) Author: Chenyun Yang (Chenyun Yang) Date: 2015-10-02 20:16
    I am fine with either handle_startendtag or handle_starttag, The issue is that the behavior is consistent for the two equally valid syntax ( and are handled differently); this inconsistent cannot be fixed from the inherited class as (handle_* calls are dispatched in the internal method of HTMLParser) On Fri, Oct 2, 2015 at 12:42 PM, Ezio Melotti <report@bugs.python.org> wrote: > > Ezio Melotti added the comment: > > Note that HTMLParser tries to follow the HTML5 specs, and for this case > they say [0]: > "Set the self-closing flag of the current tag token. Switch to the data > state. Emit the current tag token." > > So it seems that for , only the (and not the closing ) > should be emitted. HTMLParser has no way to set the self-closing flag, so > calling handle_startendtag seems the most reasonable things to do, since it > allows tree-builders to set the flag themselves. That said, the default > implementation of handle_startendtag should indeed just call > handle_starttag, however this would be a backward-incompatible change. > > [0]: http://www.w3.org/TR/html5/syntax.html#self-closing-start-tag-state > > ---------- > type: -> behavior > > _______________________________________ > Python tracker <report@bugs.python.org> > <http://bugs.python.org/issue25258> > _______________________________________ >
    msg252156 - (view) Author: Chenyun Yang (Chenyun Yang) Date: 2015-10-02 20:21
    Correct for previous comment, consistent -> not consistent On Fri, Oct 2, 2015 at 1:16 PM, Chenyun Yang <report@bugs.python.org> wrote: > > Chenyun Yang added the comment: > > I am fine with either handle_startendtag or handle_starttag, > > The issue is that the behavior is consistent for the two equally valid > syntax ( and are handled differently); this inconsistent cannot > be fixed from the inherited class as (handle_* calls are dispatched in the > internal method of HTMLParser) > > On Fri, Oct 2, 2015 at 12:42 PM, Ezio Melotti <report@bugs.python.org> > wrote: > > > > > Ezio Melotti added the comment: > > > > Note that HTMLParser tries to follow the HTML5 specs, and for this case > > they say [0]: > > "Set the self-closing flag of the current tag token. Switch to the data > > state. Emit the current tag token." > > > > So it seems that for , only the (and not the closing ) > > should be emitted. HTMLParser has no way to set the self-closing flag, > so > > calling handle_startendtag seems the most reasonable things to do, since > it > > allows tree-builders to set the flag themselves. That said, the default > > implementation of handle_startendtag should indeed just call > > handle_starttag, however this would be a backward-incompatible change. > > > > [0]: http://www.w3.org/TR/html5/syntax.html#self-closing-start-tag-state > > > > ---------- > > type: -> behavior > > > > _______________________________________ > > Python tracker <report@bugs.python.org> > > <http://bugs.python.org/issue25258> > > _______________________________________ > > > > ---------- > > _______________________________________ > Python tracker <report@bugs.python.org> > <http://bugs.python.org/issue25258> > _______________________________________ >
    msg252168 - (view) Author: Ezio Melotti (ezio.melotti) * (Python committer) Date: 2015-10-02 22:13
    > this inconsistent cannot be fixed from the inherited class as (handle_* > calls are dispatched in the internal method of HTMLParser) You can override handle_startendtag() like this: >>> class MyHTMLParser(HTMLParser): ... def handle_starttag(self, tag, attrs): ... print('start', tag) ... def handle_endtag(self, tag): ... print('end', tag) ... def handle_startendtag(self, tag, attrs): ... self.handle_starttag(tag, attrs) ... >>> parser = MyHTMLParser() >>> parser.feed('') start link start img (P.S. please don't quote the whole message in your reply)
    msg252214 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2015-10-03 14:47
    I suspect that calling startendtag is also backward incompatible, in that there may be parsers out there that are depending on starttag getting called for , and endtag not getting called (that is, endtag getting called for it will cause them to break). I would hope that this would not be the case, but I'm worried about it.
    msg252234 - (view) Author: Chenyun Yang (Chenyun Yang) Date: 2015-10-03 19:43
    handle_startendtag is also called for non-void elements, such as , so the override example will break in those situation. The compatible patch I proposed right now is just one liner checker: # http://www.w3.org/TR/html5/syntax.html#void-elements <https://www.google.com/url?q=http://www.w3.org/TR/html5/syntax.html%23void-elements&usg=AFQjCNFVtfyZ53NDOHlPq896qmX5b8fPTA>_VOID_ELEMENT_TAGS = frozenset([ 'area', 'base', 'br', 'col', 'embed', 'hr', 'img', 'input', 'keygen', 'link', 'meta', 'param', 'source', 'track', 'wbr'])class HTMLParser.HTMLParser: # Internal -- handle starttag, return end or -1 if not terminated def parse_starttag(self, i): #... if end.endswith('/>'): # XHTML-style empty tag: self.handle_startendtag(tag, attrs) ############# PATCH ################# elif end.endswith('>') and tag in _VOID_ELEMENT_TAGS: self.handle_startendtag(tag, attrs) ############# PATCH #################
    msg384362 - (view) Author: karl (karlcow) * Date: 2021-01-05 01:29
    The parsing rules for tokenization of html are at https://html.spec.whatwg.org/multipage/parsing.html#tokenization In the stack of open elements, there are specific rules for certain elements. https://html.spec.whatwg.org/multipage/parsing.html#special from a DOM point of view, there is indeed no difference in between https://software.hixie.ch/utilities/js/live-dom-viewer/?%3C!DOCTYPE%20html%3E%0A%3Cimg%20src%3D%22somewhere%22%3E%3Cimg%20src%3D%22somewhere%22%2F%3E
    msg384363 - (view) Author: karl (karlcow) * Date: 2021-01-05 01:34
    I wonder if the confusion comes from the name. The HTMLParser is kind of a tokenizer more than a full HTML parser, but that's probably a detail. It doesn't create a DOM Tree which you can access, but could help you to build a DOM Tree (!= DOM Document object) https://html.spec.whatwg.org/multipage/parsing.html#overview-of-the-parsing-model > Implementations that do not support scripting do not have to actually create a DOM Document object, but the DOM tree in such cases is still used as the model for the rest of the specification.
    History
    Date User Action Args
    2022-04-11 14:58:21 admin set github: 69445
    2021-01-05 01:34:11 karlcow set messages: +
    2021-01-05 01:29:11 karlcow set nosy: + karlcowmessages: +
    2015-10-03 19:43:29 Chenyun Yang set messages: +
    2015-10-03 14:47:47 r.david.murray set nosy: + r.david.murraymessages: +
    2015-10-02 22:13:55 ezio.melotti set messages: +
    2015-10-02 20:21:29 Chenyun Yang set messages: +
    2015-10-02 20:16:02 Chenyun Yang set messages: +
    2015-10-02 19:42:35 ezio.melotti set type: behaviormessages: +
    2015-10-02 19🔞01 Chenyun Yang set messages: +
    2015-10-01 02:05:15 martin.panter set messages: +
    2015-09-29 21:35:43 Chenyun Yang set messages: +
    2015-09-29 20:27:04 martin.panter set nosy: + martin.pantermessages: + versions: + Python 3.4, Python 3.5, Python 3.6
    2015-09-29 05:22:58 xiang.zhang set nosy: + xiang.zhangmessages: +
    2015-09-29 02:44:39 josh.r set nosy: + josh.rmessages: +
    2015-09-28 19:29:41 r.david.murray set nosy: + ezio.melotti
    2015-09-28 19:26:34 Chenyun Yang create