Issue 17410: Generator-based HTMLParser - Python tracker (original) (raw)

Issue17410

Created on 2013-03-13 18:09 by flying sheep, last changed 2022-04-11 14:57 by admin.

Files
File name	Uploaded	Description	Edit
htmltokenizer.patch	flying sheep,2013-03-13 19:28	version 1.0.0.1 of the patch. tests still pass.	review

Messages (10)
msg184096 - (view)	Author: (flying sheep) *	Date: 2013-03-13 18:09
hi, i have an idea on how to make an internal change to html.parser.HTMLParser, which would expose a token generator interface. after that, we would be able to do e.g. list(HTMLParser().tokenize(data)) or even parser = HTMLParser() for chunk in pipe_in_html(): yield from parser.tokenize(chunk) --- the changes affect excluively HTMLParser’s methods and would unfortunately require a behavior change to most (internal) parse_* methods. the changes go as follows: 1. the tokenize(data=None, end=False) method is added. it contains mainly goahead’s body with an prepended snippet to append passed data to raw_data, and all handle_* calls changed to "yield token, data". 2. all parse_* methods which returned an int and called one handle_* method are changed to return an (int, token) tuple (so that tokenize can yield the tokens) 3. goahead is changed to a skeleton implementation based on traversing the list created by tokenize, experiencing no changed behavior. all changes would only affect the behavior of the parse_* methods, and the addition of the tokenize method: the tokens are discarded if goahead, feed, or close are called. (this can of course be changed if advisable) --- since this is my first contribution, i’m unsure if i shall already add the patch, unknowing if the changes to the internal parse_* methods are acceptable at all. what do you say? PS: the tokens are named like the handle_* methods, and the current goahead implementation basically calls getattr(self, 'handle_' + token)(data) for each (token, data) tuple. This can be changed to a token: method dict or a classic “switch” elif stack.
msg184100 - (view)	Author: Ezio Melotti (ezio.melotti) *	Date: 2013-03-13 18:15
If you have a patch you can post it, however new features are allowed only in Python 3.4, and they must be backward compatible (run "python -m test test_htmlparser" to check that).
msg184101 - (view)	Author: R. David Murray (r.david.murray) *	Date: 2013-03-13 18:32
I think that in order to maintain backward compatibility the existing parse_ names should continue to have the same signature, but they could be re-implemented in terms of new versions that return the token. That way if an application overrides the methods for some reason that existing code should continue to work.
msg184103 - (view)	Author: karl (karlcow) *	Date: 2013-03-13 18:50
flying sheep: do you plan to make it easier to use the HTML5 algorithm? http://www.w3.org/TR/html5/syntax.html#parsing
msg184104 - (view)	Author: Ezio Melotti (ezio.melotti) *	Date: 2013-03-13 18:52
HTMLParser already parsers HTML5 producing the correct result in most of the cases.
msg184105 - (view)	Author: karl (karlcow) *	Date: 2013-03-13 18:58
Ezio: I'm talking about "HTML5 Parsing algorithm", not about about parsing html* documents. :) The only python parser I know who is closer of the HTML5 parser algorithm is https://code.google.com/p/html5lib/
msg184106 - (view)	Author: Ezio Melotti (ezio.melotti) *	Date: 2013-03-13 19:08
Well, I'm not sure what's the point of implementing that specific algorithm if the end result is the same. HTMLParser implementation also has the advantage of being much simpler, and probably faster too. If for some reason you want that specific algorithm you can always use html5lib. Also if you find places where HTMLParser is not doing the right thing you can report new issues (I know a few corner cases where this happens, but they are so obscure that I intentionally left them unfixed to keep the code simple).
msg184107 - (view)	Author: (flying sheep) *	Date: 2013-03-13 19:24
no, i didn’t change anything that didn’t have to be changed to expose the tokens. i kept the changes as minimal as possible. and the tests pass! i attached the patch. --- aside thoughts: i had to change _markupbase.py, too, but i wonder why it’s even a separate module: it is only ever imported by html.parser and its only content, ParserBase, is only subclassed once (by HTMLParser). both classes are so intertwined and dependent on each other (ParserBase calls HTMLParser methods that it itself doesn’t even define) that i think _markupbase should just be scrapped and included into HTMLParser.
msg184108 - (view)	Author: (flying sheep) *	Date: 2013-03-13 19:28
whoops, left my editor modeline in. i knew that was going to happen.
msg196179 - (view)	Author: Alyssa Coghlan (ncoghlan) *	Date: 2013-08-26 04:52
The event generation API for ElementTree being discussed in issue 17741 is potentially relevant here. I think that style of API is preferable, as it doesn't alter how data is fed into the parser, just how it is extracted.

History
Date	User	Action	Args
2022-04-11 14:57:42	admin	set	github: 61612
2013-08-26 04:52:06	ncoghlan	set	nosy: + ncoghlanmessages: +
2013-08-24 08:30:43	ezio.melotti	set	nosy: + scoder
2013-03-13 20:49:39	flying sheep	set	files: - htmltokenizer.patch
2013-03-13 19:28:25	flying sheep	set	files: + htmltokenizer.patchmessages: +
2013-03-13 19:24:32	flying sheep	set	files: + htmltokenizer.patchkeywords: + patchmessages: +
2013-03-13 19:08:40	ezio.melotti	set	messages: +
2013-03-13 18:58:19	karlcow	set	messages: +
2013-03-13 18:52:42	ezio.melotti	set	messages: +
2013-03-13 18:50:00	karlcow	set	nosy: + karlcowmessages: +
2013-03-13 18:32:12	r.david.murray	set	nosy: + r.david.murraymessages: +
2013-03-13 18:15:17	ezio.melotti	set	versions: + Python 3.4nosy: + ezio.melottimessages: + components: + Library (Lib), - XML
2013-03-13 18:10:50	flying sheep	set	type: enhancementcomponents: + XML
2013-03-13 18:09:52	flying sheep	create