Created on 2016-01-04 17:35 by jason_s, last changed 2022-04-11 14:58 by admin.
Files |
|
|
|
File name |
Uploaded |
Description |
Edit |
test2.py |
jason_s,2016-01-04 17:36 |
|
|
test1.html |
jason_s,2016-01-04 17:44 |
|
|
Messages (4) |
|
|
msg257472 - (view) |
Author: Jason Sachs (jason_s) * |
Date: 2016-01-04 17:35 |
The HTMLParser class (https://docs.python.org/2/library/htmlparser.html) is lacking a few features to reconstruct input exactly. For the most part it can do this, but I found two items where it falls short (there may be others): - There is a get_starttag_text() method but no get_endtag_text() method, which is necessary if the end tag is not in canonical form, e.g. instead of it is or </ P > - The effect of the parse_bogus_comment() internal method is to call handle_comment(), so content like <! I AM BOGUS > cannot be distinguished by subclasses of HTMLParser from actual comments Suggested changes: - Add a get_endtag_text() method to return the exact endtag text - change parse_bogus_comment to call self.handle_bogus_comment(), and define self.handle_bogus_comment() to call self.handle_comment(). This way it is backwards-compatible with existing behavior, but subclasses can redefine self.handle_bogus_comment() to do what they want. |
|
|
msg257473 - (view) |
Author: Jason Sachs (jason_s) * |
Date: 2016-01-04 17:36 |
sample file attached containing VerbatimParser |
|
|
msg257475 - (view) |
Author: Jason Sachs (jason_s) * |
Date: 2016-01-04 17:44 |
sample file test1.html attached. When running test2.py on it, the output is identical except for two things: test1.html contains test1b.html contains test1.html contains end tags that are capitalized e.g. or have spaces </ goober > test1b.html contains end tags that are canonicalized to lowercase and without spaces e.g. and |
|
|
msg257770 - (view) |
Author: Ezio Melotti (ezio.melotti) *  |
Date: 2016-01-08 17:46 |
What is your use case? Also note that new features can only go on 3.6. |
|
|
History |
|
|
|
Date |
User |
Action |
Args |
2022-04-11 14:58:25 |
admin |
set |
github: 70197 |
2016-01-08 18:26:03 |
terry.reedy |
set |
type: behavior -> enhancementstage: test needed |
2016-01-08 17:46:11 |
ezio.melotti |
set |
nosy: + ezio.melottimessages: + versions: + Python 3.6, - Python 2.7 |
2016-01-04 17:44:59 |
jason_s |
set |
files: + test1.htmlmessages: + |
2016-01-04 17:36:45 |
jason_s |
set |
files: + test2.pymessages: + |
2016-01-04 17:35:38 |
jason_s |
create |
|