msg106938 - (view) |
Author: Mark Nottingham (mnot) |
Date: 2010-06-03 11:14 |
In markupbase.py's ParserBase.parse_declaration, an unexpected character is caught like this: else: self.error( "unexpected %r char in declaration" % rawdata[j]) However, the position (j) isn't updated, which means that error() will be called again once it returns. For example, this declaration: (which I think is generated by MS Office) will trigger this behaviour. Two possible resolutions: 1) increment J and try the next character in this case 2) document that error() is not recoverable; i.e., it MUST raise an exception. My preference is strongly for #1 (as HTML parsing should be forgiving, and HTMLParser is based upon markerbase). |
|
|
msg106996 - (view) |
Author: Mark Nottingham (mnot) |
Date: 2010-06-03 22:39 |
Just to be clear -- if error() returns, it will cause an infinite loop. |
|
|
msg107109 - (view) |
Author: Terry J. Reedy (terry.reedy) *  |
Date: 2010-06-04 23:11 |
Neither markerbase nor markupbase are in the list of 2.6 stdlib modules at http://docs.python.org/modindex.html even with all packages [+] listings expanded to [-]. So I have to guess this is a third party module. If so, please close and report to *its* authors, not here. |
|
|
msg107114 - (view) |
Author: Mark Nottingham (mnot) |
Date: 2010-06-05 00:18 |
http://svn.python.org/view/python/trunk/Lib/markupbase.py?view=log |
|
|
msg107457 - (view) |
Author: R. David Murray (r.david.murray) *  |
Date: 2010-06-10 12:02 |
"This module is used as a foundation for the HTMLParser and sgmllib modules (indirectly, for htmllib as well). It has no documented public API and should not be used directly." So, #2 is not relevant unless you are talking about a docstring update or comment in ParserBase. Do you have a test case using one of the consumer modules that demonstrates a bug? markupbase has no test suite of its own (which probably should be fixed someday :) |
|
|
msg107518 - (view) |
Author: Mark Nottingham (mnot) |
Date: 2010-06-11 01:45 |
I'm using it from HTMLParser; try to parse a document with the DTD given when error is something like: def error(self, msg): self.errors += 1 and it will loop. |
|
|
msg107519 - (view) |
Author: Mark Nottingham (mnot) |
Date: 2010-06-11 01:48 |
Attaching test case. |
|
|
msg124525 - (view) |
Author: Terry J. Reedy (terry.reedy) *  |
Date: 2010-12-23 00:59 |
I verified the looping behavior of the testcase in both 2.7.1 and, with minor mods, 3.1.3 and 3.2b1, so this is a valid issue. The HTMLParcer docs (2.7, 3.2) do not mention the .error method. The default is def error(self, message): raise HTMLParseError(message, self.getpos()) If this is *not* intended to be part of the api and over-ridden, the name should be changed to ._error and .error deprecated. If it is, it should be documented. I think the self.error call should be followed either by j+=1 so parsing continues with the next char or by a raise statememt so it is definitely stopped. |
|
|
msg158786 - (view) |
Author: Ezio Melotti (ezio.melotti) *  |
Date: 2012-04-20 00:30 |
HTMLParser shouldn't raise errors anymore, so the "error" method (and probably the HTMLParseError exception too) should be deprecated along with the non-strict mode on 3.3. |
|
|
msg158789 - (view) |
Author: Ezio Melotti (ezio.melotti) *  |
Date: 2012-04-20 00:43 |
s/non-strict/strict/ |
|
|
msg158836 - (view) |
Author: Mark Nottingham (mnot) |
Date: 2012-04-20 15:17 |
Why remove 2.7? It'd be an easy bug fix if j is incremented. |
|
|
msg158853 - (view) |
Author: Ezio Melotti (ezio.melotti) *  |
Date: 2012-04-20 17:22 |
Because even on 2.7 the parser is now able to handle broken markup, so "error" won't be called anymore. |
|
|