Issue 2892: improve cElementTree iterparse error handling (original) (raw)

In some cases it is unfortunate that any error in the XML chunk seen by the buffer prevents the events generated before the error from being delivered. For example, in some cases valid XML is embedded in a larger file or stream, and it is useful to be able to ignore text that follows the root tag, if any.

The iterparse API and expat itself make this possible, but it doesn't work because in case of a parsing exception, iterparse doesn't deliver the events generated before the exception. A simple change to iterparse makes this possible, however. I would like to share the change with you for possible inclusion in a future release. Note that this change shouldn't affect the semantics of iterparse: the exception is still delivered to the caller, the only difference is that the events generated by expat before the exception are not forgotten.

I am attaching a diff between the current implementation of iterparse, and a modified one that fixes this problem.

Here is a small test case that demonstrates the problem, expected behavior and actual behavior:

{{{ for ev in xml.etree.cElementTree.iterparse(StringIO('rubbish'), events=('start', 'end')): print ev }}}

The above code should first print the two events (start and end), and then raise the exception. In Python 2.7 it runs like this:

{{{

for ev in xml.etree.cElementTree.iterparse(StringIO('rubbish'), events=('start', 'end')): ... print ev ... Traceback (most recent call last): File "", line 1, in File "", line 84, in next cElementTree.ParseError: junk after document element: line 1, column 7 }}}

Expected behavior, obtained with my patch, is that it runs like this:

{{{

for ev in my_iterparse(StringIO('rubbish'), events=('start', 'end')): ... print ev ... ('start', <Element 'x' at 0xb771cba8>) ('end', <Element 'x' at 0xb771cba8>) Traceback (most recent call last): File "", line 1, in File "", line 26, in iter cElementTree.ParseError: junk after document element: line 1, column 7 }}}

The difference is, of course, only visible when printing events. A side-effect-free operation, such as building a list using list(iterparse(...)) would behave exactly the same before and after the change.

Note that this was fixed in upstream 1.3 (and verified by the selftests), but the fix and test was apparently lost when that code was merged into 2.7. Since 2.7 is supposed to ship with 1.3, this is a regression, not a feature request.

(But 2.7 is in rc, and I'm on vacation, so I guess it's a bit too late to do anything about that. I'll leave the final decision to flox and the python-dev crowd.)