Issue 4100: xml.etree.ElementTree does not read xml-text over page bonderies (original) (raw)

Created on 2008-10-10 14:53 by roland, last changed 2022-04-11 14:56 by admin. This issue is now closed.

Files
File name	Uploaded	Description	Edit
bug.py	roland,2008-10-10 14:54
bug.xml	roland,2008-10-10 14:54
bug.py	ocean-city,2008-10-10 21:26
fix_cross_boundary_on_ElementTree_v2.patch	ocean-city,2008-10-11 01:50
test_v2.py	ocean-city,2008-10-11 01:51
ElementTree_iterparse_doc.patch	ocean-city,2008-11-02 00:18

Messages (6)
msg74635 - (view)	Author: roland rehmnert (roland)	Date: 2008-10-10 14:53
xml text fields are not read properly when it is encountered in a 'start' event. During a 'start'-event elem.text returns None, if the text string cross a page boundary of the file. (this is platform dependent and a typical value is 8K (8192 byte)). This line cause an error if the page size is 8192. this is a text where X has position 8192 in the file In most cases this erroneous behaviour can be avoid when elem.tree always returns the proper value at the 'end'-event. Two files are submitted: bug.py: An excerpted file that produced an error with the submitted xml file. bug.xml: An xml file, a little bit more then 8200 bytes. In can of the page size is greater than 8K.. file should be enlarged. Important is however that the text should cross the page boundary. Tags and attributes and attribute values as well are OK I might have misunderstood the documentation of etree, because there are situations that I have not tested. /roland
msg74645 - (view)	Author: Hirokazu Yamamoto (ocean-city) *	Date: 2008-10-10 21:26
Minimum script to reproduce this issue is "bug.py" I've attached. And I think this issue can be fixed with "fix_cross_boundary_on_ElementTree.patch". I'll attach the test case for this issue as "test.py". (I wanted to intergrate test into test_xml_etree_c.py, but it uses doctest which I don't know about) ///////////////////////// // Cause of issue TreeBuilder#start() and TreeBuilder#end() are handlers driven by self._parser.feed(data) in iterparse.next(), and iterparse stores elements returned by these functions. But element is not initialized at the moment. No one can determine element.text when start tag is found, and element.tail when end tag is found vise versa. We can say "the element is initialized" when encountered next element or TreeBuilder is closed. So, iterparse's _events queue may contain uninitialized elements, so my patch waits until the element will be initialized.
msg74677 - (view)	Author: roland rehmnert (roland)	Date: 2008-10-13 08:50
We had to be careful how we should handle this. http://effbot.org/zone/element-iterparse.htm A note on this site says following : Note: The tree builder and the event generator are not necessarily synchronized; the latter usually lags behind a bit. This means that when you get a “start” event for an element, the builder may already have filled that element with content. You cannot rely on this, though — a “start” event can only be used to inspect the attributes, not the element content. For more details, see this http://mail.python.org/pipermail/xml-sig/2005-January/010838.html</ref>. I do understand that it might be so that elem.text is undefined at start. I have not investigated how iterparse handle this situation over boundaries: text text text
msg75447 - (view)	Author: Fredrik Lundh (effbot) *	Date: 2008-11-01 18:31
Roland's right - "iterparse" only guarantees that it has seen the ">" character of a starting tag when it emits a "start" event, so the attributes are defined, but the contents of the text and tail attributes are undefined at that point. The same applies to the element children; they may or may not be present. If you need a fully populated element, look for "end" events instead.
msg75451 - (view)	Author: Hirokazu Yamamoto (ocean-city) *	Date: 2008-11-02 00:18
I propose to note this behavior on document. I'll attach the patch. (I just inserted your comment into document)
msg78696 - (view)	Author: Georg Brandl (georg.brandl) *	Date: 2009-01-01 11:46
Thanks, applied in r68116.

History
Date	User	Action	Args
2022-04-11 14:56:40	admin	set	github: 48350
2009-01-01 11:46:58	georg.brandl	set	status: open -> closedassignee: effbot -> georg.brandlmessages: + resolution: fixednosy: + georg.brandl
2008-11-02 00🔞23	ocean-city	set	status: closed -> openfiles: + ElementTree_iterparse_doc.patchresolution: not a bug -> (no value)messages: + components: + Documentation, - Library (Lib), XML
2008-11-02 00:00:01	ocean-city	set	status: open -> closedresolution: not a bug
2008-11-01 18:31:55	effbot	set	messages: +
2008-11-01 17:45:12	ocean-city	set	assignee: effbotnosy: + effbot
2008-10-13 08:50:24	roland	set	messages: +
2008-10-11 01:51:05	ocean-city	set	files: + test_v2.py
2008-10-11 01:50:41	ocean-city	set	files: - test.py
2008-10-11 01:50:30	ocean-city	set	files: + fix_cross_boundary_on_ElementTree_v2.patch
2008-10-11 01:04:42	ocean-city	set	files: - fix_cross_boundary_on_ElementTree.patch
2008-10-10 21:27:30	ocean-city	set	files: + test.py
2008-10-10 21:27:12	ocean-city	set	files: + fix_cross_boundary_on_ElementTree.patchkeywords: + patch
2008-10-10 21:26:38	ocean-city	set	files: + bug.pynosy: + ocean-citymessages: + components: + XMLversions: + Python 2.6, Python 3.0
2008-10-10 14:54:51	roland	set	files: + bug.xml
2008-10-10 14:54:19	roland	set	files: + bug.py
2008-10-10 14:53:37	roland	create