Issue 10026: xml.dom.pulldom strange behavior (original) (raw)

Issue10026

Created on 2010-10-05 10:17 by vojta.rylko, last changed 2022-04-11 14:57 by admin. This issue is now closed.

Messages (5)
msg117999 - (view) Author: Vojtěch Rylko (vojta.rylko) Date: 2010-10-05 10:17
Hi, I have file with 10 000 records of same element item (always same): $ head test.xml
Twitter
Twitter
Twitter
Twitter
Twitter
Twitter
Twitter
Twitter
Twitter
And run simply program for printing content of element section: $ python pulldom.py test.xml | head Twitter Twitter Twitter Twitter Twitter Twitter Twitter Twitter Twitter Twitter Seems work fine: $ python pulldom.py test.xml
wc -l 10000 But (in two cases of 10 000) gives me just "Twi" not Twitter: $ python pulldom.py test.xml grep -v Twitter Twi Twi Why? This example program demonstrate big problems in my real application - xml.dom.pulldom is cutting content of some elements. Thanks for any advice Vojta Rylko --------------------------- Python 2.5.4 (r254:67916, Feb 10 2009, 14:58:09) [GCC 4.2.4] on linux2 --------------------------- pulldom.py: --------------------------- file=open(sys.argv[1]) events = pulldom.parse(file) for event, node in events: if event == pulldom.START_ELEMENT: if node.tagName == 'item': events.expandNode(node) print node.getElementsByTagName('section').item(0).firstChild.data
msg118002 - (view) Author: Amaury Forgeot d'Arc (amaury.forgeotdarc) * (Python committer) Date: 2010-10-05 11:16
Please read http://docs.python.org/library/xml.etree.elementtree.html?highlight=elementtree#xml.etree.ElementTree.iterparse At START_ELEMENT, the element is not guaranteed to be fully populated; you should handle the END_ELEMENT event instead. This should be documented for the pulldom module as well, though.
msg118004 - (view) Author: Vojtěch Rylko (vojta.rylko) Date: 2010-10-05 11:38
Program below also splits two of 10 000 elements into two rows. Is it acceptable behavior? OUTPUT (ill part) ============= <DOM Text node "u'Twitter'"> <DOM Text node "u'\n'"> <DOM Text node "u'Twi'"> <DOM Text node "u'tter'"> <DOM Text node "u'\n'"> <DOM Text node "u'Twitter'"> <DOM Text node "u'\n'"> <DOM Text node "u'Twitter'"> PROGRAM ============= for event, node in events: if event == pulldom.CHARACTERS: print node.data
msg118006 - (view) Author: Amaury Forgeot d'Arc (amaury.forgeotdarc) * (Python committer) Date: 2010-10-05 12:41
Yes, sax parsers may split CHARACTER events. See also the discussion: http://www.mail-archive.com/xml-sig@python.org/msg00234.html Again, the END_ELEMENT event is guaranteed to return the complete node.
msg140870 - (view) Author: Myrosia Dzikovska (Myrosia.Dzikovska) Date: 2011-07-22 11:37
I have the same problem, and I tried the solution suggested in here, namely expanding the node at END_ELEMENT. It does not work, raising the following exception: Traceback (most recent call last): File "/group/project/onrbee/data/beetle2-eval-09/annotation_tools/logTools/add_start_times.py", line 163, in main(sys.argv[1:]) File "/group/project/onrbee/data/beetle2-eval-09/annotation_tools/logTools/add_start_times.py", line 130, in main events.expandNode(node) File "/usr/lib/python2.6/site-packages/_xmlplus/dom/pulldom.py", line 248, in expandNode parents[-1].appendChild(cur_node) IndexError: list index out of range The code fragment was: events = xml.dom.pulldom.parse( outName ) for (event,node) in events: if (event == xml.dom.pulldom.END_ELEMENT) and (node.tagName == "message"): events.expandNode(node)
History
Date User Action Args
2022-04-11 14:57:07 admin set github: 54235
2011-07-22 11:37:23 Myrosia.Dzikovska set nosy: + Myrosia.Dzikovskamessages: +
2010-10-06 05:11:54 georg.brandl set status: open -> closedresolution: works for me
2010-10-05 12:41:06 amaury.forgeotdarc set messages: +
2010-10-05 11:38:14 vojta.rylko set messages: +
2010-10-05 11:16:17 amaury.forgeotdarc set assignee: docs@pythonmessages: + nosy: + amaury.forgeotdarc, docs@python
2010-10-05 10:17:32 vojta.rylko create