Issue 10026: xml.dom.pulldom strange behavior (original) (raw)
Issue10026
Created on 2010-10-05 10:17 by vojta.rylko, last changed 2022-04-11 14:57 by admin. This issue is now closed.
Messages (5) | ||
---|---|---|
msg117999 - (view) | Author: Vojtěch Rylko (vojta.rylko) | Date: 2010-10-05 10:17 |
Hi, I have file with 10 000 records of same element item (always same): $ head test.xml |
wc -l 10000 But (in two cases of 10 000) gives me just "Twi" not Twitter: $ python pulldom.py test.xml | grep -v Twitter Twi Twi Why? This example program demonstrate big problems in my real application - xml.dom.pulldom is cutting content of some elements. Thanks for any advice Vojta Rylko --------------------------- Python 2.5.4 (r254:67916, Feb 10 2009, 14:58:09) [GCC 4.2.4] on linux2 --------------------------- pulldom.py: --------------------------- file=open(sys.argv[1]) events = pulldom.parse(file) for event, node in events: if event == pulldom.START_ELEMENT: if node.tagName == 'item': events.expandNode(node) print node.getElementsByTagName('section').item(0).firstChild.data |
msg118002 - (view) | Author: Amaury Forgeot d'Arc (amaury.forgeotdarc) * ![]() |
Date: 2010-10-05 11:16 |
Please read http://docs.python.org/library/xml.etree.elementtree.html?highlight=elementtree#xml.etree.ElementTree.iterparse At START_ELEMENT, the element is not guaranteed to be fully populated; you should handle the END_ELEMENT event instead. This should be documented for the pulldom module as well, though. | ||
msg118004 - (view) | Author: Vojtěch Rylko (vojta.rylko) | Date: 2010-10-05 11:38 |
Program below also splits two of 10 000 elements into two rows. Is it acceptable behavior? OUTPUT (ill part) ============= <DOM Text node "u'Twitter'"> <DOM Text node "u'\n'"> <DOM Text node "u'Twi'"> <DOM Text node "u'tter'"> <DOM Text node "u'\n'"> <DOM Text node "u'Twitter'"> <DOM Text node "u'\n'"> <DOM Text node "u'Twitter'"> PROGRAM ============= for event, node in events: if event == pulldom.CHARACTERS: print node.data | ||
msg118006 - (view) | Author: Amaury Forgeot d'Arc (amaury.forgeotdarc) * ![]() |
Date: 2010-10-05 12:41 |
Yes, sax parsers may split CHARACTER events. See also the discussion: http://www.mail-archive.com/xml-sig@python.org/msg00234.html Again, the END_ELEMENT event is guaranteed to return the complete node. | ||
msg140870 - (view) | Author: Myrosia Dzikovska (Myrosia.Dzikovska) | Date: 2011-07-22 11:37 |
I have the same problem, and I tried the solution suggested in here, namely expanding the node at END_ELEMENT. It does not work, raising the following exception: Traceback (most recent call last): File "/group/project/onrbee/data/beetle2-eval-09/annotation_tools/logTools/add_start_times.py", line 163, in main(sys.argv[1:]) File "/group/project/onrbee/data/beetle2-eval-09/annotation_tools/logTools/add_start_times.py", line 130, in main events.expandNode(node) File "/usr/lib/python2.6/site-packages/_xmlplus/dom/pulldom.py", line 248, in expandNode parents[-1].appendChild(cur_node) IndexError: list index out of range The code fragment was: events = xml.dom.pulldom.parse( outName ) for (event,node) in events: if (event == xml.dom.pulldom.END_ELEMENT) and (node.tagName == "message"): events.expandNode(node) |
History | |||
---|---|---|---|
Date | User | Action | Args |
2022-04-11 14:57:07 | admin | set | github: 54235 |
2011-07-22 11:37:23 | Myrosia.Dzikovska | set | nosy: + Myrosia.Dzikovskamessages: + |
2010-10-06 05:11:54 | georg.brandl | set | status: open -> closedresolution: works for me |
2010-10-05 12:41:06 | amaury.forgeotdarc | set | messages: + |
2010-10-05 11:38:14 | vojta.rylko | set | messages: + |
2010-10-05 11:16:17 | amaury.forgeotdarc | set | assignee: docs@pythonmessages: + nosy: + amaury.forgeotdarc, docs@python |
2010-10-05 10:17:32 | vojta.rylko | create |