Message 160266 - Python tracker (original) (raw)

I confirm the presence of a serious memory leak in ElementTree, using the iterparse() function. Memory grows disproportionately to dozens of GB when parsing a large XML file.

For further information, see discussion in: http://www.gossamer-threads.com/lists/python/bugs/912164?do=post_view_threaded#912164 but notice that the comments attributing the problem to the OS are quite off the mark.

To replicate the problem, try this on a Wikipedia dump:

iterparse = ElementTree.iterparse(file)
id = None
for event, elem in iterparse:
    if elem.tag.endswith("title"):
        title = elem.text
    elif elem.tag.endswith("id") and not id:
        id = elem.text
    elif elem.tag.endswith("text"):
       print id, title, elem.text[:20]