msg90465 - (view) |
Author: Mitchell Model (MLModel) |
Date: 2009-07-13 00:55 |
I can't quite sort this out, because it's difficult to see what is intended. The documentation of xml.etree.ElementTree (19.11 in the Library doc) uses terms like "iterator", "tree iterator", "iterable", "list" in vague and perhaps not quite accurate ways. I can't tell from the documentation which functions/methods return lists, which return a generator, which return an unspecified kind of iterable, and so on. Moreover, the results are different using ElementTree than they are using cElementTree. In particular, getiterator() returns a list in ElementTree and a generator in cElementTree. This can make a substantial difference in performance when iterating over a large number of nodes (in addition to cElementTree's parsing being what appears to be about 10x faster). I think someone should go over the page and sort this out and make it clear what the user can expect. (I don't think it's fair to overgeneralize to things like "iterables" if the module is really meant to be making a commitment to a list or a generator.) I also think that the differences in the results of methods returned in the Python and C versions of the module should be highlighted. I stumbled on this trying to parses and extract individual bits of information out of large XML files. I full well realize there are better ways to do this (SAX, e.g.) and better ways to search than just iterate over all the tags of the type I'm interested in, but I should still know what to expect from ElementTree, especially because it is so wonderful! |
|
|
msg95990 - (view) |
Author: Milko Krachounov (milko.krachounov) |
Date: 2009-12-05 13:19 |
This isn't just a documentation issue. A function named getiterator(), for which the docs say that it returns an iterator, should return an iterator, not just an iterable. They have different semantics and can't be used interchangeably, so the behaviour of getiterator() in ElementTree is wrong. I was using this in my program: iterator = element.getiterator() next(iterator) subelement = next(iterator) Which broke when I tried switching to ElementTree from cElementTree, even though the docs tell me that I'll get an iterator there. Also, for findall() and friends, is there any reason why we can't stick to either an iterator or list, and not both? The API will be more clear if findall() always returned a list, or always an iterator, regardless of the implementation. It is currently not clear what will happen if I do: for x in tree.findall(path): mutate_tree(tree, x) |
|
|
msg96000 - (view) |
Author: Florent Xicluna (flox) *  |
Date: 2009-12-05 19:32 |
There's many differences between both implementations. I don't know if we can live with them or not. ~ $ ./python Python 3.1.1+ (release31-maint:76650, Dec 3 2009, 17:14:50) [GCC 4.3.2] on linux2 Type "help", "copyright", "credits" or "license" for more information. >>> from xml.etree import ElementTree as ET, cElementTree as cET >>> from io import StringIO >>> SAMPLE = '' >>> IO_SAMPLE = StringIO(SAMPLE) With ElementTree >>> elt = ET.XML(SAMPLE) >>> elt.getiterator() [<Element root at 15cb920>] >>> elt.findall('') # or '.' [<Element root at 15cb920>] >>> elt.findall('./') [<Element root at 15cb920>] >>> elt.items() dict_items([]) >>> elt.keys() dict_keys([]) >>> elt[:] [] >>> IO_SAMPLE.seek(0) >>> next(ET.iterparse(IO_SAMPLE)) ('end', <Element root at 15d60d0>) >>> IO_SAMPLE.seek(0) >>> list(ET.iterparse(IO_SAMPLE)) [('end', <Element root at 15583e0>)] With cElementTree >>> elt_c = cET.XML(SAMPLE) >>> elt_c.getiterator() <generator object getiterator at 0x15baae0> >>> elt_c.findall('') [] >>> elt_c.findall('./') [<Element 'root' at 0x15cf3a0>] >>> elt_c.items() [] >>> elt_c.keys() [] >>> elt_c[:] Traceback (most recent call last): TypeError: sequence index must be integer, not 'slice' >>> IO_SAMPLE.seek(0) >>> next(cET.iterparse(IO_SAMPLE)) Traceback (most recent call last): TypeError: iterparse object is not an iterator >>> IO_SAMPLE.seek(0) >>> list(cET.iterparse(IO_SAMPLE)) [(b'end', <Element 'root' at 0x15cf940>)] |
|
|
msg96023 - (view) |
Author: Florent Xicluna (flox) *  |
Date: 2009-12-06 11:51 |
Proposed patch fixes most of the discrepancies between both implementations. It restores some features that were lost with Python 3: * cElement slicing and extended slicing * iterparse, cET.getiterator and cET.findall return an iterator (as documented) Some tests were added to check these issues. |
|
|
msg96040 - (view) |
Author: Florent Xicluna (flox) *  |
Date: 2009-12-06 21:16 |
I fixed it differently, using the upstream modules (Thank you Fredrik). * ElementTree 1.3a3-20070912 * cElementTree 1.0.6-20090110 It works. And it closes , too. |
|
|
msg96048 - (view) |
Author: Antoine Pitrou (pitrou) *  |
Date: 2009-12-07 11:10 |
The patch should have doc updates for new functionality, if any. |
|
|
msg96049 - (view) |
Author: Florent Xicluna (flox) *  |
Date: 2009-12-07 12:38 |
I see some new features in the changelog. I will try to update the documentation during the week. (patch "py3k" fixed: support assignment of arbitrary sequences) |
|
|
msg96181 - (view) |
Author: Florent Xicluna (flox) *  |
Date: 2009-12-09 21:30 |
Patch for the documentation. (source: upstream documentation) |
|
|
msg96373 - (view) |
Author: Florent Xicluna (flox) *  |
Date: 2009-12-14 08:27 |
Small update of the patch for 3.2: the __cmp__method is replaced with __eq__ method (on CommentProxy and PIProxy). |
|
|
msg97607 - (view) |
Author: Florent Xicluna (flox) *  |
Date: 2010-01-11 21:40 |
It would be nice to upgrade ElementTree for 2.7 and 3.2, at least. |
|
|
msg99137 - (view) |
Author: Florent Xicluna (flox) *  |
Date: 2010-02-09 22:28 |
Patch updated, with upstream packages: * ElementTree 1.3a3-20070912 * cElementTree 1.0.6-20090110 Now all tests are identical for the ElementTree part: - ElementTree 2.x - cElementTree 2.x - ElementTree 3.x - cElementTree 3.x Waiting for some developer kind enough to review and merge in 2.7 and 3.2. |
|
|
msg99138 - (view) |
Author: Antoine Pitrou (pitrou) *  |
Date: 2010-02-09 23:22 |
Given the size of the patch, it's very difficult to review properly. In any case, could you upload it to http://codereview.appspot.com/ ? |
|
|
msg99139 - (view) |
Author: Florent Xicluna (flox) *  |
Date: 2010-02-09 23:31 |
Ok, will do the upload to rietveld. In addition to the straight review of the patch itself, you could: - diff against the upstream source code (very few changes) - diff between 2.x and 3.x - review the test_suite (there's only additions, no real change) - hunt refleaks Btw, I've backported the last tests (#2746, #6233) to all 4 test files (ET and cET, 2.x and 3.x). |
|
|
msg99140 - (view) |
Author: Florent Xicluna (flox) *  |
Date: 2010-02-09 23:51 |
Here it is: * http://codereview.appspot.com/207048/show |
|
|
msg99449 - (view) |
Author: Florent Xicluna (flox) *  |
Date: 2010-02-16 23:21 |
Update the 2.x patch with the last version uploaded to rietveld (patch set 5). Improved test coverage with upstream tests and tests cases provided by Neil on issue #6232. Note: the patch for 3.x is obsolete. |
|
|
msg99466 - (view) |
Author: Florent Xicluna (flox) *  |
Date: 2010-02-17 11:48 |
Strip out the experimental C API. |
|
|
msg100856 - (view) |
Author: Florent Xicluna (flox) *  |
Date: 2010-03-11 14:40 |
Fixed on trunk with r78838. Some extra work is required to port it to 3.x. Thank you Fredrik and Antoine for reviewing this patch. |
|
|
msg100881 - (view) |
Author: Fredrik Lundh (effbot) *  |
Date: 2010-03-11 19:02 |
W00t! |
|
|
msg100928 - (view) |
Author: Florent Xicluna (flox) *  |
Date: 2010-03-12 12:03 |
Patch to merge ElementTree 1.3 in 3.x. |
|
|
msg101037 - (view) |
Author: Florent Xicluna (flox) *  |
Date: 2010-03-14 01:45 |
Merged in 3.x with r78942 and r78945. See #8047 for a discussion about the `encoding` argument of the serializer (used for .write() method and tostring() tostringlist() functions). Currently the output is not encoded by default in 3.1 and 3.x. It is encoded to ASCII in 2.6 and 2.x. |
|
|