[Python-Dev] please back out changeset f903cf864191 before alpha-2 (original) (raw)

Stefan Behnel stefan_ml at behnel.de
Sat Aug 24 00:57:48 CEST 2013


Hi,

ticket 17741 has introduced a new feature in the xml.etree.ElementTree module that was added without any major review.

http://bugs.python.org/issue17741

http://hg.python.org/cpython/rev/f903cf864191

I only recently learned about this addition and after taking a couple of closer looks, I found that it represents a rather seriously degradation of the ET module API. Since I am not a core developer, it is rather easy for the original committer to close the ticket and to sit and wait for the next releases to come in order to get the implementation out. So I'm sorry for having to ask publicly on this list for the code to be removed, before it does any harm in the wild.

Let me explain why the addition is such a disaster. I'm personally involved in this because I will eventually have to implement features that occur in ElementTree also in the external lxml.etree package. And this is not an API that I can implement with a clear conscience.

There are two parts to parsing in the ET API. The first is the parser, and the second is the target object (similar to a SAX handler). The parser has an incremental parsing API consisting of the functions .feed() and .close(). When it receives data through the .feed() method, it parses it and passes events on to the target. The target is commonly a TreeBuilder that builds an in-memory tree, but is not limited to that. Calling the .close() method tells the parser that the parsing is done and that it should finish it up.

The class that was now added is called "IncrementalParser". It has two methods for passing data in: "data_received()" and "eof_received()". So the first thing to note is that this addition is only a copy of the existing API and functionality, but under different names. It is hard to understand for me how anyone could consider this a consistent design.

Then, the purpose of this addition was to provide a way to collect parse events. That is the obvious role of the target object. In the new implementation, the target object is being instantiated, but not actually meant to collect the events. Instead, it's the parser collecting the events, based on what the target object returns (which it doesn't currently have to do at all). This is totally backwards. Instead, it should be up to the target object to decide which events to collect, how to process them and how to present them to the user. This is clearly how the API was originally designed.

Also, the IncrementalParser doesn't directly collect the events itself but gets them through a sort of backdoor in the underlying parser. That parser object is actually being passed into the IncrementalParser as a parameter, which means that user provided parser objects will also have to implement that backdoor now, even though they may not actually be able to provide that functionality.

My proposal for fixing these obvious design problems is to let each part of the parsing chain do what it's there for. Use the existing XMLParser (or an HTMLParser, as in lxml.etree) to feed in data incrementally, and let the target object process and collect the events. So, instead of replacing the parser interface with a new one, there should be a dedicated target object (or maybe just a wrapper for a TreeBuilder) that collects the parse events in this specific way.

Since the time is running in favour of the already added implementation, I'm asking to back it out for the time being. I specifically do not want to rush in a replacement. Once there is an implementation that matches the established API, I'm all in favour of adding it, because the feature itself is a good idea. But keeping a wrong design in, just because "it's there", even before anyone has actually started using it, is just asking for future deprecation hassle. It's not too late for removal now, but it will be in a couple of weeks if it is not done now.

Stefan



More information about the Python-Dev mailing list