Issue 8583: Hardcoded namespace_separator in the cElementTree.XMLParser (original) (raw)

Created on 2010-04-30 22:57 by dmtr, last changed 2022-04-11 14:57 by admin. This issue is now closed.

Files
File name	Uploaded	Description	Edit
issue-8583.patch	dmtr,2010-04-30 23:02	Issue 8583.patch Target: cElementTree-1.0.5-20051216

Messages (13)
msg104671 - (view)	Author: Dmitry Chichkov (dmtr)	Date: 2010-04-30 22:57
The namespace_separator parameter is hard coded in the cElementTree.XMLParser class disallowing the option of ignoring XML Namespaces with cElementTree library. Here's the code example: from xml.etree.cElementTree import iterparse from StringIO import StringIO xml = """""" for event, elem in iterparse(StringIO(xml)): print event, elem It produces: end <Element '{http://www.very_long_url.com}child' at 0xb7ddfa58> end <Element '{http://www.very_long_url.com}root' at 0xb7ddfa40> In the current implementation local tags get forcibly concatenated with URIs often resulting in the ugly code on the user's side and performance degradation (at least due to extra concatenations and extra lengthy compare operations in the elements matching code). Internally cElementTree uses EXPAT parser, which is doing namespace processing only optionally, enabled by providing a value for namespace_separator argument. This argument is hard-coded in the cElementTree: self->parser = EXPAT(ParserCreate_MM)(encoding, &memory_handler, "}"); Well, attached is a patch exposing this parameter in the cElementTree.XMLParser() arguments. This parameter is optional and the default behavior should be unchanged. Here's the test code: import cElementTree x = """text""" parser = cElementTree.XMLParser() parser.feed(x) elem = parser.close() print elem parser = cElementTree.XMLParser(namespace_separator="}") parser.feed(x) elem = parser.close() print elem parser = cElementTree.XMLParser(namespace_separator=None) parser.feed(x) elem = parser.close() print elem The resulting output: <Element '{http://www.very_long_url.com}root' at 0xb7e885f0> <Element '{http://www.very_long_url.com}root' at 0xb7e88608> <Element 'root' at 0xb7e88458>
msg104676 - (view)	Author: Dmitry Chichkov (dmtr)	Date: 2010-04-30 23:25
And obviously iterparse can be either overridden in the local user code or patched in the library. Here's the iterparse code/test code: import cElementTree from cStringIO import StringIO class iterparse(object): root = None def __init__(self, file, events=None, namespace_separator = "}"): if not hasattr(file, 'read'): file = open(file, 'rb') self._file = file self._events = events self._namespace_separator = namespace_separator def __iter__(self): events = [] b = cElementTree.TreeBuilder() p = cElementTree.XMLParser(b, namespace_separator= \ self._namespace_separator) p._setevents(events, self._events) while 1: data = self._file.read(16384) if not data: break p.feed(data) for event in events: yield event del events[:] root = p.close() for event in events: yield event self.root = root x = """text""" context = iterparse(StringIO(x), events=("start", "end", "start-ns")) for event, elem in context: print event, elem context = iterparse(StringIO(x), events=("start", "end", "start-ns"), namespace_separator = None) for event, elem in context: print event, elem It produces: start-ns ('', 'http://www.very_long_url.com') start <Element '{http://www.very_long_url.com}root' at 0xb7ccf650> start <Element '{http://www.very_long_url.com}child' at 0xb7ccf5a8> end <Element '{http://www.very_long_url.com}child' at 0xb7ccf5a8> end <Element '{http://www.very_long_url.com}root' at 0xb7ccf650> start <Element 'root' at 0xb7ccf620> start <Element 'child' at 0xb7ccf458> end <Element 'child' at 0xb7ccf458> end <Element 'root' at 0xb7ccf620> Note the absence of URIs and ignored start-ns events in the 'space_separator = None' version.
msg104733 - (view)	Author: Fredrik Lundh (effbot) *	Date: 2010-05-01 17:56
Namespaces are a fundamental part of the XML information model (both xpath and infoset) and all modern XML document formats, so I'm not sure what problem you're trying to solve by pretending that they don't exist. It's a bit like modifying "import foo" to work like "from foo import *"...
msg104764 - (view)	Author: Dmitry Chichkov (dmtr)	Date: 2010-05-02 02:55
This patch does not modify the existing behavior of the library. The namespace_separator parameter is optional. Parameter already exists in the EXPAT library, but it is hard coded in the cElementTree.XMLParser code. Fredrik, yes, namespaces are a fundamental part of the XML information model. Yet an option of having them ignored is a very valuable one in the performance critical code.
msg104795 - (view)	Author: Stefan Behnel (scoder) *	Date: 2010-05-02 17:30
There is at least one valid use case: code that needs to deal with HTML and XHTML currently has to normalise the tag names in some way, which usually means that it will want to remove the namespaces from XHTML documents to make it look like plain HTML. It would be nice if the library could do this efficiently right in the parser by simply removing all namespace declarations. However, this doesn't really apply to (c)ElementTree where the parser does not support HTML parsing. I'm -1 on the interface that the proposed patch adds. The keyword argument name and its semantics are badly chosen. A boolean flag will work much better. The proposed feature will have to be used with great care by users. Code that depends on it is very fragile and will fail when an input document uses unexpected namespaces, e.g. to embed foreign content, or because it is actually written in a different XML language that just happens to have similar local tag names. This kind of code is rather hard to fix, as fixing it means that it will stop accepting documents that previously passed without problems. Rejecting broken input early is a virtue. All in all, I'm -0.5 on this feature as I'd expect most use cases to be premature optimisations with potentially dangerous side effects more than anything else.
msg104815 - (view)	Author: Dmitry Chichkov (dmtr)	Date: 2010-05-03 04:38
I agree that the argument name choice is poor. But it have already been made by whoever coded the EXPAT parser which cElementTree.XMLParser wraps. So there is not much room here. As to 'proposed feature have to be used with great care by users' - this s already taken care of. If you look - cElementTree.XMLParser class is a rather obscure one. As I understand it is only being used by users requiring high performance xml parsing for large datasets (10GB - 10TB range) in data-mining applications.
msg104816 - (view)	Author: Dmitry Chichkov (dmtr)	Date: 2010-05-03 05:03
Interestingly in precisely these applications often you don't care about namespaces at all. Often all you need is to extract 'text' or 'name' elements irregardless of the namespace.
msg137104 - (view)	Author: (library.engine)	Date: 2011-05-28 01:55
I second request for tag names not prefixed with a root namespace in python, mostly because of ugly code, as performance degradation is negligible on relatively small files. But this ubiquitous repeating (even in the case if you're appending a variable to every tag name) is just against the DRY principle, and I don't like it. I think an extra option to pass list of namespaces that should NOT be prepended to the tag names would be sufficient.
msg137106 - (view)	Author: Stefan Behnel (scoder) *	Date: 2011-05-28 07:04
I don't see this having much to do with the DRY principle. It's "explicit is better than implicit" and "better safe than sorry" that applies here.
msg137164 - (view)	Author: (library.engine)	Date: 2011-05-29 01:54
What is so implicit in the passing of a list of undesired namespaces to the parse function? This is quite explicit, in my humble opinion, and it lets you not to repeat yourself for each and every tag you want to find in the tree, as well.
msg166024 - (view)	Author: Florent Xicluna (flox) *	Date: 2012-07-21 14:05
See also issue 13378 which proposes custom namespace maps for serializing.
msg235725 - (view)	Author: Martin Panter (martin.panter) *	Date: 2015-02-11 04:06
Also Issue 18304 for more discussion on simplifying namespaces
msg340999 - (view)	Author: Stefan Behnel (scoder) *	Date: 2019-04-27 16:32
Closing as a duplicate of the more general issue 18304.

History
Date	User	Action	Args
2022-04-11 14:57:00	admin	set	github: 52829
2019-04-27 16:32:07	scoder	set	status: open -> closedsuperseder: ElementTree -- provide a way to ignore namespace in tags and searchesmessages: + resolution: duplicatestage: resolved
2015-02-11 04:06:49	martin.panter	set	nosy: + martin.pantermessages: +
2012-07-21 14:05:36	flox	set	versions: + Python 3.4, - Python 3.2nosy: + eli.benderskymessages: + components: + XML
2011-05-29 01:54:35	library.engine	set	messages: +
2011-05-28 07:28:52	loewis	set	messages: -
2011-05-28 07:27:47	loewis	set	nosy: + loewismessages: +
2011-05-28 07:04:35	scoder	set	messages: +
2011-05-28 01:55:24	library.engine	set	nosy: + library.enginemessages: +
2010-08-04 20:11:46	terry.reedy	set	type: performance -> enhancementversions: + Python 3.2, - Python 2.6, Python 2.5, Python 2.7
2010-05-03 05:03:47	dmtr	set	messages: +
2010-05-03 04:38:35	dmtr	set	messages: +
2010-05-02 17:30:22	scoder	set	nosy: + scodermessages: +
2010-05-02 02:55:03	dmtr	set	messages: +
2010-05-01 17:56:49	effbot	set	nosy: + effbotmessages: +
2010-04-30 23:25:56	dmtr	set	messages: +
2010-04-30 23:02:12	dmtr	set	files: + issue-8583.patchkeywords: + patch
2010-04-30 23:00:52	brian.curtin	set	nosy: + flox
2010-04-30 22:57:27	dmtr	create