Issue 5166: ElementTree and minidom don't prevent creation of not well-formed XML (original) (raw)

process

Status:	closed	Resolution:	wont fix
Dependencies:		Superseder:
Assigned To:		Nosy List:	Ben Spiller, benspiller, effbot, eli.bendersky, flox, jwilk, martin.panter, nvetoshkin, ods, santoso.wijaya, scoder, strangefeatures
Priority:	normal	Keywords:

Created on 2009-02-06 11:13 by ods, last changed 2022-04-11 14:56 by admin. This issue is now closed.

Messages (11)
msg81259 - (view)	Author: Denis S. Otkidach (ods) *	Date: 2009-02-06 11:13
ElementTree and minidom allow creation of not well-formed XML, that can't be parsed: >>> from xml.etree import ElementTree >>> element = ElementTree.Element('element') >>> element.text = u'\0' >>> xml = ElementTree.tostring(element, encoding='utf-8') >>> ElementTree.fromstring(xml) [...] xml.parsers.expat.ExpatError: not well-formed (invalid token): line 1, column 9 >>> from xml.dom import minidom >>> doc = minidom.getDOMImplementation().createDocument(None, None, None) >>> element = doc.createElement('element') >>> element.appendChild(doc.createTextNode(u'\0')) <DOM Text node ""> >>> doc.appendChild(element) <DOM Element: element at 0xb7ca688c> >>> xml = doc.toxml(encoding='utf-8') >>> minidom.parseString(xml) [...] xml.parsers.expat.ExpatError: not well-formed (invalid token): line 1, colum I believe they should raise some exception when there are characters not allowed in XML (http://www.w3.org/TR/REC-xml/#NT-Char) are used in attribute values, text nodes and CDATA sections.
msg89685 - (view)	Author: Fredrik Lundh (effbot) *	Date: 2009-06-24 21:53
For ET, that's very much on purpose. Validating data provided by every single application would kill performance for all of them, even if only a small minority would ever try to serialize data that cannot be represented in XML.
msg89699 - (view)	Author: Denis S. Otkidach (ods) *	Date: 2009-06-25 07:33
Every blog engine I've even seen so far pass through comments from untrusted users to RSS/Atom feeds without proper validation causing broken XML in feeds. Sure, this is a bug in web applications, but DOM manipulation packages should prevent from creation broken XML to help detecting errors earlier.
msg95684 - (view)	Author: Andy (strangefeatures)	Date: 2009-11-24 16:09
I'm also of the opinion that this would be a valuable feature to have. I think it's a reasonable expectation that an XML library produces valid XML. It's particularly strange that ET would output XML that it can't itself read. Surely the job of making the input valid falls on the XML creator - that's the point of using libraries in the first place, to abstract away from details like not being able to use characters in the 0-32 range, in the same way that ampersands etc are auto-escaped. Granted, it's not as clear-cut here since the low-range ASCII characters are likely to be less frequent and the strategy to handle them is less clear. I think the sanest behaviour would be to raise an exception by default, although a user-configurable option to replace or omit the characters would also make sense. If impacting performance is a concern, maybe it would make sense to be off by default, but I would have thought that the single regex that could perform the check would have relatively minimal impact - and it seems to be an acceptable overhead on the parsing side, so why not on generation?
msg95689 - (view)	Author: Denis S. Otkidach (ods) *	Date: 2009-11-24 17:26
Here is a regexp I use to clean up text (note, that I don't touch "compatibility characters" that are also not recommended in XML; some other developers remove them too): # http://www.w3.org/TR/REC-xml/#NT-Char # Char ::= #x9 \| #xA	#xD	[#x20-#xD7FF]	[#xE000-#xFFFD]	# [#x10000- #x10FFFF] # (any Unicode character, excluding the surrogate blocks, FFFE, and FFFF) _char_tail = '' if sys.maxunicode > 0x10000: _char_tail = u'%s-%s' % (unichr(0x10000), unichr(min(sys.maxunicode, 0x10FFFF))) _nontext_sub = re.compile( ur'[^\x09\x0A\x0D\x20-\uD7FF\uE000-\uFFFD%s]' % _char_tail, re.U).sub def replace_nontext(text, replacement=u'\uFFFD'): return _nontext_sub(replacement, text)
msg101158 - (view)	Author: Vetoshkin Nikita (nvetoshkin)	Date: 2010-03-16 08:10
What about this example? >>> from xml.dom import minidom >>> doc = minidom.Document() >>> el = doc.createElement("Test") >>> el.setAttribute("with space", "False") >>> doc.appendChild(el) <DOM Element: Test at 0xba1440> >>> >>> #nahhh ... minidom.parseString(doc.toxml()) Traceback (most recent call last): File "", line 2, in File "C:\Python26\lib\xml\dom\minidom.py", line 1928, in parseString return expatbuilder.parseString(string) File "C:\Python26\lib\xml\dom\expatbuilder.py", line 940, in parseString return builder.parseString(string) File "C:\Python26\lib\xml\dom\expatbuilder.py", line 223, in parseString parser.Parse(string, True) xml.parsers.expat.ExpatError: not well-formed (invalid token): line 1, column 33 >>> Is it worth making another bug report?
msg111603 - (view)	Author: Mark Lawrence (BreamoreBoy) *	Date: 2010-07-26 12:01
In it's stated that this behaviour is deliberate for ET. Could somebody please comment on the minidom aspects.
msg258343 - (view)	Author: Martin Panter (martin.panter) *	Date: 2016-01-16 00:44
Issue 12129 is open about this sort of problem with xml.dom (which would also apply to minidom I think). If someone wants to suggest a clarification for the Element Tree documentation, that might work. But I tend to agree about not bogging down the implementation.
msg324922 - (view)	Author: Ben Spiller (benspiller) *	Date: 2018-09-10 12:36
Hi it's been a few years now since this was reported and it's still a problem, any chance of a fix for this? The API gives the impression that if you pass python strings to the XML API then the library will generate valid XML. It takes care of the charset/encoding and entity escaping aspects of XML generation so would be logical for it to in some way take care of control characters too - especially as silently generating unparseable XML is a somewhat dangerous failure mode. I think there's a strong case for some built-in functionality to replace/ignore the control characters (perhaps as a configurable option, in case of performance worries) rather than just throwing an exception, since it's very common to have an arbitrary string generated by some other program or user input that needs to be written into an XML file (and a lot less common to be 100% sure in all cases what characters your string might contain). For those common use cases, the current situation where every python developer needs to implement their own workaround to sanitize strings isn't ideal, especially as it's not trivial to get it right and likely a lot of the community who end up 'rolling their own' are getting in wrong in some way. [On the other hand if you guys decide this really isn't going to be fixed, then at the very least I'd suggest that the API documentation should prominently state that it is up to the users of these libraries to implement their own sanitization of control characters, since I'm sure none of us want people using python to end up with buggy applications]
msg328040 - (view)	Author: Ben Spiller (benspiller) *	Date: 2018-10-19 11:28
To help anyone else struggling with this bug, based on https://lsimons.wordpress.com/2011/03/17/stripping-illegal-characters-out-of-xml-in-python/ the best workaround I've currently found is to define this: def escape_xml_illegal_chars(unicodeString, replaceWith=u'?'): return re.sub(u'[\x00-\x08\x0b\x0c\x0e-\x1F\uD800-\uDFFF\uFFFE\uFFFF]', replaceWith, unicodeString) and then copy+paste the following pattern into every bit of code that generates XML: myfile.write(escape_xml_illegal_chars(document.toxml(encoding='utf-8').decode('utf-8')).encode('utf-8')) It's obviously pretty grim (and unsafe) to expect every python developer to copy+paste this kind of thing into their own project to avoid buggy XML generation, so would be better to have the escape_xml_illegal_chars function in the python standard library (maybe alongside xml.sax.utils.escape - which notably does _not_ escape all the unicode characters that aren't valid XML), and built-in support for this as part of document.toxml. I guess we'd want it to be user-configurable for any users who are prepared to tolerate the possibility unparseable XML documents will be generated in return for improved performance for the common case where these characters are not present, not not having the capability at all just means most python applications that do XML generate with special-casing this have a bug. I suggest we definitely need some clear warnings about this in the doc.
msg340981 - (view)	Author: Stefan Behnel (scoder) *	Date: 2019-04-27 11:39
This is a tricky decision. lxml, for example, validates user input, but that's because it has to process it anyway and does it along the way directly on input (and very efficiently in C code). ET, on the other hand, is rather lenient about what it allows users to do and doesn't apply much processing to user input. It even allows invalid trees during processing and only expects the tree to be serialisable when requested to serialise it. I think that's a fair behaviour, because most user input will be ok and shouldn't need to suffer the performance penalty of validating all input. Null-characters are a very rare thing to find in text, for example, and I think it's reasonable to let users handle the few cases by themselves where they can occur. Note that simply replacing invalid characters by the replacement character is not a good solution, at least not in the general case, since it silently corrupts data. It's probably a better solution for users to make their code scream out loudly when it has to deal with data that it cannot serialise in the end, and to do that early on input (where its easy to debug) rather than late on serialisation where it might be difficult to understand how the data became what it is. Trying to serialise a null-character seems only a symptom of a more important problem somewhere else in the processing pipeline. In the end, users who really care about correct output should run some kind of schema validation over it after serialisation, as that would detect not only data issues but also structural and logical issues (such as a missing or empty attribute), specifically for their target data format. In some cases, it might even detect random data corruption due to old non-ECC RAM in the server machine. :) So, if someone finds a way to augment the text escaping procedure with a bit of character validation without making it slower (especially for the extremely common very short strings), then I think we can reconsider this as an enhancement. Until then, and seeing that no-one has come up with a patch in the last 10 years, I'll close this as "won't fix".

History
Date	User	Action	Args
2022-04-11 14:56:45	admin	set	github: 49416
2019-04-27 11:39:42	scoder	set	status: open -> closeddependencies: - Document Object Model API - validationversions: + Python 3.8, - Python 3.4, Python 3.5nosy: + scodermessages: + resolution: wont fixstage: resolved
2018-11-07 17:21:29	Ben Spiller	set	nosy: + Ben Spiller
2018-10-19 11:28:13	benspiller	set	messages: +
2018-09-10 12:36:41	benspiller	set	nosy: + benspillermessages: + versions: + Python 3.5, Python 3.6, Python 3.7
2016-01-16 00:44:53	martin.panter	set	dependencies: + Document Object Model API - validationmessages: +
2015-03-12 19:48:58	ned.deily	link	issue23650 superseder
2014-12-13 01:58:30	martin.panter	set	nosy: + martin.panter
2014-02-03 17:01:35	BreamoreBoy	set	nosy: - BreamoreBoy
2013-09-02 21:19:53	eli.bendersky	set	nosy: + eli.bendersky
2013-09-02 21:19:25	eli.bendersky	link	issue18850 superseder
2012-07-21 13:43:06	flox	set	assignee: effbot -> components: + XMLversions: + Python 3.4, - Python 2.7, Python 3.2
2011-04-08 18:10:16	santoso.wijaya	set	nosy: + santoso.wijaya
2010-07-26 12:01:14	BreamoreBoy	set	nosy: + BreamoreBoymessages: +
2010-03-16 08:10:25	nvetoshkin	set	nosy: + nvetoshkinmessages: +
2010-02-16 14:44:14	jwilk	set	nosy: + jwilk
2010-02-16 14:02:11	flox	set	priority: normalnosy: + floxtype: behavior -> enhancementversions: + Python 2.7, Python 3.2, - Python 2.6, Python 2.5, Python 3.0
2010-02-16 13:50:26	flox	link	issue7599 superseder
2009-11-24 17:26:33	ods	set	messages: +
2009-11-24 16:09:13	strangefeatures	set	nosy: + strangefeaturesmessages: +
2009-06-25 07:33:33	ods	set	messages: +
2009-06-24 21:53:38	effbot	set	messages: +
2009-02-06 21:51:10	georg.brandl	set	assignee: effbotnosy: + effbot
2009-02-06 11:13:43	ods	create