Issue 9692: UnicodeDecodeError in ElementTree.tostring() (original) (raw)

Created on 2010-08-26 14:42 by uis, last changed 2022-04-11 14:57 by admin. This issue is now closed.

Messages (7)
msg114980 - (view)	Author: Ulrich Seidl (uis)	Date: 2010-08-26 14:42
The following code leads to an UnicodeError in python 2.7 while it works fine in 2.6 & 2.5: # -- coding: latin-1 -- import xml.etree.cElementTree as ElementTree oDoc = ElementTree.fromstring( '' ) oDoc.set( "ATTR", "ÄÖÜ" ) print ElementTree.tostring( oDoc , encoding="iso-8859-1" )
msg114984 - (view)	Author: Amaury Forgeot d'Arc (amaury.forgeotdarc) *	Date: 2010-08-26 15:17
IMO the code is not correct: how does ElementTree know which encoding is used for the attribute value? Even 2.5 prints a different content when the script is saved with a different encoding. The line should look like: oDoc.set( "ATTR", u"ÄÖÜ" ) or use ascii-only characters.
msg115002 - (view)	Author: Ulrich Seidl (uis)	Date: 2010-08-26 16:21
Of course, if you use an unicode string it works and of course it would be easy to switch to unicode for this demo code. Unfortunately, the affected application is a little bit more complex and it is not that easy to switch to unicode. I just wonder why the tostring() method does not assume that internal strings are encoded in the explicitly provided encoding? Is ElementTree restricted to the use of unicode strings? Anyway, why was it working (as expected) with python 2.5 & python 2.6?
msg115003 - (view)	Author: Amaury Forgeot d'Arc (amaury.forgeotdarc) *	Date: 2010-08-26 16:26
Testing with python 2.5: oDoc.set("ATTR", "ÄÖÜ") uses the encoding used by the source code (with "# -*- coding:";) If I use utf-8 instead, the output is: which contains the numbers of the 3 pairs of surrogates.
msg115012 - (view)	Author: Ulrich Seidl (uis)	Date: 2010-08-26 17:59
Well, the output of the print is not that interesting as long as ElementTree is able the restore the former attributes value when reading it in again. The print was just used to illustrate that an UnicodeDecodeError appears. Think about doing an ElementTree.fromstring( ... ).get( "ATTR" ).encode( "iso-8859-1" ).
msg126005 - (view)	Author: Ulrich Seidl (uis)	Date: 2011-01-11 13:33
I would suggest adding an additional except branch to (at least) the following functions of ElementTree.py: * _encode, * _escape_attrib, and * _escape_cdata The except branch could look like: except (UnicodeDecodeError): return text.decode( encoding ).encode( encoding, "xmlcharrefreplace")
msg166023 - (view)	Author: Florent Xicluna (flox) *	Date: 2012-07-21 13:33
I propose to close this as won't fix. The upgrade to ElementTree 1.3 brought some consistency when dealing with Unicode and encodings. The reported behavior was only seen in Python 2.7, when using bytes improperly.

History
Date	User	Action	Args
2022-04-11 14:57:05	admin	set	github: 53901
2012-07-21 13:33:29	flox	set	status: open -> closednosy: + eli.benderskymessages: + resolution: wont fix
2011-01-11 13:33:53	uis	set	messages: +
2010-08-26 17:59:21	uis	set	messages: +
2010-08-26 16:26:50	amaury.forgeotdarc	set	messages: +
2010-08-26 16:21:04	uis	set	messages: +
2010-08-26 15🔞00	amaury.forgeotdarc	set	nosy: + amaury.forgeotdarcmessages: +
2010-08-26 14:47:14	brian.curtin	set	nosy: + floxtype: behaviorstage: needs patch
2010-08-26 14:42:54	uis	create