Issue 6233: ElementTree (py3k) doesn't properly encode characters that can't be represented in the specified encoding (original) (raw)

Created on 2009-06-07 21:30 by Neil Muller, last changed 2022-04-11 14:56 by admin. This issue is now closed.

Files
File name Uploaded Description Edit
issue6233_py3k.diff Neil Muller,2009-06-07 21:47 Simple patch
issue6233_py3k_with_test.diff Neil Muller,2009-06-18 13:12 Combined patch with test case
issue6233-escape_entities.diff jcsalterego,2009-06-23 05:37 Excp handling in _encode + prev submitted test
issue6233-encode_cdata.diff jcsalterego,2009-06-24 23:56 _encode & effbot's _escape_cdata, w/ test
Messages (16)
msg89058 - (view) Author: Neil Muller (Neil Muller) Date: 2009-06-07 21:30
In py3k, ElementTree no longer correctly converts characters to entities when they can't be represented in the requested output encoding. Python 2: >>> import xml.etree.ElementTree as ET >>> e = ET.XML("t\xe3t") >>> ET.tostring(e, 'ascii') "\ntãt" Python 3: >>> import xml.etree.ElementTree as ET >>> e = ET.XML("t\xe3t") >>> ET.tostring(e, 'ascii') ..... UnicodeEncodeError: 'ascii' codec can't encode characters in position 1-2: ordinal not in range(128) It looks like _encode_entity isn't ever called inside ElementTree anymore - it probably should be called as part of _encode for characters that can't be represented.
msg89059 - (view) Author: Neil Muller (Neil Muller) Date: 2009-06-07 21:47
Simple possible patch uploaded This doesn't give the expected answer for the test above, but does work when starting from an XML file in utf-8 encoding. I still need to determine why this happens.
msg89276 - (view) Author: Neil Muller (Neil Muller) Date: 2009-06-12 12:50
> This doesn't give the expected answer for the test above Which is obviously due to not comparing apples with apples, as I should be using a byte-string in the py3k example. >>> import xml.etree.ElementTree as ET >>> e = ET.XML(b"t\xe3t") >>> ET.tostring(e, 'ascii') Fails without the patch, behaves as expected with the patch.
msg89504 - (view) Author: Neil Muller (Neil Muller) Date: 2009-06-18 13:12
Updated patch - adds a test for this.
msg89505 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2009-06-18 13:25
This regression is probably annoying enough to make it a blocker.
msg89583 - (view) Author: Fredrik Lundh (effbot) * (Python committer) Date: 2009-06-21 21:33
Umm. Isn't _encode used to encode tags and attribute names? The charref syntax is only valid in CDATA sections and attribute values, which are encoded by the corresponding _escape functions. I suspect this patch will make things blow up on a non-ASCII tag/attribute name.
msg89585 - (view) Author: Fredrik Lundh (effbot) * (Python committer) Date: 2009-06-21 21:42
Did you look at the 1.3 alpha code base when you came up with this idea? Unfortunately, 1.3's _encode is used for a different purpose... I don't have time to test it tonight, but I suspect that 1.3's escape_data/escape_attrib functions might work better under 3.X; they do the text.replace dance first, and then an explicit text.encode(encoding, "xmlcharrefreplace") at the end. E.g. def _escape_cdata(text, encoding): # escape character data try: # it's worth avoiding do-nothing calls for strings that are # shorter than 500 character, or so. assume that's, by far, # the most common case in most applications. if "&" in text: text = text.replace("&", "&") if "<" in text: text = text.replace("<", "<") if ">" in text: text = text.replace(">", ">") return text.encode(encoding, "xmlcharrefreplace") except (TypeError, AttributeError): _raise_serialization_error(text)
msg89623 - (view) Author: Jerry Chen (jcsalterego) Date: 2009-06-23 05:37
The attached patch includes Neil's original additions to test_xml_etree.py. I also noticed that _encode_entity wasn't being called in ElementTree in py3k, with the important bit being the nested function escape_entities(), in conjunction with _escape and _escape_map. In 2.x, _encode_entity() is used after _encode() throws Unicode exceptions [1], so I figured it would make sense to take the core functionality of _escape_entities() and integrate it into _encode in the same fashion -- when an exception is thrown. Basically, I: - changed _escape regexp from using "[\x0080-\uffff]" to "[\x80-xff]" - extracted _encode_entity.escape_entities() and made it _escape_entities of module scope - removed _encode_entity() - added UnicodeEncodeError exception in _encode() I'm not sure what the expected outcome is supposed to be when the text is not type bytes but str. With this patch, the output has b"tãt" rather than b"tãt". Hope this is a step in the right direction. [1] ElementTree.py:814, ElementTree.py:829, python 2.7 HEAD r50941
msg89684 - (view) Author: Fredrik Lundh (effbot) * (Python committer) Date: 2009-06-24 21:51
That's backwards, unless I'm missing something here: charrefs represent Unicode characters, not UTF-8 byte values. The character "LATIN SMALL LETTER A WITH TILDE" with the character value 227 should be represented as "ã" if serialized to an encoding that doesn't support non-ASCII characters. And there's no need to use RE:s to filter things under 3.X; those parts of ET 1.2 are there for pre-2.0 compatibility. Did you try running the tests with the escape function I posted?
msg89690 - (view) Author: Jerry Chen (jcsalterego) Date: 2009-06-24 23:56
Thanks for the explanation -- looks like I was way off base on that one. I took a look at the code you provided but it doesn't work as a drop-in replacement for _escape_cdata, since that function returns a string rather than bytes. However taking your code, calling it _encode_cdata and then refactoring all calls _encode(_escape_cdata(x), encoding) to _encode_cdata(x, encoding) seems to do the trick and passes the tests. Specific example: - file.write(_encode(_escape_cdata(node.text), encoding)) + file.write(_encode_cdata(node.text, encoding)) One minor modification is to return the string as is if encoding=None, just like _encode: def _encode_cdata(text, encoding): # escape character data try: text = text.replace("&", "&") text = text.replace("<", "<") text = text.replace(">", ">") if encoding: return text.encode(encoding, "xmlcharrefreplace") else: return text except (TypeError, AttributeError): _raise_serialization_error(text)
msg89715 - (view) Author: Benjamin Peterson (benjamin.peterson) * (Python committer) Date: 2009-06-25 21:26
effbot, do you have an opinion about the latest patch? It'd be nice to not have to delay the release for this.
msg89718 - (view) Author: Martin v. Löwis (loewis) * (Python committer) Date: 2009-06-26 01:12
I disagree with this report being classified as release-critical - it is *not* a regression over 3.0 (i.e. 3.0 already behaved in the same way). That it is a regression relative to 2.x should not make it release-critical - we can still fix such regressions in 3.2. In addition, there is an easy work-around for applications that run into the problem - just use utf-8 as the output encoding always: py> e = ET.XML(b"t\xe3t") py> ET.tostring(e,encoding='utf-8') b't\xc3\xa3t'
msg89722 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2009-06-26 03:30
+1 for Py3.1.1
msg89728 - (view) Author: Jerry Chen (jcsalterego) Date: 2009-06-26 14:21
Either way, it would be nice to get feedback so we can iterate on the patch or close out this issue already :-)
msg95045 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2009-11-08 20:43
The patch looks ok to me.
msg99125 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2010-02-09 16:56
Committed in r78123 (py3k) and r78124 (3.1). I've also removed _escape_cdata() since it wasn't used anymore. Thanks Jerry for the patch.
History
Date User Action Args
2022-04-11 14:56:49 admin set github: 50482
2010-02-09 16:56:25 pitrou set status: open -> closedresolution: fixedmessages: + stage: resolved
2009-11-08 20:43:14 pitrou set messages: +
2009-06-26 14:21:19 jcsalterego set messages: +
2009-06-26 11:13:39 pitrou set priority: release blocker -> criticalversions: + Python 3.2, - Python 3.0
2009-06-26 03:30:55 rhettinger set nosy: + rhettingermessages: +
2009-06-26 01:12:28 loewis set nosy: + loewismessages: +
2009-06-25 21:26:50 benjamin.peterson set nosy: + benjamin.petersonmessages: +
2009-06-24 23:56:04 jcsalterego set files: + issue6233-encode_cdata.diffmessages: +
2009-06-24 21:51:25 effbot set messages: +
2009-06-23 05:37:25 jcsalterego set files: + issue6233-escape_entities.diffnosy: + jcsalteregomessages: +
2009-06-21 21:42:03 effbot set messages: +
2009-06-21 21:33:01 effbot set messages: +
2009-06-18 13:25:14 pitrou set priority: release blockernosy: + pitroumessages: +
2009-06-18 13:12:25 Neil Muller set files: + issue6233_py3k_with_test.diffmessages: +
2009-06-12 12:50:24 Neil Muller set messages: +
2009-06-07 21:47:42 Neil Muller set files: + issue6233_py3k.diffkeywords: + patchmessages: +
2009-06-07 21:30:58 Neil Muller create