Issue 15866: encode(..., 'xmlcharrefreplace') produces entities for surrogate pairs (original) (raw)

Issue15866

Created on 2012-09-05 18:54 by wiml, last changed 2022-04-11 14:57 by admin. This issue is now closed.

Files
File name Uploaded Description Edit
issue15866.diff ezio.melotti,2012-09-06 09:53 Patch against 3.2.
issue15866_2.patch serhiy.storchaka,2013-03-11 12:54 Patch against 3.2. review
Messages (11)
msg169886 - (view) Author: Wim (wiml) Date: 2012-09-05 18:54
Encoding a (well-formed) Unicode string containing a non-BMP character, using the xmlcharrefreplace error handler, will produce two XML entities for surrogate codepoints instead of one entity for the actual character. Here's a transcript (Python 2.7.3, x86_64): >>> b = '\xf0\x9f\x92\x9d' >>> u = b.decode('utf8') >>> u u'\U0001f49d' >>> u.encode('ascii', errors='xmlcharrefreplace') '��' >>> ( u'\U0001f49d' ).encode('ascii', errors='xmlcharrefreplace') '��' >>> list(u) [u'\ud83d', u'\udc9d'] >>> u.encode('utf8', errors='xmlcharrefreplace') '\xf0\x9f\x92\x9d' The utf8 bytestring is correctly decoded, and the print representation shows one single Unicode character. Encoding using xmlcharrefreplace produces two XML entities, which is wrong[1]: a single non-BMP character should be represented in XML as a single entity reference, in this case presumably '💝'. As the last two lines show, I'm using a narrow build (so the unicode strings are represented internally in UTF-16, I guess). Converting the string back to utf8 does the right thing, and emits a single UTF8 sequence representing the supplementary-plane codepoint. (FWIW, the backslashreplace error handler also emits a surrogate pair, but I don't know if there is a complete specification for what that handler does, so it's possible that it's not wrong.) [1] http://www.w3.org/International/questions/qa-escapes#bytheway
msg169909 - (view) Author: Ezio Melotti (ezio.melotti) * (Python committer) Date: 2012-09-06 09:53
Attached patch against 3.2 seems to fix the problem.
msg169910 - (view) Author: Ezio Melotti (ezio.melotti) * (Python committer) Date: 2012-09-06 10:26
Note that there's similar code in charmap_encoding_error, PyUnicode_EncodeCharmap, PyUnicode_TranslateCharmap, and PyUnicode_EncodeDecimal, however I'm not sure how to reach these paths.
msg169911 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2012-09-06 11:06
Thanks to the PEP 393, this issue is already fixed in Python 3.3. $ ./python Python 3.3.0rc1+ (default:ba2c1def3710+, Sep 3 2012, 23:20:25) [GCC 4.6.3 20120306 (Red Hat 4.6.3-2)] on linux >>> ( u'\U0001f49d' ).encode('ascii', errors='xmlcharrefreplace') b'💝'
msg183446 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2013-03-04 13:20
I prefer a little different (simpler for me) form: for (p = collstart; p < collend;) { Py_UCS4 ch = *p++; if ((0xD800 <= ch && ch <= 0xDBFF) && (p < collend) && (0xDC00 <= *p && *p <= 0xDFFF)) { ch = ((((ch & 0x03FF) << 10) | ((Py_UCS4)*p++ & 0x03FF)) + 0x10000); } str += sprintf(str, "&#%d;", (int)ch); } And please look at the loop above ("determine replacement size"). It should be corrected too. It will be simpler to use a buffer with static size (``char buffer[2+29+1+1];``) as in charmap encoder. Perhaps charmap encoder should be fixed too (and common code extracted to separate function). I doubt about '\ud83d\udc9d' on wide build. Is it right to encode it as b'💝' and not as b'��'? This will be compatible with narrow build but will break compatibility with 3.3+. What is less evil?
msg183482 - (view) Author: Ezio Melotti (ezio.melotti) * (Python committer) Date: 2013-03-04 18:10
> I doubt about '\ud83d\udc9d' on wide build. Is it right to encode it as > b'💝' and not as b'��'? I don't think so. IIRC surrogates are invalid in UTF-32, and certainly shouldn't be recombined. > This will be compatible with narrow build but will break compatibility > with 3.3+. What is less evil? I think it's better to be compatible with 3.3+. This is anyway a rather obscure corner case. Do you want to propose a new patch?
msg183524 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2013-03-05 13:33
> I think it's better to be compatible with 3.3+. This is anyway a rather obscure corner case. Well, we should not introduce new divergence between 3.2 wide build and 3.3. > Do you want to propose a new patch? I will do it.
msg183956 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2013-03-11 12:54
Here is a patch which fixes xmlcharrefreplace error handling in other places. Unfortunately multibyte asian encoders are broken yet. I'll open a separate issue for this.
msg186921 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2013-04-14 14:28
Should we really invest time to fix bugs related to astral (non-BMP) characters with rare codecs and error handlers (CJK codecs, xmlcharrefreplace error handler)? Python 3.3 is released and has a much better support of astral characters (in many places). I don't know for CJK codecs: Python 3.3 still uses the legacy Unicode API for CJK codecs and so depend on the wchar_t type (which is 16 bits on Windows). I just fixed Python 3.4 to use the new Unicode API (PEP 393), which always support astral characters support and don't depend on the size of the wchar_t type. I'm not against fixing Python 2.7, I'm just not interested.
msg186934 - (view) Author: Ezio Melotti (ezio.melotti) * (Python committer) Date: 2013-04-14 17:37
I tend to agree with Victor: if you want to fix 2.7 go ahead, but if that's too much work it's OK with me to close this issue.
msg194546 - (view) Author: Roundup Robot (python-dev) (Python triager) Date: 2013-08-06 13:57
New changeset 719ee60fc5e2 by Serhiy Storchaka in branch '2.7': Issue #15866: The xmlcharrefreplace error handler no more produces two XML http://hg.python.org/cpython/rev/719ee60fc5e2
History
Date User Action Args
2022-04-11 14:57:35 admin set github: 60070
2013-08-06 13:58:20 serhiy.storchaka set status: open -> closedassignee: serhiy.storchakaresolution: fixedstage: patch review -> resolved
2013-08-06 13:57:03 python-dev set nosy: + python-devmessages: +
2013-04-14 17:37:30 ezio.melotti set messages: +
2013-04-14 14:28:38 vstinner set messages: +
2013-04-13 17:58:08 serhiy.storchaka set versions: - Python 3.2
2013-03-11 12:54:58 serhiy.storchaka set files: + issue15866_2.patchmessages: +
2013-03-05 13:33:53 serhiy.storchaka set messages: +
2013-03-04 18:10:44 ezio.melotti set messages: +
2013-03-04 13:20:33 serhiy.storchaka set messages: +
2013-03-02 13:43:03 ezio.melotti set nosy: + serhiy.storchaka
2012-09-06 11:06:50 vstinner set nosy: + vstinnermessages: +
2012-09-06 10:26:14 ezio.melotti set nosy: + lemburgmessages: +
2012-09-06 09:53:58 ezio.melotti set files: + issue15866.diffversions: + Python 3.2messages: + keywords: + patchstage: patch review
2012-09-05 18:54:33 wiml create