msg169886 - (view) |
Author: Wim (wiml) |
Date: 2012-09-05 18:54 |
Encoding a (well-formed) Unicode string containing a non-BMP character, using the xmlcharrefreplace error handler, will produce two XML entities for surrogate codepoints instead of one entity for the actual character. Here's a transcript (Python 2.7.3, x86_64): >>> b = '\xf0\x9f\x92\x9d' >>> u = b.decode('utf8') >>> u u'\U0001f49d' >>> u.encode('ascii', errors='xmlcharrefreplace') '��' >>> ( u'\U0001f49d' ).encode('ascii', errors='xmlcharrefreplace') '��' >>> list(u) [u'\ud83d', u'\udc9d'] >>> u.encode('utf8', errors='xmlcharrefreplace') '\xf0\x9f\x92\x9d' The utf8 bytestring is correctly decoded, and the print representation shows one single Unicode character. Encoding using xmlcharrefreplace produces two XML entities, which is wrong[1]: a single non-BMP character should be represented in XML as a single entity reference, in this case presumably '💝'. As the last two lines show, I'm using a narrow build (so the unicode strings are represented internally in UTF-16, I guess). Converting the string back to utf8 does the right thing, and emits a single UTF8 sequence representing the supplementary-plane codepoint. (FWIW, the backslashreplace error handler also emits a surrogate pair, but I don't know if there is a complete specification for what that handler does, so it's possible that it's not wrong.) [1] http://www.w3.org/International/questions/qa-escapes#bytheway |
|
|
msg169909 - (view) |
Author: Ezio Melotti (ezio.melotti) *  |
Date: 2012-09-06 09:53 |
Attached patch against 3.2 seems to fix the problem. |
|
|
msg169910 - (view) |
Author: Ezio Melotti (ezio.melotti) *  |
Date: 2012-09-06 10:26 |
Note that there's similar code in charmap_encoding_error, PyUnicode_EncodeCharmap, PyUnicode_TranslateCharmap, and PyUnicode_EncodeDecimal, however I'm not sure how to reach these paths. |
|
|
msg169911 - (view) |
Author: STINNER Victor (vstinner) *  |
Date: 2012-09-06 11:06 |
Thanks to the PEP 393, this issue is already fixed in Python 3.3. $ ./python Python 3.3.0rc1+ (default:ba2c1def3710+, Sep 3 2012, 23:20:25) [GCC 4.6.3 20120306 (Red Hat 4.6.3-2)] on linux >>> ( u'\U0001f49d' ).encode('ascii', errors='xmlcharrefreplace') b'💝' |
|
|
msg183446 - (view) |
Author: Serhiy Storchaka (serhiy.storchaka) *  |
Date: 2013-03-04 13:20 |
I prefer a little different (simpler for me) form: for (p = collstart; p < collend;) { Py_UCS4 ch = *p++; if ((0xD800 <= ch && ch <= 0xDBFF) && (p < collend) && (0xDC00 <= *p && *p <= 0xDFFF)) { ch = ((((ch & 0x03FF) << 10) | ((Py_UCS4)*p++ & 0x03FF)) + 0x10000); } str += sprintf(str, "&#%d;", (int)ch); } And please look at the loop above ("determine replacement size"). It should be corrected too. It will be simpler to use a buffer with static size (``char buffer[2+29+1+1];``) as in charmap encoder. Perhaps charmap encoder should be fixed too (and common code extracted to separate function). I doubt about '\ud83d\udc9d' on wide build. Is it right to encode it as b'💝' and not as b'��'? This will be compatible with narrow build but will break compatibility with 3.3+. What is less evil? |
|
|
msg183482 - (view) |
Author: Ezio Melotti (ezio.melotti) *  |
Date: 2013-03-04 18:10 |
> I doubt about '\ud83d\udc9d' on wide build. Is it right to encode it as > b'💝' and not as b'��'? I don't think so. IIRC surrogates are invalid in UTF-32, and certainly shouldn't be recombined. > This will be compatible with narrow build but will break compatibility > with 3.3+. What is less evil? I think it's better to be compatible with 3.3+. This is anyway a rather obscure corner case. Do you want to propose a new patch? |
|
|
msg183524 - (view) |
Author: Serhiy Storchaka (serhiy.storchaka) *  |
Date: 2013-03-05 13:33 |
> I think it's better to be compatible with 3.3+. This is anyway a rather obscure corner case. Well, we should not introduce new divergence between 3.2 wide build and 3.3. > Do you want to propose a new patch? I will do it. |
|
|
msg183956 - (view) |
Author: Serhiy Storchaka (serhiy.storchaka) *  |
Date: 2013-03-11 12:54 |
Here is a patch which fixes xmlcharrefreplace error handling in other places. Unfortunately multibyte asian encoders are broken yet. I'll open a separate issue for this. |
|
|
msg186921 - (view) |
Author: STINNER Victor (vstinner) *  |
Date: 2013-04-14 14:28 |
Should we really invest time to fix bugs related to astral (non-BMP) characters with rare codecs and error handlers (CJK codecs, xmlcharrefreplace error handler)? Python 3.3 is released and has a much better support of astral characters (in many places). I don't know for CJK codecs: Python 3.3 still uses the legacy Unicode API for CJK codecs and so depend on the wchar_t type (which is 16 bits on Windows). I just fixed Python 3.4 to use the new Unicode API (PEP 393), which always support astral characters support and don't depend on the size of the wchar_t type. I'm not against fixing Python 2.7, I'm just not interested. |
|
|
msg186934 - (view) |
Author: Ezio Melotti (ezio.melotti) *  |
Date: 2013-04-14 17:37 |
I tend to agree with Victor: if you want to fix 2.7 go ahead, but if that's too much work it's OK with me to close this issue. |
|
|
msg194546 - (view) |
Author: Roundup Robot (python-dev)  |
Date: 2013-08-06 13:57 |
New changeset 719ee60fc5e2 by Serhiy Storchaka in branch '2.7': Issue #15866: The xmlcharrefreplace error handler no more produces two XML http://hg.python.org/cpython/rev/719ee60fc5e2 |
|
|