msg193787 - (view) |
Author: Steven D'Aprano (steven.daprano) *  |
Date: 2013-07-27 16:12 |
The documentation for string escapes suggests that \uxxxx escapes can be used to generate characters in the Supplementary Multilingual Planes by using surrogate pairs: "Individual code units which form parts of a surrogate pair can be encoded using this escape sequence." http://docs.python.org/3/reference/lexical_analysis.html#string-and-bytes-literals E.g. in Python 3.2: py> '\uD80C\uDC80' == '\U00013080' True but that is no longer the case in Python 3.3. I suggest the documentation should just remove that note. |
|
|
msg193790 - (view) |
Author: Terry J. Reedy (terry.reedy) *  |
Date: 2013-07-27 20:03 |
3.3.2: >>> '\uD80C\uDC80' == '\U00013080' False The statement that surrogate code units can be encoded this way is still true. Indeed, it is now the only way to get such code units into a string. The suggestion that a pair will make an astral char is now false. The sentence could be changed to "Individual surrogate code units can be encoded using this escape sequence." On the other hand, the same is true of *any* BMP char, including all the *other* non-graphic chars that can only be entered this way. So I think the sentence, if not deleted, should be replaced by what seems to me a more useful (complete) statement. "Any Basic Multilingual Plane (BMP) codepoint can be encoded using this escape sequence." |
|
|
msg193860 - (view) |
Author: R. David Murray (r.david.murray) *  |
Date: 2013-07-29 12:27 |
Python 3.2.3 (default, Jun 15 2013, 14:13:52) [GCC 4.7.2] on linux2 Type "help", "copyright", "credits" or "license" for more information. >>> '\uD80C\uDC80' '\ud80c\udc80' >>> '\uD80C\uDC80' == '\U00013080' False |
|
|
msg193870 - (view) |
Author: Steven D'Aprano (steven.daprano) *  |
Date: 2013-07-29 15:03 |
On 29/07/13 22:27, R. David Murray wrote: >>>> '\uD80C\uDC80' == '\U00013080' > False Are you running a wide build? In a narrow build, it returns True. |
|
|
msg193881 - (view) |
Author: R. David Murray (r.david.murray) *  |
Date: 2013-07-29 16:58 |
Probably. I think the default build on Gentoo is wide. That seems to make the existing text even more incorrect :) |
|
|
msg194671 - (view) |
Author: Ezio Melotti (ezio.melotti) *  |
Date: 2013-08-08 13:34 |
I think it's OK to remove the sentence. Converting a surrogate pair to a non-BMP char is something that works only while decoding a UTF-16 byte sequence. Surrogates are invalid in UTF-8/32, and while dealing with Unicode strings, surrogates have no special meaning and are no different from any other codepoint, whether they are lone or paired. |
|
|
msg264080 - (view) |
Author: Roundup Robot (python-dev)  |
Date: 2016-04-24 00:13 |
New changeset 79e7808c3941 by Berker Peksag in branch '3.5': Issue #18572: Remove redundant note about surrogates in string escape doc https://hg.python.org/cpython/rev/79e7808c3941 New changeset ee815d3535f5 by Berker Peksag in branch 'default': Issue #18572: Remove redundant note about surrogates in string escape doc https://hg.python.org/cpython/rev/ee815d3535f5 |
|
|
msg264081 - (view) |
Author: Berker Peksag (berker.peksag) *  |
Date: 2016-04-24 00:14 |
I removed the sentence in 3.5 and default branches. |
|
|