Issue 9200: Make the str.is* methods work with non-BMP chars on narrow builds (original) (raw)

Created on 2010-07-08 13:50 by amaury.forgeotdarc, last changed 2022-04-11 14:57 by admin. This issue is now closed.

Files
File name Uploaded Description Edit
join-surrogates.patch amaury.forgeotdarc,2010-07-09 17:13 review
issue9200.diff belopolsky,2010-11-27 01:57 review
issue9200.diff ezio.melotti,2011-08-18 13:59 review
issue9200-2.diff ezio.melotti,2011-08-19 15:54 review
Messages (18)
msg109542 - (view) Author: Amaury Forgeot d'Arc (amaury.forgeotdarc) * (Python committer) Date: 2010-07-08 13:50
On narrow unicode builds: unicodedata.category(chr(0x10000)) == 'Lo' # correct Py_UNICODE_ISPRINTABLE(0x10000) == 1 # correct str.isprintable(chr(0x10000)) == False # inconsistent On narrow unicode builds, large code points are stored with a surrogate pair. But str.isprintable() simply loops over the Py_UNICODE array, and test the surrogates separately. There should be a way to walk a unicode string in C, character by character, and the str methods (str.is*, str.to*) should use it.
msg109766 - (view) Author: Adam Olsen (Rhamphoryncus) Date: 2010-07-09 16:38
There should be a way to walk the unicode string in Python too. Afaik there isn't.
msg109775 - (view) Author: Amaury Forgeot d'Arc (amaury.forgeotdarc) * (Python committer) Date: 2010-07-09 17:13
A "proof of concept" patch, which shows the macros used to walk a unicode string and uses them in unicode_repr() (should not change behaviour) and in unicode_isprintable() (should fix the issue). Other functions should be changed the same way, of course.
msg122465 - (view) Author: Alexander Belopolsky (belopolsky) * (Python committer) Date: 2010-11-26 16:35
AFAICT, all ctype methods (isalpha, isdigit, etc.) have the same problem. I posted a patch at that introduces a Py_UNICODE_NEXT() macro that can help fixing all these methods. I am adding #10542 as a dependency and if there are no objections, I will change the title to extend the scope of this issue to cover all ctype methods.
msg122498 - (view) Author: Alexander Belopolsky (belopolsky) * (Python committer) Date: 2010-11-27 01:57
Attached patch fixes isprintable and other ctype-like methods. I left isspace() out for now because I could not find a test character outside of BMP to test with, but I think we should fix that for completeness as well. At this point the goal is mostly to showcase Py_UNICODE_NEXT(), not completeness. See . I also noticed that upper/lower/swapcase methods have the same issue: >>> '\N{MATHEMATICAL BOLD CAPITAL A}'.lower() == '\N{MATHEMATICAL BOLD CAPITAL A}' True This will be a subject of a separate issue.
msg142116 - (view) Author: Ezio Melotti (ezio.melotti) * (Python committer) Date: 2011-08-15 11:01
I closed #12730 as a duplicate of this and updated the title of this issue.
msg142316 - (view) Author: Ezio Melotti (ezio.melotti) * (Python committer) Date: 2011-08-18 13:49
The attached patch fixes all the str.is* methods and makes them work on narrow builds with non-BMP chars. It also includes the _Py_UNICODE_NEXT macro proposed in #10542.
msg142318 - (view) Author: Ezio Melotti (ezio.melotti) * (Python committer) Date: 2011-08-18 13:57
(Note: I copied the macros from the other patch without changing the name. If the approach is good I'll get rid of the prefixes and separate the words in IS{HIGH|LOW}SURROGATE with an _.)
msg142355 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2011-08-18 16:18
I don't think that macros specific to unicodeobject.c should get the _PY_UNICODE_ prefix. "_Py_" prefix is reserved to exported symbols, but symbols reserved for the Python interprefer itself. For _Py_UNICODE_NEXT, you can call it NEXT_CHARACTER(). _Py_UNICODE_ISHIGHSURROGATE,_Py_UNICODE_ISLOWSURROGATE and _Py_UNICODE_JOIN_SURROGATES are only used once, I would prefer to see them inlined in _Py_UNICODE_NEXT. The first cast to Py_UCS4 in _Py_UNICODE_JOIN_SURROGATES is useless. It looks like the macro can be simplified to something like: #define _Py_UNICODE_NEXT(ptr, end) \ (_Py_UNICODE_ISHIGHSURROGATE(*(ptr)) && (ptr) < (end)) && _Py_UNICODE_ISLOWSURROGATE((ptr)[1] ? \ ((ptr) += 2,_Py_UNICODE_JOIN_SURROGATES((ptr)[-2], (ptr)[-1])) : \ (Py_UCS4)*(ptr)++) (you don't need two "a?b:c") I would prefer to see _Py_UNICODE_NEXT as a function instead of a macro because it has border effect (ptr++ or ptr += 2). You cannot write Py_UNICODE_ISALNUM(_Py_UNICODE_NEXT(p, e)) for example, because Py_UNICODE_ISALNUM is defined as: #define Py_UNICODE_ISALNUM(ch) (Py_UNICODE_ISALPHA(ch) | Py_UNICODE_ISDECIMAL(ch)
msg142363 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2011-08-18 16:38
STINNER Victor wrote: > > STINNER Victor <victor.stinner@haypocalc.com> added the comment: > > I don't think that macros specific to unicodeobject.c should get the _PY_UNICODE_ prefix. "_Py_" prefix is reserved to exported symbols, but symbols reserved for the Python interprefer itself. For _Py_UNICODE_NEXT, you can call it NEXT_CHARACTER(). Can we please stick to the things we discussed on . The macros are intended to be public starting with 3.3, not private and they are meant to be used outside the interpreter as well. They will only be private for patch level release patches. For those you don't need the Py-prefix, but it also doesn't hurt having it there. > _Py_UNICODE_ISHIGHSURROGATE,_Py_UNICODE_ISLOWSURROGATE and _Py_UNICODE_JOIN_SURROGATES are only used once, I would prefer to see them inlined in _Py_UNICODE_NEXT. > > The first cast to Py_UCS4 in _Py_UNICODE_JOIN_SURROGATES is useless. > > It looks like the macro can be simplified to something like: > > #define _Py_UNICODE_NEXT(ptr, end) \ > (_Py_UNICODE_ISHIGHSURROGATE(*(ptr)) && (ptr) < (end)) && _Py_UNICODE_ISLOWSURROGATE((ptr)[1] ? \ > ((ptr) += 2,_Py_UNICODE_JOIN_SURROGATES((ptr)[-2], (ptr)[-1])) : \ > (Py_UCS4)*(ptr)++) > > (you don't need two "a?b:c") > > I would prefer to see _Py_UNICODE_NEXT as a function instead of a macro because it has border effect (ptr++ or ptr += 2). You cannot write Py_UNICODE_ISALNUM(_Py_UNICODE_NEXT(p, e)) for example, because Py_UNICODE_ISALNUM is defined as: > #define Py_UNICODE_ISALNUM(ch) (Py_UNICODE_ISALPHA(ch) | Py_UNICODE_ISDECIMAL(ch)
msg142468 - (view) Author: Ezio Melotti (ezio.melotti) * (Python committer) Date: 2011-08-19 15:54
Here's a new version of the patch. I decided to leave the prefix anyway, for consistency with what I'll commit to 3.3 and because without the prefix NEXT() looks ambiguous (and it's not entirely clear if it's private or not). I rewrote the macro as Victor suggested and tested that it still works (I also added a test with surrogates). The macros are now called _Py_UNICODE_IS_{LOW|HIGH}_SURROGATE, with '_'s. I also tried the implementation proposed in #12751 and benchmarked with: $ ./python -m timeit -s 's = "\uD800\uD8000\uDFFF\uDFFF\uDFFF"*1000' 's.islower()' and got "1000 loops, best of 3: 345 usec per loop" on both, so I left the old version because I think it's more readable. Finally, I rewrote the comment about the macro, adding a note about its side effects.
msg142720 - (view) Author: Ezio Melotti (ezio.melotti) * (Python committer) Date: 2011-08-22 12:13
It turned out that this can't be fixed in 2.7 unless we backport the patch in #5127 (it's in 3.2/3.3 but not in 2.7). IIUC the macro works fine and joins surrogate pairs to a Py_UCS4 char, but since the Py_UNICODE_IS* macros still expect Py_UCS2 on narrow builds on 2.7, the higher bits gets truncated and the macros return wrong results. So, for example >>> u'\ud800\udc42'.isupper() True because \ud800 + \udc42 = \U000100429 → \U000100429 gets truncated to \u0429 → \u0429 is the CYRILLIC CAPITAL LETTER SHCHA → .isupper() returns True. The current behavior is instead broken in another way, because it checks that u'\ud800'.isupper() and u'\udc42'.isupper() separately. Would it make sense to backport #5127 or should I just give up and leave it broken?
msg142736 - (view) Author: Roundup Robot (python-dev) (Python triager) Date: 2011-08-22 17:31
New changeset 06b30c5bcc3d by Ezio Melotti in branch '3.2': #9200: The str.is* methods now work with strings that contain non-BMP characters even in narrow Unicode builds. http://hg.python.org/cpython/rev/06b30c5bcc3d New changeset e3be2941c834 by Ezio Melotti in branch 'default': #9200: merge with 3.2. http://hg.python.org/cpython/rev/e3be2941c834
msg142744 - (view) Author: Amaury Forgeot d'Arc (amaury.forgeotdarc) * (Python committer) Date: 2011-08-22 20:03
This issue and #5127 should not be backported to 2.7: narrow builds don't even accept unichar(0x10000). Only python 3 can slowly pretend to implement utf-16 features.
msg142745 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2011-08-22 20:20
> This issue and #5127 should not be backported to 2.7: > narrow builds don't even accept unichar(0x10000). I agee.
msg142746 - (view) Author: Roundup Robot (python-dev) (Python triager) Date: 2011-08-22 20:46
New changeset 75a4941d4d61 by Ezio Melotti in branch '2.7': #9200: backport tests but run them on wide builds only. http://hg.python.org/cpython/rev/75a4941d4d61
msg142747 - (view) Author: Ezio Melotti (ezio.melotti) * (Python committer) Date: 2011-08-22 20:52
Backporting #5127 is not possible anyway, because it would be necessary to recompile. I backported only the tests, skipping them on wide builds.
msg142748 - (view) Author: Ezio Melotti (ezio.melotti) * (Python committer) Date: 2011-08-22 20:55
s/skipping them on wide builds/skipping them on narrow builds/ On wide builds they work fine, on narrow builds they don't work and they can't be fixed.
History
Date User Action Args
2022-04-11 14:57:03 admin set github: 53446
2011-08-22 20:55:17 ezio.melotti set messages: +
2011-08-22 20:52:25 ezio.melotti set status: open -> closedversions: - Python 2.7messages: + dependencies: - Py_UNICODE_NEXT and other macros for surrogatesresolution: fixedstage: commit review -> resolved
2011-08-22 20:46:41 python-dev set messages: +
2011-08-22 20:20:51 vstinner set messages: +
2011-08-22 20:03:23 amaury.forgeotdarc set messages: +
2011-08-22 17:31:32 python-dev set nosy: + python-devmessages: +
2011-08-22 12:13:58 ezio.melotti set messages: + title: Make str methods work with non-BMP chars on narrow builds -> Make the str.is* methods work with non-BMP chars on narrow builds
2011-08-22 11:08:07 ezio.melotti set assignee: belopolsky -> ezio.melotti
2011-08-19 15:55:02 ezio.melotti set files: + issue9200-2.diffstage: patch review -> commit reviewmessages: + versions: + Python 2.7, Python 3.3
2011-08-18 16:38:11 lemburg set messages: +
2011-08-18 16🔞45 vstinner set nosy: + vstinnermessages: +
2011-08-18 13:59:04 ezio.melotti set files: + issue9200.diff
2011-08-18 13:58:54 ezio.melotti set files: - issue9200.diff
2011-08-18 13:57:53 ezio.melotti set messages: +
2011-08-18 13:49:52 ezio.melotti set files: + issue9200.diffmessages: +
2011-08-15 19:35:01 Arfrever set nosy: + Arfrever
2011-08-15 11:02:22 ezio.melotti set nosy: + tchrist
2011-08-15 11:01:48 ezio.melotti set messages: + title: str.isprintable() is always False for large code points -> Make str methods work with non-BMP chars on narrow builds
2011-08-15 10:59:05 ezio.melotti link issue12730 superseder
2010-11-27 01:57:14 belopolsky set files: + issue9200.diffmessages: +
2010-11-26 16:35:11 belopolsky set assignee: belopolskydependencies: + Py_UNICODE_NEXT and other macros for surrogatestype: behaviorcomponents: + Interpreter Corenosy: + belopolskymessages: + stage: patch review
2010-07-09 17:13:08 amaury.forgeotdarc set files: + join-surrogates.patchkeywords: + patchmessages: +
2010-07-09 16:38:12 Rhamphoryncus set nosy: + Rhamphoryncusmessages: +
2010-07-08 13:50:01 amaury.forgeotdarc create