[Python-Dev] UCS2/UCS4 default (original) (raw)

Amaury Forgeot d'Arc amauryfa at gmail.com
Thu Jul 3 17:31:57 CEST 2008


Hello,

2008/7/3 Guido van Rossum <guido at python.org>:

I don't see an answer there to the question of whether the length() method of a Java String object containing a single surrogate pair returns 1 or 2; I suspect it returns 2. Python 3 supports things like chr(0x12345) and ord("\U00012345"). (And so does Python 2, using unichr and unicode literals.)

python2.6 support for supplementary characters is not ideal:

unichr(0x2f81a) ValueError: unichr() arg not in range(0x10000) (narrow Python build) ord(u'\U0002F81A') TypeError: ord() expected a character, but string of length 2 found.

\Uxxxxxxxx seems the only way to enter these characters. 3.0 is much better and passes the two tests above.

The unicodedata module gives good results in both versions:

unicodedata.name(u'\U0002F81A') 'CJK COMPATIBILITY IDEOGRAPH-2F81A' [34311 refs] unicodedata.category(u'\U0002F81A') 'Lo'

With python 3.0, I found only two places that refuse large code points on narrow builds: the "%c" format, and Py_BuildValue('C'). They should be fixed.

The one thing that may be missing from Python is things like interpretation of surrogates by functions like isalpha() and I'm okay with adding that (since those have to loop over the entire string anyway).

In this case, a new .isascii() method would be needed for some uses.

-- Amaury Forgeot d'Arc



More information about the Python-Dev mailing list