[Python-Dev] UCS2/UCS4 default (original) (raw)

Jeroen Ruigrok van der Werven asmodai at in-nomine.org
Thu Jul 3 16:46:48 CEST 2008


-On [20080703 15:58], Guido van Rossum (guido at python.org) wrote:

Your seem to be suggesting that len(u"\U00012345") should return 1 on a system that internally uses UTF-16 and hence represents this string as a surrogate pair.

From a Unicode and UTF-16 point of view that makes the most sense. So yes, I am suggesting that.

This is not going to happen. You may as well complain to the authors of the Java standard about the corresponding problem there.

Why would I need to complain to them? They already fixed it since 1.5.0.

Java 1.5.0's release notes (http://java.sun.com/developer/technicalArticles/releases/j2se15/):

Supplementary Character Support

32-bit supplementary character support has been carefully added to the platform as part of the transition to Unicode 4.0 support. Supplementary characters are encoded as a special pair of UTF16 values to generate a different character, or codepoint. A surrogate pair is a combination of a high UTF16 value and a following low UTF16 value. The high and low values are from a special range of UTF16 values.

In general, when using a String or sequence of characters, the core API libraries will transparently handle the new supplementary characters for you.

See also http://java.sun.com/j2se/1.5.0/docs/api/java/lang/Character.html

The methods that accept an int value support all Unicode characters, including supplementary characters. For example, Character.isLetter(0x2F81A) returns true because the code point value represents a letter (a CJK ideograph).

-- Jeroen Ruigrok van der Werven <asmodai(-at-)in-nomine.org> / asmodai イェルーン ラウフロック ヴァン デル ウェルヴェン http://www.in-nomine.org/ | http://www.rangaku.org/ | GPG: 2EAC625B Life can only be understood backwards, but it must be lived forwards...



More information about the Python-Dev mailing list