[Python-Dev] UCS2/UCS4 default (original) (raw)
Steve Holden steve at holdenweb.com
Thu Jul 3 18:35:29 CEST 2008
- Previous message: [Python-Dev] UCS2/UCS4 default
- Next message: [Python-Dev] UCS2/UCS4 default
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
Paul Moore wrote:
On 03/07/2008, Guido van Rossum <guido at python.org> wrote:
I don't see an answer there to the question of whether the length() method of a Java String object containing a single surrogate pair returns 1 or 2; I suspect it returns 2. It appears you're right:
type testucs.java class testucs { public static void main(String[] args) { StringBuilder s = new StringBuilder("Hello, "); s.appendCodePoint(0x2F81A); System.out.println(s); // Display the string. System.out.println(s.length()); } } java testucs Hello, ? 9 java -version java version "1.6.005" Java(TM) SE Runtime Environment (build 1.6.005-b13) Java HotSpot(TM) Client VM (build 10.0-b19, mixed mode, sharing) Python 3 supports things like chr(0x12345) and ord("\U00012345"). (And so does Python 2, using unichr and unicode literals.) And Java doesn't appear to - that appendCodePoint() method was wonderfully hard to find :-) There's also the issue of indexing the Unicode strings. If we are going to insist that len(u) counts surrogate pairs as one character then random access to the characters of a string is going to be an extremely inefficient operation.
Surely it's desirable under all circumstances that
len(u) == sum(1 for c in u)
and that
[c for c in u] == [c[i] for i in range(*len(u))]
How would that play under Jeroen's proposed change?
regards Steve
Steve Holden +1 571 484 6266 +1 800 494 3119 Holden Web LLC http://www.holdenweb.com/
- Previous message: [Python-Dev] UCS2/UCS4 default
- Next message: [Python-Dev] UCS2/UCS4 default
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]