[Python-Dev] UCS2/UCS4 default (original) (raw)
Jeroen Ruigrok van der Werven asmodai at in-nomine.org
Thu Jul 3 18:51:40 CEST 2008
- Previous message: [Python-Dev] UCS2/UCS4 default
- Next message: [Python-Dev] UCS2/UCS4 default
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
-On [20080703 17:03], Guido van Rossum (guido at python.org) wrote:
I don't see an answer there to the question of whether the length() method of a Java String object containing a single surrogate pair returns 1 or 2; I suspect it returns 2.
As http://java.sun.com/j2se/1.5.0/docs/api/java/lang/CharSequence.html#length() states:
int length()
Returns the length of this character sequence. The length is the number of 16-bit chars in the sequence.
But since Java switched to full UTF-16 support in 1.5.0 they extended their API since the existing methods have probably come too ingrained.
E.g. codePointCount() http://java.sun.com/j2se/1.5.0/docs/api/java/lang/Character.html#codePointCount(char[],%20int,%20int)
The one thing that may be missing from Python is things like interpretation of surrogates by functions like isalpha() and I'm okay with adding that (since those have to loop over the entire string anyway).
Those would be welcome already, yes. I'll see if I can help out.
-- Jeroen Ruigrok van der Werven <asmodai(-at-)in-nomine.org> / asmodai イェルーン ラウフロック ヴァン デル ウェルヴェン http://www.in-nomine.org/ | http://www.rangaku.org/ | GPG: 2EAC625B Fallen into ever-mourn, with these wings so torn, after your day my dawn...
- Previous message: [Python-Dev] UCS2/UCS4 default
- Next message: [Python-Dev] UCS2/UCS4 default
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]