[Python-Dev] UCS2/UCS4 default (original) (raw)

Jeroen Ruigrok van der Werven asmodai at in-nomine.org
Thu Jul 3 12:48:13 CEST 2008

Previous message: [Python-Dev] UCS2/UCS4 default
Next message: [Python-Dev] UCS2/UCS4 default
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

My apologies for hammering on this, but I think it is quite important and currently Python 3.0 seems confused about UCS-2 versus UTF-16.

-On [20080702 20:47], Guido van Rossum (guido at python.org) wrote:

No, Python already is aware of surrogates. I meant applications processing non-BMP text should beware of them.

Just to make sure people are fully aware of the distinctions:

UCS-2 uses 16 bits to encode Unicode data, does NOT support surrogate pairs and therefore CANNOT represent data beyond U+FFFF (thus only supporting the Basic Multilingual Plane, BMP). It is a fixed-length character encoding.

UTF-16 also uses 16 bits to encode Unicode data, but DOES support surrogate pairs and therefore CAN represent data beyond U+FFFF by using said surrogate pairs (thus supporting all planes). It is a variable-length character encoding.

So a string representation in UCS-2 means every character occupies 16 bits. A string representation in UTF-16 means characters can occupy 16 bits or 32-bits.

If one stays within the BMP than all is well, but when you move beyond the BMP (U+10000 - U+10FFFF) then Python needs to correctly check the string for surrogate pairs and deal with them internally.

If you find places where the Python core or standard library is doing Unicode processing that would break when surrogates are present you should file a bug. However this does not mean that every bit of code that slices a string at an arbitrary point (and hence risks slicing in the middle of a surrogate) is incorrect -- it all depends on what is done next with the slice.

Basically everything but string forming or string printing seems to be broken for surrogate pairs, from what I can tell. Also, I think you are confused about slicing in the middle of a surrogate pair, from a UTF-16 perspective this is 1 codepoint! And as such Python needs to treat it as one character/codepoint in a string, dealing with slicing as appropriate. The way you currently describe it is that UTF-16 strings will be treated as UCS-2 when it comes to slicing and the likes.

From a UTF-16 point of view such slicing can NEVER occur unless you are bit or byte slicing instead of character/codepoint slicing.

The documentation for len() says: Return the length (the number of items) of an object.

I think it can be fairly said that an item in a string is a character or codepoint. Take for example the following string:

a = '\U00020045\u942a' # Two hanzi/kanji/hanja

From a Unicode perspective we are looking at two characters/codepoints. When we use a 4-byte Python 3.0 binary we get (as expected):

len(a) 2

When we use a 2-byte Python 3.0 binary (the default) we get (not as expected):

len(a) 3

From a UTF-16 perspective a surrogate pair is one character/codepoint and as such len() should have reported 2 as well. That the sequence is stored internally as 0xd840 0xdc45 0x942a and occupies 3 bytes is not interesting. But it seems as if len() is treating the string as being in UCS-2 (fixed-length), which is the only logical explanation for the number 3, instead of treating it as UTF-16 (variable-length) and reporting the number 2.

Subsequently doing a: print a[1] to get the 0x942a (鐪) actually requires a[2] on the 2-byte Python 3.0. As such the code you write for 2-byte and 4-byte Python 3.0 is different when you have to deal with the same Unicode strings! This cannot be the desired situation, can it?

Two more examples:

a.find('鐪') # 4-byte 1 a.find('鐪') # 2-byte 2

import re # 4-byte m = re.search('鐪', a) m.start() 1 import re # 2-byte m = re.search('鐪', a) m.start() 2

This, in my opinion, has nothing to do with the application writers, but more with Python's internals being confused about UCS-2 and UTF-16. We accept full 32-bit codepoints with the \U escape in strings, and we may even store it as UTF-16 internally, but we clearly do not deal with it properly as UTF-16, but rather as UCS-2, when it comes to using said strings with core functions and modules.

-- Jeroen Ruigrok van der Werven <asmodai(-at-)in-nomine.org> / asmodai イェルーンラウフロックヴァンデルウェルヴェン http://www.in-nomine.org/ | http://www.rangaku.org/ | GPG: 2EAC625B For wouldst thou not carve at my Soul with thine sword of Supreme Truth?

Previous message: [Python-Dev] UCS2/UCS4 default
Next message: [Python-Dev] UCS2/UCS4 default
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

More information about the Python-Dev mailing list