[Python-Dev] UCS2/UCS4 default (original) (raw)

"Martin v. Löwis" martin at v.loewis.de
Thu Jul 3 19:31:14 CEST 2008


Basically everything but string forming or string printing seems to be broken for surrogate pairs, from what I can tell.

We probably disagree what "it works correctly" means. I think everything works correctly.

Also, I think you are confused about slicing in the middle of a surrogate pair, from a UTF-16 perspective this is 1 codepoint!

Yes, but it is two code units. Python's UTF-16 implementation operates on code units, not code points.

And as such Python needs to treat it as one character/codepoint in a string, dealing with slicing as appropriate.

It does. However, functions such as len, and all indexing, operate in code units, not code points.

The way you currently describe it is that UTF-16 strings will be treated as UCS-2 when it comes to slicing and the likes.

No. In UCS-2, the surrogate range is reserved (for UTF-16). In Python, it's not reserved, but interpreted as UTF-16.

From a UTF-16 point of view such slicing can NEVER occur unless you are bit or byte slicing instead of character/codepoint slicing.

It most certainly can. UTF-16 is not a character set, but a character encoding form (unlike UCS-2, which is a coded character set). Slicing can occur at the code unit level. UTF-16 is also understood as a character encoding scheme (by means of the BOM), then slicing can occur even on the byte level.

I think it can be fairly said that an item in a string is a character or codepoint.

Not in Python - it's a code unit.

Regards, Martin



More information about the Python-Dev mailing list