[Python-Dev] Divorcing str and unicode (no more implicit conversions). (original) (raw)

M.-A. Lemburg mal at egenix.com
Mon Oct 24 10:40:28 CEST 2005

Previous message: [Python-Dev] Divorcing str and unicode (no more implicit conversions).
Next message: [Python-Dev] Divorcing str and unicode (no more implicit conversions).
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

Neil Hodgson wrote:

Guido van Rossum:

Folks, please focus on what Python 3000 should do. I'm thinking about making all character strings Unicode (possibly with different internal representations a la NSString in Apple's Objective C) and introduce a separate mutable bytes array data type. But I could use some validation or feedback on this idea from actual practitioners. I'd like to more tightly define Unicode strings for Python 3000. Currently, Unicode strings may be implemented with either 2 byte (UCS-2) or 4 byte (UTF-32) elements. Python should allow strings to contain any Unicode character and should be indexable yielding characters rather than half characters. Therefore Python strings should appear to be UTF-32. There could still be multiple implementations (using UTF-16 or UTF-8) to preserve space but all implementations should appear to be the same apart from speed and memory use.

There seems to be a general misunderstanding here: even if you have UCS4 storage, it is still possible to slice a Unicode string in a way which makes rendering it correctly.

Unicode has the concept of combining code points, e.g. you can store an "é" (e with a accent) as "e" + "'". Now if you slice off the accent, you'll break the character that you encoded using combining code points.

Note that combining code points are rather common in encodings of Asian scripts, so this is not an artificial example.

Some time ago I proposed a new module called unicodeindex to help with indexing. It would solve most of the indexing issues you run into when dealing with Unicode. I've attached it to this email for reference.