[I18n-sig] Re: [Python-Dev] Unicode debate (original) (raw)

M.-A. Lemburg mal@lemburg.com
Tue, 02 May 2000 10:36:43 +0200


Just a small note on the subject of a character being atomic which seems to have been forgotten by the discussing parties:

Unicode itself can be understood as multi-word character encoding, just like UTF-8. The reason is that Unicode entities can be combined to produce single display characters (e.g. u"e"+u"\u0301" will print "�" in a Unicode aware renderer). Slicing such a combined Unicode string will have the same effect as slicing UTF-8 data.

It seems that most Latin-1 proponents seem to have single display characters in mind. While the same is true for many Unicode entities, there are quite a few cases of combining characters in Unicode 3.0 and the Unicode nomarization algorithm uses these as basis for its work.

So in the end the "UTF-8 doesn't slice" argument holds for Unicode itself too, just as it also does for many Asian multi-byte variable length character encodings, image formats, audio formats, database formats, etc.

You can't really expect slicing to always "just work" without some knowledge about the data you are slicing.

-- Marc-Andre Lemburg


Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/