[I18n-sig] Re: [Python-Dev] Unicode debate (original) (raw)

M.-A. Lemburg mal@lemburg.com
Tue, 02 May 2000 11:56:21 +0200


Fredrik Lundh wrote:

M.-A. Lemburg <mal@lemburg.com> wrote: > Just a small note on the subject of a character being atomic > which seems to have been forgotten by the discussing parties: > > Unicode itself can be understood as multi-word character > encoding, just like UTF-8. The reason is that Unicode entities > can be combined to produce single display characters (e.g. > u"e"+u"\u0301" will print "�" in a Unicode aware renderer). > Slicing such a combined Unicode string will have the same > effect as slicing UTF-8 data. really? does it result in a decoder error? or does it just result in a rendering error, just as if you slice off any trailing character without looking...

In the example, if you cut off the u"\u0301", the "e" would appear without the acute accent, cutting off the u"e" would probably result in a rendering error or worse put the accent over the next character to the left.

UTF-8 is better in this respect: it warns you about the error by raising an exception when being converted to Unicode.

> It seems that most Latin-1 proponents seem to have single > display characters in mind. While the same is true for > many Unicode entities, there are quite a few cases of > combining characters in Unicode 3.0 and the Unicode > normalization algorithm uses these as basis for its > work.

do we supported automatic normalization in 1.6?

No, but it is likely to appear in 1.7... not sure about the "automatic" though.

FYI: Normalization is needed to make comparing Unicode strings robust, e.g. u"�" should compare equal to u"e\u0301".

-- Marc-Andre Lemburg


Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/