[I18n-sig] Re: [Python-Dev] Unicode debate (original) (raw)

M.-A. Lemburg mal@lemburg.com
Tue, 02 May 2000 17:24:24 +0200

Previous message: [I18n-sig] Re: [Python-Dev] Unicode debate
Next message: [I18n-sig] Re: [Python-Dev] Unicode debate
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

Just van Rossum wrote:

At 10:36 AM +0200 02-05-2000, M.-A. Lemburg wrote: >Just a small note on the subject of a character being atomic >which seems to have been forgotten by the discussing parties: > >Unicode itself can be understood as multi-word character >encoding, just like UTF-8. The reason is that Unicode entities >can be combined to produce single display characters (e.g. >u"e"+u"\u0301" will print "�" in a Unicode aware renderer). Erm, are you sure Unicode prescribes this behavior, for this example? I know similar behaviors are specified for certain languages/scripts, but I didn't know it did that for latin.

The details are on the www.unicode.org web-site burried in some of the tech reports on normalization and collation.

>Slicing such a combined Unicode string will have the same >effect as slicing UTF-8 data.

Not true. As Fredrik noted: no exception will be raised.

Huh ? You will always get an exception when you convert a broken UTF-8 sequence to Unicode. This is per design of UTF-8 itself which uses the top bit to identify multi-byte character encodings.

Or can you give an example (perhaps you've found a bug that needs fixing) ?

[ Speaking of exceptions,

after I sent off my previous post I realized Guido's non-utf8-strings-interpreted-as-utf8-will-often-raise-an-exception argument can easily be turned around, backfiring at utf-8: Defaulting to utf-8 when going from Unicode to 8-bit and back only gives the illusion things "just work", since it will silently "work", even if utf-8 is not the desired 8-bit encoding -- as shown by Fredrik's excellent "fun with Unicode, part 1" example. Defaulting to Latin-1 will warn the user much earlier, since it'll barf when converting a Unicode string that contains any character code > 255. So there. ] >It seems that most Latin-1 proponents seem to have single >display characters in mind. While the same is true for >many Unicode entities, there are quite a few cases of >combining characters in Unicode 3.0 and the Unicode >nomarization algorithm uses these as basis for its >work. Still, two combining characters are still two input characters for the renderer! They may result in one glyph, but trust me, that's an entirly different can of worms.

No. Please see my other post on the subject...

However, if you'd be talking about Unicode surrogates, you'd definitely have a point. How do Java/Perl/Tcl deal with surrogates?

Good question... anybody know the answers ?

-- Marc-Andre Lemburg

Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/

Previous message: [I18n-sig] Re: [Python-Dev] Unicode debate
Next message: [I18n-sig] Re: [Python-Dev] Unicode debate
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]