[I18n-sig] Re: [Python-Dev] Unicode debate (original) (raw)

Just van Rossum just@letterror.com
Wed, 3 May 2000 07:47:07 +0100


[MAL vs. PP]

> FYI: Normalization is needed to make comparing Unicode > strings robust, e.g. u"=E9" should compare equal to u"e\u0301".

That's a whole 'nother debate at a whole 'nother level of abstraction. I think we need to get the bytes/characters level right and then we can worry about display-equivalent characters (or leave that to the Python programmer to figure out...). I just wanted to point out that the argument "slicing doesn't work with UTF-8" is moot.

And failed...

I asked two Unicode guru's I happen to know about the normalization issue (which is indeed not relevant to the current discussion, but it's fascinating nevertheless!).

(Sorry about the possibly wrong email encoding... "=E8" is u"\350", "=F6" is u"\366")

John Jenkins replied: """ Well, I'm not sure you want to hear the answer -- but it really depends on what the language is attempting to do.

By and large, Unicode takes the position that "e`" should always be treated the same as "=E8". This is a semantic equivalence -- that is, they mean the same thing -- and doesn't depend on the display engine to be true. Unicode also provides a default collation algorithm (http://www.unicode.org/unicode/reports/tr10/).

At the same time, the standard acknowledges that in real life, string comparison and collation are complicated, language-specific problems requiring a lot of work and interaction with the user to do right.

From the perspective of a programming language, it would best be served IMH= O by implementing the contents of TR10 for string comparison and collation. That would make "e`" and "=E8" come out as equivalent. """

Dave Opstad replied: """ Unicode talks about "canonical decomposition" in order to make it easier to answer questions like yours. Specifically, in the Unicode 3.0 standard, rule D24 in section 3.6 (page 44) states that:

"Two character sequences are said to be canonical equivalents if their full canonical decompositions are identical. For example, the sequences <o, combining-diaeresis> and <=F6> are canonical equivalents. Canonical equivalence is a Unicode propert. It should not be confused with language-specific collation or matching, which may add additional equivalencies."

So they still have language-specific differences, even if Unicode sees them as canonically equivalent.

You might want to check this out:

http://www.unicode.org/unicode/reports/tr15/tr15-18.html

It's the latest technical report on these issues, which may help clarify things further. """

It's very deep stuff, which seems more appropriate for an extension than for builtin comparisons to me.

Just