Issue 5200: unicode.normalize gives wrong result for some characters (original) (raw)

Issue5200

Created on 2009-02-10 10:45 by PeterL, last changed 2022-04-11 14:56 by admin. This issue is now closed.

Files
File name Uploaded Description Edit
unnamed PeterL,2009-02-10 20:03
unnamed PeterL,2009-02-10 20:50
unnamed PeterL,2009-02-11 08:24
unnamed PeterL,2009-02-11 19:26
Messages (10)
msg81536 - (view) Author: Peter Landgren (PeterL) Date: 2009-02-10 10:45
If any of the Swedish characters "åäöÅÄÖ" are input to unicode.normalize(form, ustr) with form = "NFD" or "NFKD" the result will be "aaoAAO". "åäöÅÄÖ" are normal character and should be the same after normalize. They are not connected to aaoAAO other than for historic reasons, but not in modern languages. It's a common misinterpretation that the dots and circle above them are diacritic signs, but those letters should behave as the (Danish) "Ø" which is normalized correctly. From Wikipedia: Å is often perceived as an A with a ring, interpreting the ring as a diacritical mark. However, in the languages that use it, the ring is not considered a diacritic but part of the letter. The letter Ö in the Swedish and Icelandic alphabets historically arises from the Germanic umlaut, but it is considered a separate letter from O. See http://en.wikipedia.org/wiki/%C3%85 I think this is pobably impossible to solve as it will be mixed up with "umlaut" and you don't know what language the specific word is connected to.
msg81580 - (view) Author: Martin v. Löwis (loewis) * (Python committer) Date: 2009-02-10 18:59
It is not true that normalize produces "aaoAAO". Instead, it produces u'a\u030aa\u0308o\u0308A\u030aA\u0308O\u0308' This is the correct result, according to the Unicode specification. It would be incorrect to normalize them unchanged under the Unicode Normal Form D (for decomposed); the decomposed character for 'LATIN SMALL LETTER A WITH RING ABOVE' (for example) is 'LATIN SMALL LETTER A' + 'COMBINING RING ABOVE'. The wikipedia article is irrelevant; refer to the Unicode specification for a normative reference. Closing as invalid.
msg81595 - (view) Author: Peter Landgren (PeterL) Date: 2009-02-10 20:03
Thanks for the fast response. I understand that python follows the unicode specification. I think the unicode standard is not correct in this case for the Swedish letters. I have asked unicode.org for an explanation. Should not the Danish letter "Ø" be normalized as "O"? I get "Ø" for all NFC/NFD/NFKC/NFKD normalizations? Regards, Peter Landgren
msg81596 - (view) Author: Martin v. Löwis (loewis) * (Python committer) Date: 2009-02-10 20:15
> Should not the Danish letter "Ø" be normalized as "O"? I get "Ø" for all NFC/NFD/NFKC/NFKD > normalizations? I think you have a fundamental misunderstanding what a "decomposition" is. "Ø" should *not* be decomposed as "O", because clearly, "Ø" and "O" are different letters. If anything, it would be decomposed as "O" + PLUS SOME COMBINING MARK Now, in the specific case of 00D8;LATIN CAPITAL LETTER O WITH STROKE;Lu;0;L;;;;;N;LATIN CAPITAL LETTER O SLASH;;;00F8; no canonical decomposition is specified. Compare this to 00D5;LATIN CAPITAL LETTER O WITH TILDE;Lu;0;L;004F 0303;;;;N;LATIN CAPITAL LETTER O TILDE;;;00F5; which decomposes to U+004F followed by U+0303, i.e. LATIN CAPITAL LETTER O followed by COMBINING TILDE. If "Ø" was to be decomposed, it should use a mark COMBINING STROKE, but no such combining mark exists in Unicode. I don't know why that is; you would have to ask the Unicode consortium. In any case, Unicode guarantees stability wrt. decompositions, so even if some combining mark gets added later on, the existing decomposition remain stable.
msg81598 - (view) Author: Peter Landgren (PeterL) Date: 2009-02-10 20:50
The same applies "Å" and "A", "Ä" and "A" and "Ö" and "O" which also are also different letters as "Ø" and "O" are. ("Ø" is the Danish version of "Ö" ) Maybe not in the unicode world but in treal life. That's why I'm a little confused. Will wait and see what/if the unicode people says. In any case, thanks for the discussion. Regards, /Peter
msg81603 - (view) Author: Martin v. Löwis (loewis) * (Python committer) Date: 2009-02-10 21:32
> The same applies "Å" and "A", "Ä" and "A" and "Ö" and "O" > which also are also different letters as "Ø" and "O" are. Sure. And rightfully, they "Å" is *not* (I repeat: not) normalized as "A", under NFD: py> unicodedata.normalize("NFD", u"Å") u'A\u030a' > Maybe not in the unicode world but in treal life. They are different letters also in the Unicode world. > That's why I'm a little confused. I think the confusion comes from your assumption that normalizing "Å" produces "A". It does not. Really not.
msg81632 - (view) Author: Peter Landgren (PeterL) Date: 2009-02-11 08:24
> Martin v. Löwis <martin@v.loewis.de> added the comment: > > The same applies "Å" and "A", "Ä" and "A" and "Ö" and "O" > > which also are also different letters as "Ø" and "O" are. > > Sure. And rightfully, they "Å" is *not* (I repeat: not) > normalized as "A", under NFD: > > py> unicodedata.normalize("NFD", u"Å") > u'A\u030a' > > > Maybe not in the unicode world but in treal life. > > They are different letters also in the Unicode world. > > > That's why I'm a little confused. > > I think the confusion comes from your assumption that > normalizing "Å" produces "A". It does not. Really not. Yes, you are right. However the confusion/problem shows up when it is used in the application to build an alphabet and group for example all version of E, É, È, Ë, Ê together under E. The first character in the result of normalize is used to build alphabet labels for surnames: letter = normalize("NFD", surname)[0].upper() if letter != last_letter: last_letter = letter .... and this is why I get "A" when the surname begins with "Å". This way it works for all variations of E to be grouped under "E", but fails as "Å" is shown under the label "A", not the "A" in the beginning of the alphabet but after "Z", where "ÅÄÖ" comes. So a previous sorting of the surnames works correctly. (The Swedish alphabet has 29 letters: A,B,C... X,Y,Z,Å,Ä,Ö) Can you think of any solution to this conflict? u'\xd8' u'A\u030a' u'\xc5' This is obviously the result of how the unicode spec is written interpreting "Å" as a variation of "A". which it is not. I have asked the unicode people, but not got any answer yet. The application is GRAMPS: http://gramps-project.org/ Once again thanks for make some of the unicode stuff clear! Regards, Peter Landgren
msg81654 - (view) Author: Martin v. Löwis (loewis) * (Python committer) Date: 2009-02-11 18:32
> Can you think of any solution to this conflict? I don't quite understand why you want to place É, È, Ë, Ê all along with E, yet Å,Ä,Ö after Z. Because that's what the Swedish alphabet says? Please understand that collation varies across languages. For example in German, we also have Ä, but it does *not* come after Z. Instead, there are two ways to collate Ä (telephone book vs. dictionary): 1. Ä sorts exactly like A 2. Ä sorts as if it was transcribed as Ae So there is no one true collation of Ä, but you have to take into account what language rules you want to follow. If you want to implement Swedish rules, why then do you also want to support É, È, Ë, Ê? Do you have these letters in Swedish at all? If you want to use obscure collation rules, you might have to implement the collation algorithm yourself. For example, assign each letter a unique number (different from the Unicode ordinal), and then sort by these numbers. Take a look at ICU, which already includes collation algorithms for many locales.
msg81656 - (view) Author: Peter Landgren (PeterL) Date: 2009-02-11 19:26
The È... comes from French surnames and our French developer wants to group all versions of E together. The É... can be found in French surnames in Sweden as well as in Germany. The program, GRAMPS is a genealogy program used in about 20 languages, so there is no preferred language. I know. However, Swedish telephone books and dictionaries are sorted the same: A,B,C... X,Y,Z,Å,Ä,Ö. True. I agree. GRAMPS runs in the locale of the user, but must be able to handle information coming from many other languages/countries. That's why it's hard to be universal. We can have them in names. See above. I think we have found a solution that can handle most cases. We treat surnames beginning with "ÅÄÖ" special. I don't think that there are many surnames outside the Nordic countries that starts with any of these three letters. Vielen dank! /Peter
msg81661 - (view) Author: Martin v. Löwis (loewis) * (Python committer) Date: 2009-02-11 19:54
> The È... comes from French surnames and our French developer wants to group all versions > of E together. The É... can be found in French surnames in Sweden as well as in Germany. > The program, GRAMPS is a genealogy program used in about 20 languages, so there is no > preferred language. I think you'll find that you have to think much harder about collation, then. If you assume that the Unicode ordinal order will give right collation, it will be wrong many times, I predict. For example, it appears that Croatian puts Dž as a single letter between D and Đ. > I think we have found a solution that can handle most cases. > We treat surnames beginning with "ÅÄÖ" special. I don't think that there are many surnames > outside the Nordic countries that starts with any of these three letters. It seems they are also common in Turkish (Öksüz, Ölcüm, Önal, ..., taken from the Berlin phonebook), and Turkish puts Ö after O. Hungarian also uses Ö and Ü (as well as Ó, Ú, Ő, Ű), but I don't know how common they are as first letters of surnames.
History
Date User Action Args
2022-04-11 14:56:45 admin set github: 49450
2009-02-11 19:54:24 loewis set messages: +
2009-02-11 19:26:20 PeterL set files: + unnamedmessages: +
2009-02-11 18:32:32 loewis set messages: +
2009-02-11 08:24:05 PeterL set files: + unnamedmessages: +
2009-02-10 21:32:17 loewis set messages: +
2009-02-10 20:50:09 PeterL set files: + unnamedmessages: +
2009-02-10 20:15:00 loewis set messages: +
2009-02-10 20:03:36 PeterL set files: + unnamedmessages: +
2009-02-10 18:59:22 loewis set status: open -> closedresolution: not a bugmessages: + nosy: + loewis
2009-02-10 10:45:56 PeterL create