Issue 17252: Latin Capital Letter I with Dot Above (original) (raw)
Issue17252
Created on 2013-02-20 09:14 by firatozgul, last changed 2022-04-11 14:57 by admin. This issue is now closed.
Messages (19) | ||
---|---|---|
msg182485 - (view) | Author: Firat Ozgul (firatozgul) | Date: 2013-02-20 09:14 |
lower() method of strings gives different output for 'Latin Capital Letter I with Dot Above' on Python 3.2 and Python 3.3. On Python 3.2 (Windows XP): >>> "\u0130".lower() 'i' #this is correct On Python 3.3 (Windows XP): >>> "\u0130".lower() 'i\u0307' #this is wrong Why is this difference? This breaks code, because 'i' and 'i\u0307' are different letters. | ||
msg182486 - (view) | Author: R. David Murray (r.david.murray) * ![]() |
Date: 2013-02-20 10:57 |
I thought this would just be a difference in the unicode database, but that appears not to be the case. Ezio, this is related to the infamous Turkic dotless lower case i problem (see, eg, http://mail.python.org/pipermail/python-bugs-list/2005-October/030686.html). The SpecialCasing.txt file entries for these characters seems to be the same in 6.0.0 (3.2) and 6.1.0 (3.3). So the question is, why did the Python behavior change, and is it indeed a bug? What python3.3 is returning is the canonical version, which would seem to be correct. Have we been buggy up to this point and something got fixed? And, referencing that thread above, how does one do a locale dependent lower case? | ||
msg182487 - (view) | Author: Firat Ozgul (firatozgul) | Date: 2013-02-20 11:31 |
In Python, things like lowercasing-uppercasing and sorting were always problematic with regard to Turkish language. For instance, whatever the locale is, you cannot lowercase the word 'KADIN' (woman) in Turkish correctly:: >>> "KADIN".lower() 'kadin' ... which is wrong. That should be 'kadın' ('kad\u0131n'). Likewise 'kitap' (book):: >>> "kitap".upper() 'KITAP' ... which is wrong. That should be 'KİTAP' ('K\u0130TAP'). As for this thread, in 3.3, Python does a completely different thing:: >>> "KİTAP".lower() 'ki\u0307tap' #wrong In Python 3.2, this was:: >>> "KİTAP".lower() 'kitap' #correct 'i' and 'i\u0307' are not the same. Turkish Python programmers define their own upper(), lower(), title(), swapcase() and casefold() methods and use their own sorting techniques. | ||
msg182491 - (view) | Author: R. David Murray (r.david.murray) * ![]() |
Date: 2013-02-20 12:00 |
Right, and the unicode consortium says that that weird thing 3.3 is doing is the "canonical" lowercasing, and this is the case exactly because in 3.3 "\u0130".lower().upper() == "\u0130". Which I why I asked Ezio if we ever came up with a way to do lower/upper in a locale specific manner. The behavior change is an issue, but I'm thinking the 3.3 behavior is probably the "correct" behavior per the unicode standard. | ||
msg182494 - (view) | Author: Firat Ozgul (firatozgul) | Date: 2013-02-20 12:24 |
r.david.murray: '(...) because in 3.3 "\u0130".lower().upper() == "\u0130"' Do you mean in Python 3.3 "\u0130".lower() returns "\u0130"? If you are saying so, this is not the case, because in Python 3.3:: >>> '\u0130'.lower() 'i\u0307' | ||
msg182495 - (view) | Author: Antoine Pitrou (pitrou) * ![]() |
Date: 2013-02-20 12:28 |
Yes, I think 3.3 is correct here. I think it was Benjamin who fixed/improved the behaviour of casing methods. Compare 3.3: >>> "ß".upper() 'SS' with 3.2: >>> "ß".upper() 'ß' Also, 3.2 loses information: >>> "KİTAP".lower().upper() 'KITAP' >>> ascii("KİTAP".lower().upper()) "'KITAP'" while 3.3 retains it: >>> "KİTAP".lower().upper() 'KİTAP' >>> ascii("KİTAP".lower().upper()) "'KI\\u0307TAP'" You can get the combined form again with unicodedata.normalize: >>> unicodedata.normalize("NFC", "KİTAP".lower().upper()) 'KİTAP' | ||
msg182497 - (view) | Author: Firat Ozgul (firatozgul) | Date: 2013-02-20 12:36 |
Don't you think that there is a problem here? >>> "KİTAP".lower().upper() 'KİTAP' >>> ascii("KİTAP".lower().upper()) "'KI\\u0307TAP'" "İ" is not "i\u0307". That's a different letter. "i\u0307"is 'i with combining dot above'. However, "İ" is "\u0130" (Latin Capital Letter I with Dot Above). | ||
msg182498 - (view) | Author: Firat Ozgul (firatozgul) | Date: 2013-02-20 12:44 |
ascii("KİTAP".lower().upper()) should return "K\u0130TAP". Yes, Python 3.2 loses information, but Python 3.3 inserts faulty information, which, I think, is much worse than losing information. | ||
msg182499 - (view) | Author: R. David Murray (r.david.murray) * ![]() |
Date: 2013-02-20 12:45 |
Ah, you are right, I did not decode it to see what the actual characters were. That does contradict what I said, but I'm way out of my depth on unicode at this point, so we'll have to wait for someone more expert to weigh in. | ||
msg182502 - (view) | Author: Firat Ozgul (firatozgul) | Date: 2013-02-20 13:20 |
Excerpt from http://www.unicode.org/Public/UNIDATA/SpecialCasing.txt # Turkish and Azeri # I and i-dotless; I-dot and i are case pairs in Turkish and Azeri # The following rules handle those cases. 0130; 0069; 0130; 0130; tr; # LATIN CAPITAL LETTER I WITH DOT ABOVE 0130; 0069; 0130; 0130; az; # LATIN CAPITAL LETTER I WITH DOT ABOVE So the code 0130 should be 0069 in lowercase; 0130 in uppercase; 0130 in titlecase; and again 0130 in uppercase. | ||
msg182504 - (view) | Author: Benjamin Peterson (benjamin.peterson) * ![]() |
Date: 2013-02-20 13:50 |
Notice the lines you pulled have "tr" and "az" at the end of them meaning they only apply for Turkish and Azeri. Since the lower() method has no idea whether the user intends to be in a Turkish or Azeri locale or not, we just have to use the generic lowering mapping which simply preserves canonical equivalence. | ||
msg182505 - (view) | Author: Firat Ozgul (firatozgul) | Date: 2013-02-20 13:59 |
Even if you set Turkish locale, the output is still "generic". Furthermore, does "canonical equivalence" really dictate that 'Latin Capital Letter I with Dot Above' should be mapped to 'I With Combining Dot Above' in lowercase? Note: 'Uppercase Dotted i' only exists in Turkish and Azeri. | ||
msg182509 - (view) | Author: Firat Ozgul (firatozgul) | Date: 2013-02-20 14:25 |
Whatever the behavior of Python is in 'generic' terms, I believe, we should be able to do locale-dependent uppercasing-lowercasing, which we cannot do at the moment. | ||
msg182514 - (view) | Author: R. David Murray (r.david.murray) * ![]() |
Date: 2013-02-20 14:40 |
Yes, earlier in that file is the generic translation: # Preserve canonical equivalence for I with dot. Turkic is handled below. 0130; 0069 0307; 0130; 0130; # LATIN CAPITAL LETTER I WITH DOT ABOVE You see that Python is following the standard, here. Agreed about the locale-aware upper/lower, etc, but that's a feature request. There's been some discussion about this kind of thing, but I don't remember what the status is. A search of the python-ideas and/or python-dev mailing lists might yield some clues. It's a discussion for one of those mailing lists rather than the bug tracker, in any case. | ||
msg182517 - (view) | Author: Firat Ozgul (firatozgul) | Date: 2013-02-20 14:49 |
Apparently, what Python did wrong in the past was somewhat good for Turkish Python developers! This means Turkish developers now have one more problem to solve. Bad. | ||
msg182518 - (view) | Author: Antoine Pitrou (pitrou) * ![]() |
Date: 2013-02-20 14:52 |
> "İ" is not "i\u0307". That's a different letter. "i\u0307"is 'i with > combining dot above'. However, "İ" is "\u0130" (Latin Capital Letter > I with Dot Above). Did you actually read my message? You can reconcile the two using unicodedata.normalize(). | ||
msg182519 - (view) | Author: Benjamin Peterson (benjamin.peterson) * ![]() |
Date: 2013-02-20 14:58 |
The "locale" module does not affect Unicode operations. That's C locale; I'm talking about concept of Unicode locale, which Python doesn't currently know anything about. I agree it would be useful to customize the locale of various unicode operations. That's a much broader language-level issue, though, in need of careful design. As for the useless generic mapping of LATIN CAPITAL LETTER I WITH DOT ABOVE, the idea is there is no LATIN SMALL LETTER I WITH DOT ABOVE so the generic lower casing comes from decomposing the character then lowering the latin one. | ||
msg182520 - (view) | Author: Marc-Andre Lemburg (lemburg) * ![]() |
Date: 2013-02-20 15:13 |
On 20.02.2013 15:58, Benjamin Peterson wrote: > > Benjamin Peterson added the comment: > > The "locale" module does not affect Unicode operations. That's C locale; I'm talking about concept of Unicode locale, which Python doesn't currently know anything about. > > I agree it would be useful to customize the locale of various unicode operations. That's a much broader language-level issue, though, in need of careful design. We'd need to add the CLDR for locale aware operations and a Python interface for it: http://cldr.unicode.org/ The Babel project provides such an interface: http://babel.edgewall.org/ The project appears to have stalled, though. | ||
msg182521 - (view) | Author: Christian Heimes (christian.heimes) * ![]() |
Date: 2013-02-20 15:21 |
In the meantime you can use PyICU https://pypi.python.org/pypi/PyICU for locale aware transformations: >>> from icu import UnicodeString, Locale >>> tr = Locale("TR") >>> s = UnicodeString("KADIN") >>> print(unicode(s.toLower(tr))) kadın >>> unicode(s.toLower(tr)) u'kad\u0131n' |
History | |||
---|---|---|---|
Date | User | Action | Args |
2022-04-11 14:57:42 | admin | set | github: 61454 |
2019-01-04 21:31:09 | terry.reedy | link | issue35639 superseder |
2018-03-22 06:09:55 | serhiy.storchaka | link | issue33108 superseder |
2013-02-20 15:21:37 | christian.heimes | set | nosy: + christian.heimesmessages: + |
2013-02-20 15:13:47 | lemburg | set | messages: + |
2013-02-20 14:58:18 | benjamin.peterson | set | messages: + |
2013-02-20 14:52:14 | pitrou | set | messages: + |
2013-02-20 14:49:34 | firatozgul | set | messages: + |
2013-02-20 14:40:46 | r.david.murray | set | messages: + |
2013-02-20 14:25:33 | firatozgul | set | messages: + |
2013-02-20 13:59:24 | firatozgul | set | messages: + |
2013-02-20 13:50:17 | benjamin.peterson | set | status: open -> closedresolution: works for memessages: + |
2013-02-20 13:20:15 | firatozgul | set | messages: + |
2013-02-20 13:14:20 | firatozgul | set | status: closed -> openresolution: not a bug -> (no value) |
2013-02-20 12:45:16 | r.david.murray | set | messages: + |
2013-02-20 12:44:43 | firatozgul | set | messages: + |
2013-02-20 12:36:42 | firatozgul | set | messages: + |
2013-02-20 12:28:22 | pitrou | set | status: open -> closednosy: + lemburg, pitrou, vstinner, benjamin.petersonmessages: + resolution: not a bug |
2013-02-20 12:24:09 | firatozgul | set | messages: + |
2013-02-20 12:00:48 | r.david.murray | set | messages: + |
2013-02-20 11:31:59 | firatozgul | set | messages: + |
2013-02-20 10:57:33 | r.david.murray | set | nosy: + ezio.melotti, r.david.murraymessages: + components: + Unicode |
2013-02-20 09:14:15 | firatozgul | create |