Issue 17252: Latin Capital Letter I with Dot Above (original) (raw)

Issue17252

Created on 2013-02-20 09:14 by firatozgul, last changed 2022-04-11 14:57 by admin. This issue is now closed.

Messages (19)
msg182485 - (view) Author: Firat Ozgul (firatozgul) Date: 2013-02-20 09:14
lower() method of strings gives different output for 'Latin Capital Letter I with Dot Above' on Python 3.2 and Python 3.3. On Python 3.2 (Windows XP): >>> "\u0130".lower() 'i' #this is correct On Python 3.3 (Windows XP): >>> "\u0130".lower() 'i\u0307' #this is wrong Why is this difference? This breaks code, because 'i' and 'i\u0307' are different letters.
msg182486 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2013-02-20 10:57
I thought this would just be a difference in the unicode database, but that appears not to be the case. Ezio, this is related to the infamous Turkic dotless lower case i problem (see, eg, http://mail.python.org/pipermail/python-bugs-list/2005-October/030686.html). The SpecialCasing.txt file entries for these characters seems to be the same in 6.0.0 (3.2) and 6.1.0 (3.3). So the question is, why did the Python behavior change, and is it indeed a bug? What python3.3 is returning is the canonical version, which would seem to be correct. Have we been buggy up to this point and something got fixed? And, referencing that thread above, how does one do a locale dependent lower case?
msg182487 - (view) Author: Firat Ozgul (firatozgul) Date: 2013-02-20 11:31
In Python, things like lowercasing-uppercasing and sorting were always problematic with regard to Turkish language. For instance, whatever the locale is, you cannot lowercase the word 'KADIN' (woman) in Turkish correctly:: >>> "KADIN".lower() 'kadin' ... which is wrong. That should be 'kadın' ('kad\u0131n'). Likewise 'kitap' (book):: >>> "kitap".upper() 'KITAP' ... which is wrong. That should be 'KİTAP' ('K\u0130TAP'). As for this thread, in 3.3, Python does a completely different thing:: >>> "KİTAP".lower() 'ki\u0307tap' #wrong In Python 3.2, this was:: >>> "KİTAP".lower() 'kitap' #correct 'i' and 'i\u0307' are not the same. Turkish Python programmers define their own upper(), lower(), title(), swapcase() and casefold() methods and use their own sorting techniques.
msg182491 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2013-02-20 12:00
Right, and the unicode consortium says that that weird thing 3.3 is doing is the "canonical" lowercasing, and this is the case exactly because in 3.3 "\u0130".lower().upper() == "\u0130". Which I why I asked Ezio if we ever came up with a way to do lower/upper in a locale specific manner. The behavior change is an issue, but I'm thinking the 3.3 behavior is probably the "correct" behavior per the unicode standard.
msg182494 - (view) Author: Firat Ozgul (firatozgul) Date: 2013-02-20 12:24
r.david.murray: '(...) because in 3.3 "\u0130".lower().upper() == "\u0130"' Do you mean in Python 3.3 "\u0130".lower() returns "\u0130"? If you are saying so, this is not the case, because in Python 3.3:: >>> '\u0130'.lower() 'i\u0307'
msg182495 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2013-02-20 12:28
Yes, I think 3.3 is correct here. I think it was Benjamin who fixed/improved the behaviour of casing methods. Compare 3.3: >>> "ß".upper() 'SS' with 3.2: >>> "ß".upper() 'ß' Also, 3.2 loses information: >>> "KİTAP".lower().upper() 'KITAP' >>> ascii("KİTAP".lower().upper()) "'KITAP'" while 3.3 retains it: >>> "KİTAP".lower().upper() 'KİTAP' >>> ascii("KİTAP".lower().upper()) "'KI\\u0307TAP'" You can get the combined form again with unicodedata.normalize: >>> unicodedata.normalize("NFC", "KİTAP".lower().upper()) 'KİTAP'
msg182497 - (view) Author: Firat Ozgul (firatozgul) Date: 2013-02-20 12:36
Don't you think that there is a problem here? >>> "KİTAP".lower().upper() 'KİTAP' >>> ascii("KİTAP".lower().upper()) "'KI\\u0307TAP'" "İ" is not "i\u0307". That's a different letter. "i\u0307"is 'i with combining dot above'. However, "İ" is "\u0130" (Latin Capital Letter I with Dot Above).
msg182498 - (view) Author: Firat Ozgul (firatozgul) Date: 2013-02-20 12:44
ascii("KİTAP".lower().upper()) should return "K\u0130TAP". Yes, Python 3.2 loses information, but Python 3.3 inserts faulty information, which, I think, is much worse than losing information.
msg182499 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2013-02-20 12:45
Ah, you are right, I did not decode it to see what the actual characters were. That does contradict what I said, but I'm way out of my depth on unicode at this point, so we'll have to wait for someone more expert to weigh in.
msg182502 - (view) Author: Firat Ozgul (firatozgul) Date: 2013-02-20 13:20
Excerpt from http://www.unicode.org/Public/UNIDATA/SpecialCasing.txt # Turkish and Azeri # I and i-dotless; I-dot and i are case pairs in Turkish and Azeri # The following rules handle those cases. 0130; 0069; 0130; 0130; tr; # LATIN CAPITAL LETTER I WITH DOT ABOVE 0130; 0069; 0130; 0130; az; # LATIN CAPITAL LETTER I WITH DOT ABOVE So the code 0130 should be 0069 in lowercase; 0130 in uppercase; 0130 in titlecase; and again 0130 in uppercase.
msg182504 - (view) Author: Benjamin Peterson (benjamin.peterson) * (Python committer) Date: 2013-02-20 13:50
Notice the lines you pulled have "tr" and "az" at the end of them meaning they only apply for Turkish and Azeri. Since the lower() method has no idea whether the user intends to be in a Turkish or Azeri locale or not, we just have to use the generic lowering mapping which simply preserves canonical equivalence.
msg182505 - (view) Author: Firat Ozgul (firatozgul) Date: 2013-02-20 13:59
Even if you set Turkish locale, the output is still "generic". Furthermore, does "canonical equivalence" really dictate that 'Latin Capital Letter I with Dot Above' should be mapped to 'I With Combining Dot Above' in lowercase? Note: 'Uppercase Dotted i' only exists in Turkish and Azeri.
msg182509 - (view) Author: Firat Ozgul (firatozgul) Date: 2013-02-20 14:25
Whatever the behavior of Python is in 'generic' terms, I believe, we should be able to do locale-dependent uppercasing-lowercasing, which we cannot do at the moment.
msg182514 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2013-02-20 14:40
Yes, earlier in that file is the generic translation: # Preserve canonical equivalence for I with dot. Turkic is handled below. 0130; 0069 0307; 0130; 0130; # LATIN CAPITAL LETTER I WITH DOT ABOVE You see that Python is following the standard, here. Agreed about the locale-aware upper/lower, etc, but that's a feature request. There's been some discussion about this kind of thing, but I don't remember what the status is. A search of the python-ideas and/or python-dev mailing lists might yield some clues. It's a discussion for one of those mailing lists rather than the bug tracker, in any case.
msg182517 - (view) Author: Firat Ozgul (firatozgul) Date: 2013-02-20 14:49
Apparently, what Python did wrong in the past was somewhat good for Turkish Python developers! This means Turkish developers now have one more problem to solve. Bad.
msg182518 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2013-02-20 14:52
> "İ" is not "i\u0307". That's a different letter. "i\u0307"is 'i with > combining dot above'. However, "İ" is "\u0130" (Latin Capital Letter > I with Dot Above). Did you actually read my message? You can reconcile the two using unicodedata.normalize().
msg182519 - (view) Author: Benjamin Peterson (benjamin.peterson) * (Python committer) Date: 2013-02-20 14:58
The "locale" module does not affect Unicode operations. That's C locale; I'm talking about concept of Unicode locale, which Python doesn't currently know anything about. I agree it would be useful to customize the locale of various unicode operations. That's a much broader language-level issue, though, in need of careful design. As for the useless generic mapping of LATIN CAPITAL LETTER I WITH DOT ABOVE, the idea is there is no LATIN SMALL LETTER I WITH DOT ABOVE so the generic lower casing comes from decomposing the character then lowering the latin one.
msg182520 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2013-02-20 15:13
On 20.02.2013 15:58, Benjamin Peterson wrote: > > Benjamin Peterson added the comment: > > The "locale" module does not affect Unicode operations. That's C locale; I'm talking about concept of Unicode locale, which Python doesn't currently know anything about. > > I agree it would be useful to customize the locale of various unicode operations. That's a much broader language-level issue, though, in need of careful design. We'd need to add the CLDR for locale aware operations and a Python interface for it: http://cldr.unicode.org/ The Babel project provides such an interface: http://babel.edgewall.org/ The project appears to have stalled, though.
msg182521 - (view) Author: Christian Heimes (christian.heimes) * (Python committer) Date: 2013-02-20 15:21
In the meantime you can use PyICU https://pypi.python.org/pypi/PyICU for locale aware transformations: >>> from icu import UnicodeString, Locale >>> tr = Locale("TR") >>> s = UnicodeString("KADIN") >>> print(unicode(s.toLower(tr))) kadın >>> unicode(s.toLower(tr)) u'kad\u0131n'
History
Date User Action Args
2022-04-11 14:57:42 admin set github: 61454
2019-01-04 21:31:09 terry.reedy link issue35639 superseder
2018-03-22 06:09:55 serhiy.storchaka link issue33108 superseder
2013-02-20 15:21:37 christian.heimes set nosy: + christian.heimesmessages: +
2013-02-20 15:13:47 lemburg set messages: +
2013-02-20 14:58:18 benjamin.peterson set messages: +
2013-02-20 14:52:14 pitrou set messages: +
2013-02-20 14:49:34 firatozgul set messages: +
2013-02-20 14:40:46 r.david.murray set messages: +
2013-02-20 14:25:33 firatozgul set messages: +
2013-02-20 13:59:24 firatozgul set messages: +
2013-02-20 13:50:17 benjamin.peterson set status: open -> closedresolution: works for memessages: +
2013-02-20 13:20:15 firatozgul set messages: +
2013-02-20 13:14:20 firatozgul set status: closed -> openresolution: not a bug -> (no value)
2013-02-20 12:45:16 r.david.murray set messages: +
2013-02-20 12:44:43 firatozgul set messages: +
2013-02-20 12:36:42 firatozgul set messages: +
2013-02-20 12:28:22 pitrou set status: open -> closednosy: + lemburg, pitrou, vstinner, benjamin.petersonmessages: + resolution: not a bug
2013-02-20 12:24:09 firatozgul set messages: +
2013-02-20 12:00:48 r.david.murray set messages: +
2013-02-20 11:31:59 firatozgul set messages: +
2013-02-20 10:57:33 r.david.murray set nosy: + ezio.melotti, r.david.murraymessages: + components: + Unicode
2013-02-20 09:14:15 firatozgul create