msg325646 - (view) |
Author: Dogan (zamsalak) |
Date: 2018-09-18 14:02 |
Hey there, I believe I've come across a bug. It occurs when you try to lower() the Turkish uppercase letter "İ". Gonna explain it with example code since it's easier: >>> len("Ş") 1 >>> len("Ş".lower()) 1 >>> len("Ğ") 1 >>> len("Ğ".lower()) 1 >>> len("Ö") 1 >>> len("Ö".lower()) 1 >>> len("Ç") 1 >>> len("Ç".lower()) 1 >>> len("İ") 1 >>> len("İ".lower()) 2 When you lower() the Turkish uppercase letter “İ”, it returns a 2 chars long string with the first character being “i”, and the second being chr(775). Should it not simply return “i”? |
|
|
msg325649 - (view) |
Author: STINNER Victor (vstinner) *  |
Date: 2018-09-18 14:15 |
> Should it not simply return “i”? Python implements the Unicode standard. >>> "U+%04x" % ord("İ") 'U+0130' >>> ["U+%04x" % ord(ch) for ch in "İ".lower()] ['U+0069', 'U+0307'] >>> unicodedata.name("İ") 'LATIN CAPITAL LETTER I WITH DOT ABOVE' >>> [unicodedata.name(ch) for ch in "İ".lower()] ['LATIN SMALL LETTER I', 'COMBINING DOT ABOVE'] At the C level(), lower_ucs4() calls _PyUnicode_ToLowerFull() which lookup into Python internal Unicode database. U+0130 character enters the EXTENDED_CASE_MASK case: use _PyUnicode_ExtendedCase secondary database for "extended case". Well, at the end, Python uses the following data file from the Unicode standard: https://www.unicode.org/Public/9.0.0/ucd/SpecialCasing.txt Extract: """ # Preserve canonical equivalence for I with dot. Turkic is handled below. 0130; 0069 0307; 0130; 0130; # LATIN CAPITAL LETTER I WITH DOT ABOVE """ If you want to convert strings differently for the special case of Turkish, you need to use a different standard than Unicode... I close the issue as NOT A BUG. |
|
|
msg359514 - (view) |
Author: Philippe Ombredanne (pombredanne) * |
Date: 2020-01-07 15:40 |
There is a weird thing though (using Python 3.6.8): >>> [x.lower() for x in 'İ'] ['i̇'] >>> [x for x in 'İ'.lower()] ['i', '̇'] I would expect that the results would be the same in both cases. (And this is a source of a bug for some code of mine) |
|
|
msg359518 - (view) |
Author: STINNER Victor (vstinner) *  |
Date: 2020-01-07 16:34 |
> I would expect that the results would be the same in both cases. It's not. Read again my previous comment. >>> ["U+%04x" % ord(ch) for ch in "İ".lower()] ['U+0069', 'U+0307'] |
|
|
msg359519 - (view) |
Author: Christian Heimes (christian.heimes) *  |
Date: 2020-01-07 16:39 |
PS: The first entry of the result is a decomposed string, too: >>> r = [x.lower() for x in 'İ'] >>> hex(ord(r[0][0])) '0x69' >>> hex(ord(r[0][1])) '0x307' |
|
|
msg359538 - (view) |
Author: Philippe Ombredanne (pombredanne) * |
Date: 2020-01-07 20:34 |
Thank for the (re) explanation. Unicode is tough! Basically this is the issue i have really in the end with the folding: what used to be a proper alpha string is not longer one after a lower() because the second codepoint is a punctuation and I use a regex split on the \W word class that then behaves differently when the string is lowercased as we have an extra punctuation then to break on. I will find a way around these (rare) cases alright! Sorry for the noise. ``` >>> 'İ'.isalpha() True >>> 'İ'.lower().isalpha() False ``` |
|
|
msg374323 - (view) |
Author: Şahin Kureta (Şahin Kureta) |
Date: 2020-07-26 15:19 |
I know it is not finalized and released yet but are you going to implement Version 14.0.0 of the Unicode Standard? It finally solves the issue of Turkish lower/upper case 'I' and 'i'. [Here is the document](https://www.unicode.org/Public/14.0.0/ucd/NamesList-14.0.0d1.txt) > 0049 LATIN CAPITAL LETTER I * Turkish and Azerbaijani use 0131 for lowercase > 0069 LATIN SMALL LETTER I * Turkish and Azerbaijani use 0130 for uppercase |
|
|
msg374367 - (view) |
Author: Philippe Ombredanne (pombredanne) * |
Date: 2020-07-27 08:47 |
Şahin Kureta you wrote: > I know it is not finalized and released yet but are you going to > implement Version 14.0.0 of the Unicode Standard? > It finally solves the issue of Turkish lower/upper case 'I' and 'i'. Thank you for the pointer! I guess this spec could likely be under consideration for Python when it becomes final (but unlikely before?). |
|
|
msg374370 - (view) |
Author: Christian Heimes (christian.heimes) *  |
Date: 2020-07-27 09:26 |
We don't update the unicodedata database in patch releases because updates are backwards incompatible. Python 3.9 will ship with 13.0. Python 3.10 is going to ship with 14.0. |
|
|
msg396779 - (view) |
Author: (qdinar) |
Date: 2021-06-30 15:01 |
Şahin Kureta said: "I know it is not finalized and released yet but are you going to implement Version 14.0.0 of the Unicode Standard? It finally solves the issue of Turkish lower/upper case 'I' and 'i'." . this looks like that unicode version 14 has some new things about that. it is not so. it same as version 13. compare https://www.unicode.org/Public/13.0.0/ucd/SpecialCasing.txt and https://www.unicode.org/Public/14.0.0/ucd/SpecialCasing-14.0.0d8.txt ( if it is 404 try to enter from https://www.unicode.org/Public/14.0.0/ucd/ ). |
|
|