Issue 34723: lower() on Turkish letter "İ" returns a 2-chars-long string (original) (raw)

Created on 2018-09-18 14:02 by zamsalak, last changed 2022-04-11 14:59 by admin. This issue is now closed.

Messages (10)
msg325646 - (view) Author: Dogan (zamsalak) Date: 2018-09-18 14:02
Hey there, I believe I've come across a bug. It occurs when you try to lower() the Turkish uppercase letter "İ". Gonna explain it with example code since it's easier: >>> len("Ş") 1 >>> len("Ş".lower()) 1 >>> len("Ğ") 1 >>> len("Ğ".lower()) 1 >>> len("Ö") 1 >>> len("Ö".lower()) 1 >>> len("Ç") 1 >>> len("Ç".lower()) 1 >>> len("İ") 1 >>> len("İ".lower()) 2 When you lower() the Turkish uppercase letter “İ”, it returns a 2 chars long string with the first character being “i”, and the second being chr(775). Should it not simply return “i”?
msg325649 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2018-09-18 14:15
> Should it not simply return “i”? Python implements the Unicode standard. >>> "U+%04x" % ord("İ") 'U+0130' >>> ["U+%04x" % ord(ch) for ch in "İ".lower()] ['U+0069', 'U+0307'] >>> unicodedata.name("İ") 'LATIN CAPITAL LETTER I WITH DOT ABOVE' >>> [unicodedata.name(ch) for ch in "İ".lower()] ['LATIN SMALL LETTER I', 'COMBINING DOT ABOVE'] At the C level(), lower_ucs4() calls _PyUnicode_ToLowerFull() which lookup into Python internal Unicode database. U+0130 character enters the EXTENDED_CASE_MASK case: use _PyUnicode_ExtendedCase secondary database for "extended case". Well, at the end, Python uses the following data file from the Unicode standard: https://www.unicode.org/Public/9.0.0/ucd/SpecialCasing.txt Extract: """ # Preserve canonical equivalence for I with dot. Turkic is handled below. 0130; 0069 0307; 0130; 0130; # LATIN CAPITAL LETTER I WITH DOT ABOVE """ If you want to convert strings differently for the special case of Turkish, you need to use a different standard than Unicode... I close the issue as NOT A BUG.
msg359514 - (view) Author: Philippe Ombredanne (pombredanne) * Date: 2020-01-07 15:40
There is a weird thing though (using Python 3.6.8): >>> [x.lower() for x in 'İ'] ['i̇'] >>> [x for x in 'İ'.lower()] ['i', '̇'] I would expect that the results would be the same in both cases. (And this is a source of a bug for some code of mine)
msg359518 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2020-01-07 16:34
> I would expect that the results would be the same in both cases. It's not. Read again my previous comment. >>> ["U+%04x" % ord(ch) for ch in "İ".lower()] ['U+0069', 'U+0307']
msg359519 - (view) Author: Christian Heimes (christian.heimes) * (Python committer) Date: 2020-01-07 16:39
PS: The first entry of the result is a decomposed string, too: >>> r = [x.lower() for x in 'İ'] >>> hex(ord(r[0][0])) '0x69' >>> hex(ord(r[0][1])) '0x307'
msg359538 - (view) Author: Philippe Ombredanne (pombredanne) * Date: 2020-01-07 20:34
Thank for the (re) explanation. Unicode is tough! Basically this is the issue i have really in the end with the folding: what used to be a proper alpha string is not longer one after a lower() because the second codepoint is a punctuation and I use a regex split on the \W word class that then behaves differently when the string is lowercased as we have an extra punctuation then to break on. I will find a way around these (rare) cases alright! Sorry for the noise. ``` >>> 'İ'.isalpha() True >>> 'İ'.lower().isalpha() False ```
msg374323 - (view) Author: Şahin Kureta (Şahin Kureta) Date: 2020-07-26 15:19
I know it is not finalized and released yet but are you going to implement Version 14.0.0 of the Unicode Standard? It finally solves the issue of Turkish lower/upper case 'I' and 'i'. [Here is the document](https://www.unicode.org/Public/14.0.0/ucd/NamesList-14.0.0d1.txt) > 0049 LATIN CAPITAL LETTER I * Turkish and Azerbaijani use 0131 for lowercase > 0069 LATIN SMALL LETTER I * Turkish and Azerbaijani use 0130 for uppercase
msg374367 - (view) Author: Philippe Ombredanne (pombredanne) * Date: 2020-07-27 08:47
Şahin Kureta you wrote: > I know it is not finalized and released yet but are you going to > implement Version 14.0.0 of the Unicode Standard? > It finally solves the issue of Turkish lower/upper case 'I' and 'i'. Thank you for the pointer! I guess this spec could likely be under consideration for Python when it becomes final (but unlikely before?).
msg374370 - (view) Author: Christian Heimes (christian.heimes) * (Python committer) Date: 2020-07-27 09:26
We don't update the unicodedata database in patch releases because updates are backwards incompatible. Python 3.9 will ship with 13.0. Python 3.10 is going to ship with 14.0.
msg396779 - (view) Author: (qdinar) Date: 2021-06-30 15:01
Şahin Kureta said: "I know it is not finalized and released yet but are you going to implement Version 14.0.0 of the Unicode Standard? It finally solves the issue of Turkish lower/upper case 'I' and 'i'." . this looks like that unicode version 14 has some new things about that. it is not so. it same as version 13. compare https://www.unicode.org/Public/13.0.0/ucd/SpecialCasing.txt and https://www.unicode.org/Public/14.0.0/ucd/SpecialCasing-14.0.0d8.txt ( if it is 404 try to enter from https://www.unicode.org/Public/14.0.0/ucd/ ).
History
Date User Action Args
2022-04-11 14:59:06 admin set github: 78904
2021-06-30 15:01:06 qdinar set nosy: + qdinarmessages: +
2020-07-27 09:26:01 christian.heimes set messages: +
2020-07-27 08:47:23 pombredanne set messages: +
2020-07-26 15:19:57 Şahin Kureta set nosy: + Şahin Kuretamessages: +
2020-01-07 20:34:35 pombredanne set messages: +
2020-01-07 16:39:34 christian.heimes set nosy: + christian.heimesmessages: +
2020-01-07 16:34:52 vstinner set messages: +
2020-01-07 15:40:25 pombredanne set nosy: + pombredannemessages: +
2018-09-18 14:15:43 vstinner set status: open -> closedresolution: not a bugmessages: + stage: resolved
2018-09-18 14:02:16 zamsalak create