Issue 1193061: Python and Turkish Locale (original) (raw)

Issue1193061

Created on 2005-04-30 17:37 by caglar, last changed 2022-04-11 14:56 by admin. This issue is now closed.

Messages (6)
msg25185 - (view) Author: S.Çağlar Onur (caglar) Date: 2005-04-30 17:37
On behalf of this thread; http://mail.python.org/pipermail/python-dev/2005-April/052968.html As described in http://www.i18nguy.com/unicode/turkish-i18n.html [ How Applications Fail With Turkish Language ] , Turkish has 4 "i" in their alphabet. Without --with-wctype-functions support Python convert these characters locare-independent manner in tr_TR.UTF-8 locale. So all conversitons maps to "i" or "I" which is wrong in Turkish locale. So if Python Developers will remove the wctype functions from Python, then there must be a locale-dependent upper/lower funtion to handle these characters properly.
msg25186 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2005-05-02 08:00
Logged In: YES user_id=38388 I'm not sure I understand: are you saying that the Unicode mappings for upper and lower case are wrong in the standard ? Note that removing the wctype functions will only remove the possibility to use these functions for case mapping of Unicode characters instead of using the builtin Unicode character database. This was originally meant as optimization to avoid having to load the Unicode database - nowadays the database is always included, so the optimization is no longer needed. Even worse: the wctype functions sometimes behave differently than the mappings in the Unicode database (due to differences in the Unicode database version or implementation s). Now, since the string .lower() and .upper() methods are locale dependent (due to their reliance on the C functions toupper() and tolower() - not by intent), while the Unicode versions are not, we have a rather annoying situation where switching from strings to Unicode cause semantic differences. Ideally, both string and Unicode methods should do case mapping in an locale independent way. The support for differences in locale dependent case mapping, collation, etc. should be moved to an external module, e.g. the locale module.
msg25187 - (view) Author: S.Çağlar Onur (caglar) Date: 2005-05-02 08:45
Logged In: YES user_id=858447 No, im not. These rules defined in http://www.unicode.org/Public/UNIDATA/CaseFolding.txt and http://www.unicode.org/Public/UNIDATA/SpecialCasing.txt. Note that there is a comments says; # T: special case for uppercase I and dotted uppercase I # - For non-Turkic languages, this mapping is normally not used. # - For Turkic languages (tr, az), this mapping can be used instead of the normal mapping for these characters. # Note that the Turkic mappings do not maintain canonical equivalence without additional processing. # See the discussions of case mapping in the Unicode Standard for more information. So without wctype functions support, python can't convert these. This _is_ the problem. As a side effect of this, another huge problem occurs, keywords can't be locale dependent. If Python compiled with wctype support functions, all "i".upper() turns into "0" which is wrong for keyword comparision ( like quit v.s QU0T ) So i suggest implement two new functions like localeAwareLower()/localeAwareUpper() for python and let lower()/upper() locale independent. And as you wrote locale module may be a perfect home for these :)
msg25188 - (view) Author: Eray Ozkural (exa) Date: 2005-10-11 21:36
Logged In: YES user_id=1454 The better solution is to use an optional locale argument for upper/lower functions and other language-dependent text processing functions.
msg25189 - (view) Author: Ömer FADIL USTA (usta) Date: 2006-09-30 15:58
Logged In: YES user_id=278064 http://img147.imageshack.us/img147/3717/pythonte4.jpg I think this photo summarize the bug which is related to upper() in Turkish encoding.
msg55471 - (view) Author: Georg Brandl (georg.brandl) * (Python committer) Date: 2007-08-30 10:14
Dupe of #1528802.
History
Date User Action Args
2022-04-11 14:56:11 admin set github: 41929
2007-08-30 10:14:40 georg.brandl set status: open -> closedresolution: duplicatesuperseder: Turkish Charactermessages: + nosy: + georg.brandl
2005-04-30 17:37:22 caglar create