Issue 4610: Unicode case mappings are incorrect (original) (raw)
Issue4610
Created on 2008-12-09 14:50 by alexs, last changed 2022-04-11 14:56 by admin. This issue is now closed.
Messages (18) | ||
---|---|---|
msg77417 - (view) | Author: Alex Stapleton (alexs) | Date: 2008-12-09 14:50 |
Following a discussion on reddit it seems that the unicode case conversion algorithms are not being followed. $ python3.0 Python 3.0rc1 (r30rc1:66499, Oct 10 2008, 02:33:36) [GCC 4.0.1 (Apple Inc. build 5488)] on darwin Type "help", "copyright", "credits" or "license" for more information. >>> x='ß' >>> print(x, x.upper()) ß ß This conversion is correct as defined in UnicodeData.txt however http://unicode.org/Public/UNIDATA/SpecialCasing.txt defines a more complete set of case conversions. According to this file "ß".upper() should be "SS". Presumably Python simply isn't using this file to create it's mapping database. | ||
msg77461 - (view) | Author: Martin v. Löwis (loewis) * ![]() |
Date: 2008-12-09 22:14 |
I have known this problem for years, and decided not to act; I don't consider it an important problem. Implementing it properly is complicated by the fact that some of the case mappings are conditional on the locale. If you consider it important, please submit a patch. I'd rather see efforts put into an integration of ICU, which should solve this problem and many others with Python's locale support. | ||
msg77526 - (view) | Author: Marc-Andre Lemburg (lemburg) * ![]() |
Date: 2008-12-10 09:44 |
Python uses the Unicode database for the mapping and this only contains 1-1 mappings. The special cases (mostly 1-2 mappings) are not included. It would be nice to have them available as well, but I guess we'd have to write them in code rather than invent a new mapping table for them. Furthermore, there are a few cases like e.g. the Turkish i where case mappings depend on external context such as the language the code point is used in - those cases are difficult to get right. We may need to extend the .lower()/.upper()/.title() methods with an optional parameter that allow providing this extra context information to the methods. BTW: 'ß' is being phased out in German. The new writing rules encourage using 'ss' or 'SS' instead (which is not entirely correct, since 'ß' originated from 'sz' used some hundred or so years ago, but those are just details ;-). | ||
msg77572 - (view) | Author: Alex Stapleton (alexs) | Date: 2008-12-10 22:28 |
I agree with loewis that ICU is probably the best way to get this functionality into Python. lemburg, yes it seems like extending those methods would be required at the very least. We would probably also need to support ICUs collators as well I think. | ||
msg78112 - (view) | Author: Alex Stapleton (alexs) | Date: 2008-12-20 16:19 |
I am trying to get a PEP together for this. Does anyone have any thoughts on how to handle comparison between unicode strings in a locale aware situation? Should __lt__ and __gt__ be specified as ignoring locale? In which case do we need to add a new method for doing locale aware comparisons? Should locale be a property of the string, an argument passed to upper/lower/isupper/islower/swapcase/capitalize/sort or global state (locale module...)? Should doing a locale aware comparison of two strings with different locales throw an exception? Should locales be represented as objects or just a string like "en_GB"? | ||
msg78116 - (view) | Author: Martin v. Löwis (loewis) * ![]() |
Date: 2008-12-20 18:41 |
> I am trying to get a PEP together for this. Does anyone have any thoughts > on how to handle comparison between unicode strings in a locale aware > situation? Implementation-wise, or specification-wise? Implementation-wise, you can either try to use the C library, or ICU. For portability, ICU is better; for maintenance, the C library. Specification-wise: it should just Do The Right Thing, and probably be exposed either through the locale module, or through locale objects (in case you want to operate on multiple different locales in a single program) - see other OO languages on how they provide locales. > Should __lt__ and __gt__ be specified as ignoring locale? Yes. > In which case do > we need to add a new method for doing locale aware comparisons? No. Collation is a feature of the locale, not of the strings. > Should locale be a property of the string, an argument passed to > upper/lower/isupper/islower/swapcase/capitalize/sort or global state > (locale module...)? Either global state, or the object *that gets the strings passed to it*. > Should doing a locale aware comparison of two strings with different > locales throw an exception? Strings should not be tied into locales. > Should locales be represented as objects or just a string like "en_GB"? If you want to have multiple of them simultaneously, you need objects. You still need to identify them by name. | ||
msg78122 - (view) | Author: Marc-Andre Lemburg (lemburg) * ![]() |
Date: 2008-12-20 19:52 |
On 2008-12-20 17:19, Alex Stapleton wrote: > Alex Stapleton <alexs@prol.etari.at> added the comment: > > I am trying to get a PEP together for this. Does anyone have any thoughts > on how to handle comparison between unicode strings in a locale aware > situation? Some thoughts: * the Unicode implementation *must* stay locale independent * we should implement the Unicode collation algorithm (TR#10, http://unicode.org/reports/tr10/) * which collation to use should be a parameter of a function or object initializer and it should be possible to use multiple collations in the same application (without switching the locale) * the terms "locale" and "collation" should not be mixed; a (default) collation is a property of a locale and there can also be more than one collation per locale The Unicode collation algorithm defines collation in terms of a key function for each collation, so that already fits nicely with the key function parameter of list.sort(). > Should __lt__ and __gt__ be specified as ignoring locale? In which case do > we need to add a new method for doing locale aware comparisons? Unicode strings should not get any locale or collation specific methods. Instead this feature should be implemented elsewhere and the strings in question passed to this new function or object. > Should locale be a property of the string, an argument passed to > upper/lower/isupper/islower/swapcase/capitalize/sort or global state > (locale module...)? No. See above. > Should doing a locale aware comparison of two strings with different > locales throw an exception? No, assigning locales to strings is not going to work and we should not go down that road. It's better to have locale aware functions for certain operations, so that you can pass your Unicode strings to these function instead of binding additional context information to the Unicode strings themselves. > Should locales be represented as objects or just a string like "en_GB"? I think the easiest way to get the collation algorithm implemented is by using a similar scheme as for codecs: you pass a collation name to a central function and get back a collation object that implements the collation in form of a key method and a compare method. | ||
msg93936 - (view) | Author: Jeff Senn (senn) ![]() |
Date: 2009-10-13 19:57 |
Has there been any action on this? a PEP? I disagree that using ICU is good way to simply get proper unicode casing. (A heavy hammer for a small task...) I agree locales are a different issue (and would prefer optional arguments to the unicode object casing methods -- that could then be used within any future sort of locale object to handle correct casing -- but don't rely on such.) Most of the special casing rules can be accomplished by a decomposition (or recursive decomposition) on the character followed by casing the result -- so NO new table is necessary -- only marking up the characters so implicated (there are extra unused bits in the char type table that could be used for this purpose -- so no additional space needed there either). What remains are a tiny handful of cases that need to be handled in code. I have a half finished implementation of this, in case anyone is interested. | ||
msg93944 - (view) | Author: Martin v. Löwis (loewis) * ![]() |
Date: 2009-10-13 22:19 |
> I have a half finished implementation of this, in case anyone > is interested. Feel free to upload it here. I'm fairly skeptical that it is possible to implement casing "correctly" in a locale-independent way. | ||
msg94011 - (view) | Author: Jeff Senn (senn) ![]() |
Date: 2009-10-14 19:00 |
> Feel free to upload it here. I'm fairly skeptical that it is > possible to implement casing "correctly" in a locale-independent > way. Ok. I will try to find time to complete it enough to be readable. Unicode (see sec 3.13) specifies the casing of unicode strings pretty completely -- i.e. it gives "Default Casing" rules to be used when no locale specific "tailoring" is available. The only dependencies on locale for the special casing rules are for Turkish, Azeri, and Lithuanian. And you only need to know that that is the language, no other details. So I'm sure that a complete implementation is possible without resort to a lot of locale munging -- at least for .lower() .upper() and .title(). .swapcase() is just ...err... dumb^h^h^h^h questionably useful. However .capitalize() is a bit weird; and I'm not sure it isn't incorrectly implemented now: It UPPERCASES the first character, rather than TITLECASING, which is probably wrong in the very few cases where it makes a difference: e.g. (using Croatian ligatures) >>> u'\u01c5amonjna'.title() u'\u01c4amonjna' >>> u'\u01c5amonjna'.capitalize() u'\u01c5amonjna' "Capitalization" is not precisely defined (by the Unicode standard) -- the currently python implementation doesn't even do what the docs say: "makes the first character have upper case" (it also lower-cases all other characters!), however I might argue that a more useful implementation "makes the first character have titlecase..." | ||
msg94017 - (view) | Author: Jeff Senn (senn) ![]() |
Date: 2009-10-14 19:25 |
Yikes! I just noticed that u''.title() is really broken! It doesn't really pay attention to word breaks -- only characters that "have case". Therefore when there are (caseless) combining characters in a word it's really broken e.g. >>> u'n\u0303on\u0303e'.title() u'N\u0303On\u0303E' That is (where '~' is combining-tilde-over) n~on~e -title-cases-to-> N~On~E | ||
msg94023 - (view) | Author: Marc-Andre Lemburg (lemburg) * ![]() |
Date: 2009-10-14 20:16 |
Jeff Senn wrote: > > Jeff Senn <senn@users.sourceforge.net> added the comment: > > Yikes! I just noticed that u''.title() is really broken! > > It doesn't really pay attention to word breaks -- > only characters that "have case". > Therefore when there are (caseless) > combining characters in a word it's really broken e.g. > >>>> u'n\u0303on\u0303e'.title() > u'N\u0303On\u0303E' > > That is (where '~' is combining-tilde-over) > n~on~e -title-cases-to-> N~On~E Please have a look at http://bugs.python.org/issue6412 - that patch addresses many casing issues, at least up the extent that we can actually fix them without breaking code relying on: len(s.upper()) == len(s) for upper/lower/title. If we add support for 1-n code point mappings, then we can only enable this support by using an option to the casing methods (perhaps not a bad idea: the parameter could be used to signal the local to assume). | ||
msg94024 - (view) | Author: Marc-Andre Lemburg (lemburg) * ![]() |
Date: 2009-10-14 20:26 |
Jeff Senn wrote: > However .capitalize() is a bit weird; and I'm not sure it isn't > incorrectly implemented now: > > It UPPERCASES the first character, rather than TITLECASING, which is > probably wrong in the very few cases where it makes a difference: > e.g. (using Croatian ligatures) > >>>> u'\u01c5amonjna'.title() > u'\u01c4amonjna' >>>> u'\u01c5amonjna'.capitalize() > u'\u01c5amonjna' > > "Capitalization" is not precisely defined (by the Unicode standard) -- > the currently python implementation doesn't even do what the docs say: > "makes the first character have upper case" (it also lower-cases all > other characters!), however I might argue that a more useful > implementation "makes the first character have titlecase..." You don't have to worry about .capitalize() and .swapcase() :-) Those methods are defined by their implementation and don't resemble anything defined in Unicode. I agree that they are, well, not that useful. | ||
msg94026 - (view) | Author: Raymond Hettinger (rhettinger) * ![]() |
Date: 2009-10-14 20:40 |
> .swapcase() is just ...err... dumb^h^h^h^h questionably useful. FWIW, it appears that the original use case (as an Emacs macro) was to correct blocks of text where touch typists had accidentally left the CapsLocks key turned on: tHE qUICK bROWN fOX jUMPED oVER tHE lAZY dOG. I agree with the rest of you that Python would be better-off without swapcase(). | ||
msg123488 - (view) | Author: Alexander Belopolsky (belopolsky) * ![]() |
Date: 2010-12-06 18:42 |
>> .swapcase() is just ...err... dumb^h^h^h^h questionably useful. > I agree with the rest of you that Python would be better-off > without swapcase(). As long as str.upper/lower are based only on UnicodeData.txt 1-to-1 mappings, existence of str.swapcase() indicates to the users that they should not expect many-to-1 mappings. Also it does seem to be occasionally used for testing. -0 on removing it. | ||
msg191738 - (view) | Author: Alexander Belopolsky (belopolsky) * ![]() |
Date: 2013-06-23 22:52 |
There has been a relatively recent discussion of case mappings under #12753 (). I personally agree with Martin: str.upper/lower should remain the way it is - a simplistic 1-to-1 mapping using UnicodeData.txt fields. More sophisticated case mapping algorithms belong to a specialized library module not python core. The behavior of .title() and .capitalize() is harder to defend, so if someone can point out to a python library (PyICU?) that gets it right we can reference it in the documentation. | ||
msg191740 - (view) | Author: Alexander Belopolsky (belopolsky) * ![]() |
Date: 2013-06-23 23:56 |
It looks like at least the OP issue has been fixed in #12736: >>> 'ß'.upper() 'SS' | ||
msg191750 - (view) | Author: Marc-Andre Lemburg (lemburg) * ![]() |
Date: 2013-06-24 08:23 |
On 24.06.2013 00:52, Alexander Belopolsky wrote: > > Alexander Belopolsky added the comment: > > There has been a relatively recent discussion of case mappings under #12753 (). > > I personally agree with Martin: str.upper/lower should remain the way it is - a simplistic 1-to-1 mapping using UnicodeData.txt fields. More sophisticated case mapping algorithms belong to a specialized library module not python core. > > The behavior of .title() and .capitalize() is harder to defend, so if someone can point out to a python library (PyICU?) that gets it right we can reference it in the documentation. .title() and .capitalize() are 1-1 mappings as well. Python only supports "Simple Case Operations" and does not support "Full Case Operations" which require parsing context (SpecialCasing.txt). ICU does provide support for both: http://userguide.icu-project.org/transforms/casemappings PyICU wraps ICU, but it is not clear to me how you'd access those mappings (the package doesn't provide dcoumentation on the API, instead just gives a description of how to map the C++ API to a Python one): https://pypi.python.org/pypi/PyICU |
History | |||
---|---|---|---|
Date | User | Action | Args |
2022-04-11 14:56:42 | admin | set | github: 48860 |
2013-06-24 08:23:50 | lemburg | set | messages: + |
2013-06-23 23:56:10 | belopolsky | set | status: open -> closedsuperseder: Request for python casemapping functions to use full not simple casemaps per Unicode's recommendationresolution: out of datemessages: + |
2013-06-23 22:52:54 | belopolsky | set | messages: + versions: + Python 3.4, - Python 2.6, Python 3.0 |
2013-06-23 22:32:03 | belopolsky | link | issue12753 superseder |
2010-12-06 18:42:16 | belopolsky | set | nosy: + belopolskymessages: + |
2009-10-14 20:40:29 | rhettinger | set | nosy: + rhettingermessages: + |
2009-10-14 20:26:09 | lemburg | set | messages: + |
2009-10-14 20:16:27 | lemburg | set | messages: + |
2009-10-14 19:25:28 | senn | set | messages: + |
2009-10-14 19:00:09 | senn | set | messages: + |
2009-10-13 22:19:29 | loewis | set | messages: + |
2009-10-13 19:57:02 | senn | set | nosy: + sennmessages: + |
2008-12-20 19:52:50 | lemburg | set | messages: + |
2008-12-20 18:41:27 | loewis | set | messages: + |
2008-12-20 16:24:30 | ezio.melotti | set | nosy: + ezio.melotti |
2008-12-20 16:19:13 | alexs | set | messages: + |
2008-12-10 22:28:53 | alexs | set | messages: + |
2008-12-10 09:44:10 | lemburg | set | nosy: + lemburgmessages: + |
2008-12-09 22:14:44 | loewis | set | nosy: + loewismessages: + |
2008-12-09 14:50:29 | alexs | create |