[Python-3000] string module trimming (original) (raw)

Jeffrey Yasskin jyasskin at gmail.com
Thu Apr 19 17:14:15 CEST 2007

Previous message: [Python-3000] string module trimming
Next message: [Python-3000] string module trimming
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On 4/18/07, Josiah Carlson <jcarlson at uci.edu> wrote:

"Jeffrey Yasskin" <jyasskin at gmail.com> wrote: > I missed the beginning of this discussion, so sorry if you've already > covered this. Are you saying that in your app, just because I've set > the enUS locale, I won't be able to type "????"? Or that those > characters won't be recognized as letters?

If I understand the conversation correctly, the discussion is what will be in string.letters, and what will be handled in str.upper(), etc., when a locale is set.

string.letters should go away because I don't know of any correct uses of it, and as you say 40K letters is too long. Searching a list is the wrong way to decide whether a character is a letter, and case transformations don't work a character at a time (consider what happens with "ß".upper() (That is, U+00DF, German Small Sharp S)). http://www.unicode.org/Public/UNIDATA/SpecialCasing.txt defines the mappings that aren't 1-1. There are some that are locale-specific, but you can do a pretty good job ignoring the language, as long as you allow strings to change length.

> The Unicode character database (http://www.unicode.org/ucd/) seems > like the obvious way to handle character properties if you want to get > the right answers.

Certainly, but having 40k characters in string.letters seems like a bit of overkill, for any locale. It seems as though it only makes sense to include the letters for the current locale as string.letters, and to handle str.upper(), etc., as determined by the locale.

As far as I understand, "letters for the current locale" is the same as "letters" in Unicode. Can you point me to a character that is a letter in one locale but not in another? (The third column of http://www.unicode.org/Public/UNIDATA/UnicodeData.txt defines the character's category, and http://www.unicode.org/Public/UNIDATA/UCD.html#General_Category_Values says what it means.)

In terms of sorting, since all (unicode) strings should be comparable to one another, using the unicode-specified ordering would seem to make sense, unless it is something other than code point values. If it isn't code point values (which seems to be the implication), then we need to decide if we want to check a 128kbyte table (for UCS-2 builds) in order to sort strings (though cache lookup locality may make this a moot point for most comparisons).

If you just need to store strings in an order-based data structure (which I guess is moot for python with its hashes), then codepoint order is fine. If you intend to show users a sorted list, then you have to use the real collation algorithm or you'll produce the wrong answer. I don't understand the algorithm's details, but ICU has an implementation, and http://icu-project.org/charts/icu4c_footprint.html claims that the data for all languages fits in 354K.

UCS-2 is an old and broken fixed-width encoding that cannot represent characters above U+FFFF. Nobody should ever use it. You probably meant UTF-16.

-- Namasté, Jeffrey Yasskin

Previous message: [Python-3000] string module trimming
Next message: [Python-3000] string module trimming
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

More information about the Python-3000 mailing list