[Python-3000] string module trimming (original) (raw)

Josiah Carlson jcarlson at uci.edu
Thu Apr 19 19:22:13 CEST 2007


"Jeffrey Yasskin" <jyasskin at gmail.com> wrote:

On 4/18/07, Josiah Carlson <jcarlson at uci.edu> wrote: > "Jeffrey Yasskin" <jyasskin at gmail.com> wrote: > > I missed the beginning of this discussion, so sorry if you've already > > covered this. Are you saying that in your app, just because I've set > > the enUS locale, I won't be able to type "????"? Or that those > > characters won't be recognized as letters? > > If I understand the conversation correctly, the discussion is what will > be in string.letters, and what will be handled in str.upper(), etc., > when a locale is set.

string.letters should go away because I don't know of any correct uses of it, and as you say 40K letters is too long. Searching a list is the wrong way to decide whether a character is a letter, and case transformations don't work a character at a time (consider what happens with "ß".upper() (That is, U+00DF, German Small Sharp S)). http://www.unicode.org/Public/UNIDATA/SpecialCasing.txt defines the mappings that aren't 1-1. There are some that are locale-specific, but you can do a pretty good job ignoring the language, as long as you allow strings to change length.

Because we aren't mutating unicode strings, this isn't an issue. I respond below regarding string.letters .

> > The Unicode character database (http://www.unicode.org/ucd/) seems > > like the obvious way to handle character properties if you want to get > > the right answers. > > Certainly, but having 40k characters in string.letters seems like a bit > of overkill, for any locale. It seems as though it only makes sense > to include the letters for the current locale as string.letters, and to > handle str.upper(), etc., as determined by the locale.

As far as I understand, "letters for the current locale" is the same as "letters" in Unicode. Can you point me to a character that is a letter in one locale but not in another? (The third column of http://www.unicode.org/Public/UNIDATA/UnicodeData.txt defines the character's category, and http://www.unicode.org/Public/UNIDATA/UCD.html#GeneralCategoryValues says what it means.)

Neither I, nor I believe Python mean 'letters' in the general sense, but the 'alphabet' of a particular locale. For example, en_US compared to sv_SE .

> In terms of sorting, since all (unicode) strings should be comparable to > one another, using the unicode-specified ordering would seem to make > sense, unless it is something other than code point values. If it isn't > code point values (which seems to be the implication), then we need to > decide if we want to check a 128kbyte table (for UCS-2 builds) in order > to sort strings (though cache lookup locality may make this a moot point > for most comparisons).

If you just need to store strings in an order-based data structure (which I guess is moot for python with its hashes), then codepoint order is fine. If you intend to show users a sorted list, then you have to use the real collation algorithm or you'll produce the wrong answer. I don't understand the algorithm's details, but ICU has an implementation, and http://icu-project.org/charts/icu4cfootprint.html claims that the data for all languages fits in 354K.

It could probably even be reduced lower than 354K with two tables and a comparison function that knows how to handle surrogates.

UCS-2 is an old and broken fixed-width encoding that cannot represent characters above U+FFFF. Nobody should ever use it. You probably meant UTF-16.

You are more or less right. Earlier versions of Windows were limited to UCS-2, and I believe earlier versions of Python on Windows were also limited to UCS-2. For narrow builds we use UTF-16, with surrogate pairs and everything (though a unicode string consisting of a single surrogate pair will have length 2, not 1 as would be expected).



More information about the Python-3000 mailing list