[Python-3000] string module trimming (original) (raw)

Josiah Carlson jcarlson at uci.edu
Thu Apr 19 08:50:17 CEST 2007


"Jeffrey Yasskin" <jyasskin at gmail.com> wrote:

On 4/18/07, Jim Jewett <jimjjewett at gmail.com> wrote: > On 4/18/07, Guido van Rossum <guido at python.org> wrote: > > On 4/18/07, Jim Jewett <jimjjewett at gmail.com> wrote: > > But we already went over this. There are over 40K letters in Unicode. > > It simply makes no sense to have a string.letters approaching that > > size. > > Agreed. But there aren't 40K (alphabetic) letters in any particular > locale. Most individual languages will have less than 100.

I missed the beginning of this discussion, so sorry if you've already covered this. Are you saying that in your app, just because I've set the enUS locale, I won't be able to type "????"? Or that those characters won't be recognized as letters?

If I understand the conversation correctly, the discussion is what will be in string.letters, and what will be handled in str.upper(), etc., when a locale is set.

The Unicode character database (http://www.unicode.org/ucd/) seems like the obvious way to handle character properties if you want to get the right answers.

Certainly, but having 40k characters in string.letters seems like a bit of overkill, for any locale. It seems as though it only makes sense to include the letters for the current locale as string.letters, and to handle str.upper(), etc., as determined by the locale.

In terms of sorting, since all (unicode) strings should be comparable to one another, using the unicode-specified ordering would seem to make sense, unless it is something other than code point values. If it isn't code point values (which seems to be the implication), then we need to decide if we want to check a 128kbyte table (for UCS-2 builds) in order to sort strings (though cache lookup locality may make this a moot point for most comparisons).



More information about the Python-3000 mailing list