[Python-3000] string module trimming (original) (raw)

Jason Orendorff jason.orendorff at gmail.com
Thu Apr 19 20:51:24 CEST 2007

Previous message: [Python-3000] string module trimming
Next message: [Python-3000] string module trimming
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On 4/18/07, Jim Jewett <jimjjewett at gmail.com> wrote:

On 4/18/07, Guido van Rossum <guido at python.org> wrote: > On 4/18/07, Jim Jewett <jimjjewett at gmail.com> wrote:

> > Today, string.letters works most easily with ASCII supersets, and is > > effectively limited to 8-bit encodings. Once everything is unicode, I > > don't think that 8-bit restriction should apply any more. > But we already went over this. There are over 40K letters in Unicode. > It simply makes no sense to have a string.letters approaching that > size. Agreed. But there aren't 40K (alphabetic) letters in any particular locale. Most individual languages will have less than 100.

But isn't the hip thing these days to use UTF-8 as the character set, as in "LC_ALL=en_US.UTF-8"? I picked a random Linux machine and this happened to be the default LANG. A quick C test revealed that iswalpha() thinks there are 45,974 letters in that locale.

Anyway, let's discuss your use cases.

One: UIs that involve displaying all the letters.

There are also reasons to want only "local" letters. For example, in a French interface, I might want to include the extra French letters, but not the Greek.

But POSIX locales don't claim to provide this information, as far as I can tell.

It seems like for a quick throwaway program, you're better off ignoring languages. And for even a slightly serious program, string.letters is wrong in a dozen ways.

Seriously, a table of alphabets that's saner than string.letters is pretty trivial to write:

alphabets = { 'en': list("ABCDEFGHIJKLMNOPQRSTUVWXYZ"), 'es': ("A B C Ch D E F G H I J K L Ll M " + "N \u00d1 O P Q R S T U V W X Y Z").split(), ... }

...and you can go from there.

Two: Collation.

Collation can be done right: provide a function text.sort_key() that converts a str into an opaque thing that has the desired ordering. You would use it like so:

names.sort(key=text.sort_key) records.sort(key=lambda rec: text.sort_key(rec.title))

Jython can implement this using java.text.Collator. IronPython can use CultureInfo.CompareInfo.GetSortKey(). CPython can call LCMapString() on Windows and wcsxfrm() on POSIX, falling back on just returning the string itself.

-j

Previous message: [Python-3000] string module trimming
Next message: [Python-3000] string module trimming
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

More information about the Python-3000 mailing list