[Python-3000] string module trimming (original) (raw)
Jason Orendorff jason.orendorff at gmail.com
Thu Apr 19 20:51:24 CEST 2007
- Previous message: [Python-3000] string module trimming
- Next message: [Python-3000] string module trimming
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
On 4/18/07, Jim Jewett <jimjjewett at gmail.com> wrote:
On 4/18/07, Guido van Rossum <guido at python.org> wrote: > On 4/18/07, Jim Jewett <jimjjewett at gmail.com> wrote:
> > Today, string.letters works most easily with ASCII supersets, and is > > effectively limited to 8-bit encodings. Once everything is unicode, I > > don't think that 8-bit restriction should apply any more. > But we already went over this. There are over 40K letters in Unicode. > It simply makes no sense to have a string.letters approaching that > size. Agreed. But there aren't 40K (alphabetic) letters in any particular locale. Most individual languages will have less than 100.
But isn't the hip thing these days to use UTF-8 as the character set, as in "LC_ALL=en_US.UTF-8"? I picked a random Linux machine and this happened to be the default LANG. A quick C test revealed that iswalpha() thinks there are 45,974 letters in that locale.
Anyway, let's discuss your use cases.
One: UIs that involve displaying all the letters.
There are also reasons to want only "local" letters. For example, in a French interface, I might want to include the extra French letters, but not the Greek.
But POSIX locales don't claim to provide this information, as far as I can tell.
It seems like for a quick throwaway program, you're better off ignoring languages. And for even a slightly serious program, string.letters is wrong in a dozen ways.
Seriously, a table of alphabets that's saner than string.letters is pretty trivial to write:
alphabets = { 'en': list("ABCDEFGHIJKLMNOPQRSTUVWXYZ"), 'es': ("A B C Ch D E F G H I J K L Ll M " + "N \u00d1 O P Q R S T U V W X Y Z").split(), ... }
...and you can go from there.
Two: Collation.
Collation can be done right: provide a function text.sort_key() that converts a str into an opaque thing that has the desired ordering. You would use it like so:
names.sort(key=text.sort_key) records.sort(key=lambda rec: text.sort_key(rec.title))
Jython can implement this using java.text.Collator. IronPython can use CultureInfo.CompareInfo.GetSortKey(). CPython can call LCMapString() on Windows and wcsxfrm() on POSIX, falling back on just returning the string itself.
-j
- Previous message: [Python-3000] string module trimming
- Next message: [Python-3000] string module trimming
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]