[Python-3000] string module trimming (original) (raw)

Guido van Rossum guido at python.org
Thu Apr 19 00:08:29 CEST 2007


On 4/18/07, Jim Jewett <jimjjewett at gmail.com> wrote:

On 4/18/07, Guido van Rossum <guido at python.org> wrote: > On 4/18/07, Jim Jewett <jimjjewett at gmail.com> wrote: > > On 4/17/07, Guido van Rossum <guido at python.org> wrote: > > > The locale module doesn't deal with Unicode, only with 8-bit characters (not > > > multi-byte characters). You'll lose this anyway. Certainly > > > string.letters is not going to provide this functionality.

> > But for languages in Latin1, 8-bit characters are sufficient -- > > anything with more than 8 bits is by definition not a (local) letter. > Latin-1 is just another encoding (and not a very useful one given that > it can't encode all of Unicode). I don't want to define a feature that > only works for Latin-1. Today, string.letters works most easily with ASCII supersets, and is effectively limited to 8-bit encodings. Once everything is unicode, I don't think that 8-bit restriction should apply any more.

But we already went over this. There are over 40K letters in Unicode. It simply makes no sense to have a string.letters approaching that size.

> > I won't swear that localizations currently replace string.letters with > > the appropriately ordered (slight) superset, but it is a valid use > > case, and string* (or text*) is clearly the right place.

> The right solution for locale-dependent collation for sure isn't > having a string containing all the letters in the right order. There > are plenty of languages where that approach doesn't even work. Theoretically, English is one of those non-working languages. (Names in bibliographic entries are supposed to be alphabetized according to language of origin.) In practice, ordered-list-of-chars works well enough, often enough. It often works better than sorting by code point, which is the only obvious alternative. Unless I missed it (and I may have), unicode itself sort of ducks the question about how to sort strings. Python really needs to provide an answer, but I'm not sure it is possible to provide the (single) correct answer.

The Unicode standard certainly has a solution, but it is complicated and I don't believe it is currently implemented in core Python.

string.letters is one workaround, and I don't think we should remove it until a better solution (or workaround) is available.

I disagree. The correct solution is to implement the Unicode support for locale-specific sorting.

Remember that the locale module supports only a single, global locale at a time. This renders it totally useless in many apps requiring locale support (such as web servers).

-- --Guido van Rossum (home page: http://www.python.org/~guido/)



More information about the Python-3000 mailing list