[Python-3000] string module trimming (original) (raw)

Jim Jewett jimjjewett at gmail.com
Thu Apr 19 01:08:59 CEST 2007


On 4/18/07, Guido van Rossum <guido at python.org> wrote:

On 4/18/07, Jim Jewett <jimjjewett at gmail.com> wrote:

> Today, string.letters works most easily with ASCII supersets, and is > effectively limited to 8-bit encodings. Once everything is unicode, I > don't think that 8-bit restriction should apply any more.

But we already went over this. There are over 40K letters in Unicode. It simply makes no sense to have a string.letters approaching that size.

Agreed. But there aren't 40K (alphabetic) letters in any particular locale. Most individual languages will have less than 100.

As a proxy for measuring "local" characters, I'll note that during some optimization drives for Pango (e.g., http://primates.ximian.com/~federico/news-2005-11.html#04 ) it turned out that there were only two non C-J-K languages that needed more than 256 cache positions in their character glyph tables.

> Unless I missed it (and I may have), unicode itself sort of ducks the > question about how to sort strings. Python really needs to provide > an answer, but I'm not sure it is possible to provide the (single) > correct answer.

The Unicode standard certainly has a solution, but it is complicated and I don't believe it is currently implemented in core Python.

I guess you're right; I saw too many alternatives the last time I looked, and must have stopped reading http://unicode.org/reports/tr10/ after section 1, where it becomes obvious that there is no context-free right answer.

> string.letters is one workaround, and I don't think we should remove > it until a better solution (or workaround) is available.

I disagree. The correct solution is to implement the Unicode support for locale-specific sorting.

And set-inclusion.

I'm not convinced that waiting for such a heavyweight solution is really the best choice, particularly since the spec itself warns against using the strictest forms (too inefficient).

Remember that the locale module supports only a single, global locale at a time. This renders it totally useless in many apps requiring locale support (such as web servers).

Fair enough.

-jJ



More information about the Python-3000 mailing list