[Python-3000] string module trimming (original) (raw)

Jim Jewett jimjjewett at gmail.com
Thu Apr 19 20:52:00 CEST 2007

Previous message: [Python-3000] string module trimming
Next message: [Python-3000] string module trimming
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On 4/19/07, Josiah Carlson <jcarlson at uci.edu> wrote:

"Jeffrey Yasskin" <jyasskin at gmail.com> wrote: > On 4/18/07, Jim Jewett <jimjjewett at gmail.com> wrote: > > On 4/18/07, Guido van Rossum <guido at python.org> wrote: > > > On 4/18/07, Jim Jewett <jimjjewett at gmail.com> wrote:

> > Agreed. But there aren't 40K (alphabetic) letters in any particular > > locale. Most individual languages will have less than 100.

> ... Are you saying that in your app, just because I've set > the enUS locale, I won't be able to type "????"? Or that those > characters won't be recognized as letters?

The latter. Some applications may reject them for that reason; for example some domain registrars have policies to prevent domain name spoofing with similar-looking characters. One way to do that is to say that a character used in a domain name (under that registrar) is limited to those letters used by the appropriate national language.

In terms of sorting, since all (unicode) strings should be comparable to one another, using the unicode-specified ordering would seem to make sense, unless it is something other than code point values.

It is definately something other than code-point values.

In particular, see section 1.8 (common misconceptions) of http://unicode.org/reports/tr10/

The sorting isn't fully defined without locale-specific tailoring and a Unicode Element Collation Table (default 4 bytes/char, though compressible). There is a default tailoring and default Unicode Element Collation Table; it looks (but I haven't proven to myself) as if these defaults are sufficient for most use, but certainly not all usage.

Unicode sorting (even with your own collation table) definately requires normalization, which is something Python has been careful not to promise. (There were some arguments over whether normalization was even possible to do in a strictly correct fashion. I didn't understand them well enough to remember the summary.) Unless the "repetoire of supported character sequences" is (unnaturally) restricted, normalization is only an intermediate step; a third representation is constructed for the actual comparison. This third form can be done a few characters at a time, but then you have to redo it for the next comparison.

As best I can easily tell about the default settings, there are distinct strings which are equal, unequal strings which are not ordered, and strings for which you must compare multiple characters at once ("x"<"y", but "xz">"yz")

-jJ

Previous message: [Python-3000] string module trimming
Next message: [Python-3000] string module trimming
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

More information about the Python-3000 mailing list