[Python-Dev] PEP 393 Summer of Code Project (original) (raw)

Terry Reedy tjreedy at udel.edu
Thu Sep 1 00:02:53 CEST 2011


On 8/31/2011 1:10 PM, Guido van Rossum wrote:

This is why I find the issue of Python, the language (and stdlib), as a whole "conforming to the Unicode standard" such a troublesome concept -- I think it is something that an application may claim, but the language should make much more modest claims, such as "the regular expression syntax supports features X, Y and Z from the Unicode recommendation XXX, or "the UTF-8 codec will never emit a sequence of bytes that is invalid according Unicode specification YYY". (As long as the Unicode references are also versioned or dated.)

This will be a great improvement. It was both embarrassing and frustrating to have to respond to Tom C.'s (and other's) issue with "Our unicode type is too vaguely documented to tell whether you are reporting a bug or making a feature request.

But if you can observe (valid) surrogate pairs it is still UTF-16. ... Ok, I dig this, to some extent. However saying it is UCS-2 is equally bad.

As I said on the tracker, our narrow builds are in-between (while moving closer to UTF-16), and both terms are deceptive, at least to some.

At the same time I think it would be useful if certain string operations like .lower() worked in such a way that if the input were valid UTF-16, then the output would also be, while if the input contained an invalid surrogate, the result would simply be something that is no worse (in particular, those are all mapped to themselves). We could even go further and have .lower() and friends look at graphemes (multi-code-point characters) if the Unicode std has a useful definition of e.g. lowercasing graphemes that differed from lowercasing code points.

An analogy is actually found in .lower() on 8-bit strings in Python 2: it assumes the string contains ASCII, and non-ASCII characters are mapped to themselves. If your string contains Latin-1 or EBCDIC or UTF-8 it will not do the right thing. But that doesn't mean strings cannot contain those encodings, it just means that the .lower() method is not useful if they do. (Why ASCII? Because that is the system encoding in Python 2.)

Good analogy.

Let's call those things graphemes (Tom C's term, I quite like leaving "character" ambiguous) -- they are sequences of multiple code points that represent a single "visual squiggle" (the kind of thing that you'd want to be swappable in vim with "xp" :-). I agree that APIs are needed to manipulate (match, generate, validate, mutilate, etc.) things at the grapheme level. I don't agree that this means a separate data type is required.

I presume by 'separate data type' you mean a base level builtin class like int or str and that you would allow for wrapper classes built on top of str, as such are not really 'separate'. For grapheme leval and higher, we should certainly start with wrappers and probably with alternate versions based on different strategies.

There are ever-larger units of information encoded in text strings, with ever farther-reaching (and more vague) requirements on valid sequences. Do you want to have a data type that can represent (only valid) words in a language? Sentences? Novels? ... I think that at this point in time the best we can do is claim that Python (the language standard) uses either 16-bit code units or 21-bit code points in its string datatype, and that, thanks to PEP 393, CPython 3.3 and further will always use 21-bit code points (but Jython and IronPython may forever use their platform's native 16-bit code unit representing string type). And then we add APIs that can be used everywhere to look for code points (even if the string contains code points), graphemes, or larger constructs. I'd like those APIs to be designed using a garbage-in-garbage-out principle, where if the input conforms to some Unicode requirement, the output does too, but if the input doesn't, the output does what makes most sense. Validation is then limited to codecs, and optional calls.

If you index or slice a string, or create a string from chr() of a surrogate or from some other value that the Unicode standard considers an illegal code point, you better know what you are doing. I want chr(i) to be valid for all values of i in range(2**21),

Actually, it is range(0X110000) == range(1114112) so that UTF-8 uses at most 4 bytes per codepoint. 21 bits is 20.1 bits rounded up.

so it can be used to create a lone surrogate, or (on systems with 16-bit "characters") a surrogate pair. And also ord(chr(i)) == i for all i in range(2**21).

for i in range(0x110000): # 1114112 if ord(chr(i)) != i: print(i)

prints nothing (on Windows)

I'm not sure about ord() on a 2-character string containing a surrogate pair on systems where strings contain 21-bit code points; I think it should be an error there, just as ord() on other strings of length != 1. But on systems with 16-bit "characters", ord() of strings of length 2 containing a valid surrogate pair should work.

And now does, thanks to whoever fixed this (withing the last year, I think).

-- Terry Jan Reedy



More information about the Python-Dev mailing list