[Python-Dev] PEP 393 Summer of Code Project (original) (raw)

Guido van Rossum guido at python.org
Wed Aug 31 19:20:19 CEST 2011


On Wed, Aug 31, 2011 at 1:09 AM, Glenn Linderman <v+python at g.nevcal.com> wrote:

The str type itself can presently be used to process other character encodings: if they are fixed width < 32-bit elements those encodings might be considered Unicode encodings, but there is no requirement that they are, and some operations on str may operate with knowledge of some Unicode semantics, so there are caveats.

Actually, the str type in Python 3 and the unicode type in Python 2 are constrained everywhere to either 16-bit or 21-bit "characters". (Except when writing C code, which can do any number of invalid things so is the equivalent of assuming 1 == 0.) In particular, on a wide build, there is no way to get a code point >= 2**21, and I don't want PEP 393 to change this. So at best we can use these types to repesent arrays of 21-bit unsigned ints. But I think it is more useful to think of them as always representing "some form of Unicode", whether that is UTF-16 (on narrow builds) or 21-bit code points or perhaps some vaguely similar superset -- but for those code units/code points that are representable and valid (either code points or code units) according to the (supported version of) the Unicode standard, the meaning of those code points/units matches that of the standard.

Note that this is different from the bytes type, where the meaning of a byte is entirely determined by what it means in the programmer's head.

-- --Guido van Rossum (python.org/~guido)



More information about the Python-Dev mailing list