[Python-Dev] PEP 393 Summer of Code Project (original) (raw)

Guido van Rossum guido at python.org
Thu Aug 25 04:33:51 CEST 2011


On Wed, Aug 24, 2011 at 5:36 PM, Stephen J. Turnbull <stephen at xemacs.org> wrote:

Guido van Rossum writes:

 > I see nothing wrong with having the language's fundamental data types  > (i.e., the unicode object, and even the re module) to be defined in  > terms of codepoints, not characters, and I see nothing wrong with  > len() returning the number of codepoints (as long as it is advertised  > as such). In fact, the Unicode Standard, Version 6, goes farther (to code units):  2.7  Unicode Strings  A Unicode string data type is simply an ordered sequence of code  units. Thus a Unicode 8-bit string is an ordered sequence of 8-bit  code units, a Unicode 16-bit string is an ordered sequence of  16-bit code units, and a Unicode 32-bit string is an ordered  sequence of 32-bit code units.  Depending on the programming environment, a Unicode string may or  may not be required to be in the corresponding Unicode encoding  form. For example, strings in Java, C#, or ECMAScript are Unicode  16-bit strings, but are not necessarily well-formed UTF-16  sequences. (p. 32).

I am assuming that that definition only applies to use of the term "unicode string" within the standard and has no bearing on how programming languages are allowed to use the term, as that would be preposterous. (They can define what they mean by terms like well-formed and conforming etc., and I won't try to go against that. But limiting what can be called a unicode string feels like unproductive coddling.)

-- --Guido van Rossum (python.org/~guido)



More information about the Python-Dev mailing list