[Python-Dev] PEP 393 Summer of Code Project (original) (raw)
Guido van Rossum guido at python.org
Thu Aug 25 04:33:51 CEST 2011
- Previous message: [Python-Dev] PEP 393 Summer of Code Project
- Next message: [Python-Dev] PEP 393 Summer of Code Project
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
On Wed, Aug 24, 2011 at 5:36 PM, Stephen J. Turnbull <stephen at xemacs.org> wrote:
Guido van Rossum writes:
> I see nothing wrong with having the language's fundamental data types > (i.e., the unicode object, and even the re module) to be defined in > terms of codepoints, not characters, and I see nothing wrong with > len() returning the number of codepoints (as long as it is advertised > as such). In fact, the Unicode Standard, Version 6, goes farther (to code units): 2.7 Unicode Strings A Unicode string data type is simply an ordered sequence of code units. Thus a Unicode 8-bit string is an ordered sequence of 8-bit code units, a Unicode 16-bit string is an ordered sequence of 16-bit code units, and a Unicode 32-bit string is an ordered sequence of 32-bit code units. Depending on the programming environment, a Unicode string may or may not be required to be in the corresponding Unicode encoding form. For example, strings in Java, C#, or ECMAScript are Unicode 16-bit strings, but are not necessarily well-formed UTF-16 sequences. (p. 32).
I am assuming that that definition only applies to use of the term "unicode string" within the standard and has no bearing on how programming languages are allowed to use the term, as that would be preposterous. (They can define what they mean by terms like well-formed and conforming etc., and I won't try to go against that. But limiting what can be called a unicode string feels like unproductive coddling.)
-- --Guido van Rossum (python.org/~guido)
- Previous message: [Python-Dev] PEP 393 Summer of Code Project
- Next message: [Python-Dev] PEP 393 Summer of Code Project
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]