[Python-Dev] UCS2/UCS4 default (original) (raw)

Guido van Rossum guido at python.org
Fri Jul 4 00:21:46 CEST 2008

Previous message: [Python-Dev] UCS2/UCS4 default
Next message: [Python-Dev] UCS2/UCS4 default
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On Thu, Jul 3, 2008 at 3:00 PM, Adam Olsen <rhamph at gmail.com> wrote:

On Thu, Jul 3, 2008 at 3:01 PM, Terry Reedy <tjreedy at udel.edu> wrote:

The premise is the OP's idea that Python should switch to all UCS4 to create a more pure ('ideal') situation or the idea that len(s) should count codepoints (correct term?) for all builds as a matter of purity even though on it would be time-costly on 16-bit builds as a matter of practicality. Wrong term - code units and code points are equivalent in UTF-16 and UTF-32. What you're looking for is unicode scalar values.

I don't think so. I have in my lap the Unicode 5.0 standard, which on page 102, under UTF-16, states (amongst others):

"""

In UTF-16, the code point sequence <004D, 0430, 4E8C, 10302> is represented as <004D 0439 4E8C D800 DF02>, where corresponds to U+10302.
Because surrogate code points are not Unicode scalar values, isolated UTF-16 code units in the range D800[16]..DFFF[16] are ill-formed. """

From this I understand they distinguish carefully between code points and code units -- D800 is a code unit but not a code point, 10302 is a code point but not a (UTF-16) code unit.

OTOH outside the context of UTF-8, the surrogates are also referred to as "reserved code points" (e.g. in Table 2-3 on page 27, "Types of Code Points").

I think the best thing we can do is to use "code points" to refer to characters and "code units" to the individual 16-bit values in the UTF-16 encoding; this seems compatible with usage elsewhere in this thread by most folks.

Also see http://unicode.org/glossary/:

""" Code Point. Any value in the Unicode codespace; that is, the range of integers from 0 to 10FFFF16. (See definition D10 in Section 3.4, Characters and Encoding.) . . . Code Unit. The minimal bit combination that can represent a unit of encoded text for processing or interchange. The Unicode Standard uses 8-bit code units in the UTF-8 encoding form, 16-bit code units in the UTF-16 encoding form, and 32-bit code units in the UTF-32 encoding form. (See definition D77 in Section 3.9, Unicode Encoding Forms.) """

-- --Guido van Rossum (home page: http://www.python.org/~guido/)

Previous message: [Python-Dev] UCS2/UCS4 default
Next message: [Python-Dev] UCS2/UCS4 default
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

More information about the Python-Dev mailing list