[Python-Dev] UCS2/UCS4 default (original) (raw)
Guido van Rossum guido at python.org
Fri Jul 4 00:21:46 CEST 2008
- Previous message: [Python-Dev] UCS2/UCS4 default
- Next message: [Python-Dev] UCS2/UCS4 default
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
On Thu, Jul 3, 2008 at 3:00 PM, Adam Olsen <rhamph at gmail.com> wrote:
On Thu, Jul 3, 2008 at 3:01 PM, Terry Reedy <tjreedy at udel.edu> wrote:
The premise is the OP's idea that Python should switch to all UCS4 to create a more pure ('ideal') situation or the idea that len(s) should count codepoints (correct term?) for all builds as a matter of purity even though on it would be time-costly on 16-bit builds as a matter of practicality. Wrong term - code units and code points are equivalent in UTF-16 and UTF-32. What you're looking for is unicode scalar values.
I don't think so. I have in my lap the Unicode 5.0 standard, which on page 102, under UTF-16, states (amongst others):
"""
In UTF-16, the code point sequence <004D, 0430, 4E8C, 10302> is represented as <004D 0439 4E8C D800 DF02>, where corresponds to U+10302.
Because surrogate code points are not Unicode scalar values, isolated UTF-16 code units in the range D800[16]..DFFF[16] are ill-formed. """
From this I understand they distinguish carefully between code points and code units -- D800 is a code unit but not a code point, 10302 is a code point but not a (UTF-16) code unit.
OTOH outside the context of UTF-8, the surrogates are also referred to as "reserved code points" (e.g. in Table 2-3 on page 27, "Types of Code Points").
I think the best thing we can do is to use "code points" to refer to characters and "code units" to the individual 16-bit values in the UTF-16 encoding; this seems compatible with usage elsewhere in this thread by most folks.
Also see http://unicode.org/glossary/:
""" Code Point. Any value in the Unicode codespace; that is, the range of integers from 0 to 10FFFF16. (See definition D10 in Section 3.4, Characters and Encoding.) . . . Code Unit. The minimal bit combination that can represent a unit of encoded text for processing or interchange. The Unicode Standard uses 8-bit code units in the UTF-8 encoding form, 16-bit code units in the UTF-16 encoding form, and 32-bit code units in the UTF-32 encoding form. (See definition D77 in Section 3.9, Unicode Encoding Forms.) """
-- --Guido van Rossum (home page: http://www.python.org/~guido/)
- Previous message: [Python-Dev] UCS2/UCS4 default
- Next message: [Python-Dev] UCS2/UCS4 default
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]