[Python-Dev] len(chr(i)) = 2? (original) (raw)
"Martin v. Löwis" martin at v.loewis.de
Mon Nov 22 12:43:00 CET 2010
- Previous message: [Python-Dev] len(chr(i)) = 2?
- Next message: [Python-Dev] len(chr(i)) = 2?
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
Am 22.11.2010 11:48, schrieb Stephen J. Turnbull:
Raymond Hettinger writes:
> Neither UTF-16 nor UCS-2 is exactly correct anyway.
From a standards lawyer point of view, UCS-2 is exactly correct, as far as I can tell upon rereading ISO 10646-1, especially Annexes H ("retransmitting devices") and Q ("UTF-16"). Annex Q makes it clear that UTF-16 was intentionally designed so that Python-style processing could be done in a UCS-2 context.
I could only find the FCD of 10646:2010, where annex H was integrated into section 10:
http://www.itscj.ipsj.or.jp/sc2/open/02n4125/FCD10646-Main.pdf
There they have stopped using the term UCS-2, and added a note
NOTE – Former editions of this standard included references to a
two-octet BMP form called UCS-2 which would be a subset
of the UTF-16 encoding form restricted to the BMP UCS scalar values.
The UCS-2 form is deprecated.
I think they are now acknowledging that UCS-2 was a misleading term, making it ambiguous whether this refers to a CCS, a CEF, or a CES; like "ASCII", people have been using it for all three of them.
Apparently, the ISO WG interprets earlier revisions as saying that UCS-2 is a CEF that restricted UTF-16 to the BMP. THIS IS NOT WHAT PYTHON DOES. In a narrow Python build, the character set is not restricted to the BMP. Instead, Unicode strings are meant to be interpreted (by applications) as UTF-16.
> For the "wide" build, the entire range of unicode is encoded at > 4 bytes per character and slicing/len operate correctly since > every character is the same length. This used to be called UCS-4 > and is now UTF-32.
That's inaccurate, I believe. UCS-4 is not a UTF, and doesn't satisfy the range restrictions of a UTF.
Not sure what it says in your copy; in mine, section 9.3 says
9.3 UTF-32 (UCS-4)
UTF-32 (or UCS-4) is the UCS encoding form that assigns each UCS
scalar value to a single unsigned 32-bit code unit. The terms UTF-32
and UCS-4 can be used interchangeably to designate this encoding
form.
so they (now) view the two as synonyms.
I think that when ISO 10646 started, they were also fairly confused about these issues (as the group/plane/row/cell structure demonstrates, IMO). This is not surprising, since the notion of byte-based character sets had been ingrained for so long. It took 20 years to learn that a UCS scalar value really is not a sequence of bytes, but a natural number.
However, I don't see how "narrow" tells us more than "UCS-2" does. If "UCS-2" is equally (or more) informative, I prefer it because it is the technically precise, already well-defined, term.
But it's not. It is a confusing term, one that the relevant standards bodies are abandoning. After reading FCD 10646:2010, I could agree to call the two implementations UTF-16 and UTF-32 (as these terms designate CEFs). Unfortunately, they also designate CESs.
If we have to document what the terms we choose mean anyway, why not document the existing terms and reduce entropy, rather than invent new ones and increase entropy?
Because the proposed existing term is deprecated.
Regards, Martin
- Previous message: [Python-Dev] len(chr(i)) = 2?
- Next message: [Python-Dev] len(chr(i)) = 2?
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]