[Python-Dev] len(chr(i)) = 2? (original) (raw)

Stephen J. Turnbull stephen at xemacs.org
Tue Nov 23 16:00:22 CET 2010

Previous message: [Python-Dev] len(chr(i)) = 2?
Next message: [Python-Dev] len(chr(i)) = 2?
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

If you don't care about the ISO standard, but only about Python, Martin's right, I was wrong. You can stop reading now.

"Martin v. Löwis" writes:

I could only find the FCD of 10646:2010, where annex H was integrated into section 10:

Thank you for the reference.

I referred to two older versions, 10646-1:1993 (for the annexes and Amendment, and my basic understanding) and 10646:2003 (for the detailed definition of UCS-2 in Sections 7, 8 and 13; unfortunately, I missed the most important detail, which is in Section 9). In :2003 the Annex I referred to as "Annex H" is Annex J, and "Annex Q" is partly in Section 9.1 and mostly in Annex C. I don't know where the former is in the 2010 FCD, and the latter is section 9.2.

I think they are now acknowledging that UCS-2 was a misleading term, making it ambiguous whether this refers to a CCS, a CEF, or a CES; like "ASCII", people have been using it for all three of them.

In :1993 it wasn't ambiguous, they simply didn't make those distinctions. They were not needed for ISO 10646's published versions, although they certainly are for Unicode.

Now, quite clearly, the ISO has changed the definition in every new version, progressively adding new restrictions that go beyond clarifying ambiguity. But even in :2003, in view of 4.2, 6.2, 6.3, and 13.1, UCS-2 is clearly well-defined as a CM according to UTR#17, which can probably be identified with CCS in :2003 terminology. Ie, returning to UTR#17 terminology, it is the composition of a CES, a CEF, and a CCS, which are not defined individually. Note: The definition of "coded character" changed between :2003 and the 2010 FCD, from "character with representation" to "character with integer".

There is a NOTE indicating that 16-bit integers may be used in processing. Given that this is a non-normative note, I take it to mean that in an array of 16-bit integers, "most significant octet" is to be interpreted in the natural way for the architecture rather than by the representation in memory, which might be little-endian. IMO it's unnatural to think that that changes the definition of UCS-2 to be either a CEF, or a composition of a CEF and a CCS.

Apparently, the ISO WG interprets earlier revisions as saying that UCS-2 is a CEF that restricted UTF-16 to the BMP.

I think that ISO 10646-1:1993 admits only one interpretation, a CM restricted to the BMP (including surrogates), and ISO 10646:2003 admits only one interpretation, a CM restricted to the BMP (not including surrogates). The note under Table 4 on p.24 of the FCD is, uh, well, a lie. Earlier versions certainly did not restrict to "scalar values"; they had no such concept.

THIS IS NOT WHAT PYTHON DOES.

Well, no shit, Sherlock. You don't have to yell at me, I know what Python does. The question is, is what does UCS-2 do? The answer is that in :1993, AFAICT it did what Python does. In :2003, they added (last sentence, section 9.1):

UCS-2 cannot be used to represent any characters on the
supplementary planes.

I assume they maintain that position in 2010, so End Of Thread.

I apologize for missing that when I was reviewing the standard earlier, but I expected restrictions on UCS-2 to be explained in 13.1 or perhaps 14. And 13.1 simply requires that characters in the BMP be represented by their defined code positions, truncated to two octets. Like earlier versions, it doesn't prohibit use of surrogates or say that non-BMP characters can't be represented.

Not sure what it says in your copy; in mine, section 9.3 says

[snip]

Mine (:2003) says "NOTE 2 - When confined to the code positions in Planes 00 to 10, UCS-4 is also referred to as UCS Transformation Format 32 (UTF-32)." Then it references the Unicode Standard (v4.0) as the authority for UTF-32. Obviously they continued to be confused at this point in time; by the draft you have, apparently the WG had decided to pretty much completely synchronize the whole standard to a subset of Unicode. This seems pointless to me (unlike, say, the work that has been done on standardizing criteria for repertoire changes).

In particular, the :1993 definition of UCS-2 was a perfectly good standard for describing the processing Python actually does internally. The current definition of UCS-2 as identical to the BMP is useless, and good riddance, I'm perfectly happy to have them deprecate it.

Previous message: [Python-Dev] len(chr(i)) = 2?
Next message: [Python-Dev] len(chr(i)) = 2?
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

More information about the Python-Dev mailing list