[Python-Dev] len(chr(i)) = 2? (original) (raw)
M.-A. Lemburg mal at egenix.com
Mon Nov 22 13:47:29 CET 2010
- Previous message: [Python-Dev] len(chr(i)) = 2?
- Next message: [Python-Dev] len(chr(i)) = 2?
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
Martin,
it is really irrelevant whether the standards have decided to no longer use the terms UCS-2 and UCS-4 in their latest standard documents.
The definitions still stand (just like Unicode 2.0 is still a valid standard, even if it's ten years old):
UCS-2 is defined as "Universal Character Set coded in 2 octets" by ISO 10464: (see http://www.unicode.org/versions/Unicode5.2.0/appC.pdf)
UCS-4 is defined as "Universal Character Set coded in 4 octets" by ISO 10464.
Those two terms have been in use for many years. They refer to the Unicode character set as it can be represented in 2 or 4 bytes. As such they don't include any of the special meanings associated with the UTF transfer encodings. There are no invalid sequences, no invalid code points, etc. as you can find in the UTF encodings. And that's an important detail.
If you interpret them as encodings, they are 1-1 mappings of Unicode code point ordinals to integers represented using 2 or 4 bytes.
UCS-2 only supports BMP code points and can conveniently be interpreted as UTF-16, if you need to encode non-BMP code points (which we do in the UTF codecs).
UCS-4 also supports non-BMP code points directly.
Now, from a ISO or Unicode Consortium point of view, deprecating the term UCS-2 in their standard papers is only natural, since they are actively starting to assign non-BMP code points which cannot be represented in UCS-2.
However, this deprecation is only relevant for the purpose of defining the standard. The above definitions are still useful when it comes to defining code units, i.e. the used storage format, (as opposed to the transfer format).
For the purpose of describing the code units we are using in Python they are (still) the most correct terms and that's also the reason why we chose to use them when introducing the configure options in Python2.
There are no other accurate definitions we could use. The terms "narrow" and "wide" are simply too inaccurate to be used as description of UCS-2 and UCS-4 code units.
Please also note that we have used the terms UCS-2 and UCS-4 in Python2 for 9+ years now and users are just starting to learn the difference and get acquainted with the fact that Python uses these two forms.
Confronting them with "narrow" and "wide" builds is only going to cause more confusion, not less, and adding those strings to Python package files isn't going to help much either, since the terms don't convey any relationship to Unicode:
package-3.1.3.linux-x86_64-py2.6_ucs2.egg vs. package-3.1.3.linux-x86_64-py2.6_narrow.egg
I opt for switching to the following config options:
--with-unicode=ucs2 (default) --with-unicode=ucs4
and using "UCS-2" and "UCS-4" in the Python documentation when describing the two different build modes. We can add glossary entries for the two which clarify the differences.
Python2 used --enable-unicode=ucs2/ucs4, but since Python3 doesn't build without Unicode support, the above two versions appear more appropriate.
We can keep the alternative --with-wide-unicode as an alias for --with-unicode=ucs4 to maintain 3.x backwards compatibility.
Cheers,
Marc-Andre Lemburg eGenix.com
Professional Python Services directly from the Source (#1, Nov 22 2010)
Python/Zope Consulting and Support ... http://www.egenix.com/ mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/ mxODBC, mxDateTime, mxTextTools ... http://python.egenix.com/
::: Try our new mxODBC.Connect Python Database Interface for free ! ::::
eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 http://www.egenix.com/company/contact/
"Martin v. Löwis" wrote:
Am 22.11.2010 11:48, schrieb Stephen J. Turnbull:
Raymond Hettinger writes:
> Neither UTF-16 nor UCS-2 is exactly correct anyway. >From a standards lawyer point of view, UCS-2 is exactly correct, as far as I can tell upon rereading ISO 10646-1, especially Annexes H ("retransmitting devices") and Q ("UTF-16"). Annex Q makes it clear that UTF-16 was intentionally designed so that Python-style processing could be done in a UCS-2 context. I could only find the FCD of 10646:2010, where annex H was integrated into section 10: http://www.itscj.ipsj.or.jp/sc2/open/02n4125/FCD10646-Main.pdf There they have stopped using the term UCS-2, and added a note # NOTE – Former editions of this standard included references to a # two-octet BMP form called UCS-2 which would be a subset # of the UTF-16 encoding form restricted to the BMP UCS scalar values. # The UCS-2 form is deprecated. I think they are now acknowledging that UCS-2 was a misleading term, making it ambiguous whether this refers to a CCS, a CEF, or a CES; like "ASCII", people have been using it for all three of them. Apparently, the ISO WG interprets earlier revisions as saying that UCS-2 is a CEF that restricted UTF-16 to the BMP. THIS IS NOT WHAT PYTHON DOES. In a narrow Python build, the character set is not restricted to the BMP. Instead, Unicode strings are meant to be interpreted (by applications) as UTF-16. > For the "wide" build, the entire range of unicode is encoded at > 4 bytes per character and slicing/len operate correctly since > every character is the same length. This used to be called UCS-4 > and is now UTF-32. That's inaccurate, I believe. UCS-4 is not a UTF, and doesn't satisfy the range restrictions of a UTF. Not sure what it says in your copy; in mine, section 9.3 says # 9.3 UTF-32 (UCS-4) # UTF-32 (or UCS-4) is the UCS encoding form that assigns each UCS # scalar value to a single unsigned 32-bit code unit. The terms UTF-32 # and UCS-4 can be used interchangeably to designate this encoding # form. so they (now) view the two as synonyms. I think that when ISO 10646 started, they were also fairly confused about these issues (as the group/plane/row/cell structure demonstrates, IMO). This is not surprising, since the notion of byte-based character sets had been ingrained for so long. It took 20 years to learn that a UCS scalar value really is not a sequence of bytes, but a natural number. However, I don't see how "narrow" tells us more than "UCS-2" does. If "UCS-2" is equally (or more) informative, I prefer it because it is the technically precise, already well-defined, term. But it's not. It is a confusing term, one that the relevant standards bodies are abandoning. After reading FCD 10646:2010, I could agree to call the two implementations UTF-16 and UTF-32 (as these terms designate CEFs). Unfortunately, they also designate CESs. If we have to document what the terms we choose mean anyway, why not document the existing terms and reduce entropy, rather than invent new ones and increase entropy? Because the proposed existing term is deprecated. Regards, Martin
- Previous message: [Python-Dev] len(chr(i)) = 2?
- Next message: [Python-Dev] len(chr(i)) = 2?
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]