[Python-Dev] len(chr(i)) = 2? (original) (raw)
M.-A. Lemburg mal at egenix.com
Fri Nov 19 23:25:03 CET 2010
- Previous message: [Python-Dev] len(chr(i)) = 2?
- Next message: [Python-Dev] len(chr(i)) = 2?
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
Victor Stinner wrote:
Hi,
On Friday 19 November 2010 17:53:58 Alexander Belopolsky wrote: I was recently surprised to learn that chr(i) can produce a string of length 2 in python 3.x. Yes, but only on narrow build. Eg. Debian and Ubuntu compile Python 3.1 in wide mode (sys.maxunicode == 1114111).
I suspect that I am not alone finding this behavior non-obvious given that a mistake in Python manual stating the contrary survived several releases. [1] It was a documentation bug and you fixed it. Non-BMP characters are rare, so few (maybe only you?) noticed the documentation bug. I consider the behaviour as an improvment of non-BMP support of Python3. Python is unclear about non-BMP characters: narrow build was called "ucs2" for long time, even if it is UTF-16 (each character is encoded to one or two UTF-16 words).
No, no, no :-)
UCS2 and UCS4 are more appropriate than "narrow" and "wide" or even "UTF-16" and "UTF-32".
It'S rather common to confuse a transfer encoding with a storage format. UCS2 and UCS4 refer to code units (the storage format). You can use UCS2 and UCS4 code units to represent UTF-16 and UTF-32 resp., but those are not the same things.
In UTF-16 0xD800 has a special meaning, in UCS2 it doesn't. Python uses UCS2 internally. It does not assign a special meaning to those surrogate code point ranges.
However, when it comes to codecs, we do try to make use of the fact that UCS2 can easily be used to represent an UTF-16 encoding and that's why you often see surrogates being created for code points that wouldn't otherwise fit into UCS2 and you see those surrogates being converted back to single code units in UCS4 builds.
I don't know who invented the terms "narrow" and "wide" builds for Python3. Not me that's for sure :-) They don't have any meaning in Unicode terminology and thus cause even more confusion than UCS2 and UCS4. E.g. the import errors you get when importing extensions built for a different Unicode version, (correctly) refer to UCS2 vs. UCS4 and now give even less of a clue that they relate to difference in Unicode builds (since these are now labeled "narrow" and "wide").
IMO, we should go back to the Python2 terms UCS2 and UCS4 which are correct and provide a clear description of what Python uses internally for code units.
Python2 accepts non-BMP characters with \U syntax, but not with chr(). This is inconsistent and I see this as a bug. But I don't want to touch Python2 about non-BMP characters, and the "bug" is already fixed in Python3!
I do believe, however that a change like this [2] and its consequences should be better publicized. Change made before the release of Python 3.0. Do you want to patch the "What's new in Python 3.0?" document?
Perhaps add a section "What we forgot to mention in 3.0" or "What's not so new in 3.2" to "What's new in 3.2" :-)
I have not found any discussion of this change in PEPs or "What's new" documents. The closest find was a mentioning of a related issue #3280 in the 3.0 NEWS file. [3] Since this feature will be first documented in the Library Reference in 3.2, I wonder if it will be appropriate to mention it in "What's new in 3.2"? In my opinion, the question is more what was it not fixed in Python2. I suppose that the answer is something ugly like "backward compatibility" or "historical reasons" :-)
Backwards compatibility.
Python2 applications don't expect unichr(i) to return anything other than a single character. If you need this in Python2, it's easy enough to get around, though, with a little helper function.
-- Marc-Andre Lemburg eGenix.com
Professional Python Services directly from the Source (#1, Nov 19 2010)
Python/Zope Consulting and Support ... http://www.egenix.com/ mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/ mxODBC, mxDateTime, mxTextTools ... http://python.egenix.com/
::: Try our new mxODBC.Connect Python Database Interface for free ! ::::
eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 http://www.egenix.com/company/contact/
- Previous message: [Python-Dev] len(chr(i)) = 2?
- Next message: [Python-Dev] len(chr(i)) = 2?
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]