[Python-Dev] len(chr(i)) = 2? (original) (raw)

M.-A. Lemburg mal at egenix.com
Mon Nov 22 19:53:00 CET 2010

Previous message: [Python-Dev] len(chr(i)) = 2?
Next message: [Python-Dev] len(chr(i)) = 2?
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

Raymond Hettinger wrote:

Any explanation we give users needs to let them know two things: * that we cover the entire range of unicode not just BMP * that sometimes len(chr(i)) is one and sometimes two

The term UCS-2 is a complete communications failure in that regard. If someone looks up the term, they will immediately see something like the wikipedia entry which says, "UCS-2 cannot represent code points outside the BMP". How is that helpful?

It's very helpful, since it explains why a UCS-2 build of Python requires a surrogates pair to represent a non-BMP code point and explains why chr(i) gives you a length 2 string rather than a length 1 string.

A UCS-4 build does not need to use surrogates for this, hence you get a length 1 string from chr(i).

There are two levels we have to explain to users:

the transfer level
the storage level

The UTF encodings address the transfer level and is what you deal with in I/O. These provide variable length encodings of the complete Unicode code point range, regardless of whether you have a UCS-2 or a UCS-4 build.

The storage level becomes important if you want to work on strings using indexing and slicing. Here you do have to know whether you're dealing with a UCS-2 or a UCS-4 build, since the indexes will vary if you're using non-BMP code points.

Finally, to tie both together, we have to explain that UTF-16 (the transfer encoding) maps to UCS-2 in a straight-forward way, so it is possible to work with a UCS-2 build of Python and still use the complete Unicode code point range - you only have to take into consideration, that Python's string indexing will not necessarily point you to n-th code point in a string, but may well give you half or a surrogate.

Note that while that last aspect may appear like a good argument for UCS-4 builds, in reality it is not. UCS-4 has the same issue on a different level: the letters that get printed on the screen or printer (graphemes) may well be made up of multiple combining code points, e.g. an "e" and an "´". Those again map to two indexes in the Python string, even though, the appear to be one character on output.

Now try to explain all of the above using the terms "narrow" and "wide" (while remembering "explicit is better than implicit" and "avoid the temptation to guess") :-)

It is not really helpful to replace a correct and accurate term with a fuzzy term: either way we're stuck with the semantics.

However, the correct and accurate terms at least give you a chance to figure out and understand the reasoning behind the design. UCS-2 vs. UCS-4 is a trade-off, "narrow" and "wide" is marketing talk with an implicit emphasis on one side :-)

-- Marc-Andre Lemburg eGenix.com

Professional Python Services directly from the Source (#1, Nov 22 2010)

Python/Zope Consulting and Support ... http://www.egenix.com/ mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/ mxODBC, mxDateTime, mxTextTools ... http://python.egenix.com/

::: Try our new mxODBC.Connect Python Database Interface for free ! ::::

eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 http://www.egenix.com/company/contact/

Previous message: [Python-Dev] len(chr(i)) = 2?
Next message: [Python-Dev] len(chr(i)) = 2?
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

More information about the Python-Dev mailing list