[Python-Dev] UCS2/UCS4 default (original) (raw)

M.-A. Lemburg mal at egenix.com
Thu Jul 3 15:57:41 CEST 2008

Previous message: [Python-Dev] UCS2/UCS4 default
Next message: [Python-Dev] UCS2/UCS4 default
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On 2008-07-03 15:21, Jeroen Ruigrok van der Werven wrote:

-On [20080703 15:00], M.-A. Lemburg (mal at egenix.com) wrote:

Unicode if full of combining code points - if you break such a sequence, the output will be just as wrong; regardless of UCS2 vs. UCS4. In my opinion you are confusing two related, but very separated things here. Combining characters have nothing to do with breaking up the encoding of a single codepoint. Sure enough, if you arbitrary slice up codepoints that consist of combining characters then your result is indeed odd looking. I never said that nor is that the point I am making.

Please remember that lone surrogate pair code points are perfectly valid Unicode code points, nevertheless. Just as a lone combining code point is valid on its own.

Guido points out that Python supports surrogate pairs and says that if Python is dealing wrongly with this in the core than it needs to be fixed. I am pointing out that given the fact we allow surrogate pairs we deal rather simplistic with it in the core. In fact, we do not consider them at all. In essence: though we may accept full 21-bit codepoints in the form of \U00000000 escape sequences and store them internally as UTF-16 (which I still need to verify) we subsequently deal with them programmatically as UCS-2, which is plain silly.

Python applies conversion from non-BMP code points to surroagtes for UCS builds in a few places and I agree that we should probably do that at a few more places.

However, these are mainly conversion issues of encoded Unicode representations vs. the internal Unicode storage where you want to avoid exceptions in favor of finding a solution that preserves data.

To make it clear: UCS2 builds of Python do not support non-BMP code points out of the box.

A programmer will always have to use a codec to map the internal storage on these builds to the full Unicode code point range. The following codecs support surrogates on UCS2 builds:

UTF-8
UTF-16
UTF-32
unicode-escape
raw-unicode-escape

You either commit yourself fully to UTF-16 and surrogate pairs or not. Not some form in-between, because that will ultimately lead to more confusion due to the difference in results when dealing with Unicode.

Programmers will have to be aware of the fact that on UCS2 builds of Python non-BMP code points will have to be treated differently than on UCS4 builds.

I don't see that as a problem. It is in a way similar to 32-bit vs. 64-bit builds of Python or the fact that floating point numbers work differently depending on the Python platform or compiler being used.

BTW: Have you ever run into any problems with UCS2 vs. UCS4 in practice that were not easy to solve ?

-- Marc-Andre Lemburg eGenix.com

Professional Python Services directly from the Source (#1, Jul 03 2008)

Python/Zope Consulting and Support ... http://www.egenix.com/ mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/ mxODBC, mxDateTime, mxTextTools ... http://python.egenix.com/

2008-07-07: EuroPython 2008, Vilnius, Lithuania 3 days to go

:::: Try mxODBC.Zope.DA for Windows,Linux,Solaris,MacOSX for free ! ::::

eGenix.com Software, Skills and Services GmbH  Pastor-Loeh-Str.48
 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg
        Registered at Amtsgericht Duesseldorf: HRB 46611

Previous message: [Python-Dev] UCS2/UCS4 default
Next message: [Python-Dev] UCS2/UCS4 default
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

More information about the Python-Dev mailing list