[Python-Dev] UCS2/UCS4 default (original) (raw)

Jeroen Ruigrok van der Werven asmodai at in-nomine.org
Thu Jul 3 15:21:46 CEST 2008

Previous message: [Python-Dev] UCS2/UCS4 default
Next message: [Python-Dev] UCS2/UCS4 default
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

-On [20080703 15:00], M.-A. Lemburg (mal at egenix.com) wrote:

Unicode if full of combining code points - if you break such a sequence, the output will be just as wrong; regardless of UCS2 vs. UCS4.

In my opinion you are confusing two related, but very separated things here. Combining characters have nothing to do with breaking up the encoding of a single codepoint. Sure enough, if you arbitrary slice up codepoints that consist of combining characters then your result is indeed odd looking.

I never said that nor is that the point I am making.

Guido points out that Python supports surrogate pairs and says that if Python is dealing wrongly with this in the core than it needs to be fixed. I am pointing out that given the fact we allow surrogate pairs we deal rather simplistic with it in the core. In fact, we do not consider them at all. In essence: though we may accept full 21-bit codepoints in the form of \U00000000 escape sequences and store them internally as UTF-16 (which I still need to verify) we subsequently deal with them programmatically as UCS-2, which is plain silly.

You either commit yourself fully to UTF-16 and surrogate pairs or not. Not some form in-between, because that will ultimately lead to more confusion due to the difference in results when dealing with Unicode.

-- Jeroen Ruigrok van der Werven <asmodai(-at-)in-nomine.org> / asmodai イェルーンラウフロックヴァンデルウェルヴェン http://www.in-nomine.org/ | http://www.rangaku.org/ | GPG: 2EAC625B Believe in Angels...

Previous message: [Python-Dev] UCS2/UCS4 default
Next message: [Python-Dev] UCS2/UCS4 default
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

More information about the Python-Dev mailing list