[Python-Dev] please consider changing --enable-unicode default to ucs4 (original) (raw)

Adam Olsen rhamph at gmail.com
Thu Oct 8 02:10:25 CEST 2009


On Sun, Sep 20, 2009 at 10:17, Zooko O'Whielacronx <zookog at gmail.com> wrote:

On Sun, Sep 20, 2009 at 8:27 AM, Antoine Pitrou <solipsis at pitrou.net> wrote:

AFAIK, C extensions should fail loading when they have the wrong UCS2/4 setting. That would be an improvement!  Unfortunately we instead get mysterious misbehavior of the module, e.g.: http://bugs.python.org/setuptools/msg309 http://allmydata.org/trac/tahoe/ticket/704#comment:5

The real issue here is getting confused because python's option is misnamed. We support UTF-16 and UTF-32, not UCS-2 and UCS-4. This means that when decoding UTF-8, any scalar value outside the BMP will be split into a pair of surrogates on UTF-16 builds; if we were using UCS-2 that'd be an error instead (and nothing would understand surrogates.)

Yet we are getting an error here. However, if you look at the details you'll notice it's on a 6-byte UTF-8 code unit sequence, corresponding in the second link to U+6E657770. Although the originally UTF-8 left open the possibility of including up to 31 bits (or U+7FFFFFFF), this was removed in RFC 3629 and is now strictly prohibited. The modern unicode character set itself also imposes that restriction. There is nothing beyond U+10FFFF. Nothing should create a such a high code point, and even if it happened internally a RFC 3629-conformant UTF-8 encoder must refuse to pass it through.

Something more subtle must be going on. Possibly several bugs (such as a non-conformant encoder or garbage being misinterpreted as UTF-8).

-- Adam Olsen, aka Rhamphoryncus



More information about the Python-Dev mailing list