[Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces (original) (raw)

Stephen J. Turnbull stephen at xemacs.org
Wed Apr 29 15:14:18 CEST 2009

Previous message: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces
Next message: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

Baptiste Carvello writes:

By contrast, if the new utf-8b codec would supercede the old one, \udcxx would always mean raw bytes (at least on UCS-4 builds, where surrogates are unused). Thus ambiguity could be avoided.

Unfortunately, that's false. It could have come from a literal string (similar to the text above ;-), a C extension, or a string slice (on 16-bit builds), and there may be other ways to do it. The only way to avoid ambiguity is to change the definition of a Python string to be valid Unicode (possibly with Python extensions such as PEP 383 for internal use only). But Guido has rejected that in the past; validation is the application's problem, not Python's.

Nor is a UCS-4 build exempt. IIRC Guido specifically envisioned Python strings being used to build up code point sequences to be directly output, which means that a UCS-4 string might none-the-less contain surrogates being added to a string intended to be sent as UTF-16 output simply by truncating the 32-bit code units to 16 bits.

Previous message: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces
Next message: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

More information about the Python-Dev mailing list