[Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces (original) (raw)

"Martin v. Löwis" martin at v.loewis.de
Wed Apr 22 21:07:47 CEST 2009


"correct" -> "corrected"

Thanks, fixed.

To convert non-decodable bytes, a new error handler "python-escape" is introduced, which decodes non-decodable bytes using into a private-use character U+F01xx, which is believed to not conflict with private-use characters that currently exist in Python codecs. Would this mean that real private use characters in the file name would raise an exception? How? The UTF-8 decoder doesn't pass those bytes to any error handler.

The python-escape codec is only used/meaningful if the env encoding is not UTF-8. For any other encoding, it is assumed that no character actually maps to the private-use characters.

The error handler interface is extended to allow the encode error handler to return byte strings immediately, in addition to returning Unicode strings which then get encoded again. Then the error callback for encoding would become specific to the target encoding.

Why would it become specific? It can work the same way for any encoding: take U+F01xx, and generate the byte xx.

If the locale's encoding is UTF-8, the file system encoding is set to a new encoding "utf-8b". The UTF-8b codec decodes non-decodable bytes (which must be >= 0x80) into half surrogate codes U+DC80..U+DCFF. Is this done by the codec, or the error handler? If it's done by the codec I don't see a reason for the "python-escape" error handler.

utf-8b is a new codec. However, the utf-8b codec is only used if the env encoding would otherwise be utf-8. For utf-8b, the error handler is indeed unnecessary.

While providing a uniform API to non-decodable bytes, this interface has the limitation that chosen representation only "works" if the data get converted back to bytes with the python-escape error handler also. I thought the error handler would be used for decoding.

It's used in both directions: for decoding, it converts \xXX to U+F01XX. For encoding, U+F01XX will trigger an error, which is then handled by the handler to produce \xXX.

"and" -> "an"

Thanks, fixed.

Regards, Martin



More information about the Python-Dev mailing list