[Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces (original) (raw)

glyph at divmod.com glyph at divmod.com
Wed Apr 22 14:20:24 CEST 2009


On 06:50 am, martin at v.loewis.de wrote:

I'm proposing the following PEP for inclusion into Python 3.1. Please comment.

To convert non-decodable bytes, a new error handler "python-escape" is introduced, which decodes non-decodable bytes using into a private-use character U+F01xx, which is believed to not conflict with private-use characters that currently exist in Python codecs.

-1. On UNIX, character data is not sufficient to represent paths. We must, must, must continue to have a simple bytes interface to these APIs. Covering it up in layers of obscure encoding hacks will not make the problem go away, it will just make it harder to understand.

To make matters worse, Linux and GNOME use the PUA for some printable characters. If you open up charmap on an ubuntu system and select "view by unicode character block", then click on "private use area", you'll see many of these. I know that Apple uses at least a few PUA codepoints for the apple logo and the propeller/option icons as well.

I am still -1 on any turn-non-decodable-bytes-into-text, because it makes life harder for those of us trying to keep bytes and text straight, but if you absolutely must represent POSIX filenames as mojibake rather than bytes, the only workable solution is to use NUL as your escape character. That's the only code point which actually can't show up in a filename somehow. As we discussed last time, this is what Mono does with System.IO.Path. As a bonus, it's much easier to detect a NUL from random application code than to try to figure out if a string has any half-surrogates or magic PUA characters which shouldn't be interpreted according to platform PUA rules.



More information about the Python-Dev mailing list