[Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces (original) (raw)

Barry Scott barry at barrys-emacs.org
Thu Apr 30 21:43:24 CEST 2009


On 30 Apr 2009, at 05:52, Martin v. Löwis wrote:

How do get a printable unicode version of these path strings if they contain none unicode data? Define "printable". One way would be to use a regular expression, replacing all codes in a certain range with a question mark.

What I mean by printable is that the string must be valid unicode that I can print to a UTF-8 console or place as text in a UTF-8 web page.

I think your PEP gives me a string that will not encode to valid UTF-8 that the outside of python world likes. Did I get this point wrong?

I'm guessing that an app has to understand that filenames come in two forms unicode and bytes if its not utf-8 data. Why not simply return string if its valid utf-8 otherwise return bytes? That would have been an alternative solution, and the one that 2.x uses for listdir. People didn't like it.

In our application we are running fedora with the assumption that the filenames are UTF-8. When Windows systems FTP files to our system the files are in CP-1251(?) and not valid UTF-8.

What we have to do is detect these non UTF-8 filename and get the users to rename them.

Having an algorithm that says if its a string no problem, if its a byte deal with the exceptions seems simple.

How do I do this detection with the PEP proposal? Do I end up using the byte interface and doing the utf-8 decode myself?

Barry



More information about the Python-Dev mailing list