[Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces (original) (raw)

MRAB google at mrabarnett.plus.com
Thu Apr 30 22:07:41 CEST 2009


Barry Scott wrote:

On 30 Apr 2009, at 05:52, Martin v. Löwis wrote:

How do get a printable unicode version of these path strings if they contain none unicode data? Define "printable". One way would be to use a regular expression, replacing all codes in a certain range with a question mark. What I mean by printable is that the string must be valid unicode that I can print to a UTF-8 console or place as text in a UTF-8 web page. I think your PEP gives me a string that will not encode to valid UTF-8 that the outside of python world likes. Did I get this point wrong?

I'm guessing that an app has to understand that filenames come in two forms unicode and bytes if its not utf-8 data. Why not simply return string if its valid utf-8 otherwise return bytes? That would have been an alternative solution, and the one that 2.x uses for listdir. People didn't like it. In our application we are running fedora with the assumption that the filenames are UTF-8. When Windows systems FTP files to our system the files are in CP-1251(?) and not valid UTF-8. What we have to do is detect these non UTF-8 filename and get the users to rename them. Having an algorithm that says if its a string no problem, if its a byte deal with the exceptions seems simple. How do I do this detection with the PEP proposal? Do I end up using the byte interface and doing the utf-8 decode myself? What do you do currently?

The PEP just offers a way of reading all filenames as Unicode, if that's what you want. So what if the strings can't be encoded to normal UTF-8! The filenames aren't valid UTF-8 anyway! :-)



More information about the Python-Dev mailing list