[Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces (original) (raw)

"Martin v. Löwis" martin at v.loewis.de
Thu Apr 30 22:06:33 CEST 2009


How do get a printable unicode version of these path strings if they contain none unicode data?

Define "printable". One way would be to use a regular expression, replacing all codes in a certain range with a question mark. What I mean by printable is that the string must be valid unicode that I can print to a UTF-8 console or place as text in a UTF-8 web page. I think your PEP gives me a string that will not encode to valid UTF-8 that the outside of python world likes. Did I get this point wrong?

You are right. However, if your only requirement is that it should be printable, then this is fairly underspecified. One way to get a printable string would be this function

def printable_string(unprintable): return ""

This will always return a printable version of the input string...

In our application we are running fedora with the assumption that the filenames are UTF-8. When Windows systems FTP files to our system the files are in CP-1251(?) and not valid UTF-8.

That would be a bug in your FTP server, no? If you want all file names to be UTF-8, then your FTP server should arrange for that.

Having an algorithm that says if its a string no problem, if its a byte deal with the exceptions seems simple.

How do I do this detection with the PEP proposal? Do I end up using the byte interface and doing the utf-8 decode myself?

No, you should encode using the "strict" error handler, with the locale encoding. If the file name encodes successfully, it's correct, otherwise, it's broken.

Regards, Martin



More information about the Python-Dev mailing list