[Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces (original) (raw)

"Martin v. Löwis" martin at v.loewis.de
Wed Apr 29 22:28:54 CEST 2009


C. File on disk with the invalid surrogate code, accessed via the str interface, no decoding happens, matches in memory the file on disk with the byte that translates to the same surrogate, accessed via the bytes interface. Ambiguity. What does that mean? What specific interface are you referring to to obtain file names? os.listdir("") os.listdir(b"") So I guess I'd better suggest that a specific, equivalent directory name be passed in either bytes or str form.

[Leaving the issue of the empty string apparently having different meanings aside ...]

Ok. Now I understand the example. So you do

os.listdir("c:/tmp") os.listdir(b"c:/tmp")

and you have a file in c:/tmp that is named "abc\uDC10".

So what you are saying here is that Python doesn't use the "A" forms of the Windows APIs for filenames, but only the "W" forms, and uses lossy decoding (from MS) to the current code page (which can never be UTF-8 on Windows).

Actually, it does use the A form, in the second listdir example. This, in turn (inside Windows), uses the lossy CP_ACP encoding. You get back a byte string; the listdirs should give

["abc\uDC10"] [b"abc?"]

(not quite sure about the second - I only guess that CP_ACP will replace the half surrogate with a question mark).

So where is the ambiguity here?

You are further saying that Python doesn't give the programmer control over the codec that is used to convert from W results to bytes, so that on Windows, it is impossible to obtain a bytes result containing UTF-8 from os.listdir, even though sys.setfilesystemencoding exists, and sys.getfilesystemencoding is affected by it, and the latter is documented as returning "mbcs", and as returning the codec that should be used by the application to convert str to bytes for filenames. (Python 3.0.1).

Not exactly. You can do setfilesystemencoding on Windows, but it has no effect, as the Python file system encoding is never used on Windows. For a string, it passes it to the W API as is; for bytes, it passes it to the A API as-is. Python never invokes any codec here.

While I can hear a "that is outside the scope of the PEP" coming, this documentation is confusing, to say the least.

Only because you are apparently unaware of the status quo. If you would study the current Python source code, it would be all very clear.

Things are a little clearer in the documentation for sys.setfilesystemencoding, which does say the encoding isn't used by Windows -- so why is it permitted to change it, if it has no effect?).

As in many cases: because nobody contributed code to make it behave otherwise. It's not that the file system encoding is "mbcs" - the file system encoding is simply unused on Windows (but that wasn't always the case, in particular not when Windows 9x still had to be supported).

Regards, Martin



More information about the Python-Dev mailing list