[Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces (original) (raw)

Glenn Linderman v+python at g.nevcal.com
Wed Apr 29 11:56:05 CEST 2009

Previous message: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces
Next message: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On approximately 4/29/2009 12:29 AM, came the following characters from the keyboard of Martin v. Löwis:

C. File on disk with the invalid surrogate code, accessed via the str interface, no decoding happens, matches in memory the file on disk with the byte that translates to the same surrogate, accessed via the bytes interface. Ambiguity. Is that an alternative to A and B? I guess it is an adjunct to case B, the current PEP.

It is what happens when using the PEP on a system that provides both bytes and str interfaces, and both get used. Your formulation is a bit too stenographic to me, but please trust me that there is no ambiguity in the case you construct. No Martin, the point of reviewing the PEP is to not trust you, even though you are generally very knowledgeable and very trustworthy. It is much easier to find problems before something is released, or even coded, than it is afterwards. Sure. However, that requires you to provide meaningful, reproducible counter-examples, rather than a stenographic formulation that might hint some problem you apparently see (which I believe is just not there). You assumed, and maybe I wasn't clear in my statement. By "accessed via the str interface" I mean that (on Windows) the wide string interface would be used to obtain a file name. What does that mean? What specific interface are you referring to to obtain file names? Most of the time, file names are obtained by the user entering them on the keyboard. GUI applications are completely out of the scope of the PEP. Now, suppose that the file name returned contains "abc" followed by the half-surrogate U+DC10 -- four 16-bit codes. Ok, so perhaps you might be talking about os.listdir here. Communication would be much easier if I would not need to guess what you may mean.

os.listdir("")

Also, why is U+DC10 four 16-bit codes?

It isn't.

First 16-bit code is U+0061 Second 16-bit code is U+0062 Third 16-bit code is U+0063 Fourth 16-bit code is U+DC10

Then, ask for the same filename via the bytes interface, using UTF-8 encoding. How do you do that on Windows? You cannot just pick an encoding, such as UTF-8, and pass that to the byte interface, and expect it to work. If you use the byte interface, you need to encode in the file system encoding, of course. Also, what do you mean by "ask for"?????? WHAT INTERFACE ARE YOU USING???? Please use specific python code.

os.listdir(b"")

I find that on my Windows system, with all ASCII path file names, that I get quite different results when I pass os.listdir an empty str vs an empty bytes.

Rather than keep you guessing, I get the root directory contents from the empty str, and the current directory contents from an empty bytes. That is rather unexpected.

So I guess I'd better suggest that a specific, equivalent directory name be passed in either bytes or str form.

The PEP says that the above name would get translated to "abc" followed by 3 half-surrogates, corresponding to the 3 UTF-8 bytes used to represent the half-surrogate that is actually in the file name, specifically U+DCED U+DCB0 U+DC90. This means that one name on disk can be seen as two different names in memory. You are relying on false assumptions here, namely that the UTF-8 encoding would play any role. What would happen instead is that the "mbcs" encoding would be used. The "mbcs" encoding, by design from Microsoft, will never report an error, so the error handler will not be invoked at all.

So what you are saying here is that Python doesn't use the "A" forms of the Windows APIs for filenames, but only the "W" forms, and uses lossy decoding (from MS) to the current code page (which can never be UTF-8 on Windows).

You are further saying that Python doesn't give the programmer control over the codec that is used to convert from W results to bytes, so that on Windows, it is impossible to obtain a bytes result containing UTF-8 from os.listdir, even though sys.setfilesystemencoding exists, and sys.getfilesystemencoding is affected by it, and the latter is documented as returning "mbcs", and as returning the codec that should be used by the application to convert str to bytes for filenames. (Python 3.0.1).

While I can hear a "that is outside the scope of the PEP" coming, this documentation is confusing, to say the least.

Now posit another file which, when accessed via the str interface, has the name "abc" followed by U+DCED U+DCB0 U+DC90.

Looks ambiguous to me. Now if you have a scheme for handling this case, fine, but I don't understand it from what is written in the PEP. You were just making false assumptions in your reasoning, assumptions that are way beyond the scope of the PEP.

Absolutely correct. I was making what seemed to be reasonable assumptions about Python internals on Windows, and several of them are false, including misleading documentation for listdir (which doesn't specify that bytes and str parameters affect whether or not the current directory is honored), and sys.getfilesystemencoding (which reflects the result of sys.setfilesystemencoding, rather than returning, on Windows, the "mbcs" used by Python to create bytes forms of filenames from W forms of filenames even after sys.setfilesystemencoding is called. Things are a little clearer in the documentation for sys.setfilesystemencoding, which does say the encoding isn't used by Windows -- so why is it permitted to change it, if it has no effect?).

-- Glenn -- http://nevcal.com/

A protocol is complete when there is nothing left to remove. -- Stuart Cheshire, Apple Computer, regarding Zero Configuration Networking

Previous message: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces
Next message: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

More information about the Python-Dev mailing list