[Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces (original) (raw)

Glenn Linderman v+python at g.nevcal.com
Wed Apr 29 08:54:21 CEST 2009

Previous message: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces
Next message: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On approximately 4/28/2009 10:52 PM, came the following characters from the keyboard of Martin v. Löwis:

C. File on disk with the invalid surrogate code, accessed via the str interface, no decoding happens, matches in memory the file on disk with the byte that translates to the same surrogate, accessed via the bytes interface. Ambiguity. Is that an alternative to A and B? I guess it is an adjunct to case B, the current PEP. It is what happens when using the PEP on a system that provides both bytes and str interfaces, and both get used. Your formulation is a bit too stenographic to me, but please trust me that there is no ambiguity in the case you construct.

No Martin, the point of reviewing the PEP is to not trust you, even though you are generally very knowledgeable and very trustworthy. It is much easier to find problems before something is released, or even coded, than it is afterwards.

By "accessed via the str interface", I assume you do something like

fn = "some string" open(fn) You are wrong in assuming "no decoding happens", and that "matches in memory the file on disk" (whatever that means - how do I match a file on disk in memory??????). What happens instead is that fn gets encoded with the file system encoding, and the python-escape handler. This will not produce an ambiguity.

You assumed, and maybe I wasn't clear in my statement.

By "accessed via the str interface" I mean that (on Windows) the wide string interface would be used to obtain a file name. Now, suppose that the file name returned contains "abc" followed by the half-surrogate U+DC10 -- four 16-bit codes.

Then, ask for the same filename via the bytes interface, using UTF-8 encoding. The PEP says that the above name would get translated to "abc" followed by 3 half-surrogates, corresponding to the 3 UTF-8 bytes used to represent the half-surrogate that is actually in the file name, specifically U+DCED U+DCB0 U+DC90. This means that one name on disk can be seen as two different names in memory.

Now posit another file which, when accessed via the str interface, has the name "abc" followed by U+DCED U+DCB0 U+DC90.

Looks ambiguous to me. Now if you have a scheme for handling this case, fine, but I don't understand it from what is written in the PEP.

If you think there is an ambiguity in that you can use both the byte interface and the string interface to access the same file: this would be a ridiculous interpretation. Of course you can access /etc/passwd both as "/etc/passwd" and b"/etc/passwd", there is nothing ambiguous about that.

Yes, this would be a ridiculous interpretation of "ambiguous".

-- Glenn -- http://nevcal.com/

A protocol is complete when there is nothing left to remove. -- Stuart Cheshire, Apple Computer, regarding Zero Configuration Networking

Previous message: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces
Next message: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

More information about the Python-Dev mailing list