[Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces (original) (raw)

Glenn Linderman v+python at g.nevcal.com
Fri Apr 24 11:22:14 CEST 2009

Previous message: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces
Next message: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On approximately 4/24/2009 12:59 AM, came the following characters from the keyboard of Simon Cross:

On Wed, Apr 22, 2009 at 8:50 AM, "Martin v. Löwis" <martin at v.loewis.de> wrote:

For Python 3, one proposed solution is to provide two sets of APIs: a byte-oriented one, and a character-oriented one, where the character-oriented one would be limited to not being able to represent all data accurately. Unfortunately, for Windows, the situation would be exactly the opposite: the byte-oriented interface cannot represent all data; only the character-oriented API can. As a consequence, libraries and applications that want to support all user data in a cross-platform manner have to accept mish-mash of bytes and characters exactly in the way that caused endless troubles for Python 2.x. Is the second part of this actually true? My understanding may be flawed, but surely all Unicode data can be converted to and from bytes using UTF-8? Obviously not all byte sequences are valid UTF-8, but this doesn't prevent one from creating an arbitrary Unicode string using "utf-8 bytes".decode("utf-8"). Given this, can't people who must have access to all files / environment data just use the bytes interface? Disclosure: My gut reaction is that the solution described in the PEP is a hack, but I'm hardly a character encoding expert. My feeling is that the correct solution is to either standardise on the bytes interface as the lowest common denominator, or to add a Path type (and I guess an EnvironmentalData type) and use the new type to attempt to hide the differences.

Oh clearly it is a hack. The right solution of a Path type (and friends) was discarded in earlier discussion, because it would impact too much existing code. The use of bytes would be annoying in the context of py3, where things that you want to display are in str (Unicode). So there is no solution that allows the use of str, and the robustness of bytes, and is 100% compatible with existing practice. Hence the desire is to find a hack that is "good enough". At least, that is my understanding and synopsis.

I never saw MvL's original message with the PEP delivered to my mailbox, but some of the replies came there, so I found and extensively replied to it using the Google group / usenet. My reply never showed up here and no one has commented on it either... Should I repost via the mailing list? I think so... I'll just paste it in here, with one tweak I noticed after I sent it fixed... (Sorry Simon, but it is still the same thread, anyway.) (Sorry to others, if my original reply was seen, and just wasn't worth replying to.)

On Apr 21, 11:50 pm, "Martin v. Löwis" <mar... at v.loewis.de> wrote:

I'm proposing the following PEP for inclusion into Python 3.1. Please comment.

Basically the scheme doesn't work. Aside from that, it is very close.

There are tons of encoding schemes that could work... they don't have to include half-surrogates or bytes. What they have to do, is make sure that they are uniformly applied to all appropriate strings.

The problem with this, and other preceding schemes that have been discussed here, is that there is no means of ascertaining whether a particular file name str was obtained from a str API, or was funny- decoded from a bytes API... and thus, there is no means of reliably ascertaining whether a particular filename str should be passed to a str API, or funny-encoded back to bytes.

The assumption in the 2nd Discussion paragraph may hold for a large percentage of cases, maybe even including some number of 9s, but it is not guaranteed, and cannot be enforced, therefore there are cases that could fail. Whether those failure cases are a concern or not is an open question. Picking a character (I don't find U+F01xx in the Unicode standard, so I don't know what it is) that is obscure, and unlikely to be used in "real" file names, might help the heuristic nature of the encoding and decoding avoid most conflicts, but provides no guarantee that data puns will not occur in practice. Today's obscure character is tomorrows commonly used character, perhaps. Someone not on this list may be happily using that character for their own nefarious, incompatible purpose.

As I realized in the email-sig, in talking about decoding corrupted headers, there is only one way to guarantee this... to encode all character sequences, from all interfaces. Basically it requires reserving an escape character (I'll use ? in these examples -- yes, an ASCII question mark -- happens to be illegal in Windows filenames so all the better on that platform, but the specific character doesn't matter... avoiding / \ and . is probably good, though).

So the rules would be, when obtaining a file name from the bytes OS interface, that doesn't properly decode according to UTF-8, decode it by placing a ? at the beginning, then for each decodable UTF-8 sequence, add a Unicode character -- unless the character is ?, in which case you add two ??, and for each non-decodable byte sequence, place a ? and two hex digits, or a ? and a half surrogate code, or a ? and whatever gibberish you like. Two hex digits are fine by me, and will serve for this discussion.

ALSO, when obtaining a file name from the str OS interfaces, encode it too... if it contains any ?, then place a ? at the front, and then any other ? in the name must be doubled.

Then you have a string that can/must be encoded to be used on either str or bytes OS interfaces... or any other interfaces that want str or bytes... but whichever they want, you can do a decode, or determine that you can't, into that form. The encode and decode functions should be available for coders to use, that code to external interfaces, either OS or 3rd party packages, that do not use this encoding scheme. This encoding scheme would be used throughout all Python APIs (most of which would need very little change to accommodate it). However, programs would have to keep track of whether they were dealing with encoded or unencoded strings, if they use both types in their program (an example, is hard-coded file names or file name parts).

The initial ? is not strictly necessary for this scheme to work, but I think it would be a good flag to the user that this name has been altered.

This scheme does not depend on assumptions about the use of file names.

This scheme would be enhanced if the file name APIs returned a subtype of str for the encoded names, but that should be considered only a hint, not a requirement.

When encoding file name strings to pass to bytes APIs, the ? followed by two hex digits would be converted to a byte. Leading ? would be dropped, and ?? would convert to ?. I don't believe failures are possible when encoding to bytes.

When encoding file name strings to pass to str APIs, the discovery of ? followed by two hex digits would raise an exception, the file name is not acceptable to a str API. However, leading ? would be dropped, and ?? would convert to ?, and if no ? followed by two hex digits were found, the file name would be successfully converted for use on the str API.

Note that not even on Unix/Posix is it particularly easy nor useful to place a ? into file names from command lines due to shell escapes, etc. The use of ? in file names also interferes with easy ability to specifically match them in globs, etc.

Anything short of such an encoding of both types of interfaces, such that it is known that all python-manipulated filenames will be encoded, will have data puns that provide a potential for failure in edge cases.

Note that in this scheme, no file names that are fully Unicode and do not contain ? characters are altered by the decoding or the encoding process. That will probably reach quite a few 9s of likelihood that the scheme will go unnoticed by most people and programs and filenames. But the scheme will work reliably if implemented correctly and completely, and will have no edge cases of failure due to not having data puns.

-- Glenn -- http://nevcal.com/

A protocol is complete when there is nothing left to remove. -- Stuart Cheshire, Apple Computer, regarding Zero Configuration Networking

Previous message: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces
Next message: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

More information about the Python-Dev mailing list