[Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces (original) (raw)

Glenn Linderman v+python at g.nevcal.com
Tue Apr 28 03:15:17 CEST 2009


On approximately 4/27/2009 2:14 PM, came the following characters from the keyboard of Cameron Simpson:

On 27Apr2009 00:07, Glenn Linderman <v+python at g.nevcal.com> wrote:

On approximately 4/25/2009 5:22 AM, came the following characters from the keyboard of Martin v. Löwis:

The problem with this, and other preceding schemes that have been discussed here, is that there is no means of ascertaining whether a particular file name str was obtained from a str API, or was funny- decoded from a bytes API... and thus, there is no means of reliably ascertaining whether a particular filename str should be passed to a str API, or funny-encoded back to bytes.

Why is it necessary that you are able to make this distinction? It is necessary that programs (not me) can make the distinction, so that it knows whether or not to do the funny-encoding or not. I would say this isn't so. It's important that programs know if they're dealing with strings-for-filenames, but not that they be able to figure that out "a priori" if handed a bare string (especially since they can't:-)

So you agree they can't... that there are data puns. (OK, you may not have thought that through)

If a name is funny-decoded when the name is accessed by a directory listing, it needs to be funny-encoded in order to open the file.

Hmm. I had thought that legitimate unicode strings already get transcoded to bytes via the mapping specified by sys.getfilesystemencoding() (the user's locale). That already happens I believe, and Martin's scheme doesn't change this. He's just funny-encoding non-decodable byte sequences, not the decoded stuff that surrounds them.

So assume a non-decodable sequence in a name. That puts us into Martin's funny-decode scheme. His funny-decode scheme produces a bare string, indistinguishable from a bare string that would be produced by a str API that happens to contain that same sequence. Data puns.

So when open is handed the string, should it open the file with the name that matches the string, or the file with the name that funny-decodes to the same string? It can't know, unless it knows that the string is a funny-decoded string or not.

So it is already the case that strings get decoded to bytes by calls like open(). Martin isn't changing that.

I thought the process of converting strings to bytes is called encoding. You seem to be calling it decoding?

I suppose if your program carefully constructs a unicode string riddled with half-surrogates etc and imagines something specific should happen to them on the way to being POSIX bytes then you might have a problem...

Right. Or someone else's program does that. I only want to use Unicode file names. But if those other file names exist, I want to be able to access them, and not accidentally get a different file.

I think the advantage to Martin's choice of encoding-for-undecodable-bytes is that it doesn't use normal characters for the special bits. This means that all normal characters are left unmangled un both "bare" and "funny-encoded" strings.

Whether the characters used for funny decoding are normal or abnormal, unless they are prevented from also appearing in filenames when they are obtained from or passed to other APIs, there is the possibility that the funny-decoded name also exists in the filesystem by the funny-decoded name... a data pun on the name.

Whether the characters used for funny decoding are normal or abnormal, if they are not prevented from also appearing in filenames when they are obtained from or passed to other APIs, then in order to prevent data puns, all names must be passed through the decoder, and the decoder must perform a 1-to-1 reversible mapping. Martin's funny-decode process does not perform a 1-to-1 reversible mapping (unless he's changed it from the version of the PEP I found to read).

This is why some people have suggested using the null character for the decoding, because it and / can't appear in POSIX file names, but everything else can. But that makes it really hard to display the funny-decoded characters.

Because of that, I now think I'm -1 on your "use printable characters for the encoding". I think presentation of the special characters should look bogus in an app (eg little rectangles or whatever in a GUI); it's a fine flashing red light to the user.

The reason I picked a ASCII printable character is just to make it easier for humans to see the encoding. The scheme would also work with a non-ASCII non-printable character... but I fail to see how that would help a human compare the strings on a display of file names. Having a bunch of abnormal characters in a row, displayed using a single replacement glyph, just makes an annoying mess in the file open dialog.

Also, by avoiding reuse of legitimate characters in the encoding we can avoid your issue with losing track of where a string came from; legitimate characters are currently untouched by Martin's scheme, except for the normal "bytes<->string via the user's locale" translation that must already happen, and there you're aided by byets and strings being different types.

There are abnormal characters, but there are no illegal characters.
NTFS permits any 16-bit "character" code, including abnormal ones, including half-surrogates, and including full surrogate sequences that decode to PUA characters. POSIX permits all byte sequences, including things that look like UTF-8, things that don't look like UTF-8, things that look like half-surrogates, and things that look like full surrogate sequences that decode to PUA characters.

So whether the decoding/encoding scheme uses common characters, or uncommon characters, you still have the issue of data puns, unless you use a 1-to-1 transformation, that is reversible. With ASCII strings, I think no one questions that you need to escape the escape characters. C uses \ as an escape character... Everyone understands that if you want to use a \ in a C string, you have to use \ instead... and that scheme has escaped the boundaries of C to other use cases. But it seems that you think that if we could just find one more character that no one else uses, that we wouldn't have to escape it.... and that could be true, but there aren't any characters that no one else uses. So whatever character (and a range makes it worse) you pick, someone else uses it.
So in order for the scheme to work, you have to escape the escape character(s), even in names that wouldn't otherwise need to be funny-decoded.

I'm certainly not experienced enough in Python development processes or internals to attempt such, as yet. But somewhere in 25 years of programming, I picked up the knowledge that if you want to have a 1-to-1 reversible mapping, you have to avoid data puns, mappings of two different data values into a single data value. Your PEP, as first written, didn't seem to do that... since there are two interfaces from which to obtain data values, one performing a mapping from bytes to "funny invalid" Unicode, and the other performing no mapping, but accepting any sort of Unicode, possibly including "funny invalid" Unicode, the possibility of data puns seems to exist. I may be misunderstanding something about the use cases that prevent these two sources of "funny invalid" Unicode from ever coexisting, but if so, perhaps you could point it out, or clarify the PEP.

Please elucidate the "second source" of strings. I'm presuming you mean strings egenrated from scratch rather than obtained by something like listdir().

POSIX has byte APIs for strings, that's one source, that is most under discussion. Windows has both bytes and 16-bit APIs for strings... the 16-bit APIs are generally mapped directly to UTF-16, but are not checked for UTF-16 validity, so all of Martin's funny-decoded files could be used for Windows file names on the 16-bit APIs. And yes, strings can be generated from scratch.

Given such a string with "funny invalid" stuff in it, and absent Martin's scheme, what do you expect the source of the strings to expect to happen to them if passed to open()? They still have to be converted to bytes at the POSIX layer anyway.

There is a fine encoding scheme that can take any str and encode to bytes: UTF-8.

The problem is that UTF-8 doesn't work to take any byte sequence and decode to str, and that means that special handling has to happen when such byte sequences are encountered. But there is no str that can be generated that can't be generated in other ways, which would be properly encoded to a different byte sequence. Hence there are data puns, no 1-to-1 mapping. Hence it seems obvious to me that the only complete solution is to have an escape character, and ensure that all strings are decoded and encoded. As soon as you have an escape character, then you can decode anything into displayable, standard, Unicode, and you can create the reverse encoding unambiguously.

Without an escape character, you just have a heuristic that will work sometimes, and break sometimes. If you believe non-UTF-8-decodable byte sequences are rare, you can ignore them. That's what we do now, but people squawk. If you believe that you can invent an encoding that has data puns, and that because of the character or characters involved are rare, that the problems that result can be ignored, fine... but people will squawk when they hit the problem... I'm just trying to squawk now, to point out that this is complexity for complexities sake, it adds complexity to trade one problem for a different problem, under the belief that the other problem is somehow rarer than the first. And maybe it is, today. I'd much rather have a solution that actually solves the problem.

If you don't like ? as the escape character, then pick U+10F01, and anytime a U+10F01 is encountered in a file name, double it. And anytime there is an undecodable byte sequence, emit U+10F01, and then U+80 through U+FF as a subsequent character for the first byte in the undecodable sequence, and restart the decoder with the next byte.
That'll work too. But use of rare, abnormal characters to take the place of undecodable bytes can never work, because of data puns, and valid use of the rare, abnormal characters.

Someone suggested treating the byte sequences of the rare, abnormal characters as undecodable bytes, and decoding them using the same substitution rules. That would work too, if applied consistently, because then the rare, abnormal characters would each be escaped. But having 128 escape characters seems more complex than necessary, also.

-- Glenn -- http://nevcal.com/

A protocol is complete when there is nothing left to remove. -- Stuart Cheshire, Apple Computer, regarding Zero Configuration Networking



More information about the Python-Dev mailing list