[Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces (original) (raw)

MRAB google at mrabarnett.plus.com
Sat Apr 25 19:27:47 CEST 2009

Previous message: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces
Next message: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

Martin v. Löwis wrote:

I see two main user-oriented use cases for the resulting Unicode strings this PEP will produce on all systems: displaying a list of filenames for the user to select from (an open file dialog), and allowing a user to edit or supply a filename (a save dialog or a rename control). There are more, in particular the case "user passes a file name on the command line", and "web server passes URL in environment variable".

It's clear what this PEP provides for the former. On well-behaved systems where a simpler filesystemencoding approach would work, the results are identical; the user can select filenames that are what he expects to see on both Unix and Windows. On less well-behaved systems, some characters may appear as junk in the middle of the name (or would they be invisible?) Depends on the rendering. Try "print u'\udc00'" in your terminal to see what happens; for me, it renders the glyph for "replacement character". In GUI applications, you often see white boxes (rectangles). What I don't find clear is what the risks are for the latter. On the less well behaved system, a user may well attempt to use this python application to fix filenames. Can we estimate a likelihood that edits to the names would result in a Unicode string that can no longer be encoded with the python-escape? Will a new name fully provided by a user on his keyboard (ignoring copy and paste) almost always safely encode? That very much depends on the system setup, and your impression is right that the PEP doesn't address it - it only deals with cases where you get random unsupported bytes; getting random unsupported characters from the user is not considered. If the user has the locale setup in way that matches his keyboard, it should work all fine - and will already, even without the PEP. If the user enters a character that doesn't directly map to a good file name, you get an exception, and have to tell the user to pick a different filename. Notice that it may fail at several layers: - it may be that characters entered are not supported in what Python choses as the file system encoding. - it may be that the characters are not supported by the file system, e.g. leading spaces in Win32. - it may be that the file cannot be renamed because the target name already exists. In all these cases, the application has to ask the user to reconsider; for at least the last case, it should be prepared to do that, anyway (there is also the case where renaming fails because of lack of permissions; in that case, picking a different file name won't help). This has made me think about what happens going the other way, ie when a user-supplied Unicode string needs to be converted to UTF-8b. That should also be reversible.

Therefore:

When encoding using UTF-8b, codepoints in the range U+DC80..U+DCFF should map to bytes 0x80..0xFF; all other codepoints, including the remaining half surrogates, should be encoded normally.

When decoding using UTF-8b, undecodable bytes in the range 0x80..0xFF should map to U+DC80..U+DCFF; all other bytes, including the encodings for the remaining half surrogates, should be decoded normally.

This will ensure that even when the user has provided a string containing half surrogates it can be encoded to bytes and then decoded back to the original string.

Previous message: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces
Next message: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

More information about the Python-Dev mailing list