[Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces (original) (raw)

Glenn Linderman v+python at g.nevcal.com
Mon Apr 27 08:39:41 CEST 2009

Previous message: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces
Next message: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On approximately 4/25/2009 5:35 AM, came the following characters from the keyboard of Martin v. Löwis:

Because the encoding is not reliably reversible. Why do you say that? The encoding is completely reversible (unless we disagree on what "reversible" means).

I'm +1 on the concept, -1 on the PEP, due solely to the lack of a reversible encoding. Then please provide an example for a setup where it is not reversible. Regards, Martin

It is reversible if you know that it is decoded, and apply the encoding. But if you don't know that has been encoded, then applying the reverse transform can convert an undecoded str that matches the decoded str to the form that it could have, but never did take.

The problem is that there is no guarantee that the str interface provides only strictly conforming Unicode, so decoding bytes to non-strictly conforming Unicode, can result in a data pun between non-strictly conforming Unicode coming from the str interface vs bytes being decoded to non-strictly conforming Unicode coming from the bytes interface.

Any particular problem that always consistently uses one or the other (bytes vs str) APIs under the covers might never be affected by such a data pun, but programs that may use both types of interface could potentially see a data pun.

If your PEP depends on consistent use of one or the other type of interface, you should say so, and if the platform only provides that type of interface, maybe all is well. Both types of interfaces are available on Windows, perhaps POSIX only provides native bytes interfaces, and if the PEP is the only way to provide str interfaces, then perhaps consistency use is required.

There are still issues regarding how Windows and POSIX programs that are sharing cross-mounted file systems might communicate file names between each other, which is not at all clear from the PEP. If this is an insoluble or un-addressed issue, it should be stated. (It is probably insoluble, due to there being multiple ways that the cross-mounted file systems might translate names; but if there are, can we learn something from the rules the mounting systems use, to be compatible with (one of) them, or not.

Together with your change to avoid using PUA characters, and the rule suggested by MRAB in another branch of this thread, of treating half-surrogates as invalid byte sequences may avoid the data puns I'm concerned about.

It is not clear how half-surrogate characters would be displayed, when the user prints or displays such a file name string. It would seem that programs that display file names to users might still have issues with such; an escaping mechanism that uses displayable characters would have an advantage there.

-- Glenn -- http://nevcal.com/

A protocol is complete when there is nothing left to remove. -- Stuart Cheshire, Apple Computer, regarding Zero Configuration Networking

Previous message: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces
Next message: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

More information about the Python-Dev mailing list