[Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces (original) (raw)

Glenn Linderman v+python at g.nevcal.com
Fri Apr 24 21:41:25 CEST 2009

Previous message: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces
Next message: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On approximately 4/24/2009 11:40 AM, came the following characters from the keyboard of Stephen J. Turnbull:

Antoine Pitrou writes: > Stephen J. Turnbull <stephen xemacs.org> writes: > > > > Well, the problem is that both parts are false. If you didn't start > > with a valid string in a known encoding, you shouldn't treat it as > > characters because it's not. Hand it to a careful API, and you'll get > > an Exception raised in your face. > > Which "careful API" are you talking about? > > > OTOH, at least some of those who feel lucky and use it > > naively are going to turn out to be wrong. > > Why will they turn out to be wrong?

Because the encoding is not reliably reversible. That is why I proposed one that is.

To quote the PEP:

""" While providing a uniform API to non-decodable bytes, this interface has the limitation that chosen representation only "works" if the data get converted back to bytes with the python-escape error handler also. Encoding the data with the locale's encoding and the (default) strict error handler will raise an exception, encoding them with UTF-8 will produce non-sensical data. For most applications, we assume that they eventually pass data received from a system interface back into the same system interfaces. """

And so my encoding (1) doesn't alter the data stream for any valid Windows file name, and where the naivest of users reside (2) doesn't alter the data stream for any Posix file name that was encoded as UTF-8 sequences and doesn't contain ? characters in the file name [I perceive the use of ? in file names to be rare on Posix, because of experience, and because of the other problems caused by such use] (3) doesn't introduce data puns within applications that are correctly coded to know the encoding occurs. The encoding technique in the PEP not only can produce data puns, thus not being reversible, it provides no reliable mechanism to know that this has occurred.

But you can't know that. These are now "just strings", which could end up in pickles and other persistent objects, be passed across network interfaces (remote copy, for example), etc, etc, and there is no way to guarantee that the recipient will understand the rules, unless the application encapsulates them in some kind of representation that says "I look like a Unicode but I'm really just encoded bytes."

This could happen. Well-formed programs need to use the encoding at the boundaries. Python could encapsulate its interfaces to the file system, but cannot encapsulate other interfaces. Fortunately, something that is pickled, would probably be unpicked by Python, and therefore all would be well. But any interface that expects a file name, and is not encapsulated by Python, must be encapsulated by the application.

But the whole point is to turn them into plain old strings so people don't have to bother keeping track.

And if that is the point, it isn't worth doing. If the point is that it can minimize the amount of existing, file name manipulation code that uses string manipulations, that must be reworked to be functional during a 2to3 migration, then it can be worth doing. But I think it should be done with an encoding that doesn't introduce undetectable data puns, whether mine or some different encoding with that characteristic, but not the one presently in the PEP, because it does introduce undetectable data puns.

As I already said, this is no worse than the current situation, but it gives the impression that Python has a standard "solution". (Yes, I know Martin doesn't claim it's a solution to any of those problems. The point is user perception.)

I have to wonder whether having a standard way of not solving any problems is better than having no standard way of not solving any problems. It may be, and it probably can't hurt, which is why I'm +0.

Interesting phraseology there, Stephen!

I'm +1 on the concept, -1 on the PEP, due solely to the lack of a reversible encoding.

-- Glenn -- http://nevcal.com/

A protocol is complete when there is nothing left to remove. -- Stuart Cheshire, Apple Computer, regarding Zero Configuration Networking

Previous message: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces
Next message: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

More information about the Python-Dev mailing list