[Python-Dev] Python-3.0, unicode, and os.environ (original) (raw)

glyph at divmod.com glyph at divmod.com
Fri Dec 5 04:52:36 CET 2008


On 02:08 am, tjreedy at udel.edu wrote:

James Y Knight wrote:

On Dec 4, 2008, at 6:39 PM, Martin v. Löwis wrote:

I'm in favour of a different, fifth solution:

5) represent all environment variables in Unicode strings, including the ones that currently fail to decode. (then do the same to file names, then drop the byte-oriented file operations again)

FWIW, I still agree with Martin that that's the most reasonable solution. FWIW2, I have much the same feeling.

And I still disagree, but I re-read the old thread and didn't see much of a clear argument there, so at least I'm not re-treading old ground :).

The only strategy that would allow us to encode all inputs as unicode (including the invalid ones) is to abuse NUL to mean "ha ha, this isn't actually a unicode string, it's something I couldn't decode". This is nice because it allows the type of the returned value to be the same, so a Python program that expects a unicode object will be able to manipulate this object (as long as it doesn't split it up too close to a NUL).

It seems to me that this convenient, but clever-clever type distinction will inevitably be a bug magnet. For the most basic example, see the caveat above. But more realistically - not too much code splits filenames on anything but "." or os.sep, after all - if you pass this to an extension module which then wants to invoke a C library function which passes the file name to open() and friends, what is the right thing for the extension module to do? There would need to be a new API which could get the "right" bytes out of a unicode string which potentially has NULs in it. This can't just be an encoding, either, because you might need to get the Shift-JIS bytes (some foreign system's encoding) for some got-NULs-in-it filename even though your locale says the encoding is UTF-8. And what if those bytes happen to be valid Shift-JIS? Decoding bytes makes a lot more sense to me than transcoding strings.

Filenames and environment variables would all need to be encoded or decoded according to this magic encoding. And what happens if you get some garbage data from elsewhere and pass it to a function that generates a filename? Now, you get a pleasant error message, "TypeError: file() argument 1 must be (encoded string without NULL bytes), not str". In the future, I can only assume (if you're lucky) that you'll get some weird thing out of the guts of an encoding module; or, more likely, some crazy mojibake filename containing PUA code points or whatever will silently get opened. You can make this less likely (and harder to debug in the odd cases where it does happen) by making the encoding more clever, but eventually your luck will run out: most likely on somebody's computer who doesn't speak english well enough to report the problem clearly.

The scenario gets progressively more nightmarish as you start putting more libraries into the mix. You pass some environment variable into some library which knows all about unicode and happily handles it correctly, but a second library which doesn't know about this proposed tricky NUL convention gets the unicode filename and transcodes it literally, causing an error return from open(). This puts the apparent error very far away from the responsible code.

Ultimately it makes sense to expose the underlying bytes as bytes without forcing everyone to pretend that they make sense as anything but bytes, and allow different applications to make appropriately educated guesses about their character format. In any case, programmers who don't know about these kinds of issues are going to make mistakes in handling invalid filenames on UNIXy systems, and some users won't be able to open some files. If there is an easy and straightforward way to get the bytes out, it's more likely that programmers who know what they are doing will be able to get the correct behavior.

Of course, the NUL-encoding trick will make it possible to do the right thing, but our hypothetically savvy programmer now needs to learn about the bytes/unicode distinction between windows/mac+linux+everythingelse, and Python's special convention for invalid data, and how to mix it with encoding/decoding/transcoding, rather than just Python's distinct API for the distinct types that may represent a filename. I think this is significantly harder to document than just having two parallel APIs (environ, environb, open(str), open(bytes), listdir(str), listdir(bytes)) to reflect the very subtle, but nevertheless very real, distinction between the Windows and UNIX worlds.

This distinct API can still provide the same illusion of "it usually works" portability that the encoding convention can: for Windows, environb can be the representation of the environment in a particular encoding; for UNIX, environ(u) can be all of the variables which correctly decode. And so on for each other API.

At least this time I think I've encapsulated pretty much my entire argument here, so if you don't buy it, we can probably just agree to disagree :).



More information about the Python-Dev mailing list