[Python-Dev] Python-3.0, unicode, and os.environ (original) (raw)

Stephen J. Turnbull stephen at xemacs.org
Fri Dec 12 02:55:52 CET 2008

Previous message: [Python-Dev] Python-3.0, unicode, and os.environ
Next message: [Python-Dev] Python-3.0, unicode, and os.environ
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

Steve Holden writes:

Ulrich Eckhardt writes:

What I'd just like some feedback on is the approach to return a distinct type (neither a byte string nor a Unicode string) from readdir().

This is presumably unacceptable on the grounds that it will break existing code that does something more or less useful more or less some of the time.

If you know what your filesystem produces, you can take the appropriate action to convert it into a type that makes sense to the user.

Unfortunately, even programmers experienced in I18N like Martin, and those with intuition-that-has-the-force-of-law like Guido, express deliberate disbelief on this point. They say that filesystem names and environment variable values are text, which is true from the semantic viewpoint but can't be fully supported by any implementation.

The implementation issue is why you want bytes, but I don't think it is going to overcome the tide of (semantically-oriented) pragmatism.

If you don't, then at least if you have the string in its bytes form you can re-present it to the filesystem to manipulate the file. What are we supposed to do with the "special type"?

Trivially convert it back to bytes and re-present it to the filesystem, of course.

I gather that the BFDL's line on this thread of discussion is that forcing programmers to think about encodings every time they call out to the OS is unacceptable when most programs will work acceptably almost all of the time with a rather naive approach. This means that almost all Python programs will be technically broken for the forseeable future, sorry, Ulrich.

And for the same pragmatic reasons, these functions are going to return strings (ie, Unicode), not bytes, I expect. Sorry, Steve.

What needs to be determined here is the best way to provide reliability to those who will go to the effort of asking for it if it's available. I don't think "just return bytes" fits the bill for the reason above.

What I would like to see is a type that is derived from string (so if you present it to an API expecting string, it is silently treated as string), but from which the original bytes can always be extracted on request. If the original bytes cannot be sensibly decoded to a string, then the string field in the object would either contain something that should normally cause an error in a string API, or some made-up string (presumably it would attempt to be a more or less faithful representation of the bytes) at the caller's option. Probably they'd also contain some metadata useful in guessing encodings (the read time locale in particular).

These objects probably shouldn't support string-like operations in a general way (ie, maintaining both the string representation and the bytes "correctly"). Rather, using "proper" string operations on them would use the string content and produce strings. People who really want to handle mixed-encoding pathnames and the like would have to keep collections of these objects and handle them in an ad-hoc way.

Unfortunate implementing this is way beyond my skills and time availability.

Previous message: [Python-Dev] Python-3.0, unicode, and os.environ
Next message: [Python-Dev] Python-3.0, unicode, and os.environ
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

More information about the Python-Dev mailing list