[Python-Dev] Bytes path support (original) (raw)

R. David Murray rdmurray at bitdance.com
Sat Aug 23 15:41:22 CEST 2014


On Sat, 23 Aug 2014 21:08:29 +1000, Steven D'Aprano <steve at pearwood.info> wrote:

When I started this email, I originally began to say that the actual problem was with byte file names that cannot be decoded into Unicode using the system encoding (typically UTF-8 on Linux systems. But I've actually had difficulty demonstrating that it actually is a problem. I started with a byte sequence which is invalid UTF-8, namely:

b'ZZ\xdb\xdf\xfa\xff' created a file with that name, and then tried listing it with os.listdir. Even in Python 3.1 it worked fine. I was able to list the directory and open the file, so I'm not entirely sure where the problem lies exactly. Can somebody demonstrate the failure mode?

The "failure" happens only when you try to cross from the domain of posix binary filenames into the domain of text streams (that is, a stream with a consistent encoding). If you stick with os interfaces that handle filenames, Python3 handles posix bytes filenames just fine (though there may be a few corner-case rough edges yet to be fixed, and the standard streams was one of them).

The difficultly comes if you try to use a filename that contains undecodable bytes in a non-os-interface text-context (such as writing it to a text file that you have declared to be a utf-8 encoding): there you will get an error...not completely unlike the old "your code works until your user uses unicode" problem we had in python2, but in this case only happening in a very narrow set of circumstances involving trying to translate between one domain (posix binary filenames) and another domain (io streams with a consistent declared encoding). This is not a common operation, but appears to be the one Oleg is concerned about. The old unicode-blowup errors would happen almost any time someone with a non-ascii language tried to use a program written by an ascii-only programmer (which was most of us).

The same problem existed in python2 if your goal was to produce a stream with a consistent encoding, but now python3 treats that as an error. If you really want a stream with an inconsistent encoding, open it as binary and use the surrogate escape error handler to recover the bytes in the filenames. That is, be explicit about your intentions.

So yes, we've shifted a burden from those who want non-ascii text to work consistently to those who wanted inconsistently encoded text to "just work" (or rather appear to "just work"). The number of people who benefit from the improved text model greatly outweighs the number of people inconvenienced by the new strictness when the domain line (posix binary filenames to consistently encoded text stream) are crossed. And the result is more valid programs, and fewer unexpected errors overall, with no inconvenience unless that domain line is crossed, and even then the inconvenience is limited to the open call that creates the binary stream.

--David



More information about the Python-Dev mailing list