[Python-Dev] Bytes path support (original) (raw)

Steven D'Aprano steve at pearwood.info
Sat Aug 23 13:08:29 CEST 2014


On Fri, Aug 22, 2014 at 11:53:01AM -0700, Chris Barker wrote:

The point is that if you are reading a file name from the system, and then passing it back to the system, then you can treat it as just bytes -- who cares? And if you add the byte value of 47 thing, then you can even do basic path manipulations. But once you want to do other things with your file name, then you need to know the encoding. And it is very, very common for users to need to do other things with filenames, and they almost always want them as text that they can read and understand.

Python3 supports this case very well. But it does indeed make it hard to work with filenames when you don't know the encoding they are in.

Just "not knowing" is not sufficient. In that case, you'll likely get a Unicode string containing moji-bake:

I write a file name using UTF-8 on my system:

filename = 'music by Наӥв.txt'.encode('utf-8')

You try to use it assuming ISO-8859-7 (Greek)

filename.decode('iso-8859-7') => 'music by Π\x9dΠ°Σ₯Π².txt'

which, even though it looks wrong, still lets you refer to the file (provided you then encode back to bytes with ISO-8859-7 again). This won't always be the case, sometimes the encoding you guess will be wrong.

When I started this email, I originally began to say that the actual problem was with byte file names that cannot be decoded into Unicode using the system encoding (typically UTF-8 on Linux systems. But I've actually had difficulty demonstrating that it actually is a problem. I started with a byte sequence which is invalid UTF-8, namely:

b'ZZ\xdb\xdf\xfa\xff'

created a file with that name, and then tried listing it with os.listdir. Even in Python 3.1 it worked fine. I was able to list the directory and open the file, so I'm not entirely sure where the problem lies exactly. Can somebody demonstrate the failure mode?

-- Steven



More information about the Python-Dev mailing list