[Python-Dev] Python-3.0, unicode, and os.environ (original) (raw)

Toshio Kuratomi a.badger at gmail.com
Thu Dec 4 21:02:19 CET 2008

Previous message: [Python-Dev] Taint Mode in Python 3.0
Next message: [Python-Dev] Python-3.0, unicode, and os.environ
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

I opened up bug http://bugs.python.org/issue4006 a while ago and it was suggested in the report that it's not a bug but a feature and so I should come here to see about getting the feature changed :-)

I have a specific problem with os.environ and a somewhat less important architectural issue with the unicode/bytes handling in certain os.* modules. I'll start with the important one:

Currently in python3 there's no way to get at environment variables that are not encoded in the system default encoding. My understanding is that this isn't a problem on Windows systems but on *nix this is a huge problem. environment variables on *nix are a sequence of non-null bytes. These bytes are almost always "characters" but they do not have to be. Further, there is nothing that requires that the characters be in the same encoding; some of the characters could be in the UTF-8 character set while others are in latin-1, shift-jis, or big-5.

These mixed encodings can occur for a variety of reasons. Here's an example that isn't too contrived :-)

Swallow is a multi-user shell server hosted at a university in Japan. The OS installed is Fedora 10 where the encoding of all filenames provided by the OS are UTF-8. The administrator of the OS has kept this convention and, among other things has created a directory to mount and NFS directory from another computer. He calls that "ネットワーク" ("network" in Japanese). Since it's utf-8, that gets put on the filesystem as '\xe3\x83\x8d\xe3\x83\x83\xe3\x83\x88\xe3\x83\xaf\xe3\x83\xbc\xe3\x82\xaf'

Now the administrators of the fileserver have been maintaining it since before Unicode was invented. Furthermore, they don't want to suffer from the space loss of using utf-8 to encode Japanese so they use shift-jis everywhere. They have a directory on the nfs share for programs that are useful for people on the shell server to access. It's called "プログラム" ("programs" in Japanese) Since they're using shift-jis, the bytes on the filesystem are: '\x83v\x83\x8d\x83O\x83\x89\x83\x80'

The system administrator of the shell server adds the directory of programs to all his user's default PATH variables so then they have this:

PATH=/bin:/usr/bin:/usr/local/bin:/mnt/\xe3\x83\x8d\xe3\x83\x83\xe3\x83\x88\xe3\x83\xaf\xe3\x83\xbc\xe3\x82\xaf/\x83v\x83\x8d\x83O\x83\x89\x83\x80

(Note: python syntax, In the unix shell you'd likely have octal instead of hex)

Now comes the problematic part. One of the user's on the system wants to write a python3 program that needs to determine if a needed program is in the user's PATH. He tries to code it like this::

#!/usr/bin/python3.0

import os

for directory in os.environ['PATH']: programs = os.listdir(directory)

That code raises a KeyError because python3 has silently discarded the PATH due to the shift-jis encoded path elements. Much more importantly, there's no way the programmer can handle the KeyError and actually get the PATH from within python.

In the bug report I opened, I listed four ways to fix this along with the pros and cons:

return mixed unicode and byte types in os.environ and os.getenv
- I think this one is a bad idea. It's the easiest for simple code to deal with but it's repeating the major problem with python2's Unicode handling: mixing unicode and byte types unpredictably.
return only byte types in os.environ

This is conceptually correct but the most annoying option. Technically we're receiving bytes from the C libraries and the C libraries expect bytes in return. But in the common case we will be dealing with things in one encoding so this causes needless effort to the application programmer in the common case.

silently ignore non-decodable value when accessing os.environ['PATH'] as we do now but allow access to the full information via os.environ[b'PATH'] and os.getenvb().

This mirrors the practice of os.listdir('.') vs os.listdir(b'.') and os.getcwd() vs os.getcwdb().

raise an exception when non-decodable values are accessed and continue as in #3. This means that os.environ wouldn't be a simple dict as it would need to decode the values when keys are accessed (although it could cache the values).

This mirrors the practice of open() which is to decode the value for the common case but throw an exception and allow the programmer to decide what to do if all values are not decodable.

Either #3 or #4 will solve the major problem and both have precedent in python3's current implementation. The difference between them is whether to throw an exception when a non-decodable value is encountered. Here's why I think that's appropriate:

One of the things I enjoy about python is the informative tracebacks that make debugging easy. I think that the ease of debugging is lost when we silently ignore an error. If we look at the difference in coding and debugging for problems with files that aren't encoded in the default encoding (where a traceback is issued) and os.listdir() when filenames aren't in the default encoding (where the filenames are silently ignored), I think we'll see that::

#!/usr/bin/python3.0

Code with two unicode problems:

import os, sys

directory = sys.stdin.readline().strip() for filename in os.listdir(directory): myfile = open(filename, 'r') print('%s: %s' % [os.path.join(directory, filename), myfile.readline()]) myfile.close()

Let's say I write the above code and test it on a directory that's all encoded in the default encoding. I release it to the world. Someone uses it on a system that has files and filenames with mixed encodings. They immediately get a traceback like this:

File "./test.py", line 7, in print(myfile.readline()) [...] UnicodeDecodeError: 'utf8' codec can't decode bytes in position 24-26: invalid data

With that information I can diagnose that my program is failing to read a line from a file because the file is not written in the default encoding (utf8 in this case). It points out that myfile on line 7 of test.py is the file object that has issues. I quickly fix it by doing this:

unknown_encoded_files = [] [...]
try:

print(myfile.readline())

   print('%s: %s' % [os.path.join(directory, filename),

myfile.readline()])

except UnicodeDecodeError:

   unknown_encoded_files.append(filename)

myfile.close() +if unknown_encoded_files:

print('These files are not in the default encoding:\n %s' % '\n

'.join(unknown_encoded_files))

Very simple. The traceback has all the information I need to fix this.

A little later I get another report from that user that my code is failing to list the first line of all the files in their home directory. This time there's no traceback to point out which of my files is failing, just that some files are being ignored. I ask for the list of files in the directory and get back:

é.txt ñ.txt

I create those files in a directory and they're processed fine. I tell the user that and ask if there's anything special about what's in the files or anything that makes them different. No... they're both text files on his machine. One was created there, though, and the other was copied from another machine. Hmm.. do the filenames show up mangled by any chance? Yes, one of them does but he knows it's correct since it shows up correctly on his machine at home.

Ah ha! That seems to point at an encoding problem. But where? After writing a test and perusing my code for a while, I find my os.listdir() call. directory has to be converted to bytes for this to work. So I change the code like so:

for filename in os.listdir(directory):

for filename in os.listdir(directory.encode()): [...]

   unknown_encoded_files.append(filename)

   unknown_encoded_files.append(str(filename, errors='replace'))

The code for the fix is simple but the debugging to find the problem is not. Raising an exception instead of silently failing is much better for getting code that works correctly.

The bug report I opened suggests creating a PEP to address this issue. I think that's a good idea for whether os.listdir() and friends should be changed to raise an exception but not having any way to get at some environment variables seems like it's just a bug that needs to be addressed. What do other people think on both these issues?

-Toshio

-------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 197 bytes Desc: OpenPGP digital signature URL: <http://mail.python.org/pipermail/python-dev/attachments/20081204/e2ab19a0/attachment.pgp>

Previous message: [Python-Dev] Taint Mode in Python 3.0
Next message: [Python-Dev] Python-3.0, unicode, and os.environ
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

More information about the Python-Dev mailing list