Issue 4006: os.getenv silently discards env variables with non-UTF-8 values (original) (raw)

Issue4006

Created on 2008-10-01 07:17 by a.badger, last changed 2022-04-11 14:56 by admin. This issue is now closed.

Messages (16)
msg74118 - (view)	Author: Toshio Kuratomi (a.badger) *	Date: 2008-10-01 07:17
On a Linux system with a locale setting whose encoding is utf-8, if you set an environment variable to have a non-utf-8 chanacter, that environment variable silently does not appear in os.environ:: mkdir ñ convmv -f utf-8 -t latin-1 --notest ñ for i in * ; do export PATH=$PATH:$i ; done echo $PATH /usr/lib/qt-3.3/bin:/usr/kerberos/bin:/usr/lib/ccache:/usr/local/bin:/usr/bin:/bin:/home/badger/bin:� python3.0 Python 3.0rc1 (r30rc1:66499, Sep 28 2008, 08:21:09) [GCC 4.3.0 20080428 (Red Hat 4.3.0-8)] on linux2 Type "help", "copyright", "credits" or "license" for more information. >>> import os >>> os.environ['PATH'] Traceback (most recent call last): File "", line 1, in File "/usr/lib/python3.0/os.py", line 389, in __getitem__ return self.data[self.keymap(key)] KeyError: 'PATH' I'm uncertain of the impact of this. It was brought up in a discussion of sending non-ASCii data to a CGI-WSGI script where the data would be transferred via os.environ.
msg74138 - (view)	Author: Martin v. Löwis (loewis) *	Date: 2008-10-01 18:28
For the moment, this case is just not supported.
msg74151 - (view)	Author: STINNER Victor (vstinner) *	Date: 2008-10-02 01:24
It's not a bug, it's a feature! Python3 rejects invalid byte sequence (according to the "default system encoding") from the command line or environment variables. listdir(str) will also drop invalid filenames. Yes, we need a PEP (a FAQ) about invalid bytes sequences.
msg74162 - (view)	Author: Toshio Kuratomi (a.badger) *	Date: 2008-10-02 14:32
It's not a feature it's a bug! :-) (I hope you meant to have a smiley too ;-) As stated in the os.listdir() related bug, on Unix filesystems filenames are a sequence of bytes. The system encoding allows the user-level tools to display the filenames as characters instead of byte sequences and allows you to manipulate the filenames using characters instead of byte sequences. But if you change your locale the user level tools will interpret the byte sequences as different characters and allow you free access to create files in a different encoding. So in order to work correctly on Unix you must be able to accept byte sequences in place of filename. The sad fact of the matter is that while we can be all unicode with data and strings inside of python we will always have to be prepared to handle supposed strings as byte sequences when talking to some things outside of ourselves. Sometimes the border has a specification that tells us what encoding to expect and we can do conversion automatically. But when it doesn't we have to be prepared to 1) tell the user that the data exists even but isn't string type as expected and 2) make the byte sequence available to the user. Silently pretending that the data doesn't exist at all is a bug (maybe a minor bug depending on how often we expect the situation to arise but still a bug.)
msg74198 - (view)	Author: STINNER Victor (vstinner) *	Date: 2008-10-02 22:00
@a.badger: Again, dropping invalid filenames in listdir() is a (very recent) choice of the Python3 design. Please read this document which explain the current situation of bytes vs unicode: http://wiki.python.org/moin/Python3UnicodeDecodeError See also and read the long python-dev mailing list thread about filenames (start few days ago). Guido just commited my huge patch to support bytes filename in Python3 trunk. So using Python3 final, you will be abl to list all files using os.listdir(b'.') or os.listdir(os.getcwdb()).
msg74787 - (view)	Author: STINNER Victor (vstinner) *	Date: 2008-10-15 01:04
See also issue #4126 which is the opposite :-)
msg76297 - (view)	Author: Toshio Kuratomi (a.badger) *	Date: 2008-11-24 04:34
Pardon, but when you close something as wontfix it's polite to say why.
msg76302 - (view)	Author: Martin v. Löwis (loewis) *	Date: 2008-11-24 05:57
> Pardon, but when you close something as wontfix it's polite to say why. Can you propose a reasonable way to fix this? People have thought hard, and many days, and nobody could propose a reasonable fix. As 3.0 is going to be released soon, there will be no way to fix it now.
msg76304 - (view)	Author: Toshio Kuratomi (a.badger) *	Date: 2008-11-24 06:40
Is it a bug? If so, then it should be retargetted to 3.1 instead of closed wontfix. If it's not a bug then there should be an explanation of why it's not a bug. As for fixing it there are several inelegant methods that are better than silently ignoring the problem: 1) return mixed unicode and byte types in os.environ 2) return only byte types in os.environ 3) raise an exception if someone attempts to access an environment variable that cannot be decoded to unicode via the system encoding and allow the value to be accessed as a byte string via another method. 4) silently ignore the non-decodable variables when accessing os.environ the normal way but have another method of accessing it that returns all values as byte strings. #4 is closest to what was done with os.listdir(). However, I think that approach is wrong for os.listdir() and os.environ because it leads to code that works in simple testing but can start failing mysteriously when it becomes used in more environments. The os.listdir() method will lead to lots of people having to write code that uses the byte methods on Unix and does its own conversion because it's the only thing guaranteed to work on Unix and the unicode methods on Windows because it's the only thing guaranteed to work there. It degenerates to case #2 except harder to debug and requiring more platform specific knowledge of the programmer. #3 seems like the best choice to me as it provides a way for the programmer to discover what's wrong and provide a fix but people seem to have learned the wrong lessons from the python2 UnicodeEncode/Decode problems so that might not have a large following other than me.... #2 is conceptually correct since environment variables are a point where you're receiving bytes from a non-python environment. However, it's very annoying for the common case where everything in the environment has a single encoding. #1 is the easiest for simplistic code to deal with but seems to violate the python3 philosophy the most. I don't like it as it takes us to one of the real failings of python2's unicode handling: Not knowing what type of data you're going to get back from a method and therefore not knowing if you have to convert it before passing it on. Please don't do this one as it's two steps forward and one step backwards from where we are now.
msg76305 - (view)	Author: Martin v. Löwis (loewis) *	Date: 2008-11-24 06:52
> Is it a bug? It's not a bug; see my original reply. This case is just not supported. It may be supported in future versions, but (if it was for me) not without a PEP first.
msg76308 - (view)	Author: Toshio Kuratomi (a.badger) *	Date: 2008-11-24 07:07
I'm sorry but "For the moment, this case is just not supported." is not an explanation of why this is not a bug. It is a statement that the interpreter cannot handle a situation that has arisen. If you said, "We don't believe that any computer has mixed encodings that can show up in environment variables" that would be an explanation of why this is not a bug and I could then give counter-examples of computers that have mixed encodings in their environment variables. So what's the reason this is not a bug?
msg76309 - (view)	Author: Martin v. Löwis (loewis) *	Date: 2008-11-24 07:39
Toshio Kuratomi wrote: > So what's the reason this is not a bug? It's a bug only if the implementation deviates from the specification. In this case, it does not. The behavior is intentional: python deliberately drops environment variables it cannot represent as a string. We know that such environment variables can happen in real life - that's why they get dropped (rather than raising an exception at startup).
msg76315 - (view)	Author: STINNER Victor (vstinner) *	Date: 2008-11-24 10:05
@a.badger: The behaviour (drop non encodable strings) is not really a problem if you configure correctly your program and computer. Eg. you spoke about CGI-WSGI: if your website also speak UTF-8, you will be able to read all environment variables. So this issue is not important, it only appears when your website/OS is not well configured. I mean the problem is not in Python but outside Python. The PATH variable contains directory names, if you have only names encodable in your filesystem encoding (UTF-8 most of the time), you will be able to use the PATH variable. If a directory has an non decodable name, rename the directory but don't try to fix Python!
msg76316 - (view)	Author: STINNER Victor (vstinner) *	Date: 2008-11-24 10:19
The bug tracker is maybe not the right place to discuss a new Python3 feature. > 1) return mixed unicode and byte types in os.environ One goal of Python3 was to avoid mixing bytes and characters (bytes/str). > 2) return only byte types in os.environ os.environ contains text (characters) and so should decoded as unicode. > 3) raise an exception if someone attempts to access an environment > variable that cannot be decoded to unicode via the system encoding and > allow the value to be accessed as a byte string via another method. > 4) silently ignore the non-decodable variables when accessing os.environ > the normal way but have another method of accessing it that returns all > values as byte strings. Why not for (3). But what would be the "another method" (4) to access byte string? The problem of having two methods is that you need consistent objects. Imagine that you have os.environ (unicode) and os.environb (bytes). Example 1: os.environb['PATH'] = b'\xff\xff\xff\xff' What is the value in os.environ['PATH']? Example 2: os.environb['PATH'] = b'têst' What is the value in os.environ['PATH']? Example 3: os.environ['PATH'] = 'têst' What is the value in os.environb['PATH']? Example 4: should I use os.environ['PATH'] or os.environb['PATH'] to get the current PATH? It introduces many new cases (bugs?) that have to be prepared and tested. If you are motivated, you can contribute by a patch to test your ideas ;-) I'm interrested by os.environb, but as I wrote, I expect new complex problems :-/
msg76330 - (view)	Author: Toshio Kuratomi (a.badger) *	Date: 2008-11-24 15:51
''' @a.badger: The behaviour (drop non encodable strings) is not really a problem if you configure correctly your program and computer. Eg. you spoke about CGI-WSGI: if your website also speak UTF-8, you will be able to read all environment variables. So this issue is not important, it only appears when your website/OS is not well configured. I mean the problem is not in Python but outside Python. The PATH variable contains directory names, if you have only names encodable in your filesystem encoding (UTF-8 most of the time), you will be able to use the PATH variable. If a directory has an non decodable name, rename the directory but don't try to fix Python! ''' The idea that having mixed encodings on a system is a misconfiguration is a fallacy. 1) In a multiuser setup, each user has a choice of what encoding to use. So mixed encodings are both possible and valid. 2) In a legacy system, your operating system may have all utf-8 naming for the core OS but all of the old data files is being mounted with another encoding that the legacy programs on the host expect. 3) On an nfs mount, data may come from users on different machines from widely separated areas using different system encodings. 4) The same thing as 1-3 but applied to any of the data a site may be passing via an environment variable rather than just file and directory names. 5) An application may have to deal with different encodings from the system default due to limitations of another program. Since one of python's many uses is as a glue language, it needs to be able to deal with these quirks. 6) The application you're interfacing may just be using bytes rather than text in the environment variables. Let me put it this way: If I write a file in a latin-1 encoding and put it on my system that has a utf-8 system encoding what does python-3 do? 1) If I try to open it as a text file: "open('filename', 'r')" it throws a UnicodeDecodeError when I attempt to read some non-utf-8 characters from it. 2) As a programmer I then know to open it as binary "open('filename', 'rb')" and do my own decoding of the data now that I've been made aware that I must take this corner case into account. Some notes: 1) This seems to be the right general procedure to take when handling things that are usually text but can contain arbitrary bytes. 2) This makes use of python's exception infrastructure to tell the programmer plainly what's going wrong instead of silently ignoring values that the programmer may not have encountered in their test data but could exist in the real world. Would you rather get a bug report from a user that says: "FooApp gives me a UnicodeDecodeError traceback pointing at line 345" (how open() works) or "FooApp never authenticates me" (which you then have to track down to the fact that the credentials on the user's system are being passed in an env var and are not in the system encoding.) 3) This learns the correct lesson from python-2's unicode problems: Stop the mixture of bytes and unicode at the border so the programmer can be explicit about how to deal with the odd-ball data there. It does not become squeamish about throwing a Unicode Exception which is the wrong lesson to learn from python-2. 4) It also doesn't refuse to acknowledge that the world outside python is not as simple and elegant as the world inside python and allows the programmer to write an interface to that world instead of forcing them to go outside of python to deal with it.
msg76337 - (view)	Author: Toshio Kuratomi (a.badger) *	Date: 2008-11-24 16:49
> The bug tracker is maybe not the right place to discuss a new Python3 feature. It's a bug! But if you guys want it to be a feature, then what mailing list do I need to join? Is there one devoted to Unicode or is python-dev where I need to go? >> 1) return mixed unicode and byte types in os.environ >One goal of Python3 was to avoid mixing bytes and characters (bytes/str). As stated, in my evaluation of the four options, +1 to this, option #1 takes us back to the problems encountered in python-2. >> 2) return only byte types in os.environ > os.environ contains text (characters) and so should decoded as unicode. This is correct but is not accurate :-) os.environ, the python variable, contains only unicode because that's the way it's coded. However, the Unix environment which os.environ attempts to give access to contains bytes which are almost always representable as characters. The two caveats are: 1) There's nothing that constrains it to characters -- putting byte sequences that do not include null in the environment is valid. 2) The characters in the environment may be mixed encodings, sometimes due to things outside of the user's control. >> 3) raise an exception if someone attempts to access an environment >> variable that cannot be decoded to unicode via the system encoding and >> allow the value to be accessed as a byte string via another method. >> 4) silently ignore the non-decodable variables when accessing os.environ >> the normal way but have another method of accessing it that returns all >> values as byte strings. > > Why not for (3). """ Do you mean, "I support 3"? Or did you not finish a thought here? > But what would be the "another method" (4) to access byte > string? The problem of having two methods is that you need consistent > objects. This is exactly the problem I was talking about in my analysis of #4 in the previous comment. This problem plagues the new os.listdir() method as well by introducing a construct that programmers can use that doesn't give all the information (os.listdir('.')) but also doesn't warn the programmer when the information is not being shown. > Imagine that you have os.environ (unicode) and os.environb (bytes). > > Example 1: > os.environb['PATH'] = b'\xff\xff\xff\xff' > What is the value in os.environ['PATH']? Since option 4 mimics the os.listdir() method, accesing os.environ['PATH'] would give you a KeyError. ie, the value was silently dropped just as os.listdir('.') does. > Example 2: > os.environb['PATH'] = b'têst' > What is the value in os.environ['PATH']? This doesn't work in python3 since byte strings can only be ASCii literals. > Example 3: > os.environ['PATH'] = 'têst' > What is the value in os.environb['PATH']? Dependent on the default system encoding. Assuming utf-8 encoding, os.environb['PATH'] == b't\xc3\xaast' > Example 4: > should I use os.environ['PATH'] or os.environb['PATH'] to get the current > PATH? Should you use os.listdir('.') or os.listdir(b'.') to get the list of files in the current directory? This is where treating pathnames, environment variables and etc as strings instead of bytes becomes non-simple. Now you have to decide what you really want to know (and possibly keep two slightly different values if you want to know two things.) If you want to keep the path in order to look up commands that the user can run you want os.environb['PATH'] since this is exactly what the shell will use when the user types a command at the commandline. If you want to display the elements of the PATH for the user, you probably want this:: try: path = os.environ['PATH'].split(':') except KeyError: try: temp_path = os.environ['PATH'].split(b':') except KeyError: path = DEFAULT_PATH else: path = [] for directory in os.environ['PATH'].split(b':'): path.append(unicode(directory, sys.getdefaultencoding(), 'replace')) > It introduces many new cases (bugs?) that have to be prepared and tested. Those bugs are already present. Without taking one of the four options, there's simply no way to code a solution. Take the above code and imagine that there's no way to access the user's PATH variable when a non-default-encoding character is present in the PATH. That means that you're always stuck with the value of DEFAULT_PATH instead of being able to display something reasonable to the user. (Note, these examples are pretty much the same for option #3 or option #4. The value of option #3 becomes apparent when you use os.getenv('PATH') instead of os.environ['PATH'])

History
Date	User	Action	Args
2022-04-11 14:56:39	admin	set	github: 48256
2008-12-04 20:55:04	Rhamphoryncus	set	nosy: + Rhamphoryncus
2008-11-24 16:49:16	a.badger	set	messages: +
2008-11-24 15:51:59	a.badger	set	messages: +
2008-11-24 10:19:38	vstinner	set	messages: +
2008-11-24 10:05:43	vstinner	set	messages: +
2008-11-24 07:39:46	loewis	set	messages: +
2008-11-24 07:07:11	a.badger	set	messages: +
2008-11-24 06:52:45	loewis	set	messages: +
2008-11-24 06:40:38	a.badger	set	messages: +
2008-11-24 05:57:09	loewis	set	messages: +
2008-11-24 04:34:49	a.badger	set	messages: +
2008-11-23 23:43:06	vstinner	set	status: open -> closedresolution: wont fix
2008-10-15 01:04:40	vstinner	set	messages: +
2008-10-02 22:00:01	vstinner	set	messages: +
2008-10-02 14:32:22	a.badger	set	messages: +
2008-10-02 01:24:11	vstinner	set	nosy: + vstinnermessages: +
2008-10-01 18:28:28	loewis	set	nosy: + loewismessages: + versions: + Python 3.0
2008-10-01 07:17:20	a.badger	create