[Python-Dev] Python-3.0, unicode, and os.environ (original) (raw)

Toshio Kuratomi a.badger at gmail.com
Thu Dec 4 23:13:35 CET 2008


Adam Olsen wrote:

On Thu, Dec 4, 2008 at 2:09 PM, André Malo <nd at perlig.de> wrote:

* Adam Olsen wrote:

On Thu, Dec 4, 2008 at 1:02 PM, Toshio Kuratomi <a.badger at gmail.com> wrote:

I opened up bug http://bugs.python.org/issue4006 a while ago and it was suggested in the report that it's not a bug but a feature and so I should come here to see about getting the feature changed :-)

I have a specific problem with os.environ and a somewhat less important architectural issue with the unicode/bytes handling in certain os.* modules. I'll start with the important one: Currently in python3 there's no way to get at environment variables that are not encoded in the system default encoding. My understanding is that this isn't a problem on Windows systems but on *nix this is a huge problem. environment variables on *nix are a sequence of non-null bytes. These bytes are almost always "characters" but they do not have to be. Further, there is nothing that requires that the characters be in the same encoding; some of the characters could be in the UTF-8 character set while others are in latin-1, shift-jis, or big-5. Multiple encoding environments are best described as "batshit insane". It's impossible to handle any of it correctly as text, which is why UTF-8 is becoming a universal standard. For everybody's sanity python should continue to push it. Here's an example which will become popular soon, I guess: CGI scripts and, of course WSGI applications. All those get their environment in an unknown encoding. In the worst case one can blow up the application by simply sending strange header lines over the wire. But there's more: consider running the server in C locale, then probably even a single 8 bit char might break something (?). I think that's an argument that the framework should reencode all input text into the correct system encoding before passing it on to the CGI script or WSGI app. If the framework doesn't have a clear way to determine the client's encoding then it's all just gibberish anyway. A HTTP 400 or 500 range error code is appropriate here. The framework can't always encode input bytes into the system encoding for text. Sometimes the framework can be dealing with actual bytes. For instance, if the framework is being asked to reference an actual file on a *NIX filesystem the bytes have to match up with the bytes in the filename whether or not those bytes agree with the system encoding.

However, some pragmatism is also possible. Many uses of PATH may allow it to be treated as black-box bytes, rather than text. The minimal solution I see is to make os.getenv() and os.putenv() switch to byte modes when given byte arguments, as os.listdir() does. This use case doesn't require the ability to iterate over all environment variables, as os.environb would allow.

I do wonder if controlling the environment given to a subprocess requires os.environb, but it may be too obscure to really matter. IMHO, environment variables are no text. They are bytes by definition and should be treated as such. I know, there's windows having unicode enabled env vars on demand, but there's only trouble with those over there in apache's httpd (when passing them to CGI scripts, oh well...). Environment variables have textual names, are set via text, frequently contain textual file names or paths, and my shell (bash in gnome-terminal on ubuntu) lets me put unicode text in just fine. The underlying APIs may use bytes, but they're intended to be encoded text. The example I've started using recently is this: text files on my system contain character data and I expect them to be read into a string type when I open them in python3. However, if a text file contains text that is not encoded in the system default encoding I should still be able to get at the data and perform my own conversion. So I agree with the default of treating environment variables as text. We just need to be able to treat them as bytes when these corner cases come up.

-Toshio

-------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 197 bytes Desc: OpenPGP digital signature URL: <http://mail.python.org/pipermail/python-dev/attachments/20081204/f6c322e7/attachment.pgp>



More information about the Python-Dev mailing list