[Python-Dev] Python-3.0, unicode, and os.environ (original) (raw)

André Malo nd at perlig.de
Thu Dec 4 23:47:52 CET 2008


On Thu, Dec 4, 2008 at 2:09 PM, André Malo <nd at perlig.de> wrote:

> Here's an example which will become popular soon, I guess: CGI scripts > and, of course WSGI applications. All those get their environment in an > unknown encoding. In the worst case one can blow up the application by > simply sending strange header lines over the wire. But there's more: > consider running the server in C locale, then probably even a single 8 > bit char might break something (?).

I think that's an argument that the framework should reencode all input text into the correct system encoding before passing it on to the CGI script or WSGI app. If the framework doesn't have a clear way to determine the client's encoding then it's all just gibberish anyway. A HTTP 400 or 500 range error code is appropriate here.

Duh. See, you're already mixing different encodings and creating issues here! You're talking about client encoding (whatever that is) with correct system encoding (whatever that is, too) in the same paragraph and assume they are the same or compatible.

There are several points here:

>> However, some pragmatism is also possible. Many uses of PATH may >> allow it to be treated as black-box bytes, rather than text. The >> minimal solution I see is to make os.getenv() and os.putenv() switch >> to byte modes when given byte arguments, as os.listdir() does. This >> use case doesn't require the ability to iterate over all environment >> variables, as os.environb would allow. >> >> I do wonder if controlling the environment given to a subprocess >> requires os.environb, but it may be too obscure to really matter. > > IMHO, environment variables are no text. They are bytes by definition > and should be treated as such. > I know, there's windows having unicode enabled env vars on demand, but > there's only trouble with those over there in apache's httpd (when > passing them to CGI scripts, oh well...). Environment variables have textual names, are set via text, frequently

Well, think about my example again. The friendly way to maintain them is not the issue. The problems arise at least when the variables are set by an attacker.

contain textual file names or paths, and my shell (bash in gnome-terminal on ubuntu) lets me put unicode text in just fine. The underlying APIs may use bytes, but they're intended to be encoded text.

Yes, encoded text == bytes. No, they're intended to be c-strings. And well,
even if we assume that they should contain text (as in encoded unicode), their meaning is application specific and so is the encoding (even if it's mixed).

What I'm saying is: I don't see much use for unicode APIs for the environment at all, because I don't know what's in there before inspecting them. And apparently the only reliable way to inspect them is via a byte oriented API.

nd



More information about the Python-Dev mailing list