[Python-Dev] Python-3.0, unicode, and os.environ (original) (raw)

Adam Olsen rhamph at gmail.com
Fri Dec 5 00:15:47 CET 2008


On Thu, Dec 4, 2008 at 3:47 PM, André Malo <nd at perlig.de> wrote:

* Adam Olsen wrote:

On Thu, Dec 4, 2008 at 2:09 PM, André Malo <nd at perlig.de> wrote: > Here's an example which will become popular soon, I guess: CGI scripts > and, of course WSGI applications. All those get their environment in an > unknown encoding. In the worst case one can blow up the application by > simply sending strange header lines over the wire. But there's more: > consider running the server in C locale, then probably even a single 8 > bit char might break something (?).

I think that's an argument that the framework should reencode all input text into the correct system encoding before passing it on to the CGI script or WSGI app. If the framework doesn't have a clear way to determine the client's encoding then it's all just gibberish anyway. A HTTP 400 or 500 range error code is appropriate here. Duh. See, you're already mixing different encodings and creating issues here! You're talking about client encoding (whatever that is) with correct system encoding (whatever that is, too) in the same paragraph and assume they are the same or compatible.

Mixing can work so long as the encoding is clearly specified and unambiguous. It limits your character set to a common subset of both encodings, but that's a lesser problem.

There are several points here:

- there is no clear way to get a single client encoding for the whole HTTP transaction (headers + body), because there is none. If the whole header set matches the same encoding, it's more or less luck.

If there is no way, via official standards or defacto standards, you should assume ascii and blow up if anything else is given. At that point it's meaningless garbage anyway.

- there is no correct system encoding either. As said, I prefer running my servers in C locale, so it's all ascii. In fact, it shouldn't matter. The locale should not have anything to do with an application called over the network.

I half agree: the network should be unaffected by the C locale. However, using a C locale should limit you to ascii file names and environment variables.

- A 400 or 500 response for a header containing something like my name is not appropriate.

- Octets in HTTP headers are allowed. And they are what they are - octets. The interpretation has to be left to the application, not the framework.

If there is no clear interpretation then they're garbage. If there is a clear interpretation it could be done just as well in the framework, which also lets all the apps benefit from a single implementation, rather than trying to reimplement it for each one.

>> However, some pragmatism is also possible. Many uses of PATH may >> allow it to be treated as black-box bytes, rather than text. The >> minimal solution I see is to make os.getenv() and os.putenv() switch >> to byte modes when given byte arguments, as os.listdir() does. This >> use case doesn't require the ability to iterate over all environment >> variables, as os.environb would allow. >> >> I do wonder if controlling the environment given to a subprocess >> requires os.environb, but it may be too obscure to really matter. > > IMHO, environment variables are no text. They are bytes by definition > and should be treated as such. > I know, there's windows having unicode enabled env vars on demand, but > there's only trouble with those over there in apache's httpd (when > passing them to CGI scripts, oh well...).

Environment variables have textual names, are set via text, frequently Well, think about my example again. The friendly way to maintain them is not the issue. The problems arise at least when the variables are set by an attacker.

Maintaining them IS the issue. The whole reason they're text in the first place is to display them to and receive them back from the user. Otherwise we'd use meaningless serial numbers for directories or something.

It may not seem to matter in this use case, but that's only because they're communicated to/from the user on another system.

contain textual file names or paths, and my shell (bash in gnome-terminal on ubuntu) lets me put unicode text in just fine. The underlying APIs may use bytes, but they're intended to be encoded text. Yes, encoded text == bytes. No, they're intended to be c-strings. And well, even if we assume that they should contain text (as in encoded unicode), their meaning is application specific and so is the encoding (even if it's mixed). What I'm saying is: I don't see much use for unicode APIs for the environment at all, because I don't know what's in there before inspecting them. And apparently the only reliable way to inspect them is via a byte oriented API.

If you don't think your paths should contain text then please alter your other systems to stop using japanese names. Standardize on ascii serial numbers or something equally meaningless.

Treating it as bytes is a bodge. It's worth getting your use case to "just work", but in the end it is text, and the only broad solution to text is unicode.

-- Adam Olsen, aka Rhamphoryncus



More information about the Python-Dev mailing list