[Python-Dev] Can the cgi module be made Unicode-aware? (original) (raw)

Martin v. Loewis martin@v.loewis.de
11 Apr 2002 18:19:47 +0200


Guido van Rossum <guido@python.org> writes:

> "Content-Type: > application/x-www-form-urlencoded". Is utf-8 implied for the data > once the url encoding has been reversed?

I very much doubt it. You probably received that UTF-8 data from a non-standard-conforming browser.

That's partially a bug in HTTP forms, partially a bug in the browsers, and partially a bug in many CGI scripts. The original URL encoding of form paramters (in the URL itself, using GET) does not allow a specification of the encoding; that's the bug in HTTP.

To work around this, all browsers (by silent convention) send form parameters in the encoding that the document was in. So if the document containing the form is in UTF-8, they will send the form parameters in UTF-8. Of course, unless you know what encoding the original document had, there is no way of telling that it is UTF-8.

The RFC specifies that, if application/x-www-form-urlencoded is used, text fields should have a Content-Type field, with a charset argument. The bug in the browsers is that they omit the Content-Type declaration for individual fields.

I've reported this bug for MSIE, Mozilla, and Opera. Some Mozilla author told me that they tried sending a charset= parameter, and that many Web sites broke when this is done - this is the bug in many CGI scripts.

Regards, Martin