[Python-Dev] urllib.quote and unquote - Unicode issues (original) (raw)

André Malo nd at perlig.de
Sun Jul 13 20:54:52 CEST 2008

Previous message: [Python-Dev] urllib.quote and unquote - Unicode issues
Next message: [Python-Dev] urllib.quote and unquote - Unicode issues
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

Matt Giuca wrote:

> This POV is way too browser-centric...

This is but one example. Note that I found web forms to be the least clear-cut example of choosing an encoding. Most of the time applications seem to be using UTF-8, and all the standards I have read are moving towards specifying UTF-8 (from being unspecified). I've never seen a standard specify or even recommend Latin-1.

Ahem. The HTTP standard does ;-)

Where web forms are concerned, basically setting the form accept-charset or the page charset is the maximum amount of control you have over the encoding. As you say, it can be encoded by another page or the user can override their settings. Then what can you do as the server? Nothing ...

Guessing works pretty well in most of the cases.

Exactly. This is exactly my point - Latin-1 is arbitrary from a standards point of view. It's just one of the many legacy encodings we'd like to forget. The UTFs are the only options which support all languages, and UTF-8 is the only ASCII-compatible (and therefore URI-compatible) encoding. So we should aim to support that as the default.

Latin-1 is not exactly arbitray. Besides being a charset - it maps one-to-one to octet values, hence it's commonly used to encode octets and is therefore a better fallback than every other encoding.

I agree. However if there was a proper standard we wouldn't have to argue! "Most proper" and "should do" is the most confident we can be when dealing with this standard, as there is no correct encoding.

Well, the standard says, there are octets to be encoded. I find that proper enough.

Does anyone have a suggestion which will be more compatible with the rest of the world than allowing the user to select an encoding, and defaulting to "utf-8"?

Default to latin-1 for decoding and utf-8 for encoding. This might be confusing though, so maybe you've asked the wrong question ;)

nd

Real programmers confuse Christmas and Halloween because DEC 25 = OCT 31. -- Unknown

                                  (found in ssl_engine_mutex.c)

Previous message: [Python-Dev] urllib.quote and unquote - Unicode issues
Next message: [Python-Dev] urllib.quote and unquote - Unicode issues
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

More information about the Python-Dev mailing list