[Python-Dev] urllib.quote and unquote - Unicode issues (original) (raw)

Matt Giuca matt.giuca at gmail.com
Mon Jul 14 05:22:23 CEST 2008


On Mon, Jul 14, 2008 at 4:54 AM, André Malo <nd at perlig.de> wrote:

Ahem. The HTTP standard does ;-)

Really? Can you include a quotation please? The HTTP standard talks a lot about ISO-8859-1 (Latin-1) in terms of actually raw encoded bytes, but not in terms of URI percent-encoding (a different issue) as far as I can tell.

> Where web forms are concerned, basically setting the form accept-charset > or the page charset is the maximum amount of control you have over the > encoding. As you say, it can be encoded by another page or the user can > override their settings. Then what can you do as the server? Nothing ... Guessing works pretty well in most of the cases.

Are you suggesting that urllib.unquote guess the encoding? It could do that but it would make things rather unpredictable. I think if this was an application (such as a web browser), then guessing is OK. But this is a library function. Library functions should not make arbitrary decisions; they should be well-specified.

Latin-1 is not exactly arbitray. Besides being a charset - it maps

one-to-one to octet values, hence it's commonly used to encode octets and is therefore a better fallback than every other encoding.

True. So the only advantage I see to the current implementation is that if you really want to, you can take the Latin-1-decoded URI (from unquote) and explicitly encode it as Latin-1 and then decode it again as whatever encoding you want. But that would be a hack, would it not? I'd prefer if the library didn't require a hack just to get the extremely common use case (UTF-8).

> I agree. However if there was a proper standard we wouldn't have to > argue! "Most proper" and "should do" is the most confident we can be when > dealing with this standard, as there is no correct encoding. Well, the standard says, there are octets to be encoded. I find that proper enough.

Yes but unfortunately we aren't talking about octets any more in Python 3, but characters. If we're going to follow the standard and encode octets, then we should be accepting (for quote) and returning (for unquote) bytes objects, not strings. But as that's going to break most existing code and be extremely confusing, I think it's best we try and solve this problem for Unicode strings.

> Does anyone have a suggestion which will be more compatible with the rest > of the world than allowing the user to select an encoding, and defaulting > to "utf-8"?

Default to latin-1 for decoding and utf-8 for encoding. This might be confusing though, so maybe you've asked the wrong question ;)

:o that would break so so much existing code, not to mention being horribly inconsistent and confusing. Having said that, that's almost what the current behaviour is (quote uses Latin-1 for characters < 256, and UTF-8 for characters above; unquote uses Latin-1).

Again I bring up the http server example. If you go to a directory, create a file with a name such as '漢字', and then run this code in Python 3.0 from that directory:

import http.server s = http.server.HTTPServer(('',8000), http.server.SimpleHTTPRequestHandler) s.serve_forever()

You'll see the file in the directory listing - its HTML will be 漢字. But if you click it, you get a 404 because the server will look for the file named unquote("%E6%BC%A2%E5%AD%97") = 'æ¼¢å\xad\x97'.

If you apply my patch (patch5) everything just works.

On Mon, Jul 14, 2008 at 6:36 AM, Bill Janssen <janssen at parc.com> wrote:

> Ah there may be some confusion here. We're only dealing with str->str > transformations (which in Python 3 means Unicode strings). You can't put a > bytes in or get a bytes out of either of these functions. I suggested a > "quoteraw" and "unquoteraw" function which would let you do this.

Ah, well, that's a problem. Clearly the unquote is str->bytes, while the quote is (bytes OR str)->str.

OK so for quote, you're suggesting that we accept either a bytes or a str object. That sounds quite reasonable (though neither the unpatched or patched versions accept a bytes at the moment). I'd simply change the code in quote (from patch5) to do this:

if isinstance(s, str): s = s.encode(encoding, errors) .... res = map(quoter, s)

Now you get this behaviour by default (which may appear confusing but I'd argue correct given the different semantics of 'h\xfcllo' and b'h\xfcllo'):

urllib.parse.quote(b'h\xfcllo') 'h%FCllo' # Directly-encoded octets urllib.parse.quote('h\xfcllo') 'h%C3%BCllo' # UTF-8 encoded string, then encoded octets

Clearly the unquote is str->bytes, You can't pass a Unicode string

back as the result of unquote without passing in an encoding specifier, because the character set is application-specific.

So for unquote you're suggesting that it always return a bytes object UNLESS an encoding is specified? As in:

urllib.parse.unquote('h%C3%BCllo') b'h\xc3\xbcllo'

I would object to that on two grounds. Firstly, I wouldn't expect or desire a bytes object. The vast majority of uses for unquote will be to get a character string out, not bytes. Secondly, there is a mountain of code (including about 12 modules in the standard library) which call unquote and don't give the user the encoding option, so it's best if we pick a default that is what the majority of users will expect. I argue that that's UTF-8.

I'd prefer having a separate unquote_raw function which is str->bytes, and the unquote function performs the same role as it always have, which is str->str. But I agree on quote, I think it can be (bytes OR str)->str.

Matt -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.python.org/pipermail/python-dev/attachments/20080714/e2c64a4a/attachment-0001.htm>



More information about the Python-Dev mailing list