[Python-Dev] urllib.quote and unquote - Unicode issues (original) (raw)

Matt Giuca matt.giuca at gmail.com
Thu Jul 31 05:49:17 CEST 2008


Con: URI encoding does not encode characters.

OK, for all the people who say URI encoding does not encode characters: yes it does. This is not an encoding for binary data, it's an encoding for character data, but it's unspecified how the strings map to octets before being percent-encoded. From RFC 3986, section 1.2.1<http://tools.ietf.org/html/rfc3986#section-1.2.1> : Percent-encoded octets (Section 2.1) may be used within a URI to represent

characters outside the range of the US-ASCII coded character set if this representation is allowed by the scheme or by the protocol element in which the URI is referenced. Such a definition should specify the character encoding used to map those characters to octets prior to being percent-encoded for the URI.

So the string->string proposal is actually correct behaviour. I'm all in favour of a bytes->string version as well, just not with the names "quote" and "unquote".

I'll prepare a new patch shortly which has bytes->string and string->bytes versions of the functions as well. (quote will accept either type, while unquote will output a str, there will be a new function unquote_to_bytes which outputs a bytes - is everyone happy with that?)

Guido says:

Actually, we'd need to look at the various other APIs in Py3k before we can decide whether these should be considered taking or returning bytes or text. It looks like all other APIs in the Py3k version of urllib treat URLs as text.

Yes, as I said in the bug tracker, I've groveled over the entire stdlib to see how my patch affects the behaviour of dependent code. Aside from a few minor bits which assumed octets (and did their own encoding/decoding) (which I fixed), all the code assumes strings and is very happy to go on assuming this, as long as the URIs are encoded with UTF-8, which they almost certainly are.

Guido says:

I think the only change is to remove the encoding arguments and ...

You really want me to remove the encoding= named argument? And hard-code UTF-8 into these functions? It seems like we may as well have the optional encoding argument, as it does no harm and could be of significant benefit. I'll post a patch with the unquote_to_bytes function, but leave the encoding arguments in until this point is clarified.

Matt -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.python.org/pipermail/python-dev/attachments/20080731/209fd01b/attachment.htm>



More information about the Python-Dev mailing list