[Python-Dev] urllib.quote and unquote - Unicode issues (original) (raw)

Guido van Rossum guido at python.org
Thu Jul 31 06:01:56 CEST 2008


On Wed, Jul 30, 2008 at 8:49 PM, Matt Giuca <matt.giuca at gmail.com> wrote:

Con: URI encoding does not encode characters. OK, for all the people who say URI encoding does not encode characters: yes it does. This is not an encoding for binary data, it's an encoding for character data, but it's unspecified how the strings map to octets before being percent-encoded. From RFC 3986, section 1.2.1: Percent-encoded octets (Section 2.1) may be used within a URI to represent characters outside the range of the US-ASCII coded character set if this representation is allowed by the scheme or by the protocol element in which the URI is referenced. Such a definition should specify the character encoding used to map those characters to octets prior to being percent-encoded for the URI. So the string->string proposal is actually correct behaviour. I'm all in favour of a bytes->string version as well, just not with the names "quote" and "unquote". I'll prepare a new patch shortly which has bytes->string and string->bytes versions of the functions as well. (quote will accept either type, while unquote will output a str, there will be a new function unquotetobytes which outputs a bytes - is everyone happy with that?)

I'd rather have two pairs of functions, so that those who want to give the readers of their code a clue can do so. I'm not opposed to having redundant functions that accept either string or bytes though, unless others prefer not to.

Guido says:

Actually, we'd need to look at the various other APIs in Py3k before we can decide whether these should be considered taking or returning bytes or text. It looks like all other APIs in the Py3k version of urllib treat URLs as text. Yes, as I said in the bug tracker, I've groveled over the entire stdlib to see how my patch affects the behaviour of dependent code. Aside from a few minor bits which assumed octets (and did their own encoding/decoding) (which I fixed), all the code assumes strings and is very happy to go on assuming this, as long as the URIs are encoded with UTF-8, which they almost certainly are.

Sorry, I have yet to look at the tracker (only so many minutes in a day...).

Guido says:

I think the only change is to remove the encoding arguments and ... You really want me to remove the encoding= named argument? And hard-code UTF-8 into these functions? It seems like we may as well have the optional encoding argument, as it does no harm and could be of significant benefit. I'll post a patch with the unquotetobytes function, but leave the encoding arguments in until this point is clarified.

I don't mind an encoding argument, as long as it isn't used to change the return type (as Bill was proposing).

-- --Guido van Rossum (home page: http://www.python.org/~guido/)



More information about the Python-Dev mailing list