[Python-Dev] urllib.quote and unquote - Unicode issues (original) (raw)

Bill Janssen janssen at parc.com
Sat Jul 12 23:07:09 CEST 2008

Previous message: [Python-Dev] urllib.quote and unquote - Unicode issues
Next message: [Python-Dev] urllib.quote and unquote - Unicode issues
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

Basically, urllib.quote and unquote seem not to have been updated since Python 2.5, and because of this they implicitly perform Latin-1 encoding and decoding (with respect to percent-encoded characters). I think they should default to UTF-8 for a number of reasons, including that's what other software such as web browsers use.

The standard here is RFC 3986, from Jan 2005, which says,

``When a new URI scheme defines a component that represents textual data consisting of characters from the Universal Character Set [UCS], the data should first be encoded as octets according to the UTF-8 character encoding [STD63]; then only those octets that do not correspond to characters in the unreserved set should be percent-encoded.''

The "unreserved set" consists of the following ASCII characters:

``Characters that are allowed in a URI but do not have a reserved purpose are called unreserved. These include uppercase and lowercase letters, decimal digits, hyphen, period, underscore, and tilde.

unreserved = ALPHA / DIGIT / "-" / "." / "_" / "~" ''

There are a few other wrinkles; it's worth reading section 2.5 carefully.

I'd say, treat the incoming data as either Unicode (if it's a Unicode string), or some unknown superset of ASCII (which includes both Latin-1 and UTF-8) if it's a byte-string (and thus in some unknown encoding), and apply the appropriate transformation.

Bill

Previous message: [Python-Dev] urllib.quote and unquote - Unicode issues
Next message: [Python-Dev] urllib.quote and unquote - Unicode issues
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

More information about the Python-Dev mailing list