[Python-Dev] urllib.quote and unquote - Unicode issues (original) (raw)
Bill Janssen janssen at parc.com
Sat Jul 12 23:07:09 CEST 2008
- Previous message: [Python-Dev] urllib.quote and unquote - Unicode issues
- Next message: [Python-Dev] urllib.quote and unquote - Unicode issues
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
Basically, urllib.quote and unquote seem not to have been updated since Python 2.5, and because of this they implicitly perform Latin-1 encoding and decoding (with respect to percent-encoded characters). I think they should default to UTF-8 for a number of reasons, including that's what other software such as web browsers use.
The standard here is RFC 3986, from Jan 2005, which says,
``When a new URI scheme defines a component that represents textual data consisting of characters from the Universal Character Set [UCS], the data should first be encoded as octets according to the UTF-8 character encoding [STD63]; then only those octets that do not correspond to characters in the unreserved set should be percent-encoded.''
The "unreserved set" consists of the following ASCII characters:
``Characters that are allowed in a URI but do not have a reserved purpose are called unreserved. These include uppercase and lowercase letters, decimal digits, hyphen, period, underscore, and tilde.
unreserved = ALPHA / DIGIT / "-" / "." / "_" / "~" ''
There are a few other wrinkles; it's worth reading section 2.5 carefully.
I'd say, treat the incoming data as either Unicode (if it's a Unicode string), or some unknown superset of ASCII (which includes both Latin-1 and UTF-8) if it's a byte-string (and thus in some unknown encoding), and apply the appropriate transformation.
Bill
- Previous message: [Python-Dev] urllib.quote and unquote - Unicode issues
- Next message: [Python-Dev] urllib.quote and unquote - Unicode issues
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]