[Python-Dev] urllib.quote and unicode bug resuscitation attempt (original) (raw)
John J Lee jjl at pobox.com
Tue Jul 11 20:43:22 CEST 2006
- Previous message: [Python-Dev] urllib.quote and unicode bug resuscitation attempt
- Next message: [Python-Dev] urllib.quote and unicode bug resuscitation attempt
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
On Tue, 11 Jul 2006, Stefan Rank wrote:
urllib.quote fails on unicode strings and in an unhelpful way:: [...] >>> urllib.quote(u'a\xf1a') Traceback (most recent call last): File "", line 1, in ? File "C:\Python24\lib\urllib.py", line 1117, in quote res = map(safemap.getitem, s) KeyError: u'\xf1'
More helpful than silently producing the wrong answer.
[...]
I suggest to add (after 2.5 I assume) one of the following to the beginning of urllib.quote to either fail early and consistently on unicode arguments and improve the error message::
if isinstance(s, unicode): raise TypeError("quote needs a byte string argument, not unicode," " use
argument.encode('utf-8')
first.")
Won't this break existing code that catches the KeyError, for no big benefit? If nobody is yet sure what the Right Thing is (see below), I think we should not change this yet.
or to do The Right Thing (tm), which is utf-8 encoding::
if isinstance(s, unicode): s = s.encode('utf-8') as suggested in http://www.w3.org/International/O-URL-code.html and rfc3986.
You seem quite confident of that. You may be correct, but have you read all of the following? (not trying to claim superior knowledge by asking that, I just dunno what the right thing is yet: I haven't yet read RFC 2617 or got my head around what the unicode issues are or how they should apply to the Python stdlib)
http://www.ietf.org/rfc/rfc2617.txt
http://www.ietf.org/rfc/rfc2616.txt
http://en.wikipedia.org/wiki/Percent-encoding
http://mail.python.org/pipermail/python-dev/2004-September/048944.html
Also note the recent discussions here about a module named "uriparse" or "urischemes", which fits in to this somewhere. It would be good to make all the following changes in a single Python release (2.6, with luck):
extend / modify urllib and urllib2 to handle unicode input
address the urllib.quote issue you raise above (+ consider the other utility functions in that module)
add the urischemes module
In summary, I agree that your suggested fix (and all of the rest I refer to above) should wait for 2.6, unless somebody (Martin?) who understands all these issues is quite confident your suggested change is OK. Presumably the release managers wouldn't allow it in 2.5 anyway.
John
- Previous message: [Python-Dev] urllib.quote and unicode bug resuscitation attempt
- Next message: [Python-Dev] urllib.quote and unicode bug resuscitation attempt
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]