[Python-Dev] urllib unicode handling (original) (raw)

Tom Pinckney thomaspinckney3 at gmail.com
Wed May 7 18:11:34 CEST 2008


Maybe I didn't understand the RFC quite right, but it seemed like how
to handle hostnames was left as a choice between IDNA encoding the
hostname or replacing the non-ascii characters with dashes? I guess in
practice IDNA is the right decision.

Another part I wasn't clear on is whether urllib.quote() understands
it's working on URIs, arbitrary strings, URLs or what. It seems that
from the documentation it looks like it's expecting to just work on
the path component of URLs. If this is so, then it doesn't need to
understand what to do if the IRI contains a hostname.

Seems like the other somewhat under-specified part of all of this is
how urllib.unquote() should work. If after percent decoding it sees
non-ascii octets, should it try to decode them as utf-8 and if that
fails then leave them as is?

On May 7, 2008, at 11:55 AM, Robert Brewer wrote:

"Martin v. Löwis" wrote:

The proper way to implement this would be IRIs (RFC 3987), in particular section 3.1. This is not as simple as just encoding it as UTF-8, as you might have to apply IDNA to the host part.

Code doing so just hasn't been contributed yet. But if someone wanted to do so, it's pretty simple:

u'www.\u212bngstr\xf6m.com'.encode("idna") 'www.xn--ngstrm-hua5l.com' Robert Brewer fumanchu at aminus.org



More information about the Python-Dev mailing list