[Python-Dev] urllib.urlopen() vs IDNs, percent-encoded hosts, ':' (original) (raw)

"Martin v. Löwis" martin at v.loewis.de
Wed Sep 15 23:40:01 CEST 2004


Mike Brown wrote:

1. urlopen() cannot reliably process unicode unless there are no percent-encoded octets above %7F and no characters above \u007f (I think that's the gist of it, at least).

And that feature is by design. URLs are conceptually byte strings, not character strings, so passing Unicode strings is mostly a meaningless operation. Mostly - because if the Unicode string is pure ASCII, it probably matches most implementations and user expectations to convert it to pure ASCII first, and then treat it as a URL.

IETF is working on resolving the issue, by introducing IRIs. It appears that draft-duerst-iri-09.txt is what will become the relevant RFC. Once the RFC is published, urllib and urllib2 should be updated to support IRIs; contributions are welcome.

I don't think this is necessarily a bug, as a proper URI will never contain non-ASCII characters. However since urlopen()'s API is unfortunately such that it accepts OS-specific filesystem paths, which nowadays may be unicode, it may be time to tighten up the API and say that the url argument must be a URI, and that if unicode is given, it will be converted to str and thus must not contain non-ASCII characters.

No. I'ld rather prefer to specify that it if it is a Unicode string, it must be an IRI, and is converted to an URI according to the IRI spec.

2. urlopen() (the URI scheme-specific openers it uses, actually) does not percent-decode the host portion of a URL before doing a DNS lookup.

This wasn't really a problem until IDNs came along; no one was using non-ASCII in their hostnames. But now we have to deal with URLs where the host component is a string of percent-encoded UTF-8 octets.

Hmm. I think there is no backup in any standard for doing that. Applications that put URL-escaped UTF-8 bytes into host names deserve to lose. There are two valid ways for putting non-ASCII characters into the hostname part of an URL: use Unicode strings, or use IDNA. It may be that IRIs add another way (I haven't checked this aspect specifically), but unless there is some RFC supporting such a protocol, any response by urllib is fine, exceptions preferred.

Even though IDNs are the main application for percent-encoded octets in the host component, it is necessary in simpler cases as well, like

'http://www.w%33.org' which would need to be interpreted as 'http://www.w3.org'

We would have to check: this might be valid usage, but I somewhat doubt it.

urllib's urlopeners were not updated accordingly. This should be changed.

The change was deliberately deferred until the IRI RFC is published.

3. On Windows, urlopen() only recognizes '|' as a Windows drivespec character, whereas ':' is just as, if not more, common in 'file' URIs.

I have long ago given up trying to understand this issue. I'm happy to change this forth and back about once or twice a year, until somebody comes up with a clear and definitive story, backed up by standards and product documentation, so that we might get a stable implementation some day. Feel free to write patches.

Regards, Martin



More information about the Python-Dev mailing list