[Python-Dev] urllib.urlopen() vs IDNs, percent-encoded hosts, ':' (original) (raw)

"Martin v. Löwis" martin at v.loewis.de
Thu Sep 16 08:37:39 CEST 2004


Mike Brown wrote:

No. The intent is actually that a URI is (not conceptually, just is) a string of characters

You are right: URIs are meant to be written on paper. However, RFC 2396 also acknowledges that the issue of non-ASCII characters is unresolved. It suggests (in 2.1) that the URI scheme should specify the interpretation of byte values.

This was actually clear in RFC 2396 sections 1.5 and 2, but has been explained somewhat better in the rephrased section 2 of rfc2396bis, which is in Last Call.

This suggests that new URI schemes should mandate UTF-8 in the components, but is silent on the issue of existing schemes.

The question is, does the url argument to urlopen() purport to be or is it assumed to be a URL? The function is quite lenient about what it accepts as a URL -- it accepts pretty much anything you give it, be it unicode or str, with or without a scheme component, relative to some unknown base, and loaded with illegal characters, and it tries to deal with it as best it can -- yet it still rejects or inconsistently handles some valid URIs, and this is what I want to see changed.

If something passed to it is clearly a valid URL, and there is a clear definition of how a computer should process it, and urllib doesn't, than this is certainly a bug and should be fixed. Can you give an example of such a URL?

Perhaps I should rephrase part of the issue this way: If the argument to urlopen() is assumed to be a URI, then %FF in the argument should not be interpreted any differently when the argument is a str vs when it is unicode.

Certainly. Indeed, urllib makes no difference, AFAICT. "http://localhost/%FF" and u"http://localhost/%FF" are processed in the same way.

RFC 2396 left it ambiguous as to what characters are represented by %80-%FF, so an implementation thereof may make such interpretations as it pleases. The current implementation doesn't do this in a consistent manner.

No. RFC 2396 defers the specifications to the specific schema.

Applications that put URL-escaped UTF-8 bytes into host names deserve to lose.

Come February or whenever rfc2396bis and the IRI draft become RFCs, that will no longer be a position you can maintain.

I see. I think I could accept a patch in this direction for Python 2.4 even if RFC2396bis isn't published, assuming the patch arrives before 2.4b1.

Let me be clear though - I am not suggesting getting rid of support for '|'. I am merely saying that there is no reason ':' should, on Windows, fail to be treated the same as '|' for the purpose of representing the ':' in a drivespec.

I know that I personally won't touch this code, except for applying patches. So if you have a clear vision of what needs to be changed and how, submit a patch.

As for using regular expressions in the standard library: It seems you believe this is discouraged. I don't know why you think so - I've never heard of such a constraint before (in general - in specific cases, submitters may have been told that alternatives are more efficient).

Regards, Martin



More information about the Python-Dev mailing list