[Python-Dev] URL processing conformance and principles (was Re: urllib.urlopen...) (original) (raw)

Mike Brown mike at skew.org
Thu Sep 16 17:50:24 CEST 2004


"Martin v. Löwis" wrote:

You are right: URIs are meant to be written on paper. However, RFC 2396 also acknowledges that the issue of non-ASCII characters is unresolved. It suggests (in 2.1) that the URI scheme should specify the interpretation of byte values.

Right. This part of the thread was just about how the argument to urllib.urlopen() should be handled when given as unicode vs str. You seemed to be saying it should be str because a URI is fundamentally bytes and should be analyzed as such, whereas I'm saying no, a URI is fundamentally characters and should be analyzed as such. I mentioned %-encoding and the quirk of the BNF just because those are aspects of the syntax that are byte-oriented and are the source of much confusion, and because they may have influenced your assertion.

Are we in agreement on these points?

If even these principles can be agreed upon, then I can submit a documentation patch, at the very least.

Furthermore, what about this principle?

And how about these?

As for using regular expressions in the standard library: It seems you believe this is discouraged. I don't know why you think so - I've never heard of such a constraint before (in general - in specific cases, submitters may have been told that alternatives are more efficient).

I was just surprised to find that regular expressions are not used much in urllib, urllib2, and urlparse. The implementations seem to be going to a lot of trouble to process URLs using find() and string slices. I thought perhaps there was a good reason for this.

I must attend to other things right now; will comment on the other issues later.

-Mike



More information about the Python-Dev mailing list