[Python-Dev] Some more comments re new uriparse module, patch 1462525 (original) (raw)

Paul Jimenez pj at place.org
Sun Jun 4 03:09:38 CEST 2006


On Friday, Jun 2, 2006, John J Lee writes:

[Not sure whether this kind of thing is best posted as tracker comments (but then the tracker gets terribly long and is mailed out every time a change happens) or posted here. Feel free to tell me I'm posting in the wrong place...]

I think this is a fine place - more googleable, still archived, etc.

Some comments on this patch (a new module, submitted by Paul Jimenez, implementing the rules set out in RFC 3986 for URI parsing, joining URI references with a base URI etc.)

http://python.org/sf/1462525

Note that like many opensource authors, I wrote this to 'scratch an itch' that I had... and am submitting it in hopes of saving someone else somewhere some essentially identical work. I'm not married to it; I just want something like it to end up in the stdlib so that I can use it.

Sorry for the pause, Paul. I finally read RFC 3986 -- which I must say is probably the best-written RFC I've read (and there was much rejoicing).

No worries. Yeah, the RFC is pretty clear (for once) :)

I still haven't read 3987 and got to grips with the unicode issues (whatever they are), but I have just implemented the same stuff you did, so have some comments on non-unicode aspects of your implementation (the version labelled "v23" on the tracker):

Your urljoin implementation seems to pass the tests (the tests taken from the RFC), but I have to I admit I don't understand it :-) It doesn't seem to take account of the distinction between undefined and empty URI components. For example, the authority of the URI reference may be empty but still defined. Anyway, if you're taking advantage of some subtle identity that implies that you can get away with truth-testing in place of "is None" tests, please don't ;-) It's slower than "is [not] None" tests both for the computer and (especially!) the reader.

First of all, I must say that urljoin is my least favorite part of this module; I include it only so as not to break backward compatibility - I don't have any personal use-cases for such. That said, some of the 'join' semantics are indeed a bit subtle; it took a bit of tinkering to make all the tests work. I was indeed using 'if foo:' instead of 'if foo is not None:', but that can be easily fixed; I didn't know there was a performance issue there. Stylistically I find them about the same clarity-wise.

I don't like the use of module posixpath to implement the algorithm labelled "removedotsegments". URIs are not POSIX filesystem paths, and shouldn't depend on code meant to implement the latter. But my own implementation is exceedingly ugly ATM, so I'm in no position to grumble too much :-)

While URIs themselves are not, of course, POSIX filesystem paths, I believe there's a strong case that their path components are semantically identical in this usage. I see no need to duplicate code that I know can be fairly tricky to get right; better to let someone else worry about the corner cases and take advantage of their work when I can.

Normalisation of the base URI is optional, and your urljoin function never normalises. Instead, it parses the base and reference, then follows the algorithm of section 5.2 of the RFC. Parsing is required before normalisation takes place. So urljoin forces people who need to normalise the URI before to parse it twice, which is annoying. There should be some way to parse 5-tuples in instead of URIs. E.g., from my implementation:

def urljoin(baseuri, urireference): return urlunsplit(urljoinparts(urlsplit(baseuri), urlsplit(urireference)))

It would certainly be easy to add a version which took tuples instead of strings, but I was attempting, as previously stated, to conform to the extant urlparse.urljoin API for backward compatability. Also as I previously stated, I have no personal use-cases for urljoin so the issue of having to double-parse if you do normalization never came to my attention.

It would be nice to have a 5-tuple-like class (I guess implemented as a subclass of tuple) that also exposes attributes (.authority, .path, etc.) -- the same way module time does it.

That starts to edge over into a 'generic URI' class, which I'm uncomfortable with due to the possibility of opaque URIs that don't conform to that spec. The fallback of putting everthing other than the scheme into 'path' doesn't appeal to me.

The path component is required, though may be empty. Your parser returns None (meaning "undefined") where it should return an empty string.

Indeed. Fixed now; a fresh look at the code showed me where the mistakes that made that seem necessary lay.

Nit: Your tests involving ports contain non-digit characters in the port (viz. "port"), which is not valid by section 3.2.3 of the RFC.

Indeed. Nit fixed.

Smaller nit: the userinfo component was never allowed in http URLs, but you use them in your tests. This issue is outside of RFC 3986, of course.

Actually it was allowed in http URLs as an alternate method of specifying HTTP AUTH, but is indeed now deprecated; I suspect, due to the prevalence of semantic attacks.

Particularly because the userinfo component is deprecated, I'd rather that userinfo-splitting and joining were separate functions, with the other functions dealing only with the standard RFC 3986 5-tuples.

It's only deprecated for http; it's quite the convenience for other protocols.

DefaultSchemes should be a class attribute of URIParser

Got any good reasoning behind that? I don't have any real reason to keep it a module variable other than it vaguely feels too 'heavy' for a class attribute to me. Maybe we can get a second opinion?

The naming of URLParser / URIParser is still insane :-) I suggest naming them URIParser and URIParser respectively.

I dunno aobut insane; as I say in the code:

URI generally refers to generic URIs and URL refers to to URIs that match scheme://user:password@host:port/path?query#fragment.

I suppose another way to say it is that I consider a URI to be more opaque than a URL - my interpretation of section 1.1.3 of rfc3986 which says:

A URI can be further classified as a locator, a name, or both. The term "Uniform Resource Locator" (URL) refers to the subset of URIs that, in addition to identifying a resource, provide a means of locating the resource by describing its primary access mechanism (e.g., its network "location").

I guess there should be no mention of "URL" anywhere in the module -- only "URI" (even though I hate "URI", as a mostly-worthless distinction from "URL", consistency inside the module is more important, and URI is technically correct and fits with all the terminology used in the RFC). I'm still heavily -1 on calling it "uriparse" though, because of the highly misleading comparison with the name "urlparse" (the difference between the modules isn't the difference between URIs and URLs).

uriparse is more generic and extensible than urlparse is; just as uris are to urls. Honestly I don't care, I was just trying to come up with a name that would be distinct enough to not cause confusion, wasn't too inelegant (eg. urlparse2), and was appropriate enough that a future use would look at it. I'm open to suggestions.

Re your comment on "mailto:" in the tracker: sure, I understand it's not meant to be public, but the interface is! .parse() will return a 4-tuple for mailto: URLs. For everything else, it will return a 7-tuple. That's silly.

That's URIs: the length of the tuple returned is scheme-dependent. The alternative is to provide less functionality by lumping everything after the : into a single 'path' component that the programmer will then have to parse again anyway. This way they don't have to do that extra work.

The documentation should explain that the function of URIParser is hiding scheme-dependent URI normalisation.

Its function is twofold: to allow hiding and to provide an extensible framework with a uniform interface. But yes, the documentation should be better.

Method names and locals are still likeThis, contrary to PEP 8.

Fixed most places, I think.

docstrings and other whitespace are still non-standard -- follow PEP 8 (and PEP 257, which PEP 8 references) Doesn't have to be totally rigid of course -- e.g. lining up the ":" characters in the tests is fine.

I've given the whole thing a once-over to try and come up to spec, but I'm sure I missed something, so feel free to point out anything I missed.

Standard stdlib form documentation is still missing. I'll be told off if I don't read you your rights: you don't have to submit in LaTeX markup -- apparently there are hordes of eager LaTeX markers-up lurking ready to pounce on poorly-formatted documentation

Test suite still needs tweaking to put it in standard stdlib form

Excuse my ignorance, but is there a PEP somewhere for 'standard stdlib form' ? What do I need to do to bring it up to snuff?

--pj



More information about the Python-Dev mailing list