[Python-Dev] Some more comments re new uriparse module, patch 1462525 (original) (raw)

John J Lee jjl at pobox.com
Sat Jun 3 01:19:01 CEST 2006


[Not sure whether this kind of thing is best posted as tracker comments (but then the tracker gets terribly long and is mailed out every time a change happens) or posted here. Feel free to tell me I'm posting in the wrong place...]

Some comments on this patch (a new module, submitted by Paul Jimenez, implementing the rules set out in RFC 3986 for URI parsing, joining URI references with a base URI etc.)

http://python.org/sf/1462525

Sorry for the pause, Paul. I finally read RFC 3986 -- which I must say is probably the best-written RFC I've read (and there was much rejoicing).

I still haven't read 3987 and got to grips with the unicode issues (whatever they are), but I have just implemented the same stuff you did, so have some comments on non-unicode aspects of your implementation (the version labelled "v23" on the tracker):

Your urljoin implementation seems to pass the tests (the tests taken from the RFC), but I have to I admit I don't understand it :-) It doesn't seem to take account of the distinction between undefined and empty URI components. For example, the authority of the URI reference may be empty but still defined. Anyway, if you're taking advantage of some subtle identity that implies that you can get away with truth-testing in place of "is None" tests, please don't ;-) It's slower than "is [not] None" tests both for the computer and (especially!) the reader.

I don't like the use of module posixpath to implement the algorithm labelled "remove_dot_segments". URIs are not POSIX filesystem paths, and shouldn't depend on code meant to implement the latter. But my own implementation is exceedingly ugly ATM, so I'm in no position to grumble too much :-)

Normalisation of the base URI is optional, and your urljoin function never normalises. Instead, it parses the base and reference, then follows the algorithm of section 5.2 of the RFC. Parsing is required before normalisation takes place. So urljoin forces people who need to normalise the URI before to parse it twice, which is annoying. There should be some way to parse 5-tuples in instead of URIs. E.g., from my implementation:

def urljoin(base_uri, uri_reference): return urlunsplit(urljoin_parts(urlsplit(base_uri), urlsplit(uri_reference)))

It would be nice to have a 5-tuple-like class (I guess implemented as a subclass of tuple) that also exposes attributes (.authority, .path, etc.) -- the same way module time does it.

The path component is required, though may be empty. Your parser returns None (meaning "undefined") where it should return an empty string.

Nit: Your tests involving ports contain non-digit characters in the port (viz. "port"), which is not valid by section 3.2.3 of the RFC.

Smaller nit: the userinfo component was never allowed in http URLs, but you use them in your tests. This issue is outside of RFC 3986, of course.

Particularly because the userinfo component is deprecated, I'd rather that userinfo-splitting and joining were separate functions, with the other functions dealing only with the standard RFC 3986 5-tuples.

DefaultSchemes should be a class attribute of URIParser

The naming of URLParser / URIParser is still insane :-) I suggest naming them _URIParser and URIParser respectively.

I guess there should be no mention of "URL" anywhere in the module -- only "URI" (even though I hate "URI", as a mostly-worthless distinction from "URL", consistency inside the module is more important, and URI is technically correct and fits with all the terminology used in the RFC). I'm still heavily -1 on calling it "uriparse" though, because of the highly misleading comparison with the name "urlparse" (the difference between the modules isn't the difference between URIs and URLs).

Re your comment on "mailto:" in the tracker: sure, I understand it's not meant to be public, but the interface is! .parse() will return a 4-tuple for mailto: URLs. For everything else, it will return a 7-tuple. That's silly.

The documentation should explain that the function of URIParser is hiding scheme-dependent URI normalisation.

Method names and locals are still likeThis, contrary to PEP 8.

docstrings and other whitespace are still non-standard -- follow PEP 8 (and PEP 257, which PEP 8 references) Doesn't have to be totally rigid of course -- e.g. lining up the ":" characters in the tests is fine.

Standard stdlib form documentation is still missing. I'll be told off if I don't read you your rights: you don't have to submit in LaTeX markup -- apparently there are hordes of eager LaTeX markers-up lurking ready to pounce on poorly-formatted documentation

Test suite still needs tweaking to put it in standard stdlib form

John



More information about the Python-Dev mailing list