[Python-Dev] urlparse.urlunsplit should be smarter about + (original) (raw)

Stephen J. Turnbull stephen at xemacs.org
Sun May 9 14:15:38 CEST 2010


John Arbash Meinel writes:

Stephen J. Turnbull wrote:

David Abrahams writes:

This is a bug report. bugs.python.org seems to be down.

from urlparse import * urlunsplit(urlsplit('git+file:///foo/bar/baz')) git+file:/foo/bar/baz

Note the dropped slashes after the colon.

That's clearly wrong, but what does "+" have to to do with it? AFAIK, the only thing special about + in scheme names is that it's not allowed as the first character.

Don't you need to register the "git+file:///" url for urlparse to properly split it?

if protocol not in urlparse.uses_netloc:
    urlparse.uses_netloc.append(protocol)

I don't know about the urlparse implementation, but from the point of view of the RFC I think not. Either BCP 35 or RFC 3986 (or maybe both) makes it plain that if the scheme name is followed by "://", the scheme is a hierarchical one. So that URL should parse with an empty authority, and be recomposed the same. I would do this by parsing 'git+file:///foo/bar/baz' to ('git+file', '', '/foo/bar/baz') or something like than, and 'git+file:/foo/bar/baz' to ('git+file', None, '/foo/bar/baz').

I don't see any reason why implementations should abbreviate the empty authority by removing the double slashes, unless specified in the scheme definition. Although my reading of RFC 3986 is that a missing authority (no "//") should be dereferenced in the same way as an empty one:

If the URI scheme defines a default for host, then that default
applies when the host subcomponent is undefined or when the
registered name is empty (zero length).  (Sec. 3.2.2)

I don't see why urlparse should try to enforce that by converting from one to the other.



More information about the Python-Dev mailing list