Issue 639311: urllib.basejoin() mishandles '' (original) (raw)

Created on 2002-11-16 12:34 by mike_j_brown, last changed 2022-04-10 16:05 by admin. This issue is now closed.

Messages (8)
msg13311 - (view)	Author: Mike Brown (mike_j_brown)	Date: 2002-11-16 12:34
It's not entirely clear whether urllib.basejoin() intends to implement RFC 2396's "resolution of relative URI references to absolute form" faithfully, but it seems to behave improperly when given an empty string as the relative URI to make absolute. >>> from urllib import basejoin >>> basejoin('http://host/foo/bar.xml','') 'http://host/foo/' I believe it should return the base as-is, because the empty string is a reference to the document that contains that reference... and presumably the document's URI is what you're passing in as the base.
msg13312 - (view)	Author: Mike Brown (mike_j_brown)	Date: 2002-11-26 10:41
Logged In: YES user_id=371366 I was partly mistaken; the document's URI is not necessarily the base. A reference with an empty path (e.g., an empty string or just a fragment identifier) is a reference to the current document, regardless of the base URI you are resolving against. A base URI is only for resolving relative URIs that are not referencing the current document. See some discussion at http://lists.w3.org/Archives/Public/uri/2002Jan/0015.html So neither urllib.basejoin() nor urlparse.urljoin() fully implement the RFC 2396 "resolution to absolute form", since there would need to be a way to indicate "current document" other than returning the base. Nevertheless, basejoin()'s behavior differs from urlparse.urljoin ()'s when presented with the empty string, and it's not clear whether that is intentional.
msg13313 - (view)	Author: Brett Cannon (brett.cannon) *	Date: 2003-05-22 03:14
Logged In: YES user_id=357491 Perhaps urllib.basejoin (which is not documented) should just become a wrapper for urlparse.urljoin ? It won't solve this bug but it would cut back on unneeded code.
msg13314 - (view)	Author: Brett Cannon (brett.cannon) *	Date: 2004-03-23 21:37
Logged In: YES user_id=357491 I disagree with what you are expecting. For instance, if you run ``urllib.basejoin("http://python.org/index.html", "/doc")`` it returns "http://python.org/dev", which makes sense. So changing its behavior based on it being an empty string would not strictly match how the function works when compared to being given any string. And on top of things the function is not even documented, so you really shouldn't be expecting any specific behavior. Closing this as invalid.
msg13315 - (view)	Author: Mike Brown (mike_j_brown)	Date: 2004-03-23 23:15
Logged In: YES user_id=371366 Well, it is documented as "Utility to combine a URL with a base URL to form a new URL". The notion of 'base URL' is described in the RFCs mentioned in urllib.__doc__ (all of which predate RFC 2396, I notice, though it doesn't matter) and is exclusively applicable to relative URL/URI reference resolution, a process in which an empty path denotes a same-document reference / no traversal from the origin. So I don't think it was unreasonable of me to have an expectation that basejoin() would be at least somewhat conformant to the specs that its parent module purports to implement. And it actually is conformant when the relative URL consists of just a query or fragment; the empty path doesn't result in the tail of the base URL being chopped off before the query or fragment is added/replaced. So this lends further credence to the notion that the intent was to be conformant. Furthermore, in the one and only place where it is used in core python libs at all (in urllib.FancyURLOpener), basejoin() is used in a way that would give a wrong result, if left as-is: to derive an absolute URL so a redirect can be followed, when given a relative URL in the Location header of an HTTP 302 response. Granted, the odds of this header being empty are slim, but still. It's a bug in FancyURLOpener.redirect_internal, at least. Nevertheless, I am now of the opinion that anyone serious about conformant URL handling will avoid urllib. In 4Suite I have implemented wrappers and replacements for the functions we need. See also bug #649962.
msg13316 - (view)	Author: Brett Cannon (brett.cannon) *	Date: 2004-03-25 15:47
Logged In: YES user_id=357491 There is a misunderstanding about what I meant by "documented". I meant it is not a specific part of the public API since it is not documented at http://www.python.org/dev/doc/devel/lib/module-urllib.html which is the official docs for urllib. Having a docstring does not mean the code is considered part of the public API. Sorry I didn't clarify that. Now you obviously know a lot about this Mike, so can you give me your opinion on whether replacing urllib.basejoin with urlparse.urljoin is a reasonable move?
msg13317 - (view)	Author: Mike Brown (mike_j_brown)	Date: 2004-03-26 09:59
Logged In: YES user_id=371366 Given that it's not part of the public API and isn't used anywhere other than urllib.FancyURLOpener, I think that should be safe. As for whether I recommend it, well, I don't know. Cons are that urlparse.urljoin() is currently slower, makes the call stack a bit deeper, and has some minor issues related to the fact that the entire module was coded to implement facets of RFC 1808, which has been obsolete since Aug 1998. RFC 1808 has various issues mainly relating to how it handles relative paths, path params, paths containing '.' or '..' segments, and the finer points of merging query and fragment parts. None of these are showstoppers, but urljoin doesn't even get 1808 completely right, though -- it accepts relative base URLs, for example, which is a contradiction in terms that has never been allowed by the specs. (basejoin does this, too) Another possible con is that importing a function from urlparse into urllib might not be desirable. I don't know what your policy is on introducing dependencies between core libs. Alternatively, what I think might be a better option is to just change urllib.FancyURLOpener.redirect_internal() to use urlparse.urljoin() instead of urllib.basejoin(). Then, you could just remove urllib.basejoin() altogether. For the record, the main differences in behavior that you can expect to see when using urljoin instead of basejoin will be that it will better handle the case where the new URL in the redirect is an empty string or just '..'. There will also be some weird behavior, in some cases, with URLs that use 'file' or unrecognized URL schemes, but basejoin() wouldn't have done any better.
msg13318 - (view)	Author: Brett Cannon (brett.cannon) *	Date: 2004-03-26 14:05
Logged In: YES user_id=357491 I actually just did ``from urlparse import urljoin as basejoin`` and deleted the code for basejoin. So I have already implemented your suggestion in Python 2.4 . Thanks for the help, Mike.

History
Date	User	Action	Args
2022-04-10 16:05:54	admin	set	github: 37488
2002-11-16 12:34:53	mike_j_brown	create