Issue 1457264: urllib.splithost parses incorrectly (original) (raw)
urllib.splithost(url) requires that the url passed in be of the form '//host[:port]/path'. Yet I've run across some urls that are of the form '//host[:port]?querystring'. This causes splithost to return everything as the host and nothing as the path.
Section 3.2 of rfc2396 (Uniform Resource Identifiers: Generic Syntax) states that 'The authority component is preceded by a double slash "//" and is terminated by the next slash "/", question-mark "?", or by the end of the URI.'
Also, this is how it defines a URI:
absoluteURI = scheme ":" ( hier_part | opaque_part ) hier_part = ( net_path | abs_path ) [ "?" query ] net_path = "//" authority [ abs_path ] abs_path = "/" path_segments
Based on the above, you could certainly have: 'http://authority?query' as a valid url.
In python2.3 you would just need to change line 939 in urllib.py from:
_hostprog = re.compile('^//([^/]*)(.*)$')
to:
_hostprog = re.compile('^//([^/?]*)(.*)$')
This appears to affect all python versions, I just happened to be using 2.3.