urlparse behavior on invalid inputs (original) (raw)

There is a problem with the standard library's urlsplit and urlparse functions, in Python 2.7 (module urlparse) and 3.2+ (module urllib.parse).

The documentation for these functions [1] does not explain how they behave when given an invalid URL.

One could try invoking them manually and conclude that they tolerate anything thrown at them:

urlparse('http:////::\\!!::!!++///') ParseResult(scheme='http', netloc='', path='//::\\!!::!!++///', params='', query='', fragment='')

urlparse(os.urandom(32).decode('latin-1')) ParseResult(scheme='', netloc='', path='\x7f¼â1gdä»6\x82', params='', query='', fragment='\n\xadJ\x18+fli\x9cÛ\x9ak*ÄÅ\x02³F\x85Ç\x18')

Without studying the source code, it is impossible to know that there is a very narrow class of inputs on which they raise ValueError [2]:

urlparse('http://[') Traceback (most recent call last): File "", line 1, in File "/usr/lib/python3.5/urllib/parse.py", line 295, in urlparse splitresult = urlsplit(url, scheme, allow_fragments) File "/usr/lib/python3.5/urllib/parse.py", line 345, in urlsplit raise ValueError("Invalid IPv6 URL") ValueError: Invalid IPv6 URL

This could be viewed as a documentation issue. But it could also be viewed as an implementation issue. Instead of raising ValueError on those square brackets, urlsplit could simply consider them invalid parts of an RFC 3986 reg-name, and lump them into netloc, as it already does with other invalid characters:

urlparse('http://\0\0æí\n/') ParseResult(scheme='http', netloc='\x00\x00æí\n', path='/', params='', query='', fragment='')

Note that the raising behavior was introduced in Python 2.7/3.2.