Issue 35377: urlparse doesn't validate the scheme (original) (raw)

Created on 2018-12-02 15:27 by devkral, last changed 2022-04-11 14:59 by admin.

Pull Requests
URL Status Linked Edit
PR 10862 closed python-dev,2018-12-03 10:50
Messages (8)
msg330884 - (view) Author: (devkral) Date: 2018-12-02 15:27
the scheme argument of urlsplit/urlparse is completely broken. here two examples: urlunsplit(urlsplit("httpbin.org", scheme="https://")) 'https://:httpbin.org' urlunsplit(urlsplit("httpbin.org", scheme="https")) 'https:///httpbin.org' Fix: change urlsplit logic like this: ... url, scheme, _coerce_result = _coerce_args(url, scheme) scheme = scheme.rstrip("://") # this removes :// ... i = url.find('://') # harden against arbitrary : if i > 0: ... elif scheme: netloc, url = _splitnetloc(url, 0) # if scheme is specified, netloc is implied sry too lazy to create a patch from this. Most probably are all python versions affected but I checked only 2.7 and 3.7 .
msg330886 - (view) Author: SilentGhost (SilentGhost) * (Python triager) Date: 2018-12-02 15:40
While it might seem broken, it behaves according to documentation, it seems to me.
msg330888 - (view) Author: (devkral) Date: 2018-12-02 16:00
first a correction; there is one protocol where it make sense: file I would change: ... elif scheme: ... to ... elif scheme and scheme != "file": ... Second: no it is not a correct behaviour. Urlunsplit is supposed to create a valid url from a tuple created by urlsplit. :/// is not a valid url (except for the file protocol). Third: rstrip could be simplified to rstrip(":/")
msg330904 - (view) Author: Steven D'Aprano (steven.daprano) * (Python committer) Date: 2018-12-03 00:26
I don't think this is broken, but I do think it could be documented better. You have to read the documentation for `urlparse` to see this: [Quote] Following the syntax specifications in RFC 1808, urlparse recognizes a netloc only if it is properly introduced by ‘//’. Otherwise the input is presumed to be a relative URL and thus to start with a path component. https://docs.python.org/3/library/urllib.parse.html#urllib.parse.urlparse so the function is correct. You're making two errors: - providing a relative URL "httpbin.org" and expecting it to be treated as an absolute URL; - specifying scheme+delimiter instead of the scheme alone. So I don't think this is a bug. urlsplit rightly accepts any arbitrary string as a scheme (it used to have a whitelist of permitted schemes, and that was a problem), and we can't assume that :/// is ONLY valid for file protocols. Unless you come up with a convincing argument for why this is a bug, I'm going to change it to a documentation issue. The docs could do with some improvement to make it more clear, although the examples are good.
msg330927 - (view) Author: (devkral) Date: 2018-12-03 10:24
Ok, then the problem is in unsplit. Because: :/// is not a valid url. Next try: in urlunsplit: if netloc or (scheme and scheme in uses_netloc and url[:2] != '//'): if url and url[:1] != '/' and scheme in ("file", ""): # my change url = '/' + url
msg330938 - (view) Author: Steven D'Aprano (steven.daprano) * (Python committer) Date: 2018-12-03 11:39
You haven't given a convincing reason that there is a problem that needs fixing, or if there is, that your patch is the right way to fix it. You have already pointed out that there is at least one scheme where :/// is part of a valid URL: "file:///". Where there is one, there could be others, if not today, then in the future: spam:///eggs.cheese There are well over 200 registered schemes: https://www.iana.org/assignments/uri-schemes/uri-schemes.xhtml I don't think it will be practical to put in a whitelist or a blacklist of schemes that are, or aren't, permitted to include :/// after the scheme. It is the caller's responsibility to use the correct scheme, without adding extra characters to the end. I asked you to justify why this should be considered a bug in the library, rather than a bug in your code. I'm not an expert on URLs, but the functions look correct to me. If you can't justify why this is a bug in the library that needs fixing, rather than user-error, we should close this or change it to a request for better documentation.
msg330976 - (view) Author: (devkral) Date: 2018-12-03 19:48
ah, I get the idea behind urlunsplit. But still: for e.g. http, http:/// is invalid. These urls are invalid for e.g. requests. May I ask: urllib.parse is for web protocol urls? If yes, then there should be an option which defaults to my version.
msg330993 - (view) Author: Steven D'Aprano (steven.daprano) * (Python committer) Date: 2018-12-03 22:51
I'm changing the name to better describe the problem, and suggest a better solution. The urlparse.urlsplit and .urlunsplit functions currently don't validate the scheme argument, if given. According to the RFC: Scheme names consist of a sequence of characters. The lower case letters "a"--"z", digits, and the characters plus ("+"), period ("."), and hyphen ("-") are allowed. For resiliency, programs interpreting URLs should treat upper case letters as equivalent to lower case in scheme names (e.g., allow "HTTP" as well as "http"). https://www.ietf.org/rfc/rfc1738.txt If the scheme is specified, I suggest it should be normalised to lowercase and validated, something like this: # untested if scheme: # scheme_chars already defined in module badchars = set(scheme) - set(scheme_chars) if badchars: raise ValueError('"%c" is invalid in URL schemes' % badchars.pop()) scheme = scheme.lower() This will help avoid errors such as passing 'http://' as the scheme.
History
Date User Action Args
2022-04-11 14:59:08 admin set github: 79558
2018-12-03 22:51:57 steven.daprano set versions: + Python 3.8, - Python 2.7, Python 3.7title: urlsplit scheme argument broken -> urlparse doesn't validate the schememessages: + keywords: - patchstage: patch review ->
2018-12-03 19:48:46 devkral set messages: +
2018-12-03 11:39:41 steven.daprano set messages: +
2018-12-03 10:50:08 python-dev set keywords: + patchstage: patch reviewpull_requests: + <pull%5Frequest10097>
2018-12-03 10:24:15 devkral set messages: +
2018-12-03 00:26:25 steven.daprano set nosy: + steven.dapranomessages: +
2018-12-02 16:00:09 devkral set messages: +
2018-12-02 15:40:47 SilentGhost set nosy: + SilentGhost, orsenthiltype: behaviormessages: +
2018-12-02 15:27:37 devkral create