Issue 28841: urlparse.urlparse() parses invalid URI without generating an error (examples provided) (original) (raw)

Issue28841

Created on 2016-11-30 15:57 by amrith, last changed 2022-04-11 14:58 by admin.

Messages (9)
msg282086 - (view) Author: Amrith Kumar (amrith) Date: 2016-11-30 15:57
The urlparse library incorrectly accepts a URI which specifies an invalid host identifier. Example: http://www.example.com:/abc Looking at the URI specifications, this is an invalid URI. By definition, you are supposed to specify a "hostport" and a hostport is defined as: https://www.w3.org/Addressing/URL/uri-spec.html hostport host [ : port ] The BNF indicates that : is only valid if a port is also specified. See current behavior; I submit to you that this should generate an exception. https://gist.github.com/anonymous/8504f160ff90649890b5a2a82f8028b0
msg282116 - (view) Author: Eric V. Smith (eric.smith) * (Python committer) Date: 2016-11-30 23:37
Can you please paste your code into a comment on this issue? Gist content has a habit of vanishing. Thanks!
msg282117 - (view) Author: Amrith Kumar (amrith) Date: 2016-11-30 23:47
As requested ... >>> urlparse.urlparse('http://www.google.com:/abc') ParseResult(scheme='http', netloc='www.google.com:', path='/abc', params='', query='', fragment='') I submit to you that this should generate an error.
msg282119 - (view) Author: Martin Panter (martin.panter) * (Python committer) Date: 2016-12-01 00:31
The Python documentation refers to RFC 3986, which allows an empty port string introduced by a colon, although it recommends against it: <https://tools.ietf.org/html/rfc3986#section-3.2.3> The “port” subcomponent of “authority” is designated by an optional port number in decimal following the host and delimited from it by a single colon. . . . URI producers and normalizers should omit the port component and its “:” delimiter if “port” is empty . . . What problem are you trying to solve by raising an error? See also Issue 20059, where accessing the “port” field now raises ValueError for out-of-range values, and Issue 20271, discussing invalid URL handling more generally.
msg282120 - (view) Author: Martin Panter (martin.panter) * (Python committer) Date: 2016-12-01 00:33
Also, this is in direct contradiction to Issue 20270.
msg282122 - (view) Author: Eric V. Smith (eric.smith) * (Python committer) Date: 2016-12-01 00:41
And note that there are tests that explicitly check that the colon with no port works (via issue 20270). Given that this behavior has been around for a while, and is explicitly allowed, I would recommend against changing this to an error.
msg282129 - (view) Author: Amrith Kumar (amrith) Date: 2016-12-01 02:33
Eric, Martin, I'm sure this is an interpretation (and I'll be fair and give you both) that a URL of the form: http://hostname.domain.tld:/path should be held invalid. The rationale is per the section of the spec you quoted. The other side of the argument is that the hostname and port are defined as: hostname [ : port ] where port is defined as *DIGIT This implies that 0 digits is also valid. I submit to you that the ambiguity would be removed if they: - removed the paragraph telling people not to emit a : if there was no port, or - changing the port definition to 1*DIGIT Absent that, I believe that the paragraph implies that the intent was 1*DIGIT. And therefore a : with no following number should generate an error. I raise this because the behavior of urlparse() (the way it does things now) is being cited as the reason why other routines in OpenStack should handle this when the reason we were getting URL's with a : and no port was in fact an oversight. So, in hearing the rationalization as being that "urlparse() does it, so ..." I figured I'd come to the source and see if we could make urlparse() do things unambiguously. Now, if the reason it does this is because someone who came before me made the argument that a : with no port should be accepted, I guess I'm out of luck.
msg282135 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2016-12-01 04:40
Yes, I'm afraid so. But the rationale is presumably the same as it is in many RFCs: you should accept broken stuff if it can be interpreted unambiguously, but you should not *produce* the broken stuff. So I'd say the RFC is correct as quoted. It is then up to openstack whether or not it wants to accept such URLs in a given interface, and it if were me I'd base that decision on whether it was an 'internal' interface or an 'external' one. But you can also argue that even an external interface could be strict, depending on the application domain of the interface. urllib, on the other hand, needs to accept it, and we can't change it at this point for backward compatibility reasons if nothing else. Based on the RFC, one could argue that urlunsplit should omit the colon. That could break backward compatibility, too, though will a lot less likelyhood. So we could still fix it in 3.7 if we decide we should.
msg408219 - (view) Author: Irit Katriel (iritkatriel) * (Python committer) Date: 2021-12-10 16:22
Reproduced on 3.11: >>> urllib.parse.urlparse('http://www.google.com:/abc') ParseResult(scheme='http', netloc='www.google.com:', path='/abc', params='', query='', fragment='') The discussion above was mostly leaning towards won't fix, except for the suggestion to > Based on the RFC, one could argue that urlunsplit should omit the colon.
History
Date User Action Args
2022-04-11 14:58:40 admin set github: 73027
2021-12-10 16:22:55 iritkatriel set nosy: + iritkatrielmessages: + versions: + Python 3.9, Python 3.10, Python 3.11, - Python 2.7, Python 3.5, Python 3.6, Python 3.7
2016-12-01 04:40:56 r.david.murray set nosy: + r.david.murraymessages: +
2016-12-01 02:33:50 amrith set messages: +
2016-12-01 00:41:28 eric.smith set messages: +
2016-12-01 00:33:37 martin.panter set nosy: + serhiy.storchakamessages: +
2016-12-01 00:31:11 martin.panter set nosy: + martin.pantermessages: +
2016-11-30 23:47:12 amrith set messages: +
2016-11-30 23:37:26 eric.smith set nosy: + eric.smithmessages: +
2016-11-30 17:31:20 SilentGhost set nosy: + orsenthilversions: + Python 3.5, Python 3.6, Python 3.7, - Python 3.4
2016-11-30 15:57:05 amrith create