When urllib.FancyURLopener encounters 302 redirection to a URL with fragments, it sends wrong URL to servers. For example, if we run: urllib.urlopen('http://example.com/foo') and the server responds like following. HTTP/1.1 302 Found Location: /bar#test Then urllib tries next to fetch http://example.com/bar#test, instead of http://example.com/bar. Correctly, urllib should strip fragment part of URL before issuing requests.
This is the correct behavior GET http://example.com/foo with a response containing 302 and Location: /bar#test must trigger http://example.com/bar#test Location is defined in http://tools.ietf.org/html/rfc7231#section-7.1.2 7.1.2. Location The "Location" header field is used in some responses to refer to a specific resource in relation to the response. The type of relationship is defined by the combination of request method and status code semantics. Location = URI-reference The field value consists of a single URI-reference. When it has the form of a relative reference ([RFC3986], Section 4.2), the final value is computed by resolving it against the effective request URI ([RFC3986], Section 5). A bit after in the spec. If the Location value provided in a 3xx (Redirection) response does not have a fragment component, a user agent MUST process the redirection as if the value inherits the fragment component of the URI reference used to generate the request target (i.e., the redirection inherits the original reference's fragment, if any). For example, a GET request generated for the URI reference "http://www.example.org/~tim" might result in a 303 (See Other) response containing the header field: Location: /People.html#tim which suggests that the user agent redirect to "http://www.example.org/People.html#tim" Likewise, a GET request generated for the URI reference "http://www.example.org/index.html#larry" might result in a 301 (Moved Permanently) response containing the header field: Location: http://www.example.net/index.html which suggests that the user agent redirect to "http://www.example.net/index.html#larry", preserving the original fragment identifier.
Hi karl, Of course it is correct that the user agent is redirected to http://example.com/bar#test when it got such response. However, it never means UA can send an HTTP request containing fragment part. In RFC7230 section 3.1.1, HTTP request line is defined as: request-line = method SP request-target SP HTTP-version CRLF where request-target is defined in section 5.3. In usual case, it is origin-form: origin-form = absolute-path [ "?" query ] which never includes fragment part. This means current urllib behavior violates HTTP/1.1 spec.
Takahashi-san, Ah sorry misunderstood which part your were talking about. I assume wrongly you were talking about navigation. Yes for the request which is sent to the server it should be http://tools.ietf.org/html/rfc7230#section-5.3.1 So refactoring your example. 1st request: GET /foo HTTP/1.1 Accept: text/html Host: example.com server response HTTP/1.1 302 Found Location: /bar#test second request must be GET /bar HTTP/1.1 Accept: text/html Host: example.com As for the navigation context is indeed part of the piece of code taking in charge the document after being parsed and not the one doing the HTTP request. (putting it here just that people understand) (to be tested) For server side receiving invalid Request-line http://tools.ietf.org/html/rfc7230#section-3.1.1 Recipients of an invalid request-line SHOULD respond with either a 400 (Bad Request) error or a 301 (Moved Permanently) redirect with the request-target properly encoded. A recipient SHOULD NOT attempt to autocorrect and then process the request without a redirect, since the invalid request-line might be deliberately crafted to bypass security filters along the request chain.
OK I fixed the code. The issue is here https://hg.python.org/cpython/file/1e1c6e306eb4/Lib/urllib/request.py#l656 newurl = urlunparse(urlparts) Basically it reinjects the fragment in the new url. The fix is easy. if urlparts.fragment: urlparts = list(urlparts) urlparts[5] = "" newurl = urlunparse(urlparts) I was trying to make a test for it, but failed. Could someone help me for the test so I can complete the patch? Added the code patch only.