gh-87389: avoid treating path as URI with netloc by nascheme · Pull Request #93894 · python/cpython (original) (raw)
There seem to be three behaviour changes proposed:
- Changing urlunsplit(('', '', '//path', '', '')) to return '/path' instead of '//path'
Another option is to prefix with double slash, representing an empty host e.g. '////path'. This is proposed in issue #78457 and pull request #113563. I think I prefer that four-slash option, because it also fixes some urlsplit → urlunsplit round-trip cases.
- Changing urlunsplit(('', '', 'colon:path', '', '')) → './colon:path'
This seems a reasonable change, and it is kind of suggested in RFC 3986. (Another option might be to encode the first colon, and return 'colon%3Apath'.)
- SimpleHTTPRequestHandler’s handling of
GET https://example.net/dir
This is a legal HTTP 1.0 and 1.1 request, but is mainly for proxy servers, which is not what SimpleHTTPRequestHandler does. In this case the server looks up https:/example.net/dir as a path in its filesystem (which is not in spirit of HTTP), and decides to redirect with a trailing slash.
Currently it looks like the code sends Location: https://example.net/dir/
. I don’t think there is anything really wrong with that.
The proposed changes would send Location: ./https://example.net/dir/
. This new redirect is a path-relative URL. The base URL is supposed to be the original target https://example.net/dir
, so the redirect would resolve to https://example.net/https://example.net/dir/
, which is not intended.
If you want to fix anything in the HTTP server, I would make the server ignore the scheme and authority components, and just look up the path component. But I don’t think anyone is complaining about that, so it may not be worth fixing.
If the urllib.parse changes are too disruptive, perhaps a deprecation warning is the best way forward, and either add an opt-in way to get the new behaviour, or change the warning to an exception in the future?