gh-87389: avoid treating path as URI with netloc by nascheme · Pull Request #93894 · python/cpython (original) (raw)

There seem to be three behaviour changes proposed:

  1. Changing urlunsplit(('', '', '//path', '', '')) to return '/path' instead of '//path'

Another option is to prefix with double slash, representing an empty host e.g. '////path'. This is proposed in issue #78457 and pull request #113563. I think I prefer that four-slash option, because it also fixes some urlspliturlunsplit round-trip cases.

  1. Changing urlunsplit(('', '', 'colon:path', '', '')) → './colon:path'

This seems a reasonable change, and it is kind of suggested in RFC 3986. (Another option might be to encode the first colon, and return 'colon%3Apath'.)

  1. SimpleHTTPRequestHandler’s handling of GET https://example.net/dir

This is a legal HTTP 1.0 and 1.1 request, but is mainly for proxy servers, which is not what SimpleHTTPRequestHandler does. In this case the server looks up https:/example.net/dir as a path in its filesystem (which is not in spirit of HTTP), and decides to redirect with a trailing slash.

Currently it looks like the code sends Location: https://example.net/dir/. I don’t think there is anything really wrong with that.

The proposed changes would send Location: ./https://example.net/dir/. This new redirect is a path-relative URL. The base URL is supposed to be the original target https://example.net/dir, so the redirect would resolve to https://example.net/https://example.net/dir/, which is not intended.

If you want to fix anything in the HTTP server, I would make the server ignore the scheme and authority components, and just look up the path component. But I don’t think anyone is complaining about that, so it may not be worth fixing.

If the urllib.parse changes are too disruptive, perhaps a deprecation warning is the best way forward, and either add an opt-in way to get the new behaviour, or change the warning to an exception in the future?