Issue 40996: urllib should fsdecode percent-encoded parts of file URIs on Unix (original) (raw)
On Unix, file names are bytes. Python mostly prefers to use unicode for file names. On the Python <-> system boundary, os.fsencode() / os.fsdecode() are used.
In URIs, bytes can be percent-encoded. On Unix, most applications pass the percent-decoded bytes in file URIs to the file system unchanged. The remainder of this issue description is about Unix, except for the last paragraph.
Pathlib fsencodes the path when making a file URI, roundtripping the bytes e.g. passed as an argument: % python3 -c 'import pathlib, sys; print(pathlib.Path(sys.argv[1]).as_uri())' /tmp/a$(echo -e '\xE4') file:///tmp/a%E4
Example with curl using this URL: % echo 'Hello, World!' > /tmp/a$(echo -e '\xE4') % curl file:///tmp/a%E4 Hello, World!
Python 2’s urllib works the same: % python2 -c 'from urllib import urlopen; print(repr(urlopen("file:///tmp/a%E4").read()))' 'Hello, World!\n'
However, Python 3’s urllib fails: % python3 -c 'from urllib.request import urlopen; print(repr(urlopen("file:///tmp/a%E4").read()))' Traceback (most recent call last): File "/usr/lib/python3.8/urllib/request.py", line 1507, in open_local_file stats = os.stat(localfile) FileNotFoundError: [Errno 2] No such file or directory: '/tmp/a�'
During handling of the above exception, another exception occurred:
Traceback (most recent call last): File "", line 1, in File "/usr/lib/python3.8/urllib/request.py", line 222, in urlopen return opener.open(url, data, timeout) File "/usr/lib/python3.8/urllib/request.py", line 525, in open response = self._open(req, data) File "/usr/lib/python3.8/urllib/request.py", line 542, in _open result = self._call_chain(self.handle_open, protocol, protocol + File "/usr/lib/python3.8/urllib/request.py", line 502, in _call_chain result = func(*args) File "/usr/lib/python3.8/urllib/request.py", line 1485, in file_open return self.open_local_file(req) File "/usr/lib/python3.8/urllib/request.py", line 1524, in open_local_file raise URLError(exp) urllib.error.URLError: <urlopen error [Errno 2] No such file or directory: '/tmp/a�'>
urllib.request.url2pathname() is the function converting the path of the file URI to a file name. On Unix, it uses urllib.parse.unquote() with the default settings (UTF-8 encoding and the "replace" error handler).
I think that on Unix, the settings from os.fsdecode() should be used, so that it roundtrips with pathlib.Path.as_uri() and so that the percent-decoded bytes are passed to the file system as-is.
On Windows, I couldn’t do experiments, but using UTF-8 seems like the right thing (according to https://en.wikipedia.org/wiki/File_URI_scheme#Windows_2). I’m not sure that the "replace" error handler is a good idea. I prefer "errors should never pass silently" from the Zen of Python, but I don’t a have a strong opinion on this.