Issue 23904: pathlib.PurePath does not accept bytes components (original) (raw)

At https://docs.python.org/3/library/pathlib.html#pure-paths one can read

Each element of pathsegments can be either a string or bytes object representing a path segment; it can also be another path object:

which is a lie:

pathlib.PurePath(b"/foo") Traceback (most recent call last): File "", line 1, in File "/home/bru/code/cpython/Lib/pathlib.py", line 609, in new return cls._from_parts(args) File "/home/bru/code/cpython/Lib/pathlib.py", line 638, in _from_parts drv, root, parts = self._parse_args(args) File "/home/bru/code/cpython/Lib/pathlib.py", line 630, in _parse_args % type(a)) TypeError: argument should be a path or str object, not <class 'bytes'>

So either (1) the doc is wrong (2) PathLib path management fails: it should decode bytes parts with os.fsdecode() I doubt I tagged both components. I'll be happy to provide a fix once you decide what is the right solution.

I take this opportunity to share an itch: filesystem encoding on Unix cannot be reliably determined. sys.getfilesystemencoding() is only a preference and there is no guarantee that an arbitrary file will respect it. This is extensively discussed in the following thread: https://mail.python.org/pipermail/python-dev/2014-August/135873.html What is the right way to deal with those? If I use "surrogateescape" (see PEP383) how can I display the fake-unicode path to the user? print() does seems to use strict encoding. Should I encode it with "surrogateescape" or "ignore" myself beforehand?

Interesting. The doc is wrong here: pathlib was designed so that it only accepts text strings.

If I use "surrogateescape" (see PEP383) how can I display the fake-unicode path to the user? print() does seems to use strict encoding. Should I encode it with "surrogateescape" or "ignore" myself beforehand?

Yes, you should probably encode it yourself. If you are sure your terminal can eat the original bytestring, then use "surrogateescape". Otherwise, "replace" sounds better so that the user knows there are some undecodable characters out there.