[Python-Dev] When should pathlib stop being provisional? (original) (raw)

Steven D'Aprano steve at pearwood.info
Tue Apr 5 22:51:55 EDT 2016


On Wed, Apr 06, 2016 at 10:02:30AM +1000, Chris Angelico wrote:

My personal view on the text/bytes debate is that a path is fundamentally a human concept, and consists therefore of text. The fact that some file systems store (at the low level) bytes and some store (I think) UTF-16 code units should be immaterial; path components exist for people. We can smuggle unrecognized bytes around, but ultimately, those bytes came from characters at some point - we just don't know the encoding. So a Path object has no relationship with bytes, only with str.

That might be usually true in practice, but it is incorrect in principle. Paths in POSIX systems like Linux are fundamentally byte-strings with only two restrictions: \0 and \x2f are forbidden.

The fact that paths in Linux mostly happen to look like English words (often heavily abbreviated) is a historical accident. The file system itself supported paths containing (say) \xff even back in the days when text was pure US-ASCII and bytes over \x7f had no textual meaning, and these days paths still support sequences of bytes that have no human meaning in any encoding.

I don't know if this makes the tiniest lick of difference for Pathlib. I would be perfectly content if we stuck with the design decision that Pathlib can only represent paths representable as Unicode strings, and left weird POSIX filenames to the legacy byte-string interface.

-- Steve



More information about the Python-Dev mailing list