[Python-Dev] Bytes path support (original) (raw)

Cameron Simpson cs at zip.com.au
Thu Aug 21 06:52:19 CEST 2014

Previous message: [Python-Dev] Bytes path support
Next message: [Python-Dev] Bytes path support
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On 20Aug2014 16:04, Chris Barker - NOAA Federal <chris.barker at noaa.gov> wrote:

but disallowing them in higher level

> explicitly cross platform abstractions like pathlib. I think the trick here is that posix-using folks claim that filenames are just bytes, and indeed they can be passed around with a char*, so they seem to be. but you can't possible do anything other than pass them around if you REALLY think they are just bytes. So really, people treat them as "bytes-in-some-arbitrary-encoding-where-at-least the-slash-character-(and maybe a couple others)-is-ascii-compatible"

As someone who fought long and hard in the surrogate-escape listdir() wars, and was won over once the scheme was thoroughly explained to me, I take issue with these assertions: they are bogus or misleading.

Firstly, POSIX filenames are just byte strings. The only forbidden character is the NUL byte, which terminates a C string, and the only special character is the slash, which separates pathanme components.

Second, a bare low level program cannot do much more than pass them around.
It certainly can do things like compute their basename, or other path related operations.

The "bytes in some arbitrary encoding where at least the slash character (and maybe a couple others) is ascii compatible" notion is completely bogus. There's only one special byte, the slash (code 47). There's no OS-level need that it or anything else be ASCII compatible. I think characterisations such as the one quoted are activately misleading.

The way you get UTF-8 (or some other encoding, fortunately getting less and less common) is by convention: you decide in your environment to work in some encoding (say utf-8) via the locale variables, and all your user-facing text gets used in UTF-8 encoding form when turned into bytes for the filename calls because your text<->bytes methods say to do so.

I think we'd all agree it is nice to have a system where filenames are all Unicode, but since POSIX/UNIX predates it by decades it is a bit late to ignore the reality for such systems. I certainly think the Window-side Babel of code pages and multiple code systems is far far worse. (Disclaimer: not a Windows programmer, just based on hearing them complain.)

I'm +1000 on systems where the filesystem enforces Unicode (eg Plan 9 or Mac OSX, which forces a specific UTF-8 encoding in the bytes POSIX APIs - the underlying filesystems reject invalid byte sequences).

[...]

Antoine Pitrou wrote:

To elaborate specifically about pathlib, it doesn't handle bytes paths but allows you to generate them if desired: https://docs.python.org/3/library/pathlib.html#operators but that uses os.fsencode: Encode filename to the filesystem encoding As I understand it, the whole problem with some posix systems is that there is NO filesystem encoding -- i.e. you can't know for sure what encoding a filename is in. So you need to be able to pass the bytes through as they are.

Yes and no. I made that argument too.

There's no external "filesystem encoding" in the sense of something recorded in the filesystem that anyone can inspect. But there is the expressed locale settings, available at runtime to any program that cares to pay attention. It is a workable situation.

Oh, and I reject Nick's characterisation of POSIX as "broken". It's perfectly internally consistent. It just doesn't match what he wants. (Indeed, what I want, and I'm a long time UNIX fanboy.)

Cheers, Cameron Simpson <cs at zip.com.au>

God is real, unless declared integer. - Johan Montald, johan at ingres.com

Previous message: [Python-Dev] Bytes path support
Next message: [Python-Dev] Bytes path support
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

More information about the Python-Dev mailing list