[Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces (original) (raw)

"Martin v. Löwis" martin at v.loewis.de
Sat Apr 25 14:07:44 CEST 2009


Cameron Simpson wrote:

On 22Apr2009 08:50, Martin v. Löwis <martin at v.loewis.de> wrote: | File names, environment variables, and command line arguments are | defined as being character data in POSIX;

Specific citation please? I'd like to check the specifics of this.

For example, on environment variables:

http://opengroup.org/onlinepubs/007908799/xbd/envvar.html

For values to be portable across XSI-conformant systems, the value

must be composed of characters from the portable character set (except

NUL and as indicated below).

Environment variable names used by the utilities in the XCU

specification consist solely of upper-case letters, digits and the "_"

(underscore) from the characters defined in Portable Character Set .

Other characters may be permitted by an implementation;

Or, on command line arguments:

http://opengroup.org/onlinepubs/007908799/xsh/execve.html

The arguments represented by arg0, ... are pointers to null-terminated

character strings

where a character string is "A contiguous sequence of characters terminated by and including the first null byte.", and a character is

A sequence of one or more bytes representing a single graphic symbol

or control code. This term corresponds to the ISO C standard term

multibyte character (multi-byte character), where a single-byte

character is a special case of a multi-byte character. Unlike the

usage in the ISO C standard, character here has no necessary

relationship with storage space, and byte is used when storage space

is discussed.

So you're proposing that all POSIX OS interfaces (which use byte strings) interpret those byte strings into Python3 str objects, with a codec that will accept arbitrary byte sequences losslessly and is totally reversible, yes?

Correct.

And, I hope, that the os.* interfaces silently use it by default.

Correct.

| Applications that need to process the original byte | strings can obtain them by encoding the character strings with the | file system encoding, passing "python-escape" as the error handler | name.

-1 This last sentence kills the idea for me, unless I'm missing something. Which I may be, of course. POSIX filesystems donot have a file system encoding.

Why is that a problem for the PEP?

If I'm writing a general purpose UNIX tool like chmod or find, I expect it to work reliably on any UNIX pathname. It must be totally encoding blind. If I speak to the os.* interface to open a file, I expect to hand it bytes and have it behave.

See the other messages. If you want to do that, you can continue to.

I'm very much in favour of being able to work in strings for most purposes, but if I use the os.* interfaces on a UNIX system it is necessary to be able to work in bytes, because UNIX file pathnames are bytes.

Please re-read the PEP. It provides a way of being able to access any POSIX file name correctly, and still pass strings.

If there isn't a byte-safe os.* facility in Python3, it will simply be unsuitable for writing low level UNIX tools.

Why is that? The mechanism in the PEP is precisely defined to allow writing low level UNIX tools.

Finally, I have a small python program whose whole purpose in life is to transcode UNIX filenames before transfer to a MacOSX HFS directory, because of HFS's enforced particular encoding. What approach should a Python app take to transcode UNIX pathnames under your scheme?

Compute the corresponding character strings, and use them.

Regards, Martin



More information about the Python-Dev mailing list