[Python-3000] Unicode and OS strings (original) (raw)

Guido van Rossum guido at python.org
Wed Sep 19 01:00:09 CEST 2007

Previous message: [Python-3000] Unicode and OS strings
Next message: [Python-3000] Unicode and OS strings
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On 9/18/07, James Y Knight <foom at fuhm.net> wrote:

On Sep 18, 2007, at 11:11 AM, Guido van Rossum wrote: > If they contain > non-ASCII bytes I am currently in favor os doing a best-effort > decoding using the default locale encoding, replacing errors with '?' > rather than throwing exception. One of the more common things to do with command line arguments is open them. So, it'd really be nice if: python -c 'import sys; open(sys.argv[1])' [some filename]

I'd like this too, but it isn't easy.

would always work, regardless of the current system encoding and what characters make up the filename. Note that filenames are essentially random binary gunk in most Unix systems; the encoding is unspecified, and there can in fact be multiple encodings, even for different directories making up a single file's path.

I'd like to propose that python simply assume the external world is likely to be UTF-8, and always decode command-line arguments (and environment vars), and encode for filesystem operations using the roundtrip-able UTF-8b. Even if the system says its encoding is iso-2022 or some other abomination. This has upsides (simple, doesn't trample on PUA codepoints, only needs one new codec, never throws exception in the above example, and really is correct much of the time), and downsides (if the system locale is iso-2022, and all the filenames you're dealing with really are also properly encoded in iso-2022, it might be nice if they decoded into the sensible unicode string, instead of a non-sensical (but still round-trippable) one. I think the advantages outweigh the disadvantages, but the world I live in, using anything other than UTF8 or ASCII is grounds for entry into an insane asylum. ;)

You seem to be contradicting yourself. The world isn't using UTF-8(b) predominantly yet, so assuming UTF-8(b) everywhere will break your first requirement.

Two encodings are more likely (though not guaranteed) to produce success: the locale encoding or the filesystem encoding. I'm thinking that the locale encoding is probably the one to use for argv and environ, since at least the user can change it in order to make things work.

-- --Guido van Rossum (home page: http://www.python.org/~guido/)

Previous message: [Python-3000] Unicode and OS strings
Next message: [Python-3000] Unicode and OS strings
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

More information about the Python-3000 mailing list