[Python-3000] Unicode and OS strings (original) (raw)

James Y Knight foom at fuhm.net
Wed Sep 19 00:52:18 CEST 2007


On Sep 18, 2007, at 11:11 AM, Guido van Rossum wrote:

If they contain non-ASCII bytes I am currently in favor os doing a best-effort decoding using the default locale encoding, replacing errors with '?' rather than throwing exception.

One of the more common things to do with command line arguments is
open them. So, it'd really be nice if:

python -c 'import sys; open(sys.argv[1])' [some filename]

would always work, regardless of the current system encoding and what
characters make up the filename. Note that filenames are essentially
random binary gunk in most Unix systems; the encoding is unspecified,
and there can in fact be multiple encodings, even for different
directories making up a single file's path.

I'd like to propose that python simply assume the external world is
likely to be UTF-8, and always decode command-line arguments (and
environment vars), and encode for filesystem operations using the
roundtrip-able UTF-8b. Even if the system says its encoding is
iso-2022 or some other abomination. This has upsides (simple, doesn't
trample on PUA codepoints, only needs one new codec, never throws
exception in the above example, and really is correct much of the
time), and downsides (if the system locale is iso-2022, and all the
filenames you're dealing with really are also properly encoded in
iso-2022, it might be nice if they decoded into the sensible unicode
string, instead of a non-sensical (but still round-trippable) one.

I think the advantages outweigh the disadvantages, but the world I
live in, using anything other than UTF8 or ASCII is grounds for entry
into an insane asylum. ;)

James



More information about the Python-3000 mailing list