[Python-3000] Unicode and OS strings (original) (raw)

"Martin v. Löwis" martin at v.loewis.de
Fri Sep 14 14:32:59 CEST 2007


Are you sure that "strings in an unknown encoding" are conceptually strings and not rather bytes?

For file names, most definitely. For command line arguments, I am fairly sure: the argc/argv calling convention does not allow for arbitrary bytes.

And what if we skillfully conserve unknown bytes in a private use or surrogate area and the application author actually knows the encoding and wants correctly decoded strings?

They can easily roundtrip that then to the encoding that it should have:

good_string = sys.argv[bad_string_index].
encode(sys.argv_encoding, "pua-replace").decode(real_encoding)

However, we are talking about borderline cases here - in most cases, Python will just do the right thing. Special cases aren't special enough to break the rules.

Regards, Martin



More information about the Python-3000 mailing list