[Python-3000] Unicode and OS strings (original) (raw)

Victor Stinner victor.stinner at haypocalc.com
Wed Sep 19 12:40:33 CEST 2007


Hi,

On Thursday 13 September 2007 18:22:12 Marcin 'Qrczak' Kowalczyk wrote:

What should happen when a command line argument or an environment variable is not decodable using the system encoding (on Unix where from the OS point of view it is an array of bytes)?

On Linux, filenames are byte string and not character string. I always have his problem with Python 2.x. I converted filename (argv[x]) to Unicode to be able to format error messages in full unicode... but it's not possible. Linux allows invalid utf8 filename even on full utf8 installation (ubuntu), see Marcin's examples.

So I propose to keep sys.argv as byte string array. If you try to create unicode strings, you will be unable to write a program to convert filesystem with "broken" filenames (see convmv program for example) or open file with broken "filename" (broken: invalid byte sequence for UTF/JIS/Big5/... charset).


For Python 2.x, my solution is to keep byte string for I/O and use unicode string for error messages. Function to convert any byte string (filename string) to Unicode:

def unicodeFilename(filename, charset=None): if not charset: charset = getTerminalCharset() try: return unicode(filename, charset) except UnicodeDecodeError: return makePrintable(filename, charset, to_unicode=True)

makePrintable() replace invalid byte sequence by escape string, example:

from hachoircore.tools import makePrintable makePrintable("a\x80", "utf8", tounicode=True) u'a\x80' print makePrintable("a\x80", "utf8", tounicode=True) a\x80

Source code of function makePrintable: http://hachoir.org/browser/trunk/hachoir-core/hachoir_core/tools.py#L225

Source code of function getTerminalCharset(): http://hachoir.org/browser/trunk/hachoir-core/hachoir_core/i18n.py#L23

Victor Stinner http://hachoir.org/



More information about the Python-3000 mailing list