[Python-3000] Unicode and OS strings (original) (raw)
Victor Stinner victor.stinner at haypocalc.com
Wed Sep 19 12:40:33 CEST 2007
- Previous message: [Python-3000] Unicode and OS strings
- Next message: [Python-3000] Unicode and OS strings
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
Hi,
On Thursday 13 September 2007 18:22:12 Marcin 'Qrczak' Kowalczyk wrote:
What should happen when a command line argument or an environment variable is not decodable using the system encoding (on Unix where from the OS point of view it is an array of bytes)?
On Linux, filenames are byte string and not character string. I always have his problem with Python 2.x. I converted filename (argv[x]) to Unicode to be able to format error messages in full unicode... but it's not possible. Linux allows invalid utf8 filename even on full utf8 installation (ubuntu), see Marcin's examples.
So I propose to keep sys.argv as byte string array. If you try to create unicode strings, you will be unable to write a program to convert filesystem with "broken" filenames (see convmv program for example) or open file with broken "filename" (broken: invalid byte sequence for UTF/JIS/Big5/... charset).
For Python 2.x, my solution is to keep byte string for I/O and use unicode string for error messages. Function to convert any byte string (filename string) to Unicode:
def unicodeFilename(filename, charset=None): if not charset: charset = getTerminalCharset() try: return unicode(filename, charset) except UnicodeDecodeError: return makePrintable(filename, charset, to_unicode=True)
makePrintable() replace invalid byte sequence by escape string, example:
from hachoircore.tools import makePrintable makePrintable("a\x80", "utf8", tounicode=True) u'a\x80' print makePrintable("a\x80", "utf8", tounicode=True) a\x80
Source code of function makePrintable: http://hachoir.org/browser/trunk/hachoir-core/hachoir_core/tools.py#L225
Source code of function getTerminalCharset(): http://hachoir.org/browser/trunk/hachoir-core/hachoir_core/i18n.py#L23
Victor Stinner http://hachoir.org/
- Previous message: [Python-3000] Unicode and OS strings
- Next message: [Python-3000] Unicode and OS strings
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]