[Python-3000] Unicode and OS strings (original) (raw)
Guido van Rossum guido at python.org
Thu Sep 13 18:48:47 CEST 2007
- Previous message: [Python-3000] Unicode and OS strings
- Next message: [Python-3000] Unicode and OS strings
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
Yes, I have noticed this too. Environment variables, command line arguments, locale properties, TZ names, and so on, are often given as 8-bit strings in who knows what encoding. I'm not sure what the solution is, but we need one. I'm guessing one thing we need to do is research how various systems decide what encoding to use. Even on OSX, I managed to create an environment variable containing non-ASCII non-UTF-8 bytes.
I believe Tcl/Tk used to have some kind of heuristic where they would try UTF-8 first and if that failed used Latin-1 for the bytes that aren't valid UTF-8, but I'm not at all sure that that's the right solution in places where Latin-1 is not spoken.
--Guido
On 9/13/07, Marcin 'Qrczak' Kowalczyk <qrczak at knm.org.pl> wrote:
What should happen when a command line argument or an environment variable is not decodable using the system encoding (on Unix where from the OS point of view it is an array of bytes)?
This is an unfortunate side effect of switching to Unicode. It's unfortunate because often the data is only passed back to another function, and thus lack of round trip is a pure loss caused by choosing a Unicode string as the representation of such data. I opt for Unicode strings nevertheless, Python did a right step. I once checked what other languages with Unicode strings do, and the results were not enlightening: inconsistency, weird errors, damaged or truncated data. Python 3.0a1 mostly fails with weird errors, and fails a bit too early: [qrczak ~]$ echo $LANG plPL.UTF-8 [qrczak ~]$ python3.0 - $(printf '\x80') Python 3.0a1 (py3k, Sep 8 2007, 15:57:56) [GCC 4.2.1 20070719 (release) (PLD-Linux)] on linux2 Type "help", "copyright", "credits" or "license" for more information. Fatal Python error: no mem for sys.argv zsh: abort python3.0 - $(printf '\x80') [qrczak ~]$ FOO=$(printf '\x80') python3.0 Python 3.0a1 (py3k, Sep 8 2007, 15:57:56) [GCC 4.2.1 20070719 (release) (PLD-Linux)] on linux2 Type "help", "copyright", "credits" or "license" for more information. >>> import os object : UnicodeDecodeError('utf8', b'\x80', 0, 1, 'unexpected code byte') type : UnicodeDecodeError refcount: 4 address : 0xb7a5142c lost sys.stderr >>> [qrczak ~]$ mkdir $(printf '\x80') [qrczak ~]$ cd $(printf '\x80') [qrczak ~/\M-^@]$ python3.0 Python 3.0a1 (py3k, Sep 8 2007, 15:57:56) [GCC 4.2.1 20070719 (release) (PLD-Linux)] on linux2 Type "help", "copyright", "credits" or "license" for more information. >>> import os object : UnicodeDecodeError('utf8', b'/home/users/qrczak/\x80', 19, 20, 'unexpected code byte') type : UnicodeDecodeError refcount: 4 address : 0xb7a1242c lost sys.stderr >>> os.listdir returns undecodable filenames as str8. I don't know what it should do. Choices: 1. Fail in a controlled way (without losing sys.stderr), and no earlier than necessary, i.e. fail when the given string is requested, not when a module is imported. 1a. Guarantee that choosing a different encoding and retrying works, for a rare case when the programmer wishes to handle such strings by explicitly trying latin1. 2. Return undecodable information as bytes, and accept bytes when it is passed back to similar functions in the other direction. 3. Have an option to use a modified UTF-8 in these places, where undecodable bytes are e.g. escaped as U+0000 U+00xx. I will not advocate any choice other than 1, but perhaps someone has another idea. My language Kogut uses 1a (even for things like sys.argv which look like variables), experimentally with 3 as an option to be requested either by choosing such encoding by the program or with an environment variable. -- _("< Marcin Kowalczyk _/ qrczak at knm.org.pl ^^ http://qrnik.knm.org.pl/~qrczak/
Python-3000 mailing list Python-3000 at python.org http://mail.python.org/mailman/listinfo/python-3000 Unsubscribe: http://mail.python.org/mailman/options/python-3000/guido%40python.org
-- --Guido van Rossum (home page: http://www.python.org/~guido/)
- Previous message: [Python-3000] Unicode and OS strings
- Next message: [Python-3000] Unicode and OS strings
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]