[Python-3000] Unicode and OS strings (original) (raw)

Marcin 'Qrczak' Kowalczyk qrczak at knm.org.pl
Thu Sep 13 18:22:12 CEST 2007


What should happen when a command line argument or an environment variable is not decodable using the system encoding (on Unix where from the OS point of view it is an array of bytes)?

This is an unfortunate side effect of switching to Unicode. It's unfortunate because often the data is only passed back to another function, and thus lack of round trip is a pure loss caused by choosing a Unicode string as the representation of such data. I opt for Unicode strings nevertheless, Python did a right step.

I once checked what other languages with Unicode strings do, and the results were not enlightening: inconsistency, weird errors, damaged or truncated data.

Python 3.0a1 mostly fails with weird errors, and fails a bit too early:

[qrczak ~]$ echo $LANG pl_PL.UTF-8

[qrczak ~]$ python3.0 - $(printf '\x80')
Python 3.0a1 (py3k, Sep 8 2007, 15:57:56) [GCC 4.2.1 20070719 (release) (PLD-Linux)] on linux2 Type "help", "copyright", "credits" or "license" for more information. Fatal Python error: no mem for sys.argv zsh: abort python3.0 - $(printf '\x80')

[qrczak ~]$ FOO=$(printf '\x80') python3.0 Python 3.0a1 (py3k, Sep 8 2007, 15:57:56) [GCC 4.2.1 20070719 (release) (PLD-Linux)] on linux2 Type "help", "copyright", "credits" or "license" for more information.

import os object : UnicodeDecodeError('utf8', b'\x80', 0, 1, 'unexpected code byte') type : UnicodeDecodeError refcount: 4 address : 0xb7a5142c lost sys.stderr

[qrczak ~]$ mkdir $(printf '\x80')

[qrczak ~]$ cd $(printf '\x80')

[qrczak ~/\M-^@]$ python3.0 Python 3.0a1 (py3k, Sep 8 2007, 15:57:56) [GCC 4.2.1 20070719 (release) (PLD-Linux)] on linux2 Type "help", "copyright", "credits" or "license" for more information.

import os object : UnicodeDecodeError('utf8', b'/home/users/qrczak/\x80', 19, 20, 'unexpected code byte') type : UnicodeDecodeError refcount: 4 address : 0xb7a1242c lost sys.stderr

os.listdir returns undecodable filenames as str8.

I don't know what it should do. Choices:

  1. Fail in a controlled way (without losing sys.stderr), and no earlier than necessary, i.e. fail when the given string is requested, not when a module is imported.

1a. Guarantee that choosing a different encoding and retrying works, for a rare case when the programmer wishes to handle such strings by explicitly trying latin1.

  1. Return undecodable information as bytes, and accept bytes when it is passed back to similar functions in the other direction.

  2. Have an option to use a modified UTF-8 in these places, where undecodable bytes are e.g. escaped as U+0000 U+00xx.

I will not advocate any choice other than 1, but perhaps someone has another idea.

My language Kogut uses 1a (even for things like sys.argv which look like variables), experimentally with 3 as an option to be requested either by choosing such encoding by the program or with an environment variable.

-- _("< Marcin Kowalczyk _/ qrczak at knm.org.pl ^^ http://qrnik.knm.org.pl/~qrczak/



More information about the Python-3000 mailing list