[Python-3000] Unicode and OS strings (original) (raw)
James Y Knight foom at fuhm.net
Fri Sep 14 05:41:12 CEST 2007
- Previous message: [Python-3000] Unicode and OS strings
- Next message: [Python-3000] Unicode and OS strings
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
On Sep 13, 2007, at 12:22 PM, Marcin 'Qrczak' Kowalczyk wrote:
What should happen when a command line argument or an environment variable is not decodable using the system encoding (on Unix where from the OS point of view it is an array of bytes)?
Here's a suggestion I made on the SBCL dev list a while back, in
response to the same issues. I am responding to myself here, where my
first suggestion was to keep all the environmental gunk in byte-
arrays rather than strings. That is still a very nice and simple
possibility.
My second inclination was to use a variant of utf8 which can handle
all bytestrings, instead of utf8 itself: utf-8b. This obviously works
best when the system encoding is actually utf8.
On Aug 2, 2007, at 4:55 PM, James Y Knight wrote:
Yeah -- it's pretty clear the environment isn't actually in the default encoding. It's just binary junk which often but not always contains some text encoded in some arbitrary superset of ASCII. Just like command line arguments (and filenames on linux).
The hard part is that users expect command line arguments, filenames, and environment values to be strings (because they normally do contain text-like things), when strictly they cannot be because there is no reliable encoding. A good alternative to this is for SBCL to use the UTF8b encoding to decode unix environment gunk (filenames, env vars, command line args) which are probably in utf8, but might not be. utf8b has the nice property that any arbitrary bytestring can be decoded into unicode, and then round-tripped back to the same bytes. Valid utf8 sequences turns into the same unicode characters as with the utf8 codec. Invalid utf8 sequences turn into invalid surrogate pair sequences in the unicode string. Thus, SBCL can return strings, and never throw an error. If you actually wanted the random binary, you can losslessly convert the unicode string back to binary. Win win. Some references: Original mail: http://mail.nl.linux.org/linux-utf8/2000-07/msg00040.html Blog entry: http://bsittler.livejournal.com/10381.html Python implementation: http://hyperreal.org/~est/libutf8b/
James
- Previous message: [Python-3000] Unicode and OS strings
- Next message: [Python-3000] Unicode and OS strings
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]