[Python-Dev] transform() and untransform() methods, and the codec registry (original) (raw)

Alexander Belopolsky alexander.belopolsky at gmail.com
Tue Dec 7 05:46:54 CET 2010


On Sun, Dec 5, 2010 at 5:25 PM, Victor Stinner <victor.stinner at haypocalc.com> wrote:

On Saturday 04 December 2010 09:31:04 you wrote:

Alexander Belopolsky writes:  > In fact, once the language moratorium is over, I will argue that  > str.encode() and byte.decode() should deprecate encoding argument and  > just do UTF-8 encoding/decoding.  Hopefully by that time most people  > will forget that other encodings exist.  (I can dream, right?)

It's just a dream.  There's a pile of archival material, often on R/O media, out there that won't be transcoded any more quickly than the inscriptions on Tutankhamun's tomb. Not only, many libraries expect use bytes arguments encoded to a specific encoding (eg. locale encoding). Said differenlty, only few libraries written in C accept wchar* strings.

My proposal has nothing to do with C-API. It only concerns Python API of the builtin str type.

The Linux kernel (or many, or all, UNIX/BSD kernels) only manipulate byte strings. The libc only accept wide characters for a few operations. I don't know how to open a file with an unicode path with the Linux libc: you have to encode it...

Yes, but hopefully the encoding used by the filesystem will be UTF-8. For Python users, however, encoding details will hopefully be hidden by the open() call. Yes, I am aware of the many problems with divining the filesystem encoding, but instructing application developers to supply their own fsencoding in open(filepath.encode(fsencoding)) calls is not very helpful.

Alexander: you should first patch all UNIX/BSD kernels to use unicode everywhere, then patch all libc implementations, and then all libraries (written in C). After that, you can have a break.

As Martin explained later in this thread with respect to the transform() method, removing codec argument from str.encode() method does not imply removing the codecs themselves. If I need a method to encode strings to say koi8_r encoding, I can easily access it directly:

from encodings import koi8r tokoi8r = koi8r.Codec().encode tokoi8r('код') (b'\xcb\xcf\xc4', 3)

More likely, however, I will only need en/decoding to read/write legacy files and rather than encoding the strings explicitly before writing into a file, I will just open that file with the correct encoding.

Having all encodings accessible in a str method only promotes a programming style where bytes objects can contain differently encoded strings in different parts of the program. Instead, well-written programs should decode bytes on input, do all processing with str type and decode on output. When strings need to be passed to char* C APIs, they should be encoded in UTF-8. Many C APIs originally designed for ASCII actually produce meaningful results when given UTF-8 bytes. (Supporting such usage was one of the design goals of UTF-8.)



More information about the Python-Dev mailing list