[Python-Dev] Divorcing str and unicode (no more implicit conversions). (original) (raw)

"Martin v. Löwis" martin at v.loewis.de
Tue Oct 25 00:59:27 CEST 2005


Antoine Pitrou wrote:

There are many design alternatives: Wouldn't it be simpler to use: - one-byte representation if every character <= 0xFF - two-byte representation if every character <= 0xFFFF - four-byte representation otherwise

As I said: there are many alternatives. This one has the disadvantage of requiring a copy every time you pass the string to a Win32 function (which expects UTF-16).

Whether or not this is a significant disadvantage, I don't know.

In any case, a multi-representations implementation has the disadvantage of making the C API more difficult to use, in particular for writing codecs. On encoding, it is difficult to fetch the individual characters which you need for the lookup table; on decoding, it is difficult to know in advance what representation to use (unless you know there is an upper bound on the decoded character ordinals).

Regards, Martin



More information about the Python-Dev mailing list