[Python-Dev] Re: [I18n-sig] Unicode strings: an alternative (original) (raw)

Guido van Rossum guido@python.org
Wed, 03 May 2000 17:22:59 -0400


Today I had a relatively simple idea that unites wide strings and narrow strings in a way that is more backward comatible at the C level. It's quite possible this has already been considered and rejected for reasons that are not yet obvious to me, but I'll give it a shot anyway.

The main concept is not to provide a new string type but to extend the existing string object like so: - wide strings are stored as if they were narrow strings, simply using two bytes for each Unicode character. - there's a flag that specifies whether the string is narrow or wide. - the obsize field is the physical length of the data; if the string is wide, len(s) will return obsize/2, all other string operations will have to do similar things. - there can possibly be an encoding attribute which may specify the used encoding, if known. Admittedly, this is tricky and involves quite a bit of effort to implement, since all string methods need to have narrow/wide switch. To make it worse, it hardly offers anything the current solution doesn't. However, it offers one IMHO big advantage: C code that just passes strings along does not need to change: wide strings can be seen as narrow strings without any loss. This allows for str() & str() and friends to work with unicode strings without any change.

This seems to have some nice properties, but I think it would cause problems for existing C code that tries to interpret the bytes of a string: it could very well do the wrong thing for wide strings (since old C code doesn't check for the "wide" flag). I'm not sure how much C code there is that merely passes strings along... Most C code using strings makes use of the strings (e.g. open() falls in this category in my eyes).

--Guido van Rossum (home page: http://www.python.org/~guido/)