[Python-Dev] String encoding (original) (raw)

Fred L. Drake fdrake@acm.org
Tue, 23 May 2000 08:13:59 -0700 (PDT)


On Tue, 23 May 2000, M.-A. Lemburg wrote:

The problem is that "s" and "t" return C pointers to some internal data structure of the object. It has to be assured that this data remains intact at least as long as the object itself exists.

AFAIK, this cannot be fixed without creating a memory leak.

The "es" parser marker uses a different strategy, BTW: the data is copied into a buffer, thus detaching the object from the data.

C APIs which want to support Unicode should be fixed to use "es" or query the object directly and then apply proper, possibly OS dependent conversion.

for convenience, it might be a good idea to have a "wide system encoding" too, and special parser markers for that purpose.

or can we assume that all wide system API's use unicode all the time?

At least in all references I've seen (e.g. ODBC, wchar_t implementations, etc.) "wide" refers to Unicode.

On Linux, wchar_t is 4 bytes; that's not just Unicode. Doesn't ISO 10646 require a 32-bit space? I recall a fair bit of discussion about wchar_t when it was introduced to ANSI C, and the character set and encoding were specifically not made part of the specification. Making a requirement that wchar_t be Unicode doesn't make a lot of sense, and opens up potential portability issues.

-1 on any assumption that wchar_t is usefully portable.

-Fred

-- Fred L. Drake, Jr.