[Python-3000] Unicode and OS strings (original) (raw)

Guido van Rossum guido at python.org
Wed Sep 19 00:29:24 CEST 2007

Previous message: [Python-3000] Unicode and OS strings
Next message: [Python-3000] Unicode and OS strings
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On 9/18/07, Jim Jewett <jimjjewett at gmail.com> wrote:

On 9/18/07, Guido van Rossum <guido at python.org> wrote: > On 9/18/07, Jim Jewett <jimjjewett at gmail.com> wrote: > > On 9/18/07, Stephen J. Turnbull <stephen at xemacs.org> wrote:

> > > There's no UTF-8 in Python's internal string encoding. > > (At least as of a few days ago) > > In Python 3 there is; strings are unicode. A PyUnicodeObject object > > has two encodings that you can grab from a pointer (which means > > they have to be there; you don't have time to generate them like > > you would with a function pointer). > Incorrect. The pointer can be NULL. I had missed that comment, but I do see it now; thank you. > The API for getting the UTF-8 encoding is a function Thank you. But given that defenc is now always UTF-8, won't exposing it in the public typedef then just be an attractive nuisance?

ALL fields of the struct def are strictly internal.

> (moreover a function whose name starts with Py).

That I still don't see.

I am talking about _PyUnicode_AsDefaultEncoding(). (Which you shouldn't be calling. :-)

http://svn.python.org/view/python/branches/py3k/Include/unicodeobject.h?rev=57656&view=markup

PyAPIFUNC(PyObject*) PyUnicodeAsUTF8String( PyObject unicode / Unicode object */ ); PyAPIFUNC(PyObject*) PyUnicodeEncodeUTF8( const PyUNICODE data, / Unicode char buffer */ Pyssizet length, /* number of PyUNICODE chars to encode */ const char errors / error handling */ );

Later, the same file shows me: /* --- Unicode Type ------------------------------------------------------- */ typedef struct { PyObjectHEAD Pyssizet length; /* Length of raw Unicode data in buffer */ PyUNICODE str; / Raw Unicode buffer */ long hash; /* Hash value; -1 if not set */ int state; /* != 0 if interned. In this case the two * references from the dictionary to this object * are not counted in obrefcnt. */ PyObject defenc; / (Default) Encoded version as Python string, or NULL; this is used for implementing the buffer protocol */ } PyUnicodeObject; I would be happier with: typedef struct { PyObjectVARHEAD /* Length in code points, not chars */ } PyUnicodeObject; And, in unicodeobject.c (not in a public header) typedef struct { PyUnicodeObject obunicodehead; PyUNICODE str; / Raw Unicode buffer */ long hash; /* Hash value; -1 if not set */ int state; /* != 0 if interned. In this case the two * references from the dictionary to this object * are not counted in obrefcnt. */ PyObject defenc; / (Default) Encoded version as Python string, or NULL; this is used for implementing the buffer protocol */ } PyDefaultUnicodeObject; As this would allow 3rd parties to create implementations specialized for (and saving space on) smaller alphabets, without breaking C extensions that stick to the public header files. (Moving hash or even state to the public header might be OK too, but they seemed to get ignored for subclasses anyhow.)

That is not a supported use case.

-- --Guido van Rossum (home page: http://www.python.org/~guido/)

Previous message: [Python-3000] Unicode and OS strings
Next message: [Python-3000] Unicode and OS strings
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

More information about the Python-3000 mailing list