[Python-3000] Unicode and OS strings (original) (raw)

Jim Jewett jimjjewett at gmail.com
Wed Sep 19 00:23:18 CEST 2007


On 9/18/07, Guido van Rossum <guido at python.org> wrote:

On 9/18/07, Jim Jewett <jimjjewett at gmail.com> wrote: > On 9/18/07, Stephen J. Turnbull <stephen at xemacs.org> wrote:

> > There's no UTF-8 in Python's internal string encoding.

> (At least as of a few days ago)

> In Python 3 there is; strings are unicode. A PyUnicodeObject object > has two encodings that you can grab from a pointer (which means > they have to be there; you don't have time to generate them like > you would with a function pointer).

Incorrect. The pointer can be NULL.

I had missed that comment, but I do see it now; thank you.

The API for getting the UTF-8 encoding is a function

Thank you. But given that defenc is now always UTF-8, won't exposing it in the public typedef then just be an attractive nuisance?

(moreover a function whose name starts with Py).

That I still don't see.

http://svn.python.org/view/python/branches/py3k/Include/unicodeobject.h?rev=57656&view=markup

PyAPI_FUNC(PyObject*) PyUnicode_AsUTF8String( PyObject unicode / Unicode object */ );

PyAPI_FUNC(PyObject*) PyUnicode_EncodeUTF8( const Py_UNICODE data, / Unicode char buffer / Py_ssize_t length, / number of Py_UNICODE chars to encode */ const char errors / error handling */ );

Later, the same file shows me:

/* --- Unicode Type ------------------------------------------------------- */

typedef struct { PyObject_HEAD Py_ssize_t length; /* Length of raw Unicode data in buffer */ Py_UNICODE str; / Raw Unicode buffer / long hash; / Hash value; -1 if not set / int state; / != 0 if interned. In this case the two * references from the dictionary to this object * are not counted in ob_refcnt. */ PyObject defenc; / (Default) Encoded version as Python string, or NULL; this is used for implementing the buffer protocol */ } PyUnicodeObject;

I would be happier with:

typedef struct { PyObject_VAR_HEAD /* Length in code points, not chars */ } PyUnicodeObject;

And, in unicodeobject.c (not in a public header)

typedef struct { PyUnicodeObject ob_unicodehead; Py_UNICODE str; / Raw Unicode buffer / long hash; / Hash value; -1 if not set / int state; / != 0 if interned. In this case the two * references from the dictionary to this object * are not counted in ob_refcnt. */ PyObject defenc; / (Default) Encoded version as Python string, or NULL; this is used for implementing the buffer protocol */ } _PyDefaultUnicodeObject;

As this would allow 3rd parties to create implementations specialized for (and saving space on) smaller alphabets, without breaking C extensions that stick to the public header files. (Moving hash or even state to the public header might be OK too, but they seemed to get ignored for subclasses anyhow.)

-jJ



More information about the Python-3000 mailing list