[Python-Dev] PEP 393 review (original) (raw)

Victor Stinner victor.stinner at haypocalc.com
Thu Aug 25 00:29:19 CEST 2011


With this PEP, the unicode object overhead grows to 10 pointer-sized words (including PyObjectHEAD), that's 80 bytes on a 64-bit machine. Does it have any adverse effects?

For pure ASCII, it might be possible to use a shorter struct:

typedef struct { PyObject_HEAD Py_ssize_t length; Py_hash_t hash; int state; Py_ssize_t wstr_length; wchar_t wstr; / no more utf8_length, utf8, str / / followed by ascii data */ } _PyASCIIObject; (-2 pointer -1 ssize_t: 56 bytes)

=> "a" is 58 bytes (with utf8 for free, without wchar_t)

For object allocated with the new API, we can use a shorter struct:

typedef struct { PyObject_HEAD Py_ssize_t length; Py_hash_t hash; int state; Py_ssize_t wstr_length; wchar_t *wstr; Py_ssize_t utf8_length; char utf8; / no more str pointer / / followed by latin1/ucs2/ucs4 data */ } _PyNewUnicodeObject; (-1 pointer: 72 bytes)

=> "é" is 74 bytes (without utf8 / wchar_t)

For the legacy API:

typedef struct { PyObject_HEAD Py_ssize_t length; Py_hash_t hash; int state; Py_ssize_t wstr_length; wchar_t *wstr; Py_ssize_t utf8_length; char *utf8; void *str; } _PyLegacyUnicodeObject; (same size: 80 bytes)

=> "a" is 80+2 (2 malloc) bytes (without utf8 / wchar_t)

The current struct:

typedef struct { PyObject_HEAD Py_ssize_t length; Py_UNICODE *str; Py_hash_t hash; int state; PyObject *defenc; } PyUnicodeObject;

=> "a" is 56+2 (2 malloc) bytes (without utf8, with wchar_t if Py_UNICODE is wchar_t)

... but the code (maybe only the macros?) and debuging will be more complex.

Will the format codes returning a PyUNICODE pointer with PyArgParseTuple be deprecated?

Because Python 2.x is still dominant and it's already hard enough to port C modules, it's not the best moment to deprecate the legacy API (Py_UNICODE*).

Do you think the wstr representation could be removed in some future version of Python?

Conversion to wchar_t* is common, especially on Windows. But I don't know if we have to cache the result. Is it cached by the way? Or is wstr only used when a string is created from Py_UNICODE?

Victor



More information about the Python-Dev mailing list