[Python-Dev] PEP 393 review (original) (raw)
Victor Stinner victor.stinner at haypocalc.com
Thu Aug 25 00:29:19 CEST 2011
- Previous message: [Python-Dev] PEP 393 review
- Next message: [Python-Dev] PEP 393 review
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
With this PEP, the unicode object overhead grows to 10 pointer-sized words (including PyObjectHEAD), that's 80 bytes on a 64-bit machine. Does it have any adverse effects?
For pure ASCII, it might be possible to use a shorter struct:
typedef struct { PyObject_HEAD Py_ssize_t length; Py_hash_t hash; int state; Py_ssize_t wstr_length; wchar_t wstr; / no more utf8_length, utf8, str / / followed by ascii data */ } _PyASCIIObject; (-2 pointer -1 ssize_t: 56 bytes)
=> "a" is 58 bytes (with utf8 for free, without wchar_t)
For object allocated with the new API, we can use a shorter struct:
typedef struct { PyObject_HEAD Py_ssize_t length; Py_hash_t hash; int state; Py_ssize_t wstr_length; wchar_t *wstr; Py_ssize_t utf8_length; char utf8; / no more str pointer / / followed by latin1/ucs2/ucs4 data */ } _PyNewUnicodeObject; (-1 pointer: 72 bytes)
=> "é" is 74 bytes (without utf8 / wchar_t)
For the legacy API:
typedef struct { PyObject_HEAD Py_ssize_t length; Py_hash_t hash; int state; Py_ssize_t wstr_length; wchar_t *wstr; Py_ssize_t utf8_length; char *utf8; void *str; } _PyLegacyUnicodeObject; (same size: 80 bytes)
=> "a" is 80+2 (2 malloc) bytes (without utf8 / wchar_t)
The current struct:
typedef struct { PyObject_HEAD Py_ssize_t length; Py_UNICODE *str; Py_hash_t hash; int state; PyObject *defenc; } PyUnicodeObject;
=> "a" is 56+2 (2 malloc) bytes (without utf8, with wchar_t if Py_UNICODE is wchar_t)
... but the code (maybe only the macros?) and debuging will be more complex.
Will the format codes returning a PyUNICODE pointer with PyArgParseTuple be deprecated?
Because Python 2.x is still dominant and it's already hard enough to port C modules, it's not the best moment to deprecate the legacy API (Py_UNICODE*).
Do you think the wstr representation could be removed in some future version of Python?
Conversion to wchar_t* is common, especially on Windows. But I don't know if we have to cache the result. Is it cached by the way? Or is wstr only used when a string is created from Py_UNICODE?
Victor
- Previous message: [Python-Dev] PEP 393 review
- Next message: [Python-Dev] PEP 393 review
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]