[Python-Dev] The future of the wchar_t cache (original) (raw)

Serhiy Storchaka storchaka at gmail.com
Sat Oct 20 07:06:49 EDT 2018


Currently the PyUnicode object contains two caches: for UTF-8 representation and for wchar_t representation. They are needed not for optimization but for supporting C API which returns borrowed references for such representations.

The UTF-8 cache always was in unicode objects (but in Python 2 it was not a UTF-8 cache, but a 8-bit representation cache). Initially it was needed for compatibility with 8-bit str, for implementing the "s" and "z" format units in PyArg_Parse(). Now it is used also for PyUnicode_AsUTF8() and PyUnicode_AsUTF8AndSize().

The wchar_t cache was added with PEP 393 in 3.3 as a replacement for the former Py_UNICODE representation. Now Py_UNICODE is defined as an alias of wchar_t, and the C API which returned a pointer to Py_UNICODE content returns now a pointer to the cached wchar_t representation. It is the "u" and "Z" format units in PyArg_Parse(), PyUnicode_AsUnicode(), PyUnicode_AsUnicodeAndSize(), PyUnicode_GET_SIZE(), PyUnicode_GET_DATA_SIZE(), PyUnicode_AS_UNICODE(), PyUnicode_AS_DATA().

All this increase the size of the unicode object. It includes the constant overhead of additional pointer and size fields, and the overhead of the cached representation proportional to the string length. The following table contains number of bytes per character for different kinds, with and without filling specified caches.

    raw  +utf8     +wchar_t       +utf8+wchar_t
                Windows  Linux   Windows  Linux

ASCII 1 1 3 5 3 5 UCS1 1 2-3 3 5 4-5 6-7 UCS2 2 3-5 2 6 3-5 7-9 UCS4 4 5-8 6-8 4 7-12 5-8

There is also a new C API added in 3.3 for getting wchar_t representation without using the cache: PyUnicode_AsWideChar() and PyUnicode_AsWideCharString(). Currently it uses the cache, this has benefits and disadvantages.

Old Py_UNICODE based API is deprecated, and will be removed eventually. I want to ask about the future of the wchar_t cache. Is the benefit of caching the wchar_t representation larger the disadvantage of spending more memory? The wchar_t representation is so natural for Windows API as the UTF8 representation for POSIX API. But in all other cases it is just waste of memory. Are there reasons of keeping the wchar_t cache after removing the deprecated API?

I have rewrote PyUnicode_AsWideChar() and PyUnicode_AsWideCharString(). They were implemented via the old Py_UNICODE based API, and now they don't use deprecated functions. They still use the wchar_t cache if it was created by previous use of the deprecated API, but don't create it themselves. Is this the correct decision?

https://bugs.python.org/issue30863



More information about the Python-Dev mailing list