This issue is a branch from . Below a summary of the discussion: Antoine Pitrou wrote: > It seems that in some UCS4 builds, sizeof(Py_UNICODE) could end > up being more than 4 if the native int type is itself larger than 32 > bits; although the latter is probably quite rare (64-bit platforms are > usually either LP64 or LLP64). Marc-Andre Lemburg wrote: > AFAIK, only Crays have this problem, but apart from that: I'd consider > it a bug if sizeof(Py_UCS4) != 4. Antoine Pitrou wrote: > Perhaps a #error can be added to that effect? > Something like (untested): > > #if SIZEOF_INT == 4 > typedef unsigned int Py_UCS4; > #elif SIZEOF_LONG == 4 > typedef unsigned long Py_UCS4; > #else > #error Could not find a 4-byte integer type for Py_UCS4, aborting > #endif Marc-Andre Lemburg wrote: > Sounds good ! > > Python should really try to use uint32_t as fallback solution for > UCS4 where available (and uint16_t for UCS2). > > We'd have to add an AC_TYPE_INT32_T and AC_TYPE_INT16_T check to > configure: > > http://www.gnu.org/software/autoconf/manual/html_node/Particular-Types.html#Particular-Types > > and could then use > > typedef uint32_t Py_UCS4 > > and > > typedef uint16_t Py_UCS2 > > Note that the code for supporting UCS2/UCS4 is not really all that > clean. It was a quick sprint between Martin and Fredrik and appears > to be only half-done... e.g. there currently is no Py_UCS2.
I like the idea of using uint16_t and uint32_t. Unicode 5.1 contains approximately 1 million of codes (and 100,000 characters), so 21 bits are already enough to use the full Unicode 5.1 standard (released in April 2009). Use more than 32 bits for an unicode character is wasting memory.
> We'd have to add an AC_TYPE_INT32_T and AC_TYPE_INT16_T check to > configure: AC_TYPE_INT32_T should already be there. See also the code in pyport.h that #defines HAVE_INT32_T and PY_INT32_T, and the corresponding bits of PC/pyconfig.h. It was recently pointed out that there are some issues with these definitions when using a C++ compiler instead of a C compiler, since then INT32_MAX is undefined. (See the footnote to 7.18.2, para.1 of C99.)
The PEP 393 has been accepted: strings are now stored as PyUCS1*, PyUCS2* or PyUCS4*. The Py_UNICODE type still exist but is deprecated, and only used in the legacy API. Py_UNICODE is now always the wchar_t type, it cannot be unsigned int anymore. I hope that no platform chose to use wchar_t larger than 32 bits. Let' close this issue.