[Python-Dev] PEP 393: Special-casing ASCII-only strings (original) (raw)

Guido van Rossum guido at python.org
Thu Sep 15 21:48:01 CEST 2011


On Thu, Sep 15, 2011 at 8:50 AM, "Martin v. Löwis" <martin at v.loewis.de> wrote:

In reviewing memory usage, I found potential for saving more memory for ASCII-only strings. Both Victor and Guido commented that something like this be done; Antoine had asked whether there was anything that could be done. Here is the idea:

In an ASCII-only string, the UTF-8 representation is shared with the canonical one-byte representation. This would allow to drop the UTF-8 pointer and the UTF-8 length field; instead, a flag in the state would indicate that these fields are not there. Likewise, the wchart/PyUNICODE length can be shared (even though the data cannot), since the ASCII-only string won't contain any surrogate pairs. To comply with the C aliasing rules, the structures would look like this: typedef struct {  PyObjectHEAD  Pyssizet length;  union {  void *any;  PyUCS1 *latin1;  PyUCS2 *ucs2;  PyUCS4 *ucs4;  } data;  Pyhasht hash;  int state;     /* may include SSTATESHORTASCII flag */  wchart *wstr; } PyASCIIObject;

typedef struct {  PyASCIIObject base;  Pyssizet utf8length;  char *utf8;  Pyssizet wstrlength; } PyUnicodeObject; Code that directly accesses the structures would become more complex; code that use the accessor macros wouldn't notice. As a result, ASCII-only strings would lose three pointers, and shrink to their 3.2 structure size. Since they also save in the individual characters, strings with more than 3 characters (16-bit PyUNICODE) or more than one character (32-bit PyUNICODE) would see a total size reduction compared to 3.2. Objects created throught the legacy API (PyUnicodeFromUnicode) that are only later found to be ASCII-only (in PyUnicodeReady) would still have the UTF-8 pointer shared with the data pointer, but keep including separate fields for pointer & size. What do you think? Regards, Martin P.S. There are similar reductions that could be applied to the wstrlength in general: on 32-bit wchart systems, it could be always dropped, on a 16-bit wchart system, it could be dropped for UCS-2 strings. However, I'm not proposing these, as I think the increase in complexity is not worth the savings.

This sounds like a good plan.

-- --Guido van Rossum (python.org/~guido)



More information about the Python-Dev mailing list