[Python-Dev] PyUnicodeObject / PyASCIIObject questions (original) (raw)

Jim Jewett jimjjewett at gmail.com
Tue Dec 13 08:09:02 CET 2011


(see http://www.python.org/dev/peps/pep-0393/ and http://hg.python.org/cpython/file/6f097ff9ac04/Include/unicodeobject.h )

typedef struct {
  PyObject_HEAD
  Py_ssize_t length;
  Py_hash_t hash;
  struct {
      unsigned int interned:2;
      unsigned int kind:2;   /* now 3 in implementation */
      unsigned int compact:1;
      unsigned int ascii:1;
      unsigned int ready:1;
  } state;
  wchar_t *wstr;
} PyASCIIObject;

typedef struct {
  PyASCIIObject _base;
  Py_ssize_t utf8_length;
  char *utf8;
  Py_ssize_t wstr_length;
} PyCompactUnicodeObject;

typedef struct {
  PyCompactUnicodeObject _base;
  union {
      void *any;
      Py_UCS1 *latin1;
      Py_UCS2 *ucs2;
      Py_UCS4 *ucs4;
  } data;
} PyUnicodeObject;

(1) Why is PyObject_HEAD used instead of PyObject_VAR_HEAD? It is because of the names (.length vs .size), or a holdover from when unicode (as opposed to str) did not expect to be compact, or is there a deeper reason?

(2) Why does PyASCIIObject have a wstr member, and why does PyCompactUnicodeObject have wstr_length? As best I can tell from the PEP or header file, wstr is only meaningful when either:

(2a)  wstr is shared with (and redundant to) the canonical representation
     -- which will therefore not be ASCII.  So wstr (and

wstr_length) shouldn't need to be represented explicitly, and certainly not in the PyASCIIObject base.

or

(2b)  The string is a "Legacy String" (and PyUnicode_READY has not

been called). Because it is a Legacy String, the object header must already be a full PyUnicodeObject, and the wstr fields could at least be stored there.

    I'm also not sure why wstr can't be stored in the existing

.data member -- once PyUnicode_READY is called, it will either be there (shared) or be discarded.

    Are there other times when the wstr will be explicitly

re-filled and cached?

(3) I would feel much less nervous if the remaining 4 values of PyUnicode_Kind were explicitly reserved, and the macros raised an error when they showed up. (Better still would be to allow other values, and to have the macros delegate to some attribute on the (sub) type object.)

Discussion on py-ideas strongly suggested that people should not be rolling their own string string representations, and that it won't really save as much as people think it will, etc ... but I'm not sure that saying "do it without inheritance" is the best solution -- and that is what treating kind as an exhaustive list does.

-jJ



More information about the Python-Dev mailing list