[Python-Dev] PyUnicodeObject / PyASCIIObject questions (original) (raw)

"Martin v. Löwis" martin at v.loewis.de
Tue Dec 13 08:55:02 CET 2011

Previous message: [Python-Dev] PyUnicodeObject / PyASCIIObject questions
Next message: [Python-Dev] PyUnicodeObject / PyASCIIObject questions
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

(1) Why is PyObjectHEAD used instead of PyObjectVARHEAD? It is because of the names (.length vs .size), or a holdover from when unicode (as opposed to str) did not expect to be compact, or is there a deeper reason?

The unicode object is not a var object. In a var object, tp_itemsize gives the element size, which is not possible for unicode objects, since the itemsize may vary by instance. In addition, not all instances have the items after the base object (plus the size of the base object in tp_basicsize is also not always correct).

(2) Why does PyASCIIObject have a wstr member, and why does PyCompactUnicodeObject have wstrlength? As best I can tell from the PEP or header file, wstr is only meaningful when either:

No. wstr is most of all relevant if someone calls PyUnicode_AsUnicode(AndSize); any unicode object might get the wstr pointer filled out at some point. It can be shared only if sizeof(Py_UNICODE) matches the canonical width of the string.

wstr_length is only relevant if wstr is not NULL. For a pure ASCII string (and also for Latin-1 and other BMP strings), the wstr length will always equal the canonical length (number of code points). Only for ASCII objects the optimization was made to drop the wstr_length from the representation.

I'm also not sure why wstr can't be stored in the existing .data member -- once PyUnicodeREADY is called, it will either be there (shared) or be discarded.

Most objects won't have the .data member. For those that do, .data holds the canonical representation (and only after PyUnicode_READY has been called).

(3) I would feel much less nervous if the remaining 4 values of PyUnicodeKind were explicitly reserved, and the macros raised an error when they showed up. (Better still would be to allow other values, and to have the macros delegate to some attribute on the (sub) type object.)

Discussion on py-ideas strongly suggested that people should not be rolling their own string string representations, and that it won't really save as much as people think it will, etc ... but I'm not sure that saying "do it without inheritance" is the best solution -- and that is what treating kind as an exhaustive list does.

If people use C, they can construct all kinds of "illegal" representations, for any object (e.g. lists where the stored length differs from the actual length, dictionaries where key an value are switched, and so on). If they do that, they likely get crashes and other failures, so they quickly stop doing it. In the specific case of kind values: many places will either work incorrectly, or have an assertion in debug mode already if an unexpected kind is encountered. I don't mind adding such checks to more places, but I also don't see a need to explicitly care about this specific class of bugs where people would have to deliberately try to "cheat".

Regards, Martin

Previous message: [Python-Dev] PyUnicodeObject / PyASCIIObject questions
Next message: [Python-Dev] PyUnicodeObject / PyASCIIObject questions
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

More information about the Python-Dev mailing list