[Python-Dev] PyUnicodeObject / PyASCIIObject questions (original) (raw)

Jim Jewett jimjjewett at gmail.com
Tue Dec 13 22:17:13 CET 2011


On Tue, Dec 13, 2011 at 2:55 AM, "Martin v. Löwis" <martin at v.loewis.de> wrote:

(1)  Why is PyObjectHEAD used instead of PyObjectVARHEAD?

The unicode object is not a var object. In a var object, tpitemsize gives the element size, which is not possible for unicode objects, since the itemsize may vary by instance. In addition, not all instances have the items after the base object (plus the size of the base object in tpbasicsize is also not always correct).

That makes perfect sense.

Any chance of adding the rationale to the code? Either inline, such as changing unicodeobject.h line 291 from

PyObject_HEAD

to something like: PyObject_HEAD /* Not VAR_HEAD, because tp_itemsize varies, and data may be elsewhere. */

or in the large comments around line 288:

Note that Strings use PyObject_HEAD and a length field instead of

PyObject_VAR_HEAD, because the tp_itemsize varies by instance, and the actual data is not always immediately after the PyASCIIObject header.

(2)  Why does PyASCIIObject have a wstr member, and why does PyCompactUnicodeObject have wstrlength?  As best I can tell from the PEP or header file, wstr is only meaningful when either:

No. wstr is most of all relevant if someone calls PyUnicodeAsUnicode(AndSize); any unicode object might get the wstr pointer filled out at some point.

I am willing to believe that requests for a wchar_t (or utf-8 or System Locale charset) representation are common enough to justify caching the data after the first request.

But then why throw it away in the first place? Wouldn't programs that create unicode from wchar_t data also be the most likely to request wchar_t data back?

wstrlength is only relevant if wstr is not NULL. For a pure ASCII string (and also for Latin-1 and other BMP strings), the wstr length will always equal the canonical length (number of code points).

wstr_length != length exactly when:

2==sizeof(wchar_t) &&
PyUnicode_4BYTE_KIND == PyUnicode_KIND( str )

which can sometimes be eliminated at compile-time, and always by string creation time.

In all other cases, (wstr_length == length), and wstr can be generated by widening the data without having to inspect it. Is it worth eliminating wstr_length (or even wstr) in those cases, or is that too much complexity?

(3)  I would feel much less nervous if the remaining 4 values of PyUnicodeKind were explicitly reserved, and the macros raised an error when they showed up. ...

If people use C, they can construct all kinds of "illegal" ... kind values: many places will either work incorrectly, or have an assertion in debug mode already if an unexpected kind is encountered.

What I'm asking is that (1) The other values be documented as reserved, rather than as illegal. (2) The macros produce an error rather than silently corrupting data.

This allows at least the possibility of a later change such that

(3) The macros handle the new values correctly, if only by delegating back to type-supplied functions.

-jJ



More information about the Python-Dev mailing list