[Python-Dev] PEP 393 close to pronouncement (original) (raw)

"Martin v. Löwis" martin at v.loewis.de
Wed Sep 28 19:47:22 CEST 2011

Previous message: [Python-Dev] PEP 393 close to pronouncement
Next message: [Python-Dev] PEP 393 close to pronouncement
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

Codecs use resizing a lot. Given that PyCompactUnicodeObject does not support resizing, most decoders will have to use PyUnicodeObject and thus not benefit from the memory footprint advantages of e.g. PyASCIIObject.

No, codecs have been rewritten to not use resizing.

PyASCIIObject has a wchart *wstr pointer - I guess this should be a char *str pointer, otherwise, where's the memory footprint advantage (esp. on Linux where sizeof(wchart) == 4) ?

That's the Py_UNICODE representation for backwards compatibility. It's normally NULL.

I also don't see a reason to limit the UCS1 storage version to ASCII. Accordingly, the object should be called PyLatin1Object or PyUCS1Object.

No, in the ASCII case, the UTF-8 length can be shared with the regular string length - not so for Latin-1 character above 127.

Typedef'ing PyUNICODE to wchart and using wchart in existing code will cause problems on some systems where whcart is a signed type.

Python assumes that PyUNICODE is unsigned and thus doesn't check for negative values or takes these into account when doing range checks or code point arithmetic. On such platform where wchart is signed, it is safer to typedef PyUNICODE to unsigned wchart.

No. Py_UNICODE values must be in the range 0..172**16. Values larger than 172**16 are just as bad as negative values, so having Py_UNICODE unsigned doesn't improve anything.

PyUNICODE access to the objects assumes that len(obj) == length of the PyUNICODE buffer. The PEP suggests that length should not take surrogates into account on UCS2 platforms such as Windows. The causes len(obj) to not match len(wstr).

Correct.

As a result, PyUNICODE access to the Unicode objects breaks when surrogate code points are present in the Unicode object on UCS2 platforms.

Incorrect. What specifically do you think would break?

The PEP also does not explain how lone surrogates will be handled with respect to the length information.

Just as any other code point. Python does not special-case surrogate code points anymore.

Furthermore, determining len(obj) will require a loop over the data, checking for surrogate code points. A simple memcpy() is no longer enough.

No, it won't. The length of the Unicode object is stored in the length field.

I suggest to drop the idea of having len(obj) not count wstr surrogate code points to maintain backwards compatibility and allow for working with lone surrogates.

Backwards-compatibility is fully preserved by PyUnicode_GET_SIZE returning the size of the Py_UNICODE buffer. PyUnicode_GET_LENGTH returns the true length of the Unicode object.

Note that the whole surrogate debate does not have much to do with this PEP, since it's mainly about memory footprint savings. I'd also urge to do a reality check with respect to surrogates and non-BMP code points: in practice you only very rarely see any non-BMP code points in your data. Making all Python users pay for the needs of a tiny fraction is not really fair. Remember: practicality beats purity.

That's the whole point of the PEP. You only pay for what you actually need, and in most cases, it's ASCII.

For best performance, each algorithm will have to be implemented for all three storage types.

This will be a trade-off. I think most developers will be happy with a single version covering all three cases, especially as it's much more maintainable.

Kind regards, Martin

Previous message: [Python-Dev] PEP 393 close to pronouncement
Next message: [Python-Dev] PEP 393 close to pronouncement
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

More information about the Python-Dev mailing list