[Python-Dev] PEP 393 close to pronouncement (original) (raw)
M.-A. Lemburg mal at egenix.com
Wed Sep 28 18:44:23 CEST 2011
- Previous message: [Python-Dev] PEP 393 close to pronouncement
- Next message: [Python-Dev] PEP 393 close to pronouncement
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
Guido van Rossum wrote:
Given the feedback so far, I am happy to pronounce PEP 393 as accepted. Martin, congratulations! Go ahead and mark ity as Accepted. (But please do fix up the small nits that Victor reported in his earlier message.)
I've been working on feedback for the last few days, but I guess it's too late. Here goes anyway...
I've only read the PEP and not followed the discussion due to lack of time, so if any of this is no longer valid, that's probably because the PEP wasn't updated :-)
Resizing
Codecs use resizing a lot. Given that PyCompactUnicodeObject does not support resizing, most decoders will have to use PyUnicodeObject and thus not benefit from the memory footprint advantages of e.g. PyASCIIObject.
Data structure
The data structure description in the PEP appears to be wrong:
PyASCIIObject has a wchar_t *wstr pointer - I guess this should be a char *str pointer, otherwise, where's the memory footprint advantage (esp. on Linux where sizeof(wchar_t) == 4) ?
I also don't see a reason to limit the UCS1 storage version to ASCII. Accordingly, the object should be called PyLatin1Object or PyUCS1Object.
Here's the version from the PEP:
""" typedef struct { PyObject_HEAD Py_ssize_t length; Py_hash_t hash; struct { unsigned int interned:2; unsigned int kind:2; unsigned int compact:1; unsigned int ascii:1; unsigned int ready:1; } state; wchar_t *wstr; } PyASCIIObject;
typedef struct { PyASCIIObject _base; Py_ssize_t utf8_length; char *utf8; Py_ssize_t wstr_length; } PyCompactUnicodeObject; """
Typedef'ing Py_UNICODE to wchar_t and using wchar_t in existing code will cause problems on some systems where whcar_t is a signed type.
Python assumes that Py_UNICODE is unsigned and thus doesn't check for negative values or takes these into account when doing range checks or code point arithmetic.
On such platform where wchar_t is signed, it is safer to typedef Py_UNICODE to unsigned wchar_t.
Accordingly and to prevent further breakage, Py_UNICODE should not be deprecated and used instead of wchar_t throughout the code.
Length information
Py_UNICODE access to the objects assumes that len(obj) == length of the Py_UNICODE buffer. The PEP suggests that length should not take surrogates into account on UCS2 platforms such as Windows. The causes len(obj) to not match len(wstr).
As a result, Py_UNICODE access to the Unicode objects breaks when surrogate code points are present in the Unicode object on UCS2 platforms.
The PEP also does not explain how lone surrogates will be handled with respect to the length information.
Furthermore, determining len(obj) will require a loop over the data, checking for surrogate code points. A simple memcpy() is no longer enough.
I suggest to drop the idea of having len(obj) not count wstr surrogate code points to maintain backwards compatibility and allow for working with lone surrogates.
Note that the whole surrogate debate does not have much to do with this PEP, since it's mainly about memory footprint savings. I'd also urge to do a reality check with respect to surrogates and non-BMP code points: in practice you only very rarely see any non-BMP code points in your data. Making all Python users pay for the needs of a tiny fraction is not really fair. Remember: practicality beats purity.
API
Victor already described the needed changes.
Performance
The PEP only lists a few low-level benchmarks as basis for the performance decrease. I'm missing some more adequate real-life tests, e.g. using an application framework such as Django (to the extent this is possible with Python3) or a server like the Radicale calendar server (which is available for Python3).
I'd also like to see a performance comparison which specifically uses the existing Unicode APIs to create and work with Unicode objects. Most extensions will use this way of working with the Unicode API, either because they want to support Python 2 and 3, or because the effort it takes to port to the new APIs is too high. The PEP makes some statements that this is slower, but doesn't quantify those statements.
Memory savings
The table only lists string sizes up 8 code points. The memory savings for these are really only significant for ASCII strings on 64-bit platforms, if you use the default UCS2 Python build as basis.
For larger strings, I expect the savings to be more significant. OTOH, a single non-BMP code point in such a string would cause the savings to drop significantly again.
Complexity
In order to benefit from the new API, any code that has to deal with low-level Py_UNICODE access to the Unicode objects will have to be adapted.
For best performance, each algorithm will have to be implemented for all three storage types.
Not doing so, will result in a slow-down, if I read the PEP correctly. It's difficult to say, of what scale, since that information is not given in the PEP, but the added loop over the complete data array in order to determine the maximum code point value suggests that it is significant.
Summary
I am not convinced that the memory savings are big enough to warrant the performance penalty and added complexity suggested by the PEP.
In times where even smartphones come with multiple GB of RAM, performance is more important than memory savings.
In practice, using a UCS2 build of Python usually is a good compromise between memory savings, performance and standards compatibility. For the few cases where you have to deal with UCS4 code points, we already have made good progress to make handling these much easier. IMHO, Python should be optimized for UCS2 usage, not the rare cases of UCS4 usage you find in practice.
I do see the advantage for large strings, though.
My personal conclusion
Given that I've been working on and maintaining the Python Unicode implementation actively or by providing assistance for almost 12 years now, I've also thought about whether it's still worth the effort.
My interests have shifted somewhat into other directions and I feel that helping Python reach world domination in other ways makes me happier than fighting over Unicode standards, implementations, special cases that aren't special enough, and all those other nitty-gritty details that cause long discussions :-)
So I feel that the PEP 393 change is a good time to draw a line and leave Unicode maintenance to Ezio, Victor, Martin, and all the others that have helped over the years. I know it's in good hands. So here it is:
Hey, that was easy :-)
PS: I'll stick around a bit more for the platform module, pybench and whatever else comes along where you might be interested in my input.
Thanks and cheers,
Marc-Andre Lemburg eGenix.com
Professional Python Services directly from the Source (#1, Sep 28 2011)
Python/Zope Consulting and Support ... http://www.egenix.com/ mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/ mxODBC, mxDateTime, mxTextTools ... http://python.egenix.com/
2011-10-04: PyCon DE 2011, Leipzig, Germany 6 days to go
::: Try our new mxODBC.Connect Python Database Interface for free ! ::::
eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 http://www.egenix.com/company/contact/
- Previous message: [Python-Dev] PEP 393 close to pronouncement
- Next message: [Python-Dev] PEP 393 close to pronouncement
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]