[Python-Dev] proposal+patch: sys.gettickeraccumulation() (original) (raw)

Ralf W. Grosse-Kunstleve rwgk at cci.lbl.gov
Mon Dec 6 00:37:12 CET 2004


Brett C. wrote:

Ralf W. Grosse-Kunstleve wrote: > Brett C. wrote: > > >I have been mulling over this proposal and I think I am finally > >settling on +0 as long as it can be shown that it does not hurt > >performance at all (yes, it shouldn't but this is the eval loop we are > >talking about and that thing is a touchy beast). > > > Is there a standard benchmark I could use to demonstrate the > impact of the two new additions on the runtime? > Thanks!

=) Parrotbench and PyBench would be enough for me to sign off on any performance hit. There is also pystone.

I have done my homework now, based on parrotbench. Timings are below. The most important conclusion from my viewpoint is that my original goal can be achieved with a small patch and with a runtime penalty that is in the noise. However, mysterious runtime behavior of Objects/longobject.c lead me to change my original patch, and to the unanswered question "could small rearrangements in longobject.c potentially boost performance by 45% on some platforms?"

Please don't be put off by the size of this message. My patch is essentially just one line in ceval.c (_Py_TickerAccumulation++)!

To restate my original goal:

I am looking for a simple way to answer the question: How much of a speedup can I expect if I reimplement a piece of Python code in C or C++?

Note that I am not asking how much time is spent in Python including all the extensions that come with it, but how much time is spent in the Python bytecode interpreter loop. To clarify, look at the timings for two tests in parrotbench:

b1 5.8287 time 30049474 ticks 0.1940 us/tick b2 1.5944 time 482584 ticks 3.3039 us/tick

Each of the tests was run with range(10) in b.py to increase accuracy. The first column shows time.time() differences, the second the number of bytecodes processed in ceval.c, and the last column shows micro-seconds per tick. Why is the us/tick value for b2 17 times larger than that for b1?

The answer is that b2 makes heavy use of long integers, and that a lot of time is spent in Objects/longobject.c doing long-integer arithmetic. Compared to b1, relatively little time is spent interpreting bytecodes.

Back to the original question: how much sense does it make to reimplement b1 or b2 in C? Simply looking at the us/tick values is telling me that I can expect much more of a performance boost by reimplementing b1 rather than b2. This is because b2 spends a lot of time in C already. A similar situation arises if extensions like Numeric are used. Computationally intensive operations like matrix multiplication and matrix inversion are already done at C speed.

The us/tick let us quickly estimate how disappointed we would be after moving Python code to C or C++. We get this estimate without having to do detailed source code analysis, and before we have wasted our time doing the recoding (we have been through this a couple of times ...).

The timings below show that the patched Python which counts the number of interpreted bytecodes is in the worst case 0.6% slower than the original Python, but some tests even run faster at face value:

$gcc296_compiled/original/python -O b_timings.py | grep reporter b0 16.8831 time b1 5.9553 time b2 1.5914 time b3 10.8149 time b4 5.2528 time b5 11.6437 time b6 11.0248 time b7 27.6960 time all 90.9276 time $gcc296_compiled/work/python -O b_timings.py | grep reporter b0 16.9193 time 50979585 ticks 0.3319 us/tick b1 5.8287 time 30049474 ticks 0.1940 us/tick b2 1.5944 time 482584 ticks 3.3039 us/tick b3 10.7931 time 43577404 ticks 0.2477 us/tick b4 5.2865 time 18463044 ticks 0.2863 us/tick b5 11.6086 time 17065750 ticks 0.6802 us/tick b6 10.0992 time 60000464 ticks 0.1683 us/tick b7 27.6830 time 1928644 ticks 14.3536 us/tick all 89.8760 time 222548399 ticks 0.4038 us/tick

The b7 test is the same as b2 but with xrange(10000) instead of xrange(1000), and with all print statements removed.

Ratios (rounded to 3 decimals): 16.9193/16.8831=1.002 5.8287/5.9553 =0.979 1.5944/1.5914 =1.002 10.7931/10.8149=0.998 5.2865/5.2528 =1.006 11.6086/11.6437=0.997 10.0992/11.0248=0.916 27.6830/27.6960=1.000

Therefore I'd argue that the runtime penalty for the one additional long long increment in ceval.c (_Py_TickerAccumulation++) is in the noise.

The timings were collected on a 2.8GHz Dual Xeon, Redhat WS 3. Python was compiled under Red Hat 7.3 with gcc 2.96. See below why.

My latest patch can be found at the end of this posting. It can be applied to the Python-2.4 distribution like this:

tar zxf Python-2.4.tgz
patch --strip=1 -d Python-2.4 < patch_file



Now to the mysterious runtime behavior of Objects/longobject.c

In the original Python 2.4 sources _Py_Ticker is decremented in two places:

In the interpreter loop in Python/ceval.c:

            if (--_Py_Ticker < 0) {

In Objects/longobject.c:

#define SIGCHECK(PyTryBlock)
if (--_Py_Ticker < 0) {
_Py_Ticker = _Py_CheckInterval;
if (PyErr_CheckSignals()) { PyTryBlock; }
}

This macro is used in four places. My initial patch was to insert updates of a new, global _Py_TickerAccumulation variable in these two places:

Python/ceval.c:

                                a try: finally: block uninterruptable. */
                             goto fast_next_opcode;
                     }

Objects/longobject.c:

#define SIGCHECK(PyTryBlock)
if (--_Py_Ticker < 0) {
+ _Py_TickerAccumulation += _Py_CheckInterval - _Py_Ticker;
_Py_Ticker = _Py_CheckInterval;
if (PyErr_CheckSignals()) { PyTryBlock; }
}

This lead to the timings below. All timings were collected on the same machine, but the original and the patched Pythons were each compiled with two different compilers, gcc 2.96 and gcc 3.2.3. The important timings to look at are those for b2 and b7 again.

$gcc296_compiled/original/python -O b_timings.py | grep reporter b0 16.8831 time b1 5.9553 time b2 1.5914 time b3 10.8149 time b4 5.2528 time b5 11.6437 time b6 11.0248 time b7 27.6960 time all 90.9276 time $gcc323_compiled/original/python -O b_timings.py | grep reporter b0 18.6390 time b1 6.4681 time b2 2.2588 time b3 11.0215 time b4 5.6490 time b5 12.3022 time b6 10.3815 time b7 40.2735 time all 107.0636 time

This shows that the gcc 2.96 optimizer is generally a little bit better than the gcc 3.2.3 optimizer, but not by very much. Except for the b2 and b7 tests. E.g. 40.2735/27.6960 = 1.454!

Now with the patches in both ceval.c and longobject.c:

$gcc296_compiled/ticker/python -O b_timings.py | grep reporter b0 17.1190 time 60792625 ticks 0.2816 us/tick b1 6.1171 time 30049474 ticks 0.2036 us/tick b2 1.8764 time 705754 ticks 2.6587 us/tick b3 10.7070 time 43577404 ticks 0.2457 us/tick b4 5.2677 time 22944694 ticks 0.2296 us/tick b5 11.7448 time 17433190 ticks 0.6737 us/tick b6 10.9603 time 60000504 ticks 0.1827 us/tick b7 33.2320 time 2837124 ticks 11.7133 us/tick all 97.0893 time 238342235 ticks 0.4074 us/tick $gcc323_compiled/ticker/python -O b_timings.py | grep reporter b0 18.3115 time 60792625 ticks 0.3012 us/tick b1 6.1193 time 30049474 ticks 0.2036 us/tick b2 2.2522 time 705754 ticks 3.1912 us/tick b3 11.2202 time 43577404 ticks 0.2575 us/tick b4 5.4596 time 22944694 ticks 0.2379 us/tick b5 11.8197 time 17433190 ticks 0.6780 us/tick b6 10.7407 time 60000504 ticks 0.1790 us/tick b7 40.2397 time 2837124 ticks 14.1833 us/tick all 106.2283 time 238342235 ticks 0.4457 us/tick

The gcc 3.2.3 timings for b7 are hardly affected by my patch in longobject.c, but the gcc 2.96 timing shoots up from 27.6960 to 33.2320. Still not as bad as the gcc 3.2.3 timing, but why?

Interestingly, if I declare _Py_TickerAccumulation as long instead of PY_LONG_LONG, the gcc 2.96 timing is also hardly affected by my patch in longobject.c.

Even more interestingly, if I change longobject.c like this...

#define SIGCHECK(PyTryBlock)
{
if (PyErr_CheckSignals()) { PyTryBlock; }
}

the runtime is also hardly affected (times not shown in this posting) even though PyErr_CheckSignals() is executed 100 times more often.

Curious to learn how other compilers deal with longobject.c I collected timings with this compiler: Compaq C V6.3-028 on Compaq Tru64 UNIX V5.1 (Rev. 732)

$compaq_compiled/original/python -O b_timings.py | grep reporter b1 19.4492 time b2 4.9395 time b3 23.6787 time b5 28.2568 time b6 21.6230 time b7 86.5283 time all 192.0498 time $compaq_compiled/ticker/python -O b_timings.py | grep reporter b1 19.4775 time 30049474 ticks 0.6482 us/tick b2 4.9971 time 704784 ticks 7.0902 us/tick b3 24.2559 time 43577404 ticks 0.5566 us/tick b5 26.3320 time 17420746 ticks 1.5115 us/tick b6 21.7227 time 60000464 ticks 0.3620 us/tick b7 86.9395 time 2836154 ticks 30.6540 us/tick all 190.8936 time 165555937 ticks 1.1530 us/tick

The b0 and b4 timings are missing because of exceptions that I didn't investigate. As with the gcc 3.2.3 compilation, my longobject.c patch hardly makes a difference.

What does all this mean? Does the gcc 2.96 surprise tell us that small rearrangements in longobject.c could potentially boost performance by 45% on some platforms?

Since I am not interested in optimizing longobject.c this is where I stopped. Eventually I decided that the problematic patch in longobject.c is not helping me with my original goal as stated near the beginning of this posting. I am only interested in counting the iterations of the interpreter loop. However, the _Py_Ticker decrements in longobject.c are not inside the interpreter loop, but in C loops! This means _Py_Ticker is useless for my purposes. Therefore I decoupled _Py_Ticker and _Py_TickerAccumulation.

BTW: I think strictly speaking this documentation is incorrect because of the _Py_Ticker manipulations in longobject.c:

sys.setcheckinterval.__doc__:
Tell the Python interpreter to check for asynchronous events every
n instructions.

I feel very dizzy now after dealing with all the timings and the nasty details. A lot of fallout caused by the simple idea to add one innocent line to ceval.c. I hope it makes sense to somebody!

Cheers, Ralf




diff -r -u original/Include/ceval.h work/Include/ceval.h --- original/Include/ceval.h 2004-10-10 19:40:35.000000000 -0700 +++ work/Include/ceval.h 2004-11-30 23:27:11.000000000 -0800 @@ -68,6 +68,11 @@

/* this used to be handled on a per-thread basis - now just two globals */ PyAPI_DATA(volatile int) _Py_Ticker; +#ifndef HAVE_LONG_LONG +PyAPI_DATA(volatile long) _Py_TickerAccumulation; +#else +PyAPI_DATA(volatile PY_LONG_LONG) _Py_TickerAccumulation; +#endif PyAPI_DATA(int) _Py_CheckInterval;

/* Interface for threads. diff -r -u original/PC/os2emx/python24.def work/PC/os2emx/python24.def --- original/PC/os2emx/python24.def 2004-10-10 19:40:50.000000000 -0700 +++ work/PC/os2emx/python24.def 2004-11-30 23:27:11.000000000 -0800 @@ -743,6 +743,7 @@ "_Py_CheckRecursionLimit" "_Py_CheckInterval" "_Py_Ticker"

; From python24_s.lib(compile) "PyCode_New" diff -r -u original/Python/ceval.c work/Python/ceval.c --- original/Python/ceval.c 2004-11-23 10:06:08.000000000 -0800 +++ work/Python/ceval.c 2004-12-03 19:39:36.000000000 -0800 @@ -373,6 +373,7 @@ pendinglast = j;

 _Py_Ticker = 0;

@@ -476,6 +477,11 @@ per thread, now just a pair o' globals */ int _Py_CheckInterval = 100; volatile int _Py_Ticker = 100; +#ifndef HAVE_LONG_LONG +volatile long _Py_TickerAccumulation = 0; +#else +volatile PY_LONG_LONG _Py_TickerAccumulation = 0; +#endif

PyObject * PyEval_EvalCode(PyCodeObject *co, PyObject *globals, PyObject *locals) @@ -770,6 +776,7 @@ async I/O handler); see Py_AddPendingCall() and Py_MakePendingCalls() above. */

diff -r -u original/Python/sysmodule.c work/Python/sysmodule.c --- original/Python/sysmodule.c 2004-08-12 11:19:17.000000000 -0700 +++ work/Python/sysmodule.c 2004-12-03 19:36:52.000000000 -0800 @@ -442,6 +442,20 @@ "getcheckinterval() -> current check interval; see setcheckinterval()." );

+static PyObject * +sys_gettickeraccumulation(PyObject *self, PyObject *args) +{ +#ifndef HAVE_LONG_LONG + return PyInt_FromLong(_Py_TickerAccumulation); +#else + return PyLong_FromLongLong(_Py_TickerAccumulation); +#endif +} + +PyDoc_STRVAR(gettickeraccumulation_doc, +"gettickeraccumulation() -> current count of bytecodes processed by the interpreter." +); + #ifdef WITH_TSC static PyObject * sys_settscdump(PyObject *self, PyObject *args) @@ -763,6 +777,8 @@ setcheckinterval_doc}, {"getcheckinterval", sys_getcheckinterval, METH_NOARGS, getcheckinterval_doc}, + {"gettickeraccumulation", sys_gettickeraccumulation, METH_NOARGS, + gettickeraccumulation_doc}, #ifdef HAVE_DLOPEN {"setdlopenflags", sys_setdlopenflags, METH_VARARGS, setdlopenflags_doc},



More information about the Python-Dev mailing list