msg287112 - (view) |
Author: STINNER Victor (vstinner) *  |
Date: 2017-02-06 09:41 |
Attached patch is an attempt to use the GCC/Clang __builtin_expect() attribute to provide the compiler with branch prediction information, it adds _Py_likely() and _Py_unlikely() macros, similar to the Linux kernel macros. I always wanted to experiment this change. I opened this issue to collect information and benchmark results to take a decision. I doubt that the change will really make Python faster, there is a risk of modifying a lot of code for no value, or worse: make Python slower. Extract of GCC doc: "as programmers are notoriously bad at predicting how their programs actually perform." My patch marks error cases as unlikely. A profiler like Linux perf should be used to check which branches are less likely, but I tried to use my guesses to expeeriment a first change. Since there is a risk that using such macro makes the code slower, or has no effect, I tried to restrict changes to the hot code and the most common functions/macros: * Py_EnterRecursiveCall() * _PyArg_NoKeywords() * Py_DECREF()... not sure about this case, is it really more likely that refcnt != 0 than refcnt==0? * Functions to call functions: _PyObject_FastCallKeywords(), fast_function(), etc. * etc. I'm not sure about the methodology to benchmark such patch neither. Is it worth it to benchmark a Python compiled without PGO? By the way, does GCC/Clang use __builtin_expect() information when Python is compiled with PGO? If PGO ignores this information, it would be better to not use it, since I now strongly recommend to use PGO. Otherwise, Python performance is too random because of the joy of code placement. Some links on __builtin_expect(): * http://blog.man7.org/2012/10/how-much-do-builtinexpect-likely-and.html * http://stackoverflow.com/questions/109710/likely-unlikely-macros-in-the-linux-kernel-how-do-they-work-whats-their * https://news.ycombinator.com/item?id=9411227 |
|
|
msg287113 - (view) |
Author: STINNER Victor (vstinner) *  |
Date: 2017-02-06 09:45 |
Copy of (issue #29460), Serhiy Storchaka (serhiy.storchaka): > The next question might it: would it be worth it to try using unlikely() > macro on (kwargs == NULL) test in the macro? ;-) Perhaps this may help in some critical tight loops (in implementations of dict, string operations or long integer arithmetic), but I doubt it can have any measurable effect when used once per a call of a function calling PyArg_Parse* and using many other stuff. |
|
|
msg287115 - (view) |
Author: STINNER Victor (vstinner) *  |
Date: 2017-02-06 10:01 |
http://blog.man7.org/2012/10/how-much-do-builtinexpect-likely-and.html Using __builtin_expect() in a very long loop 10^9 iteratos (1,000,000,000) makes the loop 15% faster (2.67 sec => 2.28 sec), *but* using PGO avoids the need of using __builtin_expect() explicitly and makes the code 27% faster (2.67 sec => 1.95 sec): "This optimized version runs significantly faster (1.95 versus 2.28 seconds) than our version that used __builtin_expect(). This is because, in addition to the branching in the if statement, the branching in the for loops was also optimized." The article also confirms that if __builtin_expect() is misused, it makes the code 5% slower (2.67 sec => 2.79 sec). -- Another story related to micro-optimization in the Linux kernel. The Linux kernel used explicit prefetch in some tiny loops. After many benchmarks, it was concluded that letting the developer uses prefetch makes the code slower, and so the micro-optimization was removed: https://lwn.net/Articles/444336/ “So the conclusion is: prefetches are absolutely toxic, even if the NULL ones are excluded.” |
|
|
msg287118 - (view) |
Author: STINNER Victor (vstinner) *  |
Date: 2017-02-06 10:40 |
Benchmarks with Python compiled by gcc -O3 (without PGO). haypo@smithers$ ./python -m perf timeit 'len("abc")' --duplicate=1000 --compare-to=../default-ref/python Median +- std dev: [ref] 40.4 ns +- 0.8 ns -> [likely] 40.8 ns +- 2.1 ns: 1.01x slower (+1%) haypo@smithers$ ./python -m perf timeit 'sum(())' --duplicate=1000 --compare-to=../default-ref/python -p3 Median +- std dev: [ref] 86.4 ns +- 2.8 ns -> [likely] 86.3 ns +- 0.3 ns: 1.00x faster (-0%) Not significant! haypo@smithers$ ./python -m perf timeit -s 's=list("abc")' 'sorted(s)' --duplicate=100 --compare-to=../default-ref/python -p3 Median +- std dev: [ref] 224 ns +- 3 ns -> [likely] 222 ns +- 1 ns: 1.01x faster (-1%) |
|
|
msg287144 - (view) |
Author: Raymond Hettinger (rhettinger) *  |
Date: 2017-02-06 15:51 |
I had read that likely/unlikely were hard to use well and often not worth the trouble. IIRC, they are used extensively in the Linux kernel but people were finding zero measurable effect when enabling or disabling the constructs. |
|
|
msg287145 - (view) |
Author: Christian Heimes (christian.heimes) *  |
Date: 2017-02-06 15:58 |
I would expect that PGO/LGO builds give better improvements with less coding bloat. Did you compare a PGO likely/unlikely build against a standard PGO build? |
|
|
msg287148 - (view) |
Author: Serhiy Storchaka (serhiy.storchaka) *  |
Date: 2017-02-06 16:04 |
I also always wanted to experiment this change. But I was afraid that providing such macros can encourage of overusing it. I don't want to wrap any test for NULL or -1 with these macros. If we expect some benefit from using these macros, I would play with them in hot loops. For example in dict lookup function (unlikely comparing keys raise an exception or dict is modified in process of searching), in codecs (unlikely illegal sequence is occurred), etc. |
|
|
msg297094 - (view) |
Author: STINNER Victor (vstinner) *  |
Date: 2017-06-28 00:59 |
So, I played with likely/unlikely and... nothing, nothing interesting. Please use PGO compilation ;-) |
|
|