msg309806 - (view) |
Author: Jeethu Rao (jeethu) * |
Date: 2018-01-11 13:52 |
|
|
|
I've noticed that replacing the for loop in the ins1 function in listobject.c with a memmove when the number of pointers to move is greater than 16 seems to speed up list.insert by about 3 to 4x on a contrived benchmark. # Before jeethu@dev:cpython (master)$ ./python -m timeit -s "l = []" "for _ in range(100): l.insert(0, None)" 200 loops, best of 5: 3.07 msec per loop #After jeethu@dev:cpython (3.7_list_insert_memmove)$ ./python -m timeit -s "l = []" "for _ in range(100): l.insert(0, None)" 500 loops, best of 5: 744 usec per loop Both builds were configured with --enable-optimizations and --with-lto |
|
|
|
|
|
msg309809 - (view) |
Author: Serhiy Storchaka (serhiy.storchaka) *  |
Date: 2018-01-11 14:17 |
|
|
|
What are results when set the threshold to 0 or 1? |
|
|
|
|
|
msg309810 - (view) |
Author: Jeethu Rao (jeethu) * |
Date: 2018-01-11 14:40 |
|
|
|
I tried it with a couple of different thresholds, twice each, ignoring the results of the first run. 16 seems to be the sweet spot. THRESHOLD = 0 jeethu@dev:cpython (3.7_list_insert_memmove)$ ./python -m timeit -s "l = []" "for _ in range(100): l.insert(0, None)" 500 loops, best of 5: 787 usec per loop THRESHOLD = 4 jeethu@dev:cpython (3.7_list_insert_memmove)$ ./python -m timeit -s "l = []" "for _ in range(100): l.insert(0, None)" 500 loops, best of 5: 781 usec per loop THRESHOLD = 8 jeethu@dev:cpython (3.7_list_insert_memmove)$ ./python -m timeit -s "l = []" "for _ in range(100): l.insert(0, None)" 500 loops, best of 5: 780 usec per loop THRESHOLD = 16 jeethu@dev:cpython (3.7_list_insert_memmove)$ ./python -m timeit -s "l = []" "for _ in range(100): l.insert(0, None)" 500 loops, best of 5: 758 usec per loop THRESHOLD = 32 jeethu@dev:cpython (3.7_list_insert_memmove)$ ./python -m timeit -s "l = []" "for _ in range(100): l.insert(0, None)" 500 loops, best of 5: 764 usec per loop |
|
|
|
|
|
msg309812 - (view) |
Author: Serhiy Storchaka (serhiy.storchaka) *  |
Date: 2018-01-11 14:59 |
|
|
|
In your benchmarks the difference between thresholds is only 4%. It would be not worth to keep a special case for such small benefit. But note that in your benchmarks you inserted in a list with the size up to 50000 elements. The value of the threshold affects only the first few insertions. Your need to test inserting in short lists. Recreate a list at every iteration. |
|
|
|
|
|
msg309917 - (view) |
Author: Jeethu Rao (jeethu) * |
Date: 2018-01-14 10:57 |
|
|
|
I managed to tune an i7700k desktop running Ubuntu 17.10 per this doc[1], and ran the pyperformance benchmarks[2]. I also tried various threshold with this benchmark and 16 still seems to be the sweet spot. The geometric mean of the relative changes across all benchmarks indicates a 12.85% improvement. [1]: http://pyperformance.readthedocs.io/usage.html#how-to-get-stable-benchmarks [2]: https://gist.github.com/jeethu/4d9a8bd3eecbd067c00f085f7d2e7595 |
|
|
|
|
|
msg309929 - (view) |
Author: Raymond Hettinger (rhettinger) *  |
Date: 2018-01-14 18:15 |
|
|
|
The result likely varies quite a bit from compiler-to-compiler and processor-to-processor and os-to-os. It is may also be affected by data size and caching. |
|
|
|
|
|
msg309983 - (view) |
Author: Antoine Pitrou (pitrou) *  |
Date: 2018-01-15 14:16 |
|
|
|
I'm quite surprised so many benchmarks would be speeded up so significantly by a list.insert() optimization (why a 27% speedup on computing digits of pi, or a 33% speedup on a N-body simulation?). Are you sure the two executables are similarly compiled? |
|
|
|
|
|
msg310022 - (view) |
Author: Jeethu Rao (jeethu) * |
Date: 2018-01-15 23:17 |
|
|
|
I rebased my branch off of master and rebuilt it, and also rebuilt the baseline from master. Both versions were configured with --with-lto and --enable-optimizations. The benchmark numbers are rather different this time[1]. pidigits is slower, but nbody is still inexplicably faster. Should I run the benchmark without a PGO build (i.e without --enable-optimizations)? Also, I've attached the output from the pyperformance run to the aforementioned gist. [1]: https://gist.github.com/jeethu/dc0811d415dd6d1a1621761e43842f88 |
|
|
|
|
|
msg310023 - (view) |
Author: Antoine Pitrou (pitrou) *  |
Date: 2018-01-15 23:19 |
|
|
|
Perhaps the patch is interfering weirdly with PGO? > Should I run the benchmark without a PGO build (i.e without --enable-optimizations)? That would help clear things up, IMHO. |
|
|
|
|
|
msg310056 - (view) |
Author: Jeethu Rao (jeethu) * |
Date: 2018-01-16 09:09 |
|
|
|
Built and benchmarked both the baseline and the patch without PGO; the differences are less pronounced, but still present. https://gist.github.com/jeethu/abd404e39c6dfcbabb4c01661b9238d1 |
|
|
|
|
|
msg310059 - (view) |
Author: Antoine Pitrou (pitrou) *  |
Date: 2018-01-16 09:13 |
|
|
|
Thanks. That's really surprising. I'll give it a try myself. |
|
|
|
|
|
msg310064 - (view) |
Author: STINNER Victor (vstinner) *  |
Date: 2018-01-16 09:30 |
|
|
|
> jeethu@dev:cpython (3.7_list_insert_memmove)$ ./python -m timeit -s "l = []" "for _ in range(100): l.insert(0, None)" Please don't use timeit, but perf timeit to run such *microbenchmark* (time smaller than 1 ms). Your benchmark measures also the performance of the loop, it might be significant in such short loop (100 items). You may try to unroll the loop manually, and truncate the list: vstinner@apu$ python3 -c 'print("l.insert(None); " * 3 + "l.clear();")' l.insert(None); l.insert(None); l.insert(None); l.clear(); Example: python3 -m perf timeit -s 'l=[]' $(python3 -c 'print("l.insert(0, None); " * 100 + "l.clear();")') --duplicate 100 Jeethu Rao: Are you using CPU isolation? What is your OS? |
|
|
|
|
|
msg310094 - (view) |
Author: Antoine Pitrou (pitrou) *  |
Date: 2018-01-16 17:05 |
|
|
|
Ok, I ran the benchmarks here (Ubuntu 16.04, Core i5-2500K, PGO and LTO disabled) and I don't get any consistent speedup, which is more in line with what I was expecting: https://gist.github.com/pitrou/29eb7592fa1eae2be390f3bfa3db0a3a |
|
|
|
|
|
msg310120 - (view) |
Author: Jeethu Rao (jeethu) * |
Date: 2018-01-17 00:07 |
|
|
|
Victor: I’m booting with the isolcpus and rcu_nocbs flags, and running pyperformance with the --affinity flag to pin the benchmark to the isolated CPU cores. I’ve also run `perf system tune`. And the OS is Ubuntu 17.10. Thanks for the tip on using perf timeit instead of timeit. I’ve run the benchmark that you've suggested with a minor change (to avoid the cost of LOAD_ATTR) and attached the output on a gist[1]. Antoine: Thanks for benchmarking it. After looking at the generated assembly[2], I found that ins1 is being inlined and the call to memmove was appearing before the loop (possibly because the compiler assumes that the call to memmove is more likely). I made a minor change and increased the threshold to 32. I’ve attached the generated assembly in a gist[3] (The relevant sequence is around line 8406, if you’re interested). And here’s the pyperformance comparison[4]. Could you please try benchmarking this version on your machine? [1]: https://gist.github.com/jeethu/2d2de55afdb8ea4ad03b6a5d04d5227f [2]: Generated with “gcc -DNDEBUG -fwrapv -O3 -std=c99 -I. -I./Include -DPy_BUILD_CORE -S -masm=intel Objects/listobject.c” [3]: https://gist.github.com/jeethu/596bfc1251590bc51cc230046b52fb38 [4]: https://gist.github.com/jeethu/d6e4045f7932136548a806380eddd030 |
|
|
|
|
|
msg310121 - (view) |
Author: STINNER Victor (vstinner) *  |
Date: 2018-01-17 00:28 |
|
|
|
> I’ve run the benchmark that you've suggested with a minor change (to avoid the cost of LOAD_ATTR) Be careful. Moving "l.insert" lookup of the loop might make the code slower. I never looked why. But Python 3.7 was also optimized in many places to call methods, so I'm not sure anymore :) |
|
|
|
|
|
msg310122 - (view) |
Author: Jeethu Rao (jeethu) * |
Date: 2018-01-17 00:56 |
|
|
|
> Be careful. Moving "l.insert" lookup of the loop might make the code slower. I never looked why. But Python 3.7 was also optimized in many places to call methods, so I'm not sure anymore :) Thanks again! Here's a gist without the hack[1]. [1]: https://gist.github.com/jeethu/19430d802aa08e28d1cb5eb20a47a470 |
|
|
|
|
|
msg310137 - (view) |
Author: Raymond Hettinger (rhettinger) *  |
Date: 2018-01-17 04:38 |
|
|
|
FWIW, we've encountered a number of situations in the past when something that improved the timings on one compiler would make timings worse on another compiler. There was also variance between timings on 32-bit builds versus 64-bit builds. |
|
|
|
|
|
msg310146 - (view) |
Author: STINNER Victor (vstinner) *  |
Date: 2018-01-17 09:45 |
|
|
|
https://gist.github.com/jeethu/19430d802aa08e28d1cb5eb20a47a470 Mean +- std dev: 10.5 us +- 1.4 us => Mean +- std dev: 9.68 us +- 0.89 us It's 1.08x faster (-7.8%). It's small for a microbenchmark, usually an optimization should make a *microbenchmark* at least 10% faster. For this optimization, I have no strong opinion. Using memmove() for large copy is a good idea. The main question is the "if (n <= INS1_MEMMOVE_THRESHOLD)" test. Is it slower if we always call memmove()? Previously, Python had a Py_MEMCPY() macro which also had such threshold. Basically, it's a workaround for compiler performance issues: #if defined(_MSC_VER) #define Py_MEMCPY(target, source, length) do { \ size_t i_, n_ = (length); \ char *t_ = (void*) (target); \ const char *s_ = (void*) (source); \ if (n_ >= 16) \ memcpy(t_, s_, n_); \ else \ for (i_ = 0; i_ < n_; i_++) \ t_[i_] = s_[i_]; \ } while (0) #else #define Py_MEMCPY memcpy #endif Hopefully, the macro now just became: /* Py_MEMCPY is kept for backwards compatibility, * see https://bugs.python.org/issue28126 */ #define Py_MEMCPY memcpy And it's no more used. I recall a performance issues with GCC memcmp() builtin function (replacing the libc function during the compilation): https://bugs.python.org/issue17628#msg186012 See also: * https://bugs.python.org/issue13134 * https://bugs.python.org/issue29782 |
|
|
|
|
|
msg310160 - (view) |
Author: Jeethu Rao (jeethu) * |
Date: 2018-01-17 13:36 |
|
|
|
> FWIW, we've encountered a number of situations in the past when something that improved the timings on one compiler would make timings worse on another compiler. There was also variance between timings on 32-bit builds versus 64-bit builds. I've verified that both clang and gcc generate similar assembly (memcpy is not inlined and the loop is not vectorized) 32-bit mode. I'd wager that the improvement with vectorization (in memmove) would be even more pronounced on 32-bit systems, given that pointers are half the size and cache lines are still 64 bytes wide. > It's 1.08x faster (-7.8%). It's small for a microbenchmark, usually an optimization should make a *microbenchmark* at least 10% faster. That's true if we assume lists to have 100 or lesser elements. On the other hand, on the pyperformance comparison I'd posted yesterday[1], there seems to be an average improvement of 1.27x on the first seven benchmarks, and the slowest slowdown is only 1.03x. Albeit, the improvement cannot be better than by a constant factor with the vectorized loop in memmove. > Using memmove() for large copy is a good idea. The main question is the "if (n <= INS1_MEMMOVE_THRESHOLD)" test. Is it slower if we always call memmove()? The overhead of calling memmove makes it slower for small lists. That's how I arrived at this patch in the first place. I tried replacing the loop with a memmove() and it was slower on pyperformance and it was faster with switching to memmove() after a threshold. > Previously, Python had a Py_MEMCPY() macro which also had such threshold. Basically, it's a workaround for compiler performance issues: That's very interesting! I think it boils down to the pointer aliasing problem. The pointers in memcpy()'s signature have the `restrict` qualifier, which gives the compiler more leeway to optimize calls to memcpy, while the compiler has to be more conservative with memmove(). I wonder if it's worth trying out a Py_MEMMOVE() :) [1]: https://gist.github.com/jeethu/d6e4045f7932136548a806380eddd030 |
|
|
|
|
|
msg310161 - (view) |
Author: Antoine Pitrou (pitrou) *  |
Date: 2018-01-17 13:37 |
|
|
|
Le 17/01/2018 à 14:36, Jeethu Rao a écrit : > > On the other hand, on the pyperformance comparison I'd posted yesterday[1], there seems to be an average improvement of 1.27x on the first seven benchmarks, and the slowest slowdown is only 1.03x. I still think those numbers are misleading or downright bogus. There is no existing proof that list.insert() is a critical path in those benchmarks. |
|
|
|
|
|
msg310163 - (view) |
Author: STINNER Victor (vstinner) *  |
Date: 2018-01-17 13:43 |
|
|
|
> I still think those numbers are misleading or downright bogus. There is no existing proof that list.insert() is a critical path in those benchmarks. Can someone check if these bencmarks really use list.insert() in hot code? If yes, why? :-) The cost of list.insert() is known, no? |
|
|
|
|
|
msg310184 - (view) |
Author: Jeethu Rao (jeethu) * |
Date: 2018-01-17 15:07 |
|
|
|
> > I still think those numbers are misleading or downright bogus. There is no existing proof that list.insert() is a critical path in those benchmarks. > Can someone check if these bencmarks really use list.insert() in hot code? If yes, why? :-) The cost of list.insert() is known, no? Sure! https://gist.github.com/jeethu/000a2d3ecd9033c0ef51331f062ac294 |
|
|
|
|
|
msg310187 - (view) |
Author: STINNER Victor (vstinner) *  |
Date: 2018-01-17 15:24 |
|
|
|
> Sure! https://gist.github.com/jeethu/000a2d3ecd9033c0ef51331f062ac294 I don't understand how to read this output. "54640 ins1 called 40050 times" What is 54640? I'm interested to know which benchmarks call list.insert() 40k times. |
|
|
|
|
|
msg310189 - (view) |
Author: Jeethu Rao (jeethu) * |
Date: 2018-01-17 15:29 |
|
|
|
> What is 54640? That's the pid of the process. > I'm interested to know which benchmarks call list.insert() 40k times. The django_template benchmark. |
|
|
|
|
|
msg310193 - (view) |
Author: STINNER Victor (vstinner) *  |
Date: 2018-01-17 15:39 |
|
|
|
In https://gist.github.com/jeethu/dc0811d415dd6d1a1621761e43842f88 I read: | django_template |
160 ms |
129 ms |
1.24x faster |
Significant (t=15.05) |
So on this benchmark, the optimization seems significant. But in the same paste, I see other benchmarks 1.2x faster whereas they don't abuse list.insert(), and maybe don't use list.insert() at all. Maybe Django template engine should be fixed to use deque instead ;-) |
msg310195 - (view) |
Author: Jeethu Rao (jeethu) * |
Date: 2018-01-17 15:50 |
|
|
|
It's also interesting that in https://gist.github.com/pitrou/29eb7592fa1eae2be390f3bfa3db0a3a : | django_template |
307 ms |
312 ms |
1.02x slower |
Not significant |
It seems to be slower and the benchmarks before it (2to3, chameleon, chaos, crypto_pyaes, deltablue) which we now know barely use list.insert are slightly faster :) ¯\_(ツ)_/¯ |
msg316647 - (view) |
Author: Stéphane Wirtel (matrixise) *  |
Date: 2018-05-15 12:19 |
|
|
|
Hi, just a small reminder for this issue because I was reviewing the PR. what is the status? Thanks |
|
|
|
|
|
msg317075 - (view) |
Author: STINNER Victor (vstinner) *  |
Date: 2018-05-19 00:11 |
|
|
|
This issue is a micro-optimization which is only 1.08x faster: https://bugs.python.org/issue32534#msg310146 Moreover, it seems really hard to measure precisely the benefit on benchmarks. Results seem to not be reliable. I suggest to close the issue as WONTFIX, as I cannot see any reliable speedup nor any significant impact on performance. |
|
|
|
|
|