Issue 28618: Decorate hot functions using attribute((hot)) to optimize Python (original) (raw)
Created on 2016-11-05 00:29 by vstinner, last changed 2022-04-11 14:58 by admin. This issue is now closed.
Messages (34)
Author: STINNER Victor (vstinner) *
Date: 2016-11-05 00:29
When analyzing results of Python performance benchmarks, I noticed that call_method was 70% slower (!) between revisions 83877018ef97 (Oct 18) and 3e073e7b4460 (Oct 22), including these revisions, on the speed-python server.
On these revisions, the CPU L1 instruction cache is less efficient: 8% cache misses, whereas it was only 0.06% before and after these revisions.
Since the two mentioned revisions have no obvious impact on the call_method() benchmark, I understand that the performance difference by a different layout of the machine code, maybe the exact location of functions.
IMO the best solution to such compilation issue is to use PGO compilation. Problem: PGO doesn't work on Ubuntu 14.04, the OS used by speed-python (the server runining benchmarks for http://speed.python.org/).
I propose to decorate manually the "hot" functions using the GCC attribute((hot)) decorator: https://gcc.gnu.org/onlinedocs/gcc/Common-Function-Attributes.html#Common-Function-Attributes (search for "hot")
Attached patch adds Py_HOT_FUNCTION and decorates the following functions:
- _PyEval_EvalFrameDefault()
- PyFrame_New()
- call_function()
- lookdict_unicode_nodummy()
- _PyFunction_FastCall()
- frame_dealloc()
These functions are the top 6 according to the Linux perf tool when running the call_simple benchmark of the performance project:
32,66%: _PyEval_EvalFrameDefault 13,09%: PyFrame_New 12,78%: call_function 12,24%: lookdict_unicode_nodummy 9,85%: _PyFunction_FastCall 8,47%: frame_dealloc
Author: STINNER Victor (vstinner) *
Date: 2016-11-05 09:07
I ran benchmarks. Globally, it seems like the impact of the patch is positive. regex_v8 and call_simple are slower, but these benchmarks are microbenchmarks impacted by low level stuff like CPU L1 cache. Well, my patch was supposed to optimize CPython for call_simple :-/ I should maybe investigate a little bit more.
Performance comparison (performance 0.3.2):
haypo@smithers$ python3 -m perf compare_to orig.json hot.json -G Slower (6):
- regex_v8: 40.6 ms +- 5.7 ms -> 47.1 ms +- 0.3 ms: 1.16x slower
- call_simple: 12.6 ms +- 0.2 ms -> 13.2 ms +- 1.3 ms: 1.05x slower
- regex_effbot: 4.58 ms +- 0.07 ms -> 4.70 ms +- 0.05 ms: 1.03x slower
- sympy_integrate: 43.4 ms +- 0.3 ms -> 44.0 ms +- 0.2 ms: 1.01x slower
- nqueens: 239 ms +- 2 ms -> 241 ms +- 1 ms: 1.01x slower
- scimark_fft: 674 ms +- 12 ms -> 680 ms +- 75 ms: 1.01x slower
Faster (32):
- scimark_monte_carlo: 255 ms +- 4 ms -> 234 ms +- 7 ms: 1.09x faster
- chameleon: 28.4 ms +- 3.1 ms -> 27.0 ms +- 0.4 ms: 1.05x faster
- scimark_sor: 488 ms +- 27 ms -> 467 ms +- 10 ms: 1.05x faster
- sqlite_synth: 9.16 us +- 1.03 us -> 8.82 us +- 0.23 us: 1.04x faster
- scimark_lu: 485 ms +- 20 ms -> 469 ms +- 14 ms: 1.03x faster
- xml_etree_process: 226 ms +- 30 ms -> 219 ms +- 4 ms: 1.03x faster
- logging_simple: 29.7 us +- 0.4 us -> 28.9 us +- 0.3 us: 1.03x faster
- pickle_list: 7.99 us +- 0.88 us -> 7.78 us +- 0.05 us: 1.03x faster
- raytrace: 1.26 sec +- 0.08 sec -> 1.23 sec +- 0.01 sec: 1.03x faster
- sympy_expand: 995 ms +- 31 ms -> 971 ms +- 35 ms: 1.03x faster
- deltablue: 17.0 ms +- 0.1 ms -> 16.6 ms +- 0.2 ms: 1.02x faster
- call_method_slots: 16.0 ms +- 0.1 ms -> 15.6 ms +- 0.2 ms: 1.02x faster
- fannkuch: 983 ms +- 12 ms -> 962 ms +- 29 ms: 1.02x faster
- pickle_pure_python: 1.25 ms +- 0.14 ms -> 1.22 ms +- 0.01 ms: 1.02x faster
- logging_format: 34.0 us +- 0.3 us -> 33.4 us +- 1.5 us: 1.02x faster
- xml_etree_parse: 274 ms +- 9 ms -> 270 ms +- 5 ms: 1.02x faster
- sympy_str: 441 ms +- 3 ms -> 433 ms +- 3 ms: 1.02x faster
- genshi_text: 87.6 ms +- 9.2 ms -> 86.0 ms +- 1.4 ms: 1.02x faster
- genshi_xml: 187 ms +- 17 ms -> 184 ms +- 1 ms: 1.02x faster
- django_template: 376 ms +- 4 ms -> 370 ms +- 2 ms: 1.02x faster
- json_dumps: 27.1 ms +- 0.4 ms -> 26.7 ms +- 0.4 ms: 1.02x faster
- sqlalchemy_declarative: 295 ms +- 3 ms -> 291 ms +- 3 ms: 1.01x faster
- call_method_unknown: 18.1 ms +- 0.1 ms -> 17.8 ms +- 0.1 ms: 1.01x faster
- nbody: 218 ms +- 4 ms -> 216 ms +- 2 ms: 1.01x faster
- regex_dna: 250 ms +- 24 ms -> 247 ms +- 2 ms: 1.01x faster
- go: 573 ms +- 2 ms -> 566 ms +- 3 ms: 1.01x faster
- richards: 173 ms +- 4 ms -> 171 ms +- 4 ms: 1.01x faster
- python_startup: 24.6 ms +- 0.1 ms -> 24.5 ms +- 0.1 ms: 1.00x faster
- regex_compile: 404 ms +- 6 ms -> 403 ms +- 5 ms: 1.00x faster
- dulwich_log: 143 ms +- 11 ms -> 143 ms +- 1 ms: 1.00x faster
- pidigits: 290 ms +- 1 ms -> 289 ms +- 0 ms: 1.00x faster
- pickle_dict: 58.3 us +- 6.5 us -> 58.3 us +- 0.7 us: 1.00x faster
Benchmark hidden because not significant (26): 2to3, call_method, chaos, crypto_pyaes, float, hexiom, html5lib, json_loads, logging_silent, mako, meteor_contest, pathlib, pickle, python_startup_no_site, scimark_sparse_mat_mult, spectral_norm, sqlalchemy_imperative, sympy_sum, telco, tornado_http, unpack_sequence, unpickle, unpickle_list, unpickle_pure_python, xml_etree_generate, xml_etree_iterparse
--
More readable output, only display differences >= 5%:
haypo@smithers$ python3 -m perf compare_to orig.json hot.json -G --min-speed=5 Slower (1):
- regex_v8: 40.6 ms +- 5.7 ms -> 47.1 ms +- 0.3 ms: 1.16x slower
Faster (2):
- scimark_monte_carlo: 255 ms +- 4 ms -> 234 ms +- 7 ms: 1.09x faster
- chameleon: 28.4 ms +- 3.1 ms -> 27.0 ms +- 0.4 ms: 1.05x faster
Benchmark hidden because not significant (61): 2to3, call_method, call_method_slots, call_method_unknown, call_simple, chaos, crypto_pyaes, deltablue, django_template, dulwich_log, fannkuch, float, genshi_text, genshi_xml, go, hexiom, html5lib, json_dumps, json_loads, logging_format, logging_silent, logging_simple, mako, meteor_contest, nbody, nqueens, pathlib, pickle, pickle_dict, pickle_list, pickle_pure_python, pidigits, python_startup, python_startup_no_site, raytrace, regex_compile, regex_dna, regex_effbot, richards, scimark_fft, scimark_lu, scimark_sor, scimark_sparse_mat_mult, spectral_norm, sqlalchemy_declarative, sqlalchemy_imperative, sqlite_synth, sympy_expand, sympy_integrate, sympy_str, sympy_sum, telco, tornado_http, unpack_sequence, unpickle, unpickle_list, unpickle_pure_python, xml_etree_generate, xml_etree_iterparse, xml_etree_parse, xml_etree_process
Author: STINNER Victor (vstinner) *
Date: 2016-11-05 09:08
Oh, I forgot to mention that I compiled Python with "./configure -C". The purpose of the patch is to optimize Python when LTO and/or PGO compilation are not explicitly used.
Author: Antoine Pitrou (pitrou) *
Date: 2016-11-05 09:59
Can you compare against a PGO build? Ubuntu 14.04 is old, and I don't think this is something we should worry about.
Overall I think this manual approach is really the wrong way to look at it. Compilers can do better than us.
Author: STINNER Victor (vstinner) *
Date: 2016-11-05 15:37
Antoine Pitrou added the comment:
Can you compare against a PGO build?
Do you mean comparison between current Python with PGO and patched Python without PGO?
The hot attribute is ignored by GCC when PGO compilation is used.
Ubuntu 14.04 is old, and I don't think this is something we should worry about.
Well, it's a practical issue for me to run benchmarks for speed.python.org.
Moreover, I like the idea of getting a fast(er) Python even when no advanced optimization techniques like LTO or PGO is used. At least, it's common to build quickly Python using "./configure && make" for a quick benchmark.
Author: Serhiy Storchaka (serhiy.storchaka) *
Date: 2016-11-05 16:14
Moreover, I like the idea of getting a fast(er) Python even when no advanced optimization techniques like LTO or PGO is used.
Seconded.
Author: Antoine Pitrou (pitrou) *
Date: 2016-11-05 20:02
Le 05/11/2016 à 16:37, STINNER Victor a écrit :
Antoine Pitrou added the comment:
Can you compare against a PGO build?
Do you mean comparison between current Python with PGO and patched Python without PGO?
Yes.
Ubuntu 14.04 is old, and I don't think this is something we should worry about.
Well, it's a practical issue for me to run benchmarks for speed.python.org.
Why isn't the OS updated on that machine?
Author: STINNER Victor (vstinner) *
Date: 2016-11-05 22:53
Antoine Pitrou added the comment:
Do you mean comparison between current Python with PGO and patched Python without PGO?
Yes.
Oh ok, sure. I will try to run these 2 benchmarks.
Ubuntu 14.04 is old, and I don't think this is something we should worry about.
Well, it's a practical issue for me to run benchmarks for speed.python.org.
Why isn't the OS updated on that machine?
I am not sure that I want to use PGO compilation to run benchmarks. Last time I checked, I noticed performance differences between two compilations. PGO compilation doesn't seem 100% deterministic.
Maybe PGO compilation is fine when you build Python to create a Linux package. But to get reliable benchmarks, I'm not sure that it's a good idea.
I'm still running benchmarks on Python recompiled many times using different compiler options, to measure the impact of the compiler options (especially LTO and/or PGO) on the benchmark stability.
Author: STINNER Victor (vstinner) *
Date: 2016-11-08 21:09
Do you mean comparison between current Python with PGO and patched Python without PGO?
Yes.
Ok, here you have. As expected, PGO compilation is faster than default compilation with my patch. PGO implements more optimization than just attribute((hot)), it also optimizes branches for example.
haypo@smithers$ python3 -m perf compare_to pgo.json.gz patch.json.gz -G --min-speed=5 Slower (56):
- regex_effbot: 4.30 ms +- 0.26 ms -> 5.77 ms +- 0.33 ms: 1.34x slower
- telco: 16.0 ms +- 1.1 ms -> 20.6 ms +- 0.4 ms: 1.29x slower
- xml_etree_process: 174 ms +- 15 ms -> 218 ms +- 29 ms: 1.25x slower
- xml_etree_generate: 205 ms +- 16 ms -> 254 ms +- 4 ms: 1.24x slower
- unpickle_list: 6.04 us +- 1.12 us -> 7.47 us +- 0.18 us: 1.24x slower
- call_simple: 10.6 ms +- 1.4 ms -> 13.1 ms +- 0.3 ms: 1.24x slower
- mako: 33.5 ms +- 0.3 ms -> 41.3 ms +- 0.9 ms: 1.23x slower
- pathlib: 37.0 ms +- 2.3 ms -> 44.7 ms +- 2.0 ms: 1.21x slower
- sqlite_synth: 7.56 us +- 0.20 us -> 8.97 us +- 0.18 us: 1.19x slower
- unpickle: 24.2 us +- 3.9 us -> 28.7 us +- 0.3 us: 1.18x slower
- chameleon: 23.4 ms +- 2.6 ms -> 27.4 ms +- 1.5 ms: 1.17x slower
- spectral_norm: 214 ms +- 7 ms -> 249 ms +- 9 ms: 1.17x slower
- nqueens: 210 ms +- 2 ms -> 244 ms +- 36 ms: 1.16x slower
- unpickle_pure_python: 717 us +- 10 us -> 831 us +- 66 us: 1.16x slower
- pickle: 18.7 us +- 4.3 us -> 21.6 us +- 3.3 us: 1.15x slower
- sympy_expand: 829 ms +- 39 ms -> 957 ms +- 28 ms: 1.15x slower
- genshi_text: 73.1 ms +- 3.2 ms -> 84.3 ms +- 1.1 ms: 1.15x slower
- pickle_list: 6.82 us +- 0.20 us -> 7.86 us +- 0.05 us: 1.15x slower
- sympy_str: 372 ms +- 28 ms -> 428 ms +- 3 ms: 1.15x slower
- xml_etree_parse: 231 ms +- 7 ms -> 266 ms +- 9 ms: 1.15x slower
- call_method_slots: 14.0 ms +- 1.3 ms -> 16.1 ms +- 1.2 ms: 1.15x slower
- sympy_sum: 169 ms +- 6 ms -> 194 ms +- 19 ms: 1.15x slower
- logging_format: 29.3 us +- 2.5 us -> 33.7 us +- 1.6 us: 1.15x slower
- logging_simple: 25.7 us +- 2.1 us -> 29.3 us +- 0.4 us: 1.14x slower
- genshi_xml: 159 ms +- 15 ms -> 182 ms +- 1 ms: 1.14x slower
- xml_etree_iterparse: 178 ms +- 3 ms -> 203 ms +- 5 ms: 1.14x slower
- pickle_pure_python: 1.06 ms +- 0.17 ms -> 1.21 ms +- 0.16 ms: 1.14x slower
- logging_silent: 618 ns +- 11 ns -> 705 ns +- 62 ns: 1.14x slower
- hexiom: 19.0 ms +- 0.2 ms -> 21.7 ms +- 0.2 ms: 1.14x slower
- html5lib: 184 ms +- 11 ms -> 209 ms +- 31 ms: 1.14x slower
- call_method: 14.3 ms +- 0.7 ms -> 16.3 ms +- 0.1 ms: 1.14x slower
- django_template: 324 ms +- 18 ms -> 368 ms +- 3 ms: 1.14x slower
- sympy_integrate: 37.9 ms +- 0.3 ms -> 43.0 ms +- 2.7 ms: 1.13x slower
- deltablue: 15.0 ms +- 2.0 ms -> 16.9 ms +- 1.0 ms: 1.12x slower
- call_method_unknown: 16.0 ms +- 0.4 ms -> 17.9 ms +- 0.2 ms: 1.12x slower
- 2to3: 611 ms +- 12 ms -> 677 ms +- 57 ms: 1.11x slower
- regex_compile: 300 ms +- 3 ms -> 332 ms +- 21 ms: 1.11x slower
- json_loads: 50.5 us +- 2.5 us -> 55.8 us +- 1.2 us: 1.10x slower
- unpack_sequence: 111 ns +- 5 ns -> 122 ns +- 1 ns: 1.10x slower
- pickle_dict: 53.2 us +- 3.7 us -> 58.1 us +- 3.7 us: 1.09x slower
- scimark_sor: 420 ms +- 60 ms -> 458 ms +- 12 ms: 1.09x slower
- scimark_lu: 398 ms +- 74 ms -> 434 ms +- 18 ms: 1.09x slower
- regex_dna: 227 ms +- 1 ms -> 247 ms +- 9 ms: 1.09x slower
- pidigits: 266 ms +- 33 ms -> 290 ms +- 10 ms: 1.09x slower
- chaos: 243 ms +- 2 ms -> 265 ms +- 3 ms: 1.09x slower
- crypto_pyaes: 197 ms +- 16 ms -> 215 ms +- 28 ms: 1.09x slower
- dulwich_log: 129 ms +- 15 ms -> 140 ms +- 8 ms: 1.08x slower
- sqlalchemy_imperative: 50.8 ms +- 0.9 ms -> 55.0 ms +- 1.8 ms: 1.08x slower
- meteor_contest: 173 ms +- 22 ms -> 187 ms +- 5 ms: 1.08x slower
- sqlalchemy_declarative: 268 ms +- 11 ms -> 290 ms +- 3 ms: 1.08x slower
- tornado_http: 335 ms +- 4 ms -> 361 ms +- 3 ms: 1.08x slower
- python_startup: 20.6 ms +- 0.6 ms -> 22.1 ms +- 0.9 ms: 1.08x slower
- python_startup_no_site: 8.37 ms +- 0.08 ms -> 9.00 ms +- 0.07 ms: 1.08x slower
- go: 518 ms +- 36 ms -> 557 ms +- 39 ms: 1.07x slower
- raytrace: 1.14 sec +- 0.08 sec -> 1.22 sec +- 0.02 sec: 1.07x slower
- scimark_fft: 594 ms +- 29 ms -> 627 ms +- 13 ms: 1.06x slower
Benchmark hidden because not significant (8): fannkuch, float, json_dumps, nbody, regex_v8, richards, scimark_monte_carlo, scimark_sparse_mat_mult
Author: Roundup Robot (python-dev)
Date: 2016-11-11 01:14
New changeset 59b91b4e9506 by Victor Stinner in branch 'default': Issue #28618: Make hot functions using attribute((hot)) https://hg.python.org/cpython/rev/59b91b4e9506
Author: STINNER Victor (vstinner) *
Date: 2016-11-11 01:49
I tried different patches and ran many quick & dirty benchmarks.
I tried to use likely/unlikely macros (using GCC __builtin__expect): the effect is not significant on call_simple microbenchmark. I gave up on this part.
attribute((hot)) on a few Python core functions fixes the major slowdown on call_method on the revision 83877018ef97 (described in the initial message).
I noticed tiny differences when using attribute((hot)), speedup in most cases. I noticed sometimes slowdown, but very small (ex: 1%, but 1% on a microbenchmark doesn't mean anything).
I pushed my patch to try to keep stable performance when Python is not compiled with PGO.
If you would like to know more about the crazy effect of code placement in modern Intel CPUs, I suggest you to see the slides of this recent talk from an Intel engineer: https://llvmdevelopersmeetingbay2016.sched.org/event/8YzY/causes-of-performance-instability-due-to-code-placement-in-x86 "Causes of Performance Swings Due to Code Placement in IA by Zia Ansari (Intel), November 2016"
--
About PGO or not PGO: this question is not simple, I suggest to discuss it in a different place to not flood this issue ;-)
For my use case, I'm not convinced yet that PGO with our current build system produce reliable performance.
Not all Linux distributions compile Python using PGO: Fedora and RHEL don't compile Python using PGO for example. Bugzilla for Fedora: https://bugzilla.redhat.com/show_bug.cgi?id=613045
I guess that there also some developers running benchmarks on Python compiled with "./configure && make". I'm trying to enhance documentation and tools around Python benchmarks to advice developers to use LTO and/or PGO.
Author: STINNER Victor (vstinner) *
Date: 2016-11-11 09:10
Final result on speed-python:
haypo@speed-python$ python3 -m perf compare_to json_8nov/2016-11-10_15-39-default-8ebaa546a033.json 2016-11-11_02-13-default-59b91b4e9506.json -G
Slower (12):
- scimark_sparse_mat_mult: 8.71 ms +- 0.19 ms -> 9.28 ms +- 0.12 ms: 1.07x slower
- nbody: 244 ms +- 2 ms -> 252 ms +- 4 ms: 1.03x slower
- json_loads: 71.4 us +- 0.8 us -> 72.9 us +- 1.4 us: 1.02x slower
- fannkuch: 1.07 sec +- 0.01 sec -> 1.09 sec +- 0.01 sec: 1.01x slower
- scimark_lu: 502 ms +- 19 ms -> 509 ms +- 12 ms: 1.01x slower
- chaos: 302 ms +- 3 ms -> 305 ms +- 3 ms: 1.01x slower
- xml_etree_iterparse: 224 ms +- 3 ms -> 226 ms +- 6 ms: 1.01x slower
- regex_dna: 299 ms +- 1 ms -> 300 ms +- 1 ms: 1.00x slower
- pickle_list: 9.21 us +- 0.33 us -> 9.24 us +- 0.56 us: 1.00x slower
- crypto_pyaes: 245 ms +- 1 ms -> 246 ms +- 2 ms: 1.00x slower
- meteor_contest: 219 ms +- 1 ms -> 219 ms +- 1 ms: 1.00x slower
- unpack_sequence: 128 ns +- 2 ns -> 128 ns +- 0 ns: 1.00x slower
Faster (39):
- logging_silent: 997 ns +- 40 ns -> 803 ns +- 13 ns: 1.24x faster
- regex_effbot: 6.16 ms +- 0.24 ms -> 5.17 ms +- 0.27 ms: 1.19x faster
- mako: 45.9 ms +- 0.7 ms -> 42.9 ms +- 0.6 ms: 1.07x faster
- xml_etree_process: 253 ms +- 4 ms -> 237 ms +- 4 ms: 1.07x faster
- call_simple: 13.9 ms +- 0.3 ms -> 13.1 ms +- 0.4 ms: 1.06x faster
- spectral_norm: 274 ms +- 2 ms -> 260 ms +- 2 ms: 1.05x faster
- xml_etree_generate: 300 ms +- 4 ms -> 285 ms +- 5 ms: 1.05x faster
- call_method_slots: 17.1 ms +- 0.2 ms -> 16.2 ms +- 0.3 ms: 1.05x faster
- telco: 21.8 ms +- 0.5 ms -> 20.7 ms +- 0.3 ms: 1.05x faster
- call_method: 17.3 ms +- 0.3 ms -> 16.5 ms +- 0.2 ms: 1.05x faster
- pickle_pure_python: 1.42 ms +- 0.02 ms -> 1.36 ms +- 0.03 ms: 1.04x faster
- pathlib: 51.9 ms +- 0.8 ms -> 50.6 ms +- 0.4 ms: 1.03x faster
- xml_etree_parse: 295 ms +- 8 ms -> 287 ms +- 7 ms: 1.03x faster
- chameleon: 31.0 ms +- 0.3 ms -> 30.2 ms +- 0.2 ms: 1.03x faster
- deltablue: 19.3 ms +- 0.3 ms -> 18.8 ms +- 0.2 ms: 1.02x faster
- django_template: 484 ms +- 4 ms -> 472 ms +- 5 ms: 1.02x faster
- call_method_unknown: 18.7 ms +- 0.2 ms -> 18.3 ms +- 0.2 ms: 1.02x faster
- html5lib: 261 ms +- 6 ms -> 256 ms +- 6 ms: 1.02x faster
- unpickle_pure_python: 973 us +- 12 us -> 954 us +- 15 us: 1.02x faster
- regex_v8: 47.6 ms +- 0.8 ms -> 46.7 ms +- 0.4 ms: 1.02x faster
- richards: 202 ms +- 4 ms -> 198 ms +- 5 ms: 1.02x faster
- logging_simple: 37.8 us +- 0.6 us -> 37.1 us +- 0.4 us: 1.02x faster
- sympy_integrate: 50.8 ms +- 0.9 ms -> 49.9 ms +- 1.4 ms: 1.02x faster
- dulwich_log: 189 ms +- 2 ms -> 186 ms +- 1 ms: 1.02x faster
- sqlalchemy_declarative: 343 ms +- 3 ms -> 339 ms +- 3 ms: 1.01x faster
- hexiom: 25.0 ms +- 0.1 ms -> 24.7 ms +- 0.1 ms: 1.01x faster
- logging_format: 44.6 us +- 0.6 us -> 44.1 us +- 0.6 us: 1.01x faster
- 2to3: 787 ms +- 4 ms -> 777 ms +- 4 ms: 1.01x faster
- tornado_http: 440 ms +- 4 ms -> 435 ms +- 4 ms: 1.01x faster
- json_dumps: 30.7 ms +- 0.4 ms -> 30.5 ms +- 0.3 ms: 1.01x faster
- go: 637 ms +- 10 ms -> 632 ms +- 8 ms: 1.01x faster
- regex_compile: 397 ms +- 2 ms -> 394 ms +- 3 ms: 1.01x faster
- nqueens: 266 ms +- 2 ms -> 264 ms +- 2 ms: 1.01x faster
- python_startup: 16.8 ms +- 0.0 ms -> 16.7 ms +- 0.0 ms: 1.01x faster
- python_startup_no_site: 9.91 ms +- 0.01 ms -> 9.86 ms +- 0.01 ms: 1.01x faster
- scimark_sor: 513 ms +- 13 ms -> 510 ms +- 8 ms: 1.01x faster
- raytrace: 1.41 sec +- 0.02 sec -> 1.40 sec +- 0.02 sec: 1.00x faster
- genshi_text: 95.2 ms +- 1.1 ms -> 94.7 ms +- 0.8 ms: 1.00x faster
- sympy_str: 529 ms +- 5 ms -> 528 ms +- 4 ms: 1.00x faster
Benchmark hidden because not significant (13): float, genshi_xml, pickle, pickle_dict, pidigits, scimark_fft, scimark_monte_carlo, sqlalchemy_imperative, sqlite_synth, sympy_expand, sympy_sum, unpickle, unpickle_list
Author: STINNER Victor (vstinner) *
Date: 2016-11-11 19:52
- json_loads: 71.4 us +- 0.8 us -> 72.9 us +- 1.4 us: 1.02x slower
Hum, sadly this benchmark is still unstable after my change 59b91b4e9506 ("Mark hot functions using attribute((hot))", oops, I wanted to write Mark, not Make :-/).
This benchmark is around 63.4 us during many months, whereas it reached 72.9 us at rev 59b91b4e9506, and the new run (also using hot attribute) gone back to 63.0 us...
I understand that json_loads depends on the code placement of some other functions which are not currently marked with the hot attribute.
https://speed.python.org/timeline/#/?exe=4&ben=json_loads&env=1&revs=50&equid=off&quarts=on&extr=on
Author: STINNER Victor (vstinner) *
Date: 2016-11-11 19:58
- scimark_sparse_mat_mult: 8.71 ms +- 0.19 ms -> 9.28 ms +- 0.12 ms: 1.07x slower
Same issue on this benchmark:
- average on one year: 8.8 ms
- peak at rev 59b91b4e9506: 9.3 ms
- run after rev 59b91b4e9506: 9.0 ms
The benchmark is unstable, but the difference is small, especially compared to the difference of call_method without the hot attribute.
Author: Yury Selivanov (yselivanov) *
Date: 2016-11-12 22:25
Can we commit this to 3.6 too?
Author: STINNER Victor (vstinner) *
Date: 2016-11-12 23:40
Can we commit this to 3.6 too?
I worked on patches to try to optimize json_loads and regex_effbot as well, but it's still unclear to me how the hot attribute works, and I'm not 100% sure that using the attribut explicitly does not introduce a performance regession.
So I prefer to experiment such change in default right now.
Author: Inada Naoki (methane) *
Date: 2016-11-14 10:41
How about marking lookdict_unicode and lookdict_unicode_nodummy as hot?
Author: STINNER Victor (vstinner) *
Date: 2016-11-14 12:23
INADA Naoki added the comment:
How about marking lookdict_unicode and lookdict_unicode_nodummy as hot?
I don't understand well the effect of the hot attribute, so I suggest to run benchmarks and check that it has a non negligible effect on benchmarks ;-)
Author: Inada Naoki (methane) *
Date: 2016-11-15 11:56
I don't understand well the effect of the hot attribute
I compared lookdict_unicode_nodummy assembly by objdump -d dictobject.o
.
It looks completely same.
So I think only difference is placement. hot functions are in .text.hot section and linker groups hot functions. This reduces cache hazard possibility.
When compiling Python with PGO, we can see what function is hot by objdump.
~/work/cpython/Objects$ objdump -tj .text.hot dictobject.o
dictobject.o: file format elf64-x86-64
SYMBOL TABLE:
0000000000000000 l d .text.hot 0000000000000000 .text.hot
00000000000007a0 l F .text.hot 0000000000000574 lookdict_unicode_nodummy
00000000000046d0 l F .text.hot 00000000000000e8 free_keys_object
00000000000001c0 l F .text.hot 0000000000000161 new_keys_object
00000000000003b0 l F .text.hot 00000000000003e8 insertdict
0000000000001180 l F .text.hot 000000000000081f dictresize
00000000000019a0 l F .text.hot 0000000000000165 find_empty_slot.isra.0
0000000000002180 l F .text.hot 00000000000005f1 lookdict
0000000000001b10 l F .text.hot 00000000000000c2 unicode_eq
0000000000002780 l F .text.hot 0000000000000184 dict_traverse
0000000000004c20 l F .text.hot 00000000000005f7 lookdict_unicode
0000000000006b20 l F .text.hot 0000000000000330 lookdict_split
...
cold section of hot function is placed in .text.unlikely section.
$ objdump -t dictobject.o | grep lookdict
00000000000007a0 l F .text.hot 0000000000000574 lookdict_unicode_nodummy
0000000000002180 l F .text.hot 00000000000005f1 lookdict
000000000000013e l .text.unlikely 0000000000000000 lookdict_unicode_nodummy.cold.6
0000000000000a38 l .text.unlikely 0000000000000000 lookdict.cold.15
0000000000004c20 l F .text.hot 00000000000005f7 lookdict_unicode
0000000000006b20 l F .text.hot 0000000000000330 lookdict_split
0000000000001339 l .text.unlikely 0000000000000000 lookdict_unicode.cold.28
0000000000001d01 l .text.unlikely 0000000000000000 lookdict_split.cold.42
All lookdict* function are put in hot section, and all of cold part is 0 byte. It seems PGO put all lookdict* functions in hot section.
compiler info:
$ gcc -v
Using built-in specs.
COLLECT_GCC=gcc
COLLECT_LTO_WRAPPER=/usr/lib/gcc/x86_64-linux-gnu/5/lto-wrapper
Target: x86_64-linux-gnu
Configured with: ../src/configure -v --with-pkgversion='Ubuntu 5.4.0-6ubuntu1~16.04.4' --with-bugurl=file:///usr/share/doc/gcc-5/README.Bugs --enable-languages=c,ada,c++,java,go,d,fortran,objc,obj-c++ --prefix=/usr --program-suffix=-5 --enable-shared --enable-linker-build-id --libexecdir=/usr/lib --without-included-gettext --enable-threads=posix --libdir=/usr/lib --enable-nls --with-sysroot=/ --enable-clocale=gnu --enable-libstdcxx-debug --enable-libstdcxx-time=yes --with-default-libstdcxx-abi=new --enable-gnu-unique-object --disable-vtable-verify --enable-libmpx --enable-plugin --with-system-zlib --disable-browser-plugin --enable-java-awt=gtk --enable-gtk-cairo --with-java-home=/usr/lib/jvm/java-1.5.0-gcj-5-amd64/jre --enable-java-home --with-jvm-root-dir=/usr/lib/jvm/java-1.5.0-gcj-5-amd64 --with-jvm-jar-dir=/usr/lib/jvm-exports/java-1.5.0-gcj-5-amd64 --with-arch-directory=amd64 --with-ecj-jar=/usr/share/java/eclipse-ecj.jar --enable-objc-gc --enable-multiarch --disable-werror --with-arch-32=i686 --with-abi=m64 --with-multilib-list=m32,m64,mx32 --enable-multilib --with-tune=generic --enable-checking=release --build=x86_64-linux-gnu --host=x86_64-linux-gnu --target=x86_64-linux-gnu
Thread model: posix
gcc version 5.4.0 20160609 (Ubuntu 5.4.0-6ubuntu1~16.04.4)
Author: Inada Naoki (methane) *
Date: 2016-11-15 12:04
so I suggest to run benchmarks and check that it has a non negligible effect on benchmarks ;-)
When added _Py_HOT_FUNCTION to lookdict_unicode, lookdict_unicode_nodummy and lookdict_split
(I can't measure L1 miss via perf stat -d
because I use EC2 for benchmark):
$ ~/local/python-master/bin/python3 -m perf compare_to -G all-master.json all-patched.json Slower (28):
- pybench.CompareFloats: 106 ns +- 1 ns -> 112 ns +- 1 ns: 1.07x slower
- pybench.BuiltinFunctionCalls: 1.62 us +- 0.00 us -> 1.68 us +- 0.03 us: 1.04x slower
- pybench.CompareFloatsIntegers: 180 ns +- 3 ns -> 185 ns +- 3 ns: 1.03x slower
- sympy_sum: 163 ms +- 7 ms -> 167 ms +- 7 ms: 1.03x slower
- deltablue: 13.7 ms +- 0.4 ms -> 14.1 ms +- 0.5 ms: 1.02x slower
- pickle_list: 5.77 us +- 0.09 us -> 5.90 us +- 0.07 us: 1.02x slower
- pybench.PythonFunctionCalls: 1.20 us +- 0.02 us -> 1.22 us +- 0.02 us: 1.02x slower
- pybench.SpecialClassAttribute: 1.46 us +- 0.02 us -> 1.49 us +- 0.03 us: 1.02x slower
- pybench.TryRaiseExcept: 207 ns +- 4 ns -> 210 ns +- 0 ns: 1.02x slower
- pickle_pure_python: 868 us +- 18 us -> 882 us +- 16 us: 1.02x slower
- genshi_text: 56.0 ms +- 0.7 ms -> 56.8 ms +- 0.6 ms: 1.01x slower
- json_dumps: 19.5 ms +- 0.3 ms -> 19.8 ms +- 0.2 ms: 1.01x slower
- richards: 137 ms +- 3 ms -> 139 ms +- 2 ms: 1.01x slower
- sqlalchemy_declarative: 272 ms +- 4 ms -> 276 ms +- 3 ms: 1.01x slower
- pickle_dict: 43.5 us +- 0.4 us -> 44.1 us +- 0.2 us: 1.01x slower
- go: 436 ms +- 4 ms -> 441 ms +- 4 ms: 1.01x slower
- pybench.SecondImport: 2.52 us +- 0.04 us -> 2.55 us +- 0.07 us: 1.01x slower
- pybench.NormalClassAttribute: 1.46 us +- 0.02 us -> 1.47 us +- 0.02 us: 1.01x slower
- genshi_xml: 118 ms +- 2 ms -> 118 ms +- 3 ms: 1.01x slower
- pybench.UnicodePredicates: 75.8 ns +- 0.6 ns -> 76.2 ns +- 0.9 ns: 1.01x slower
- pybench.ListSlicing: 415 us +- 4 us -> 417 us +- 4 us: 1.01x slower
- scimark_fft: 494 ms +- 2 ms -> 496 ms +- 12 ms: 1.01x slower
- logging_format: 23.7 us +- 0.3 us -> 23.9 us +- 0.4 us: 1.00x slower
- chaos: 200 ms +- 1 ms -> 201 ms +- 1 ms: 1.00x slower
- pybench.StringPredicates: 509 ns +- 3 ns -> 511 ns +- 4 ns: 1.00x slower
- call_method: 13.6 ms +- 0.1 ms -> 13.7 ms +- 0.2 ms: 1.00x slower
- pybench.StringSlicing: 530 ns +- 3 ns -> 532 ns +- 8 ns: 1.00x slower
- pybench.SimpleLongArithmetic: 535 ns +- 2 ns -> 536 ns +- 4 ns: 1.00x slower
Faster (47):
- html5lib: 169 ms +- 7 ms -> 158 ms +- 6 ms: 1.07x faster
- pybench.ConcatUnicode: 57.3 ns +- 3.0 ns -> 55.8 ns +- 1.3 ns: 1.03x faster
- pybench.IfThenElse: 60.5 ns +- 1.0 ns -> 59.0 ns +- 0.7 ns: 1.02x faster
- logging_silent: 606 ns +- 14 ns -> 593 ns +- 13 ns: 1.02x faster
- scimark_lu: 411 ms +- 5 ms -> 404 ms +- 4 ms: 1.02x faster
- pathlib: 29.1 ms +- 0.3 ms -> 28.7 ms +- 0.5 ms: 1.02x faster
- pybench.CreateStringsWithConcat: 2.87 us +- 0.01 us -> 2.82 us +- 0.00 us: 1.02x faster
- pybench.DictCreation: 621 ns +- 10 ns -> 612 ns +- 8 ns: 1.01x faster
- meteor_contest: 167 ms +- 5 ms -> 164 ms +- 5 ms: 1.01x faster
- unpickle_pure_python: 656 us +- 19 us -> 647 us +- 9 us: 1.01x faster
- pybench.NestedForLoops: 20.2 ns +- 0.1 ns -> 20.0 ns +- 0.1 ns: 1.01x faster
- regex_effbot: 4.01 ms +- 0.07 ms -> 3.95 ms +- 0.06 ms: 1.01x faster
- pybench.CreateUnicodeWithConcat: 57.1 ns +- 0.2 ns -> 56.4 ns +- 0.2 ns: 1.01x faster
- chameleon: 18.3 ms +- 0.2 ms -> 18.0 ms +- 0.3 ms: 1.01x faster
- python_startup: 13.7 ms +- 0.1 ms -> 13.5 ms +- 0.1 ms: 1.01x faster
- pybench.SmallTuples: 967 ns +- 6 ns -> 955 ns +- 8 ns: 1.01x faster
- pybench.TryFinally: 200 ns +- 3 ns -> 198 ns +- 2 ns: 1.01x faster
- pybench.SimpleIntegerArithmetic: 425 ns +- 3 ns -> 420 ns +- 4 ns: 1.01x faster
- pybench.Recursion: 1.34 us +- 0.02 us -> 1.33 us +- 0.03 us: 1.01x faster
- pybench.SimpleIntFloatArithmetic: 424 ns +- 1 ns -> 420 ns +- 1 ns: 1.01x faster
- float: 222 ms +- 2 ms -> 220 ms +- 3 ms: 1.01x faster
- 2to3: 531 ms +- 4 ms -> 527 ms +- 5 ms: 1.01x faster
- python_startup_no_site: 8.30 ms +- 0.04 ms -> 8.23 ms +- 0.05 ms: 1.01x faster
- xml_etree_parse: 196 ms +- 5 ms -> 194 ms +- 2 ms: 1.01x faster
- pybench.ComplexPythonFunctionCalls: 794 ns +- 7 ns -> 788 ns +- 7 ns: 1.01x faster
- logging_simple: 20.4 us +- 0.2 us -> 20.3 us +- 0.4 us: 1.01x faster
- fannkuch: 795 ms +- 9 ms -> 790 ms +- 3 ms: 1.01x faster
- hexiom: 18.7 ms +- 0.1 ms -> 18.6 ms +- 0.1 ms: 1.01x faster
- regex_compile: 322 ms +- 9 ms -> 320 ms +- 8 ms: 1.01x faster
- mako: 36.0 ms +- 0.1 ms -> 35.8 ms +- 0.2 ms: 1.01x faster
- pybench.UnicodeProperties: 91.7 ns +- 0.9 ns -> 91.1 ns +- 0.8 ns: 1.01x faster
- pybench.SimpleComplexArithmetic: 577 ns +- 8 ns -> 573 ns +- 3 ns: 1.01x faster
- xml_etree_process: 147 ms +- 2 ms -> 146 ms +- 2 ms: 1.01x faster
- pybench.CompareUnicode: 22.4 ns +- 0.1 ns -> 22.2 ns +- 0.1 ns: 1.01x faster
- crypto_pyaes: 175 ms +- 1 ms -> 174 ms +- 1 ms: 1.01x faster
- unpickle_list: 5.43 us +- 0.04 us -> 5.41 us +- 0.02 us: 1.01x faster
- pybench.WithFinally: 257 ns +- 4 ns -> 256 ns +- 2 ns: 1.01x faster
- xml_etree_generate: 183 ms +- 2 ms -> 182 ms +- 2 ms: 1.00x faster
- pybench.WithRaiseExcept: 475 ns +- 4 ns -> 472 ns +- 6 ns: 1.00x faster
- pybench.SecondPackageImport: 2.85 us +- 0.08 us -> 2.84 us +- 0.09 us: 1.00x faster
- pybench.SimpleListManipulation: 444 ns +- 1 ns -> 442 ns +- 2 ns: 1.00x faster
- spectral_norm: 208 ms +- 2 ms -> 208 ms +- 1 ms: 1.00x faster
- pybench.ForLoops: 8.95 ns +- 0.19 ns -> 8.94 ns +- 0.01 ns: 1.00x faster
- scimark_sor: 371 ms +- 3 ms -> 371 ms +- 2 ms: 1.00x faster
- scimark_sparse_mat_mult: 5.61 ms +- 0.06 ms -> 5.61 ms +- 0.36 ms: 1.00x faster
- pybench.UnicodeMappings: 40.7 us +- 0.1 us -> 40.7 us +- 0.0 us: 1.00x faster
- pybench.CompareStrings: 22.2 ns +- 0.0 ns -> 22.2 ns +- 0.0 ns: 1.00x faster
Benchmark hidden because not significant (47): call_method_slots, call_method_unknown, call_simple, django_template, dulwich_log, json_loads, nbody, nqueens, pickle, pidigits, pybench.BuiltinMethodLookup, pybench.CompareIntegers, pybench. CompareInternedStrings, pybench.CompareLongs, pybench.ConcatStrings, pybench.CreateInstances, pybench.CreateNewInstances, pybench.DictWithFloatKeys, pybench.DictWithIntegerKeys, pybench.DictWithStringKeys, pybench.NestedListComprehensions , pybench.NormalInstanceAttribute, pybench.PythonMethodCalls, pybench.SecondSubmoduleImport, pybench.SimpleDictManipulation, pybench.SimpleFloatArithmetic, pybench.SimpleListComprehensions, pybench.SmallLists, pybench.SpecialInstanceAttri bute, pybench.StringMappings, pybench.TryExcept, pybench.TupleSlicing, pybench.UnicodeSlicing, raytrace, regex_dna, regex_v8, scimark_monte_carlo, sqlalchemy_imperative, sqlite_synth, sympy_expand, sympy_integrate, sympy_str, telco, torna do_http, unpack_sequence, unpickle, xml_etree_iterparse
Author: Roundup Robot (python-dev)
Date: 2016-11-15 14:15
New changeset cfc956f13ce2 by Victor Stinner in branch 'default': Issue #28618: Mark dict lookup functions as hot https://hg.python.org/cpython/rev/cfc956f13ce2
Author: STINNER Victor (vstinner) *
Date: 2016-11-15 14:18
How about marking lookdict_unicode and lookdict_unicode_nodummy as hot?
Ok, your benchmark results doens't look bad, so I marked the following functions as hot:
- lookdict
- lookdict_unicode
- lookdict_unicode_nodummy
- lookdict_split
It's common to see these functions in the top 3 of "perf report".
Author: STINNER Victor (vstinner) *
Date: 2016-11-15 14:21
hot3.patch: Mark additional functions as hot
- PyNumber_AsSsize_t()
- _PyUnicode_FromUCS1()
- json: scanstring_unicode()
- siphash24()
- sre_ucs1_match, sre_ucs2_match, sre_ucs4_match
I'm not sure about this patch. It's hard to get reliable benchmark results on microbenchmarks :-/ It's hard to understand if a speedup comes from the hot attribute, or if the compiler decided itself to change the code placement. Without the hot attribute, the code placement seems random.
Author: STINNER Victor (vstinner) *
Date: 2016-11-15 14:28
I wrote hot3.patch when trying to make the following benchmarks more reliable:
- logging_silent: rev 8ebaa546a033 is 20% slower than the average en 2016
- json_loads: rev 0bd618fe0639 is 30% slower and rev 8ebaa546a033 is 15% slower than the average on 2016
- regex_effbot: rev 573bc1f9900e (nov 7) takes 6.0 ms, rev cf7711887b4a (nov 7) takes 5.2 ms, rev 8ebaa546a033 (nov 10) takes 6.1 ms, etc.
Author: Serhiy Storchaka (serhiy.storchaka) *
Date: 2016-11-15 14:40
- json: scanstring_unicode()
This doesn't look wise. This is specific to single extension module and perhaps to single particular benchmark. Most Python code don't use json at all.
What is the top of "perf report"? How this list intersects with the list of functions in .text.hot section of PGO build? Make several PGO builds (perhaps on different computers). Is .text.hot section stable?
Author: STINNER Victor (vstinner) *
Date: 2016-11-15 15:42
New changeset cfc956f13ce2 by Victor Stinner in branch 'default': Issue #28618: Mark dict lookup functions as hot https://hg.python.org/cpython/rev/cfc956f13ce2
Here are benchmark results on the speed-python server:
haypo@speed-python$ PYTHONPATH=~/perf python -m perf compare_to 2016-11-15_09-12-default-ac93d188ebd6.json 2016-11-15_15-13-default-cfc956f13ce2.json -G --min-speed=1 Slower (6):
- json_loads: 62.8 us +- 1.1 us -> 65.8 us +- 2.6 us: 1.05x slower
- nbody: 243 ms +- 2 ms -> 253 ms +- 6 ms: 1.04x slower
- mako: 42.7 ms +- 0.2 ms -> 43.5 ms +- 0.3 ms: 1.02x slower
- chameleon: 29.2 ms +- 0.3 ms -> 29.7 ms +- 0.2 ms: 1.02x slower
- spectral_norm: 261 ms +- 2 ms -> 266 ms +- 3 ms: 1.02x slower
- pickle: 26.6 us +- 0.4 us -> 27.0 us +- 0.4 us: 1.01x slower
Faster (26):
- xml_etree_generate: 290 ms +- 4 ms -> 275 ms +- 3 ms: 1.06x faster
- float: 306 ms +- 5 ms -> 292 ms +- 7 ms: 1.05x faster
- logging_simple: 37.7 us +- 0.4 us -> 36.1 us +- 0.4 us: 1.04x faster
- hexiom: 25.6 ms +- 0.1 ms -> 24.5 ms +- 0.1 ms: 1.04x faster
- regex_effbot: 6.11 ms +- 0.31 ms -> 5.88 ms +- 0.43 ms: 1.04x faster
- sympy_expand: 1.19 sec +- 0.02 sec -> 1.15 sec +- 0.01 sec: 1.04x faster
- telco: 21.5 ms +- 0.4 ms -> 20.8 ms +- 0.4 ms: 1.03x faster
- raytrace: 1.41 sec +- 0.02 sec -> 1.37 sec +- 0.02 sec: 1.03x faster
- scimark_sor: 512 ms +- 11 ms -> 500 ms +- 12 ms: 1.03x faster
- logging_format: 44.6 us +- 0.5 us -> 43.6 us +- 0.7 us: 1.02x faster
- sympy_str: 532 ms +- 4 ms -> 520 ms +- 4 ms: 1.02x faster
- fannkuch: 1.11 sec +- 0.01 sec -> 1.08 sec +- 0.02 sec: 1.02x faster
- django_template: 475 ms +- 5 ms -> 467 ms +- 6 ms: 1.02x faster
- chaos: 308 ms +- 2 ms -> 303 ms +- 3 ms: 1.02x faster
- xml_etree_process: 244 ms +- 4 ms -> 240 ms +- 4 ms: 1.02x faster
- xml_etree_iterparse: 225 ms +- 5 ms -> 221 ms +- 4 ms: 1.02x faster
- pathlib: 51.1 ms +- 0.5 ms -> 50.3 ms +- 0.5 ms: 1.02x faster
- sqlite_synth: 10.5 us +- 0.2 us -> 10.3 us +- 0.2 us: 1.01x faster
- dulwich_log: 186 ms +- 1 ms -> 184 ms +- 1 ms: 1.01x faster
- sqlalchemy_imperative: 72.5 ms +- 1.6 ms -> 71.5 ms +- 1.6 ms: 1.01x faster
- deltablue: 18.5 ms +- 0.3 ms -> 18.3 ms +- 0.2 ms: 1.01x faster
- tornado_http: 438 ms +- 5 ms -> 433 ms +- 5 ms: 1.01x faster
- json_dumps: 30.4 ms +- 0.4 ms -> 30.1 ms +- 0.4 ms: 1.01x faster
- genshi_xml: 212 ms +- 3 ms -> 210 ms +- 3 ms: 1.01x faster
- scimark_monte_carlo: 273 ms +- 5 ms -> 271 ms +- 5 ms: 1.01x faster
- call_simple: 13.3 ms +- 0.3 ms -> 13.2 ms +- 0.4 ms: 1.01x faster
Benchmark hidden because not significant (32): 2to3, call_method, call_method_slots, call_method_unknown, crypto_pyaes, genshi_text, go, html5lib, logging_silent, meteor_contest, nqueens, pickle_dict, pickle_list, pickle_pure_python, pidigits, python_startup, python_startup_no_site, regex_compile, regex_dna, regex_v8, richards, scimark_fft, scimark_lu, scimark_sparse_mat_mult, sqlalchemy_declarative, sympy_integrate, sympy_sum, unpack_sequence, unpickle, unpickle_list, unpickle_pure_python, xml_etree_parse
Author: STINNER Victor (vstinner) *
Date: 2016-11-15 15:50
Serhiy Storchaka:
- json: scanstring_unicode()
This doesn't look wise. This is specific to single extension module and perhaps to single particular benchmark. Most Python code don't use json at all.
Well, I tried different things to make these benchmarks more stable. I didn't say that we should merge hot3.patch as it is :-) It's just an attempt.
What is the top of "perf report"?
For json_loads, it's:
14.99% _json.cpython-37m-x86_64-linux-gnu.so scanstring_unicode 8.34% python _PyUnicode_FromUCS1 8.32% _json.cpython-37m-x86_64-linux-gnu.so scan_once_unicode 8.01% python lookdict_unicode_nodummy 6.72% python siphash24 4.45% python PyDict_SetItem 4.26% python _PyObject_Malloc 3.38% python _PyEval_EvalFrameDefault 3.16% python _Py_HashBytes 2.72% python PyUnicode_New 2.36% python PyLong_FromString 2.25% python _PyObject_Free 2.02% libc-2.19.so __memcpy_sse2_unaligned 1.61% python PyDict_GetItem 1.40% python dictresize 1.24% python unicode_hash 1.11% libc-2.19.so _int_malloc 1.07% python unicode_dealloc 1.00% python free_keys_object
Result produced with:
$ perf record ./python ~/performance/performance/benchmarks/bm_json_loads.py --worker -v -l 128 -w0 -n 100
$ perf report
How this list intersects with the list of functions in .text.hot section of PGO build?
I checked which functions are considered as "hot" by a PGO build: I found more than 2,000 functions... I'm not interested to tag so many functions with _Py_HOT_FUNCTIONS. I would prefer to only tag something like the top 10 or top 25 functions.
I don't know the recommandations to tag functions as hot. I guess that what matters is the total size of hot functions. Should I be smaller than the L2 cache? Smaller than the L3 cache? I'm talking about instructions, but data share also these caches...
Make several PGO builds (perhaps on different computers). Is .text.hot section stable?
In my experience PGO builds don't provide stable performances, but I was never able to write an article on that because of so many bugs :-)
Author: STINNER Victor (vstinner) *
Date: 2016-11-22 10:30
FYI I wrote an article about this issue: https://haypo.github.io/analysis-python-performance-issue.html
Sadly, it seems like I was just lucky when adding attribute((hot)) fixed the issue, because call_method is slow again!
- acde821520fc (Nov 21): 16.3 ms
- 2a14385710dc (Nov 22): 24.6 ms (+51%)
Author: Inada Naoki (methane) *
Date: 2016-11-22 11:07
Wow. It's sad that tagged version is accidentally slow...
I want to reproduce it and check perf record -e L1-icache-load-misses
.
But IaaS (EC2, GCE, Azure VM) doesn't support CPU performance counter.
Author: STINNER Victor (vstinner) *
Date: 2016-11-22 11:47
2016-11-22 12:07 GMT+01:00 INADA Naoki <report@bugs.python.org>:
I want to reproduce it and check
perf record -e L1-icache-load-misses
. But IaaS (EC2, GCE, Azure VM) doesn't support CPU performance counter.
You don't need to go that far to check performances: just run call_method and check timings. You need to compare on multiple revisions.
speed.python.org Timeline helps to track performances, to have an idea of the "average performance" and detect spikes.
Author: STINNER Victor (vstinner) *
Date: 2016-11-22 11:50
Naoki: "Wow. It's sad that tagged version is accidentally slow..."
If you use PGO compilation, for example use "./configure --enable-optimizations" as suggested by configure if you don't enable the option, you don't get the issue.
I hope that most Linux distribution use PGO compilation. I'm quite sure that it's the case for Ubuntu. I don't know for Fedora.
Author: Inada Naoki (methane) *
Date: 2016-11-22 12:19
I setup Ubuntu 14.04 on Azure, built python without neither PGO nor LTO. But I failed to reproduce it.
@haypo, would you give me two binaries?
$ ~/local/py-2a143/bin/python3 -c 'import sys; print(sys.version)' 3.7.0a0 (default:2a14385710dc, Nov 22 2016, 12:02:34) [GCC 4.8.4]
$ ~/local/py-acde8/bin/python3 -c 'import sys; print(sys.version)'
3.7.0a0 (default:acde821520fc, Nov 22 2016, 11:31:16)
[GCC 4.8.4]
$ ~/local/py-2a143/bin/python3 bm_call_method.py ..................... call_method: Median +- std dev: 16.1 ms +- 0.6 ms
$ ~/local/py-acde8/bin/python3 bm_call_method.py
.....................
call_method: Median +- std dev: 16.1 ms +- 0.7 ms
Author: STINNER Victor (vstinner) *
Date: 2016-11-22 13:17
But I failed to reproduce it.
Hey, performance issues with code placement is a mysterious secret :-) Nobody understands it :-D
The server runner the benchmark is a Intel Xeon CPU of 2011. It seems like code placement issues are more important on this CPU than my more recent laptop or desktop PC.
Author: STINNER Victor (vstinner) *
Date: 2017-02-01 17:21
Victor: "FYI I wrote an article about this issue: https://haypo.github.io/analysis-python-performance-issue.html Sadly, it seems like I was just lucky when adding attribute((hot)) fixed the issue, because call_method is slow again!"
I upgraded speed-python server (running benchmarks) to Ubuntu 16.04 LTS to support PGO compilation. I removed all old benchmark results and ran again benchmarks with LTO+PGO. It seems like benchmark results are much better now.
I'm not sure anymore that _Py_HOT_FUNCTION is really useful to get stable benchmarks, but it may help code placement a little bit. I don't think that it hurts, so I suggest to keep it. Since benchmarks were still unstable with _Py_HOT_FUNCTION, I'm not interested to continue to tag more functions with _Py_HOT_FUNCTION. I will now focus on LTO+PGO for stable benchmarks, and ignore small performance difference when PGO is not used.
I close this issue now.