Issue 28618: Decorate hot functions using attribute((hot)) to optimize Python (original) (raw)

Created on 2016-11-05 00:29 by vstinner, last changed 2022-04-11 14:58 by admin. This issue is now closed.

Messages (34)

msg280097 - (view)

Author: STINNER Victor (vstinner) * (Python committer)

Date: 2016-11-05 00:29

When analyzing results of Python performance benchmarks, I noticed that call_method was 70% slower (!) between revisions 83877018ef97 (Oct 18) and 3e073e7b4460 (Oct 22), including these revisions, on the speed-python server.

On these revisions, the CPU L1 instruction cache is less efficient: 8% cache misses, whereas it was only 0.06% before and after these revisions.

Since the two mentioned revisions have no obvious impact on the call_method() benchmark, I understand that the performance difference by a different layout of the machine code, maybe the exact location of functions.

IMO the best solution to such compilation issue is to use PGO compilation. Problem: PGO doesn't work on Ubuntu 14.04, the OS used by speed-python (the server runining benchmarks for http://speed.python.org/).

I propose to decorate manually the "hot" functions using the GCC attribute((hot)) decorator: https://gcc.gnu.org/onlinedocs/gcc/Common-Function-Attributes.html#Common-Function-Attributes (search for "hot")

Attached patch adds Py_HOT_FUNCTION and decorates the following functions:

_PyEval_EvalFrameDefault()
PyFrame_New()
call_function()
lookdict_unicode_nodummy()
_PyFunction_FastCall()
frame_dealloc()

These functions are the top 6 according to the Linux perf tool when running the call_simple benchmark of the performance project:

32,66%: _PyEval_EvalFrameDefault 13,09%: PyFrame_New 12,78%: call_function 12,24%: lookdict_unicode_nodummy 9,85%: _PyFunction_FastCall 8,47%: frame_dealloc

msg280105 - (view)

Author: STINNER Victor (vstinner) * (Python committer)

Date: 2016-11-05 09:07

I ran benchmarks. Globally, it seems like the impact of the patch is positive. regex_v8 and call_simple are slower, but these benchmarks are microbenchmarks impacted by low level stuff like CPU L1 cache. Well, my patch was supposed to optimize CPython for call_simple :-/ I should maybe investigate a little bit more.

Performance comparison (performance 0.3.2):

haypo@smithers$ python3 -m perf compare_to orig.json hot.json -G Slower (6):

regex_v8: 40.6 ms +- 5.7 ms -> 47.1 ms +- 0.3 ms: 1.16x slower
call_simple: 12.6 ms +- 0.2 ms -> 13.2 ms +- 1.3 ms: 1.05x slower
regex_effbot: 4.58 ms +- 0.07 ms -> 4.70 ms +- 0.05 ms: 1.03x slower
sympy_integrate: 43.4 ms +- 0.3 ms -> 44.0 ms +- 0.2 ms: 1.01x slower
nqueens: 239 ms +- 2 ms -> 241 ms +- 1 ms: 1.01x slower
scimark_fft: 674 ms +- 12 ms -> 680 ms +- 75 ms: 1.01x slower

Faster (32):

scimark_monte_carlo: 255 ms +- 4 ms -> 234 ms +- 7 ms: 1.09x faster
chameleon: 28.4 ms +- 3.1 ms -> 27.0 ms +- 0.4 ms: 1.05x faster
scimark_sor: 488 ms +- 27 ms -> 467 ms +- 10 ms: 1.05x faster
sqlite_synth: 9.16 us +- 1.03 us -> 8.82 us +- 0.23 us: 1.04x faster
scimark_lu: 485 ms +- 20 ms -> 469 ms +- 14 ms: 1.03x faster
xml_etree_process: 226 ms +- 30 ms -> 219 ms +- 4 ms: 1.03x faster
logging_simple: 29.7 us +- 0.4 us -> 28.9 us +- 0.3 us: 1.03x faster
pickle_list: 7.99 us +- 0.88 us -> 7.78 us +- 0.05 us: 1.03x faster
raytrace: 1.26 sec +- 0.08 sec -> 1.23 sec +- 0.01 sec: 1.03x faster
sympy_expand: 995 ms +- 31 ms -> 971 ms +- 35 ms: 1.03x faster
deltablue: 17.0 ms +- 0.1 ms -> 16.6 ms +- 0.2 ms: 1.02x faster
call_method_slots: 16.0 ms +- 0.1 ms -> 15.6 ms +- 0.2 ms: 1.02x faster
fannkuch: 983 ms +- 12 ms -> 962 ms +- 29 ms: 1.02x faster
pickle_pure_python: 1.25 ms +- 0.14 ms -> 1.22 ms +- 0.01 ms: 1.02x faster
logging_format: 34.0 us +- 0.3 us -> 33.4 us +- 1.5 us: 1.02x faster
xml_etree_parse: 274 ms +- 9 ms -> 270 ms +- 5 ms: 1.02x faster
sympy_str: 441 ms +- 3 ms -> 433 ms +- 3 ms: 1.02x faster
genshi_text: 87.6 ms +- 9.2 ms -> 86.0 ms +- 1.4 ms: 1.02x faster
genshi_xml: 187 ms +- 17 ms -> 184 ms +- 1 ms: 1.02x faster
django_template: 376 ms +- 4 ms -> 370 ms +- 2 ms: 1.02x faster
json_dumps: 27.1 ms +- 0.4 ms -> 26.7 ms +- 0.4 ms: 1.02x faster
sqlalchemy_declarative: 295 ms +- 3 ms -> 291 ms +- 3 ms: 1.01x faster
call_method_unknown: 18.1 ms +- 0.1 ms -> 17.8 ms +- 0.1 ms: 1.01x faster
nbody: 218 ms +- 4 ms -> 216 ms +- 2 ms: 1.01x faster
regex_dna: 250 ms +- 24 ms -> 247 ms +- 2 ms: 1.01x faster
go: 573 ms +- 2 ms -> 566 ms +- 3 ms: 1.01x faster
richards: 173 ms +- 4 ms -> 171 ms +- 4 ms: 1.01x faster
python_startup: 24.6 ms +- 0.1 ms -> 24.5 ms +- 0.1 ms: 1.00x faster
regex_compile: 404 ms +- 6 ms -> 403 ms +- 5 ms: 1.00x faster
dulwich_log: 143 ms +- 11 ms -> 143 ms +- 1 ms: 1.00x faster
pidigits: 290 ms +- 1 ms -> 289 ms +- 0 ms: 1.00x faster
pickle_dict: 58.3 us +- 6.5 us -> 58.3 us +- 0.7 us: 1.00x faster

Benchmark hidden because not significant (26): 2to3, call_method, chaos, crypto_pyaes, float, hexiom, html5lib, json_loads, logging_silent, mako, meteor_contest, pathlib, pickle, python_startup_no_site, scimark_sparse_mat_mult, spectral_norm, sqlalchemy_imperative, sympy_sum, telco, tornado_http, unpack_sequence, unpickle, unpickle_list, unpickle_pure_python, xml_etree_generate, xml_etree_iterparse

More readable output, only display differences >= 5%:

haypo@smithers$ python3 -m perf compare_to orig.json hot.json -G --min-speed=5 Slower (1):

regex_v8: 40.6 ms +- 5.7 ms -> 47.1 ms +- 0.3 ms: 1.16x slower

Faster (2):

scimark_monte_carlo: 255 ms +- 4 ms -> 234 ms +- 7 ms: 1.09x faster
chameleon: 28.4 ms +- 3.1 ms -> 27.0 ms +- 0.4 ms: 1.05x faster

Benchmark hidden because not significant (61): 2to3, call_method, call_method_slots, call_method_unknown, call_simple, chaos, crypto_pyaes, deltablue, django_template, dulwich_log, fannkuch, float, genshi_text, genshi_xml, go, hexiom, html5lib, json_dumps, json_loads, logging_format, logging_silent, logging_simple, mako, meteor_contest, nbody, nqueens, pathlib, pickle, pickle_dict, pickle_list, pickle_pure_python, pidigits, python_startup, python_startup_no_site, raytrace, regex_compile, regex_dna, regex_effbot, richards, scimark_fft, scimark_lu, scimark_sor, scimark_sparse_mat_mult, spectral_norm, sqlalchemy_declarative, sqlalchemy_imperative, sqlite_synth, sympy_expand, sympy_integrate, sympy_str, sympy_sum, telco, tornado_http, unpack_sequence, unpickle, unpickle_list, unpickle_pure_python, xml_etree_generate, xml_etree_iterparse, xml_etree_parse, xml_etree_process

msg280106 - (view)

Author: STINNER Victor (vstinner) * (Python committer)

Date: 2016-11-05 09:08

Oh, I forgot to mention that I compiled Python with "./configure -C". The purpose of the patch is to optimize Python when LTO and/or PGO compilation are not explicitly used.

msg280108 - (view)

Author: Antoine Pitrou (pitrou) * (Python committer)

Date: 2016-11-05 09:59

Can you compare against a PGO build? Ubuntu 14.04 is old, and I don't think this is something we should worry about.

Overall I think this manual approach is really the wrong way to look at it. Compilers can do better than us.

msg280115 - (view)

Author: STINNER Victor (vstinner) * (Python committer)

Date: 2016-11-05 15:37

Antoine Pitrou added the comment:

Can you compare against a PGO build?

Do you mean comparison between current Python with PGO and patched Python without PGO?

The hot attribute is ignored by GCC when PGO compilation is used.

Ubuntu 14.04 is old, and I don't think this is something we should worry about.

Well, it's a practical issue for me to run benchmarks for speed.python.org.

Moreover, I like the idea of getting a fast(er) Python even when no advanced optimization techniques like LTO or PGO is used. At least, it's common to build quickly Python using "./configure && make" for a quick benchmark.

msg280116 - (view)

Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer)

Date: 2016-11-05 16:14

Moreover, I like the idea of getting a fast(er) Python even when no advanced optimization techniques like LTO or PGO is used.

Seconded.

msg280125 - (view)

Author: Antoine Pitrou (pitrou) * (Python committer)

Date: 2016-11-05 20:02

Le 05/11/2016 à 16:37, STINNER Victor a écrit :

Antoine Pitrou added the comment:

Can you compare against a PGO build?

Do you mean comparison between current Python with PGO and patched Python without PGO?

Yes.

Ubuntu 14.04 is old, and I don't think this is something we should worry about.

Well, it's a practical issue for me to run benchmarks for speed.python.org.

Why isn't the OS updated on that machine?

msg280126 - (view)

Author: STINNER Victor (vstinner) * (Python committer)

Date: 2016-11-05 22:53

Antoine Pitrou added the comment:

Do you mean comparison between current Python with PGO and patched Python without PGO?

Yes.

Oh ok, sure. I will try to run these 2 benchmarks.

Ubuntu 14.04 is old, and I don't think this is something we should worry about.

Well, it's a practical issue for me to run benchmarks for speed.python.org.

Why isn't the OS updated on that machine?

I am not sure that I want to use PGO compilation to run benchmarks. Last time I checked, I noticed performance differences between two compilations. PGO compilation doesn't seem 100% deterministic.

Maybe PGO compilation is fine when you build Python to create a Linux package. But to get reliable benchmarks, I'm not sure that it's a good idea.

I'm still running benchmarks on Python recompiled many times using different compiler options, to measure the impact of the compiler options (especially LTO and/or PGO) on the benchmark stability.

msg280350 - (view)

Author: STINNER Victor (vstinner) * (Python committer)

Date: 2016-11-08 21:09

Do you mean comparison between current Python with PGO and patched Python without PGO?

Yes.

Ok, here you have. As expected, PGO compilation is faster than default compilation with my patch. PGO implements more optimization than just attribute((hot)), it also optimizes branches for example.

haypo@smithers$ python3 -m perf compare_to pgo.json.gz patch.json.gz -G --min-speed=5 Slower (56):

regex_effbot: 4.30 ms +- 0.26 ms -> 5.77 ms +- 0.33 ms: 1.34x slower
telco: 16.0 ms +- 1.1 ms -> 20.6 ms +- 0.4 ms: 1.29x slower
xml_etree_process: 174 ms +- 15 ms -> 218 ms +- 29 ms: 1.25x slower
xml_etree_generate: 205 ms +- 16 ms -> 254 ms +- 4 ms: 1.24x slower
unpickle_list: 6.04 us +- 1.12 us -> 7.47 us +- 0.18 us: 1.24x slower
call_simple: 10.6 ms +- 1.4 ms -> 13.1 ms +- 0.3 ms: 1.24x slower
mako: 33.5 ms +- 0.3 ms -> 41.3 ms +- 0.9 ms: 1.23x slower
pathlib: 37.0 ms +- 2.3 ms -> 44.7 ms +- 2.0 ms: 1.21x slower
sqlite_synth: 7.56 us +- 0.20 us -> 8.97 us +- 0.18 us: 1.19x slower
unpickle: 24.2 us +- 3.9 us -> 28.7 us +- 0.3 us: 1.18x slower
chameleon: 23.4 ms +- 2.6 ms -> 27.4 ms +- 1.5 ms: 1.17x slower
spectral_norm: 214 ms +- 7 ms -> 249 ms +- 9 ms: 1.17x slower
nqueens: 210 ms +- 2 ms -> 244 ms +- 36 ms: 1.16x slower
unpickle_pure_python: 717 us +- 10 us -> 831 us +- 66 us: 1.16x slower
pickle: 18.7 us +- 4.3 us -> 21.6 us +- 3.3 us: 1.15x slower
sympy_expand: 829 ms +- 39 ms -> 957 ms +- 28 ms: 1.15x slower
genshi_text: 73.1 ms +- 3.2 ms -> 84.3 ms +- 1.1 ms: 1.15x slower
pickle_list: 6.82 us +- 0.20 us -> 7.86 us +- 0.05 us: 1.15x slower
sympy_str: 372 ms +- 28 ms -> 428 ms +- 3 ms: 1.15x slower
xml_etree_parse: 231 ms +- 7 ms -> 266 ms +- 9 ms: 1.15x slower
call_method_slots: 14.0 ms +- 1.3 ms -> 16.1 ms +- 1.2 ms: 1.15x slower
sympy_sum: 169 ms +- 6 ms -> 194 ms +- 19 ms: 1.15x slower
logging_format: 29.3 us +- 2.5 us -> 33.7 us +- 1.6 us: 1.15x slower
logging_simple: 25.7 us +- 2.1 us -> 29.3 us +- 0.4 us: 1.14x slower
genshi_xml: 159 ms +- 15 ms -> 182 ms +- 1 ms: 1.14x slower
xml_etree_iterparse: 178 ms +- 3 ms -> 203 ms +- 5 ms: 1.14x slower
pickle_pure_python: 1.06 ms +- 0.17 ms -> 1.21 ms +- 0.16 ms: 1.14x slower
logging_silent: 618 ns +- 11 ns -> 705 ns +- 62 ns: 1.14x slower
hexiom: 19.0 ms +- 0.2 ms -> 21.7 ms +- 0.2 ms: 1.14x slower
html5lib: 184 ms +- 11 ms -> 209 ms +- 31 ms: 1.14x slower
call_method: 14.3 ms +- 0.7 ms -> 16.3 ms +- 0.1 ms: 1.14x slower
django_template: 324 ms +- 18 ms -> 368 ms +- 3 ms: 1.14x slower
sympy_integrate: 37.9 ms +- 0.3 ms -> 43.0 ms +- 2.7 ms: 1.13x slower
deltablue: 15.0 ms +- 2.0 ms -> 16.9 ms +- 1.0 ms: 1.12x slower
call_method_unknown: 16.0 ms +- 0.4 ms -> 17.9 ms +- 0.2 ms: 1.12x slower
2to3: 611 ms +- 12 ms -> 677 ms +- 57 ms: 1.11x slower
regex_compile: 300 ms +- 3 ms -> 332 ms +- 21 ms: 1.11x slower
json_loads: 50.5 us +- 2.5 us -> 55.8 us +- 1.2 us: 1.10x slower
unpack_sequence: 111 ns +- 5 ns -> 122 ns +- 1 ns: 1.10x slower
pickle_dict: 53.2 us +- 3.7 us -> 58.1 us +- 3.7 us: 1.09x slower
scimark_sor: 420 ms +- 60 ms -> 458 ms +- 12 ms: 1.09x slower
scimark_lu: 398 ms +- 74 ms -> 434 ms +- 18 ms: 1.09x slower
regex_dna: 227 ms +- 1 ms -> 247 ms +- 9 ms: 1.09x slower
pidigits: 266 ms +- 33 ms -> 290 ms +- 10 ms: 1.09x slower
chaos: 243 ms +- 2 ms -> 265 ms +- 3 ms: 1.09x slower
crypto_pyaes: 197 ms +- 16 ms -> 215 ms +- 28 ms: 1.09x slower
dulwich_log: 129 ms +- 15 ms -> 140 ms +- 8 ms: 1.08x slower
sqlalchemy_imperative: 50.8 ms +- 0.9 ms -> 55.0 ms +- 1.8 ms: 1.08x slower
meteor_contest: 173 ms +- 22 ms -> 187 ms +- 5 ms: 1.08x slower
sqlalchemy_declarative: 268 ms +- 11 ms -> 290 ms +- 3 ms: 1.08x slower
tornado_http: 335 ms +- 4 ms -> 361 ms +- 3 ms: 1.08x slower
python_startup: 20.6 ms +- 0.6 ms -> 22.1 ms +- 0.9 ms: 1.08x slower
python_startup_no_site: 8.37 ms +- 0.08 ms -> 9.00 ms +- 0.07 ms: 1.08x slower
go: 518 ms +- 36 ms -> 557 ms +- 39 ms: 1.07x slower
raytrace: 1.14 sec +- 0.08 sec -> 1.22 sec +- 0.02 sec: 1.07x slower
scimark_fft: 594 ms +- 29 ms -> 627 ms +- 13 ms: 1.06x slower

Benchmark hidden because not significant (8): fannkuch, float, json_dumps, nbody, regex_v8, richards, scimark_monte_carlo, scimark_sparse_mat_mult

msg280556 - (view)

Author: Roundup Robot (python-dev) (Python triager)

Date: 2016-11-11 01:14

New changeset 59b91b4e9506 by Victor Stinner in branch 'default': Issue #28618: Make hot functions using attribute((hot)) https://hg.python.org/cpython/rev/59b91b4e9506

msg280557 - (view)

Author: STINNER Victor (vstinner) * (Python committer)

Date: 2016-11-11 01:49

I tried different patches and ran many quick & dirty benchmarks.

I tried to use likely/unlikely macros (using GCC __builtin__expect): the effect is not significant on call_simple microbenchmark. I gave up on this part.

attribute((hot)) on a few Python core functions fixes the major slowdown on call_method on the revision 83877018ef97 (described in the initial message).

I noticed tiny differences when using attribute((hot)), speedup in most cases. I noticed sometimes slowdown, but very small (ex: 1%, but 1% on a microbenchmark doesn't mean anything).

I pushed my patch to try to keep stable performance when Python is not compiled with PGO.

If you would like to know more about the crazy effect of code placement in modern Intel CPUs, I suggest you to see the slides of this recent talk from an Intel engineer: https://llvmdevelopersmeetingbay2016.sched.org/event/8YzY/causes-of-performance-instability-due-to-code-placement-in-x86 "Causes of Performance Swings Due to Code Placement in IA by Zia Ansari (Intel), November 2016"

About PGO or not PGO: this question is not simple, I suggest to discuss it in a different place to not flood this issue ;-)

For my use case, I'm not convinced yet that PGO with our current build system produce reliable performance.

Not all Linux distributions compile Python using PGO: Fedora and RHEL don't compile Python using PGO for example. Bugzilla for Fedora: https://bugzilla.redhat.com/show_bug.cgi?id=613045

I guess that there also some developers running benchmarks on Python compiled with "./configure && make". I'm trying to enhance documentation and tools around Python benchmarks to advice developers to use LTO and/or PGO.

msg280568 - (view)

Author: STINNER Victor (vstinner) * (Python committer)

Date: 2016-11-11 09:10

Final result on speed-python:

haypo@speed-python$ python3 -m perf compare_to json_8nov/2016-11-10_15-39-default-8ebaa546a033.json 2016-11-11_02-13-default-59b91b4e9506.json -G

Slower (12):

scimark_sparse_mat_mult: 8.71 ms +- 0.19 ms -> 9.28 ms +- 0.12 ms: 1.07x slower
nbody: 244 ms +- 2 ms -> 252 ms +- 4 ms: 1.03x slower
json_loads: 71.4 us +- 0.8 us -> 72.9 us +- 1.4 us: 1.02x slower
fannkuch: 1.07 sec +- 0.01 sec -> 1.09 sec +- 0.01 sec: 1.01x slower
scimark_lu: 502 ms +- 19 ms -> 509 ms +- 12 ms: 1.01x slower
chaos: 302 ms +- 3 ms -> 305 ms +- 3 ms: 1.01x slower
xml_etree_iterparse: 224 ms +- 3 ms -> 226 ms +- 6 ms: 1.01x slower
regex_dna: 299 ms +- 1 ms -> 300 ms +- 1 ms: 1.00x slower
pickle_list: 9.21 us +- 0.33 us -> 9.24 us +- 0.56 us: 1.00x slower
crypto_pyaes: 245 ms +- 1 ms -> 246 ms +- 2 ms: 1.00x slower
meteor_contest: 219 ms +- 1 ms -> 219 ms +- 1 ms: 1.00x slower
unpack_sequence: 128 ns +- 2 ns -> 128 ns +- 0 ns: 1.00x slower

Faster (39):

logging_silent: 997 ns +- 40 ns -> 803 ns +- 13 ns: 1.24x faster
regex_effbot: 6.16 ms +- 0.24 ms -> 5.17 ms +- 0.27 ms: 1.19x faster
mako: 45.9 ms +- 0.7 ms -> 42.9 ms +- 0.6 ms: 1.07x faster
xml_etree_process: 253 ms +- 4 ms -> 237 ms +- 4 ms: 1.07x faster
call_simple: 13.9 ms +- 0.3 ms -> 13.1 ms +- 0.4 ms: 1.06x faster
spectral_norm: 274 ms +- 2 ms -> 260 ms +- 2 ms: 1.05x faster
xml_etree_generate: 300 ms +- 4 ms -> 285 ms +- 5 ms: 1.05x faster
call_method_slots: 17.1 ms +- 0.2 ms -> 16.2 ms +- 0.3 ms: 1.05x faster
telco: 21.8 ms +- 0.5 ms -> 20.7 ms +- 0.3 ms: 1.05x faster
call_method: 17.3 ms +- 0.3 ms -> 16.5 ms +- 0.2 ms: 1.05x faster
pickle_pure_python: 1.42 ms +- 0.02 ms -> 1.36 ms +- 0.03 ms: 1.04x faster
pathlib: 51.9 ms +- 0.8 ms -> 50.6 ms +- 0.4 ms: 1.03x faster
xml_etree_parse: 295 ms +- 8 ms -> 287 ms +- 7 ms: 1.03x faster
chameleon: 31.0 ms +- 0.3 ms -> 30.2 ms +- 0.2 ms: 1.03x faster
deltablue: 19.3 ms +- 0.3 ms -> 18.8 ms +- 0.2 ms: 1.02x faster
django_template: 484 ms +- 4 ms -> 472 ms +- 5 ms: 1.02x faster
call_method_unknown: 18.7 ms +- 0.2 ms -> 18.3 ms +- 0.2 ms: 1.02x faster
html5lib: 261 ms +- 6 ms -> 256 ms +- 6 ms: 1.02x faster
unpickle_pure_python: 973 us +- 12 us -> 954 us +- 15 us: 1.02x faster
regex_v8: 47.6 ms +- 0.8 ms -> 46.7 ms +- 0.4 ms: 1.02x faster
richards: 202 ms +- 4 ms -> 198 ms +- 5 ms: 1.02x faster
logging_simple: 37.8 us +- 0.6 us -> 37.1 us +- 0.4 us: 1.02x faster
sympy_integrate: 50.8 ms +- 0.9 ms -> 49.9 ms +- 1.4 ms: 1.02x faster
dulwich_log: 189 ms +- 2 ms -> 186 ms +- 1 ms: 1.02x faster
sqlalchemy_declarative: 343 ms +- 3 ms -> 339 ms +- 3 ms: 1.01x faster
hexiom: 25.0 ms +- 0.1 ms -> 24.7 ms +- 0.1 ms: 1.01x faster
logging_format: 44.6 us +- 0.6 us -> 44.1 us +- 0.6 us: 1.01x faster
2to3: 787 ms +- 4 ms -> 777 ms +- 4 ms: 1.01x faster
tornado_http: 440 ms +- 4 ms -> 435 ms +- 4 ms: 1.01x faster
json_dumps: 30.7 ms +- 0.4 ms -> 30.5 ms +- 0.3 ms: 1.01x faster
go: 637 ms +- 10 ms -> 632 ms +- 8 ms: 1.01x faster
regex_compile: 397 ms +- 2 ms -> 394 ms +- 3 ms: 1.01x faster
nqueens: 266 ms +- 2 ms -> 264 ms +- 2 ms: 1.01x faster
python_startup: 16.8 ms +- 0.0 ms -> 16.7 ms +- 0.0 ms: 1.01x faster
python_startup_no_site: 9.91 ms +- 0.01 ms -> 9.86 ms +- 0.01 ms: 1.01x faster
scimark_sor: 513 ms +- 13 ms -> 510 ms +- 8 ms: 1.01x faster
raytrace: 1.41 sec +- 0.02 sec -> 1.40 sec +- 0.02 sec: 1.00x faster
genshi_text: 95.2 ms +- 1.1 ms -> 94.7 ms +- 0.8 ms: 1.00x faster
sympy_str: 529 ms +- 5 ms -> 528 ms +- 4 ms: 1.00x faster

Benchmark hidden because not significant (13): float, genshi_xml, pickle, pickle_dict, pidigits, scimark_fft, scimark_monte_carlo, sqlalchemy_imperative, sqlite_synth, sympy_expand, sympy_sum, unpickle, unpickle_list

msg280606 - (view)

Author: STINNER Victor (vstinner) * (Python committer)

Date: 2016-11-11 19:52

json_loads: 71.4 us +- 0.8 us -> 72.9 us +- 1.4 us: 1.02x slower

Hum, sadly this benchmark is still unstable after my change 59b91b4e9506 ("Mark hot functions using attribute((hot))", oops, I wanted to write Mark, not Make :-/).

This benchmark is around 63.4 us during many months, whereas it reached 72.9 us at rev 59b91b4e9506, and the new run (also using hot attribute) gone back to 63.0 us...

I understand that json_loads depends on the code placement of some other functions which are not currently marked with the hot attribute.

https://speed.python.org/timeline/#/?exe=4&ben=json_loads&env=1&revs=50&equid=off&quarts=on&extr=on

msg280607 - (view)

Author: STINNER Victor (vstinner) * (Python committer)

Date: 2016-11-11 19:58

scimark_sparse_mat_mult: 8.71 ms +- 0.19 ms -> 9.28 ms +- 0.12 ms: 1.07x slower

Same issue on this benchmark:

average on one year: 8.8 ms
peak at rev 59b91b4e9506: 9.3 ms
run after rev 59b91b4e9506: 9.0 ms

The benchmark is unstable, but the difference is small, especially compared to the difference of call_method without the hot attribute.

msg280675 - (view)

Author: Yury Selivanov (yselivanov) * (Python committer)

Date: 2016-11-12 22:25

Can we commit this to 3.6 too?

msg280679 - (view)

Author: STINNER Victor (vstinner) * (Python committer)

Date: 2016-11-12 23:40

Can we commit this to 3.6 too?

I worked on patches to try to optimize json_loads and regex_effbot as well, but it's still unclear to me how the hot attribute works, and I'm not 100% sure that using the attribut explicitly does not introduce a performance regession.

So I prefer to experiment such change in default right now.

msg280748 - (view)

Author: Inada Naoki (methane) * (Python committer)

Date: 2016-11-14 10:41

How about marking lookdict_unicode and lookdict_unicode_nodummy as hot?

msg280764 - (view)

Author: STINNER Victor (vstinner) * (Python committer)

Date: 2016-11-14 12:23

INADA Naoki added the comment:

How about marking lookdict_unicode and lookdict_unicode_nodummy as hot?

I don't understand well the effect of the hot attribute, so I suggest to run benchmarks and check that it has a non negligible effect on benchmarks ;-)

msg280831 - (view)

Author: Inada Naoki (methane) * (Python committer)

Date: 2016-11-15 11:56

I don't understand well the effect of the hot attribute

I compared lookdict_unicode_nodummy assembly by objdump -d dictobject.o. It looks completely same.

So I think only difference is placement. hot functions are in .text.hot section and linker groups hot functions. This reduces cache hazard possibility.

When compiling Python with PGO, we can see what function is hot by objdump.

~/work/cpython/Objects$ objdump -tj .text.hot dictobject.o

dictobject.o:     file format elf64-x86-64

SYMBOL TABLE:
0000000000000000 l    d  .text.hot      0000000000000000 .text.hot
00000000000007a0 l     F .text.hot      0000000000000574 lookdict_unicode_nodummy
00000000000046d0 l     F .text.hot      00000000000000e8 free_keys_object
00000000000001c0 l     F .text.hot      0000000000000161 new_keys_object
00000000000003b0 l     F .text.hot      00000000000003e8 insertdict
0000000000001180 l     F .text.hot      000000000000081f dictresize
00000000000019a0 l     F .text.hot      0000000000000165 find_empty_slot.isra.0
0000000000002180 l     F .text.hot      00000000000005f1 lookdict
0000000000001b10 l     F .text.hot      00000000000000c2 unicode_eq
0000000000002780 l     F .text.hot      0000000000000184 dict_traverse
0000000000004c20 l     F .text.hot      00000000000005f7 lookdict_unicode
0000000000006b20 l     F .text.hot      0000000000000330 lookdict_split
...

cold section of hot function is placed in .text.unlikely section.

$ objdump -t  dictobject.o  | grep lookdict
00000000000007a0 l     F .text.hot      0000000000000574 lookdict_unicode_nodummy
0000000000002180 l     F .text.hot      00000000000005f1 lookdict
000000000000013e l       .text.unlikely 0000000000000000 lookdict_unicode_nodummy.cold.6
0000000000000a38 l       .text.unlikely 0000000000000000 lookdict.cold.15
0000000000004c20 l     F .text.hot      00000000000005f7 lookdict_unicode
0000000000006b20 l     F .text.hot      0000000000000330 lookdict_split
0000000000001339 l       .text.unlikely 0000000000000000 lookdict_unicode.cold.28
0000000000001d01 l       .text.unlikely 0000000000000000 lookdict_split.cold.42

All lookdict* function are put in hot section, and all of cold part is 0 byte. It seems PGO put all lookdict* functions in hot section.

compiler info:

$ gcc -v
Using built-in specs.
COLLECT_GCC=gcc
COLLECT_LTO_WRAPPER=/usr/lib/gcc/x86_64-linux-gnu/5/lto-wrapper
Target: x86_64-linux-gnu
Configured with: ../src/configure -v --with-pkgversion='Ubuntu 5.4.0-6ubuntu1~16.04.4' --with-bugurl=file:///usr/share/doc/gcc-5/README.Bugs --enable-languages=c,ada,c++,java,go,d,fortran,objc,obj-c++ --prefix=/usr --program-suffix=-5 --enable-shared --enable-linker-build-id --libexecdir=/usr/lib --without-included-gettext --enable-threads=posix --libdir=/usr/lib --enable-nls --with-sysroot=/ --enable-clocale=gnu --enable-libstdcxx-debug --enable-libstdcxx-time=yes --with-default-libstdcxx-abi=new --enable-gnu-unique-object --disable-vtable-verify --enable-libmpx --enable-plugin --with-system-zlib --disable-browser-plugin --enable-java-awt=gtk --enable-gtk-cairo --with-java-home=/usr/lib/jvm/java-1.5.0-gcj-5-amd64/jre --enable-java-home --with-jvm-root-dir=/usr/lib/jvm/java-1.5.0-gcj-5-amd64 --with-jvm-jar-dir=/usr/lib/jvm-exports/java-1.5.0-gcj-5-amd64 --with-arch-directory=amd64 --with-ecj-jar=/usr/share/java/eclipse-ecj.jar --enable-objc-gc --enable-multiarch --disable-werror --with-arch-32=i686 --with-abi=m64 --with-multilib-list=m32,m64,mx32 --enable-multilib --with-tune=generic --enable-checking=release --build=x86_64-linux-gnu --host=x86_64-linux-gnu --target=x86_64-linux-gnu
Thread model: posix
gcc version 5.4.0 20160609 (Ubuntu 5.4.0-6ubuntu1~16.04.4)

msg280832 - (view)

Author: Inada Naoki (methane) * (Python committer)

Date: 2016-11-15 12:04

so I suggest to run benchmarks and check that it has a non negligible effect on benchmarks ;-)

When added _Py_HOT_FUNCTION to lookdict_unicode, lookdict_unicode_nodummy and lookdict_split (I can't measure L1 miss via perf stat -d because I use EC2 for benchmark):

$ ~/local/python-master/bin/python3 -m perf compare_to -G all-master.json all-patched.json Slower (28):

pybench.CompareFloats: 106 ns +- 1 ns -> 112 ns +- 1 ns: 1.07x slower
pybench.BuiltinFunctionCalls: 1.62 us +- 0.00 us -> 1.68 us +- 0.03 us: 1.04x slower
pybench.CompareFloatsIntegers: 180 ns +- 3 ns -> 185 ns +- 3 ns: 1.03x slower
sympy_sum: 163 ms +- 7 ms -> 167 ms +- 7 ms: 1.03x slower
deltablue: 13.7 ms +- 0.4 ms -> 14.1 ms +- 0.5 ms: 1.02x slower
pickle_list: 5.77 us +- 0.09 us -> 5.90 us +- 0.07 us: 1.02x slower
pybench.PythonFunctionCalls: 1.20 us +- 0.02 us -> 1.22 us +- 0.02 us: 1.02x slower
pybench.SpecialClassAttribute: 1.46 us +- 0.02 us -> 1.49 us +- 0.03 us: 1.02x slower
pybench.TryRaiseExcept: 207 ns +- 4 ns -> 210 ns +- 0 ns: 1.02x slower
pickle_pure_python: 868 us +- 18 us -> 882 us +- 16 us: 1.02x slower
genshi_text: 56.0 ms +- 0.7 ms -> 56.8 ms +- 0.6 ms: 1.01x slower
json_dumps: 19.5 ms +- 0.3 ms -> 19.8 ms +- 0.2 ms: 1.01x slower
richards: 137 ms +- 3 ms -> 139 ms +- 2 ms: 1.01x slower
sqlalchemy_declarative: 272 ms +- 4 ms -> 276 ms +- 3 ms: 1.01x slower
pickle_dict: 43.5 us +- 0.4 us -> 44.1 us +- 0.2 us: 1.01x slower
go: 436 ms +- 4 ms -> 441 ms +- 4 ms: 1.01x slower
pybench.SecondImport: 2.52 us +- 0.04 us -> 2.55 us +- 0.07 us: 1.01x slower
pybench.NormalClassAttribute: 1.46 us +- 0.02 us -> 1.47 us +- 0.02 us: 1.01x slower
genshi_xml: 118 ms +- 2 ms -> 118 ms +- 3 ms: 1.01x slower
pybench.UnicodePredicates: 75.8 ns +- 0.6 ns -> 76.2 ns +- 0.9 ns: 1.01x slower
pybench.ListSlicing: 415 us +- 4 us -> 417 us +- 4 us: 1.01x slower
scimark_fft: 494 ms +- 2 ms -> 496 ms +- 12 ms: 1.01x slower
logging_format: 23.7 us +- 0.3 us -> 23.9 us +- 0.4 us: 1.00x slower
chaos: 200 ms +- 1 ms -> 201 ms +- 1 ms: 1.00x slower
pybench.StringPredicates: 509 ns +- 3 ns -> 511 ns +- 4 ns: 1.00x slower
call_method: 13.6 ms +- 0.1 ms -> 13.7 ms +- 0.2 ms: 1.00x slower
pybench.StringSlicing: 530 ns +- 3 ns -> 532 ns +- 8 ns: 1.00x slower
pybench.SimpleLongArithmetic: 535 ns +- 2 ns -> 536 ns +- 4 ns: 1.00x slower

Faster (47):

html5lib: 169 ms +- 7 ms -> 158 ms +- 6 ms: 1.07x faster
pybench.ConcatUnicode: 57.3 ns +- 3.0 ns -> 55.8 ns +- 1.3 ns: 1.03x faster
pybench.IfThenElse: 60.5 ns +- 1.0 ns -> 59.0 ns +- 0.7 ns: 1.02x faster
logging_silent: 606 ns +- 14 ns -> 593 ns +- 13 ns: 1.02x faster
scimark_lu: 411 ms +- 5 ms -> 404 ms +- 4 ms: 1.02x faster
pathlib: 29.1 ms +- 0.3 ms -> 28.7 ms +- 0.5 ms: 1.02x faster
pybench.CreateStringsWithConcat: 2.87 us +- 0.01 us -> 2.82 us +- 0.00 us: 1.02x faster
pybench.DictCreation: 621 ns +- 10 ns -> 612 ns +- 8 ns: 1.01x faster
meteor_contest: 167 ms +- 5 ms -> 164 ms +- 5 ms: 1.01x faster
unpickle_pure_python: 656 us +- 19 us -> 647 us +- 9 us: 1.01x faster
pybench.NestedForLoops: 20.2 ns +- 0.1 ns -> 20.0 ns +- 0.1 ns: 1.01x faster
regex_effbot: 4.01 ms +- 0.07 ms -> 3.95 ms +- 0.06 ms: 1.01x faster
pybench.CreateUnicodeWithConcat: 57.1 ns +- 0.2 ns -> 56.4 ns +- 0.2 ns: 1.01x faster
chameleon: 18.3 ms +- 0.2 ms -> 18.0 ms +- 0.3 ms: 1.01x faster
python_startup: 13.7 ms +- 0.1 ms -> 13.5 ms +- 0.1 ms: 1.01x faster
pybench.SmallTuples: 967 ns +- 6 ns -> 955 ns +- 8 ns: 1.01x faster
pybench.TryFinally: 200 ns +- 3 ns -> 198 ns +- 2 ns: 1.01x faster
pybench.SimpleIntegerArithmetic: 425 ns +- 3 ns -> 420 ns +- 4 ns: 1.01x faster
pybench.Recursion: 1.34 us +- 0.02 us -> 1.33 us +- 0.03 us: 1.01x faster
pybench.SimpleIntFloatArithmetic: 424 ns +- 1 ns -> 420 ns +- 1 ns: 1.01x faster
float: 222 ms +- 2 ms -> 220 ms +- 3 ms: 1.01x faster
2to3: 531 ms +- 4 ms -> 527 ms +- 5 ms: 1.01x faster
python_startup_no_site: 8.30 ms +- 0.04 ms -> 8.23 ms +- 0.05 ms: 1.01x faster
xml_etree_parse: 196 ms +- 5 ms -> 194 ms +- 2 ms: 1.01x faster
pybench.ComplexPythonFunctionCalls: 794 ns +- 7 ns -> 788 ns +- 7 ns: 1.01x faster
logging_simple: 20.4 us +- 0.2 us -> 20.3 us +- 0.4 us: 1.01x faster
fannkuch: 795 ms +- 9 ms -> 790 ms +- 3 ms: 1.01x faster
hexiom: 18.7 ms +- 0.1 ms -> 18.6 ms +- 0.1 ms: 1.01x faster
regex_compile: 322 ms +- 9 ms -> 320 ms +- 8 ms: 1.01x faster
mako: 36.0 ms +- 0.1 ms -> 35.8 ms +- 0.2 ms: 1.01x faster
pybench.UnicodeProperties: 91.7 ns +- 0.9 ns -> 91.1 ns +- 0.8 ns: 1.01x faster
pybench.SimpleComplexArithmetic: 577 ns +- 8 ns -> 573 ns +- 3 ns: 1.01x faster
xml_etree_process: 147 ms +- 2 ms -> 146 ms +- 2 ms: 1.01x faster
pybench.CompareUnicode: 22.4 ns +- 0.1 ns -> 22.2 ns +- 0.1 ns: 1.01x faster
crypto_pyaes: 175 ms +- 1 ms -> 174 ms +- 1 ms: 1.01x faster
unpickle_list: 5.43 us +- 0.04 us -> 5.41 us +- 0.02 us: 1.01x faster
pybench.WithFinally: 257 ns +- 4 ns -> 256 ns +- 2 ns: 1.01x faster
xml_etree_generate: 183 ms +- 2 ms -> 182 ms +- 2 ms: 1.00x faster
pybench.WithRaiseExcept: 475 ns +- 4 ns -> 472 ns +- 6 ns: 1.00x faster
pybench.SecondPackageImport: 2.85 us +- 0.08 us -> 2.84 us +- 0.09 us: 1.00x faster
pybench.SimpleListManipulation: 444 ns +- 1 ns -> 442 ns +- 2 ns: 1.00x faster
spectral_norm: 208 ms +- 2 ms -> 208 ms +- 1 ms: 1.00x faster
pybench.ForLoops: 8.95 ns +- 0.19 ns -> 8.94 ns +- 0.01 ns: 1.00x faster
scimark_sor: 371 ms +- 3 ms -> 371 ms +- 2 ms: 1.00x faster
scimark_sparse_mat_mult: 5.61 ms +- 0.06 ms -> 5.61 ms +- 0.36 ms: 1.00x faster
pybench.UnicodeMappings: 40.7 us +- 0.1 us -> 40.7 us +- 0.0 us: 1.00x faster
pybench.CompareStrings: 22.2 ns +- 0.0 ns -> 22.2 ns +- 0.0 ns: 1.00x faster

Benchmark hidden because not significant (47): call_method_slots, call_method_unknown, call_simple, django_template, dulwich_log, json_loads, nbody, nqueens, pickle, pidigits, pybench.BuiltinMethodLookup, pybench.CompareIntegers, pybench. CompareInternedStrings, pybench.CompareLongs, pybench.ConcatStrings, pybench.CreateInstances, pybench.CreateNewInstances, pybench.DictWithFloatKeys, pybench.DictWithIntegerKeys, pybench.DictWithStringKeys, pybench.NestedListComprehensions , pybench.NormalInstanceAttribute, pybench.PythonMethodCalls, pybench.SecondSubmoduleImport, pybench.SimpleDictManipulation, pybench.SimpleFloatArithmetic, pybench.SimpleListComprehensions, pybench.SmallLists, pybench.SpecialInstanceAttri bute, pybench.StringMappings, pybench.TryExcept, pybench.TupleSlicing, pybench.UnicodeSlicing, raytrace, regex_dna, regex_v8, scimark_monte_carlo, sqlalchemy_imperative, sqlite_synth, sympy_expand, sympy_integrate, sympy_str, telco, torna do_http, unpack_sequence, unpickle, xml_etree_iterparse

msg280844 - (view)

Author: Roundup Robot (python-dev) (Python triager)

Date: 2016-11-15 14:15

New changeset cfc956f13ce2 by Victor Stinner in branch 'default': Issue #28618: Mark dict lookup functions as hot https://hg.python.org/cpython/rev/cfc956f13ce2

msg280845 - (view)

Author: STINNER Victor (vstinner) * (Python committer)

Date: 2016-11-15 14:18

How about marking lookdict_unicode and lookdict_unicode_nodummy as hot?

Ok, your benchmark results doens't look bad, so I marked the following functions as hot:

lookdict
lookdict_unicode
lookdict_unicode_nodummy
lookdict_split

It's common to see these functions in the top 3 of "perf report".

msg280846 - (view)

Author: STINNER Victor (vstinner) * (Python committer)

Date: 2016-11-15 14:21

hot3.patch: Mark additional functions as hot

PyNumber_AsSsize_t()
_PyUnicode_FromUCS1()
json: scanstring_unicode()
siphash24()
sre_ucs1_match, sre_ucs2_match, sre_ucs4_match

I'm not sure about this patch. It's hard to get reliable benchmark results on microbenchmarks :-/ It's hard to understand if a speedup comes from the hot attribute, or if the compiler decided itself to change the code placement. Without the hot attribute, the code placement seems random.

msg280849 - (view)

Author: STINNER Victor (vstinner) * (Python committer)

Date: 2016-11-15 14:28

I wrote hot3.patch when trying to make the following benchmarks more reliable:

logging_silent: rev 8ebaa546a033 is 20% slower than the average en 2016
json_loads: rev 0bd618fe0639 is 30% slower and rev 8ebaa546a033 is 15% slower than the average on 2016
regex_effbot: rev 573bc1f9900e (nov 7) takes 6.0 ms, rev cf7711887b4a (nov 7) takes 5.2 ms, rev 8ebaa546a033 (nov 10) takes 6.1 ms, etc.

msg280853 - (view)

Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer)

Date: 2016-11-15 14:40

json: scanstring_unicode()

This doesn't look wise. This is specific to single extension module and perhaps to single particular benchmark. Most Python code don't use json at all.

What is the top of "perf report"? How this list intersects with the list of functions in .text.hot section of PGO build? Make several PGO builds (perhaps on different computers). Is .text.hot section stable?

msg280859 - (view)

Author: STINNER Victor (vstinner) * (Python committer)

Date: 2016-11-15 15:42

New changeset cfc956f13ce2 by Victor Stinner in branch 'default': Issue #28618: Mark dict lookup functions as hot https://hg.python.org/cpython/rev/cfc956f13ce2

Here are benchmark results on the speed-python server:

haypo@speed-python$ PYTHONPATH=~/perf python -m perf compare_to 2016-11-15_09-12-default-ac93d188ebd6.json 2016-11-15_15-13-default-cfc956f13ce2.json -G --min-speed=1 Slower (6):

json_loads: 62.8 us +- 1.1 us -> 65.8 us +- 2.6 us: 1.05x slower
nbody: 243 ms +- 2 ms -> 253 ms +- 6 ms: 1.04x slower
mako: 42.7 ms +- 0.2 ms -> 43.5 ms +- 0.3 ms: 1.02x slower
chameleon: 29.2 ms +- 0.3 ms -> 29.7 ms +- 0.2 ms: 1.02x slower
spectral_norm: 261 ms +- 2 ms -> 266 ms +- 3 ms: 1.02x slower
pickle: 26.6 us +- 0.4 us -> 27.0 us +- 0.4 us: 1.01x slower

Faster (26):

xml_etree_generate: 290 ms +- 4 ms -> 275 ms +- 3 ms: 1.06x faster
float: 306 ms +- 5 ms -> 292 ms +- 7 ms: 1.05x faster
logging_simple: 37.7 us +- 0.4 us -> 36.1 us +- 0.4 us: 1.04x faster
hexiom: 25.6 ms +- 0.1 ms -> 24.5 ms +- 0.1 ms: 1.04x faster
regex_effbot: 6.11 ms +- 0.31 ms -> 5.88 ms +- 0.43 ms: 1.04x faster
sympy_expand: 1.19 sec +- 0.02 sec -> 1.15 sec +- 0.01 sec: 1.04x faster
telco: 21.5 ms +- 0.4 ms -> 20.8 ms +- 0.4 ms: 1.03x faster
raytrace: 1.41 sec +- 0.02 sec -> 1.37 sec +- 0.02 sec: 1.03x faster
scimark_sor: 512 ms +- 11 ms -> 500 ms +- 12 ms: 1.03x faster
logging_format: 44.6 us +- 0.5 us -> 43.6 us +- 0.7 us: 1.02x faster
sympy_str: 532 ms +- 4 ms -> 520 ms +- 4 ms: 1.02x faster
fannkuch: 1.11 sec +- 0.01 sec -> 1.08 sec +- 0.02 sec: 1.02x faster
django_template: 475 ms +- 5 ms -> 467 ms +- 6 ms: 1.02x faster
chaos: 308 ms +- 2 ms -> 303 ms +- 3 ms: 1.02x faster
xml_etree_process: 244 ms +- 4 ms -> 240 ms +- 4 ms: 1.02x faster
xml_etree_iterparse: 225 ms +- 5 ms -> 221 ms +- 4 ms: 1.02x faster
pathlib: 51.1 ms +- 0.5 ms -> 50.3 ms +- 0.5 ms: 1.02x faster
sqlite_synth: 10.5 us +- 0.2 us -> 10.3 us +- 0.2 us: 1.01x faster
dulwich_log: 186 ms +- 1 ms -> 184 ms +- 1 ms: 1.01x faster
sqlalchemy_imperative: 72.5 ms +- 1.6 ms -> 71.5 ms +- 1.6 ms: 1.01x faster
deltablue: 18.5 ms +- 0.3 ms -> 18.3 ms +- 0.2 ms: 1.01x faster
tornado_http: 438 ms +- 5 ms -> 433 ms +- 5 ms: 1.01x faster
json_dumps: 30.4 ms +- 0.4 ms -> 30.1 ms +- 0.4 ms: 1.01x faster
genshi_xml: 212 ms +- 3 ms -> 210 ms +- 3 ms: 1.01x faster
scimark_monte_carlo: 273 ms +- 5 ms -> 271 ms +- 5 ms: 1.01x faster
call_simple: 13.3 ms +- 0.3 ms -> 13.2 ms +- 0.4 ms: 1.01x faster

Benchmark hidden because not significant (32): 2to3, call_method, call_method_slots, call_method_unknown, crypto_pyaes, genshi_text, go, html5lib, logging_silent, meteor_contest, nqueens, pickle_dict, pickle_list, pickle_pure_python, pidigits, python_startup, python_startup_no_site, regex_compile, regex_dna, regex_v8, richards, scimark_fft, scimark_lu, scimark_sparse_mat_mult, sqlalchemy_declarative, sympy_integrate, sympy_sum, unpack_sequence, unpickle, unpickle_list, unpickle_pure_python, xml_etree_parse

msg280860 - (view)

Author: STINNER Victor (vstinner) * (Python committer)

Date: 2016-11-15 15:50

Serhiy Storchaka:

json: scanstring_unicode()

This doesn't look wise. This is specific to single extension module and perhaps to single particular benchmark. Most Python code don't use json at all.

Well, I tried different things to make these benchmarks more stable. I didn't say that we should merge hot3.patch as it is :-) It's just an attempt.

What is the top of "perf report"?

For json_loads, it's:

14.99% _json.cpython-37m-x86_64-linux-gnu.so scanstring_unicode 8.34% python _PyUnicode_FromUCS1 8.32% _json.cpython-37m-x86_64-linux-gnu.so scan_once_unicode 8.01% python lookdict_unicode_nodummy 6.72% python siphash24 4.45% python PyDict_SetItem 4.26% python _PyObject_Malloc 3.38% python _PyEval_EvalFrameDefault 3.16% python _Py_HashBytes 2.72% python PyUnicode_New 2.36% python PyLong_FromString 2.25% python _PyObject_Free 2.02% libc-2.19.so __memcpy_sse2_unaligned 1.61% python PyDict_GetItem 1.40% python dictresize 1.24% python unicode_hash 1.11% libc-2.19.so _int_malloc 1.07% python unicode_dealloc 1.00% python free_keys_object

Result produced with:

$ perf record ./python ~/performance/performance/benchmarks/bm_json_loads.py --worker -v -l 128 -w0 -n 100
$ perf report

How this list intersects with the list of functions in .text.hot section of PGO build?

I checked which functions are considered as "hot" by a PGO build: I found more than 2,000 functions... I'm not interested to tag so many functions with _Py_HOT_FUNCTIONS. I would prefer to only tag something like the top 10 or top 25 functions.

I don't know the recommandations to tag functions as hot. I guess that what matters is the total size of hot functions. Should I be smaller than the L2 cache? Smaller than the L3 cache? I'm talking about instructions, but data share also these caches...

Make several PGO builds (perhaps on different computers). Is .text.hot section stable?

In my experience PGO builds don't provide stable performances, but I was never able to write an article on that because of so many bugs :-)

msg281459 - (view)

Author: STINNER Victor (vstinner) * (Python committer)

Date: 2016-11-22 10:30

FYI I wrote an article about this issue: https://haypo.github.io/analysis-python-performance-issue.html

Sadly, it seems like I was just lucky when adding attribute((hot)) fixed the issue, because call_method is slow again!

acde821520fc (Nov 21): 16.3 ms
2a14385710dc (Nov 22): 24.6 ms (+51%)

msg281463 - (view)

Author: Inada Naoki (methane) * (Python committer)

Date: 2016-11-22 11:07

Wow. It's sad that tagged version is accidentally slow...

I want to reproduce it and check perf record -e L1-icache-load-misses. But IaaS (EC2, GCE, Azure VM) doesn't support CPU performance counter.

msg281466 - (view)

Author: STINNER Victor (vstinner) * (Python committer)

Date: 2016-11-22 11:47

2016-11-22 12:07 GMT+01:00 INADA Naoki <report@bugs.python.org>:

I want to reproduce it and check perf record -e L1-icache-load-misses. But IaaS (EC2, GCE, Azure VM) doesn't support CPU performance counter.

You don't need to go that far to check performances: just run call_method and check timings. You need to compare on multiple revisions.

speed.python.org Timeline helps to track performances, to have an idea of the "average performance" and detect spikes.

msg281467 - (view)

Author: STINNER Victor (vstinner) * (Python committer)

Date: 2016-11-22 11:50

Naoki: "Wow. It's sad that tagged version is accidentally slow..."

If you use PGO compilation, for example use "./configure --enable-optimizations" as suggested by configure if you don't enable the option, you don't get the issue.

I hope that most Linux distribution use PGO compilation. I'm quite sure that it's the case for Ubuntu. I don't know for Fedora.

msg281473 - (view)

Author: Inada Naoki (methane) * (Python committer)

Date: 2016-11-22 12:19

I setup Ubuntu 14.04 on Azure, built python without neither PGO nor LTO. But I failed to reproduce it.

@haypo, would you give me two binaries?

$ ~/local/py-2a143/bin/python3 -c 'import sys; print(sys.version)' 3.7.0a0 (default:2a14385710dc, Nov 22 2016, 12:02:34) [GCC 4.8.4]

$ ~/local/py-acde8/bin/python3 -c 'import sys; print(sys.version)'
3.7.0a0 (default:acde821520fc, Nov 22 2016, 11:31:16) [GCC 4.8.4]

$ ~/local/py-2a143/bin/python3 bm_call_method.py ..................... call_method: Median +- std dev: 16.1 ms +- 0.6 ms

$ ~/local/py-acde8/bin/python3 bm_call_method.py
..................... call_method: Median +- std dev: 16.1 ms +- 0.7 ms

msg281477 - (view)

Author: STINNER Victor (vstinner) * (Python committer)

Date: 2016-11-22 13:17

But I failed to reproduce it.

Hey, performance issues with code placement is a mysterious secret :-) Nobody understands it :-D

The server runner the benchmark is a Intel Xeon CPU of 2011. It seems like code placement issues are more important on this CPU than my more recent laptop or desktop PC.

msg286662 - (view)

Author: STINNER Victor (vstinner) * (Python committer)

Date: 2017-02-01 17:21

Victor: "FYI I wrote an article about this issue: https://haypo.github.io/analysis-python-performance-issue.html Sadly, it seems like I was just lucky when adding attribute((hot)) fixed the issue, because call_method is slow again!"

I upgraded speed-python server (running benchmarks) to Ubuntu 16.04 LTS to support PGO compilation. I removed all old benchmark results and ran again benchmarks with LTO+PGO. It seems like benchmark results are much better now.

I'm not sure anymore that _Py_HOT_FUNCTION is really useful to get stable benchmarks, but it may help code placement a little bit. I don't think that it hurts, so I suggest to keep it. Since benchmarks were still unstable with _Py_HOT_FUNCTION, I'm not interested to continue to tag more functions with _Py_HOT_FUNCTION. I will now focus on LTO+PGO for stable benchmarks, and ignore small performance difference when PGO is not used.

I close this issue now.