Issue 28618: Decorate hot functions using attribute((hot)) to optimize Python (original) (raw)

Created on 2016-11-05 00:29 by vstinner, last changed 2022-04-11 14:58 by admin. This issue is now closed.

Messages (34)

msg280097 - (view)

Author: STINNER Victor (vstinner) * (Python committer)

Date: 2016-11-05 00:29

When analyzing results of Python performance benchmarks, I noticed that call_method was 70% slower (!) between revisions 83877018ef97 (Oct 18) and 3e073e7b4460 (Oct 22), including these revisions, on the speed-python server.

On these revisions, the CPU L1 instruction cache is less efficient: 8% cache misses, whereas it was only 0.06% before and after these revisions.

Since the two mentioned revisions have no obvious impact on the call_method() benchmark, I understand that the performance difference by a different layout of the machine code, maybe the exact location of functions.

IMO the best solution to such compilation issue is to use PGO compilation. Problem: PGO doesn't work on Ubuntu 14.04, the OS used by speed-python (the server runining benchmarks for http://speed.python.org/).

I propose to decorate manually the "hot" functions using the GCC attribute((hot)) decorator: https://gcc.gnu.org/onlinedocs/gcc/Common-Function-Attributes.html#Common-Function-Attributes (search for "hot")

Attached patch adds Py_HOT_FUNCTION and decorates the following functions:

These functions are the top 6 according to the Linux perf tool when running the call_simple benchmark of the performance project:

32,66%: _PyEval_EvalFrameDefault 13,09%: PyFrame_New 12,78%: call_function 12,24%: lookdict_unicode_nodummy 9,85%: _PyFunction_FastCall 8,47%: frame_dealloc

msg280105 - (view)

Author: STINNER Victor (vstinner) * (Python committer)

Date: 2016-11-05 09:07

I ran benchmarks. Globally, it seems like the impact of the patch is positive. regex_v8 and call_simple are slower, but these benchmarks are microbenchmarks impacted by low level stuff like CPU L1 cache. Well, my patch was supposed to optimize CPython for call_simple :-/ I should maybe investigate a little bit more.

Performance comparison (performance 0.3.2):

haypo@smithers$ python3 -m perf compare_to orig.json hot.json -G Slower (6):

Faster (32):

Benchmark hidden because not significant (26): 2to3, call_method, chaos, crypto_pyaes, float, hexiom, html5lib, json_loads, logging_silent, mako, meteor_contest, pathlib, pickle, python_startup_no_site, scimark_sparse_mat_mult, spectral_norm, sqlalchemy_imperative, sympy_sum, telco, tornado_http, unpack_sequence, unpickle, unpickle_list, unpickle_pure_python, xml_etree_generate, xml_etree_iterparse

--

More readable output, only display differences >= 5%:

haypo@smithers$ python3 -m perf compare_to orig.json hot.json -G --min-speed=5 Slower (1):

Faster (2):

Benchmark hidden because not significant (61): 2to3, call_method, call_method_slots, call_method_unknown, call_simple, chaos, crypto_pyaes, deltablue, django_template, dulwich_log, fannkuch, float, genshi_text, genshi_xml, go, hexiom, html5lib, json_dumps, json_loads, logging_format, logging_silent, logging_simple, mako, meteor_contest, nbody, nqueens, pathlib, pickle, pickle_dict, pickle_list, pickle_pure_python, pidigits, python_startup, python_startup_no_site, raytrace, regex_compile, regex_dna, regex_effbot, richards, scimark_fft, scimark_lu, scimark_sor, scimark_sparse_mat_mult, spectral_norm, sqlalchemy_declarative, sqlalchemy_imperative, sqlite_synth, sympy_expand, sympy_integrate, sympy_str, sympy_sum, telco, tornado_http, unpack_sequence, unpickle, unpickle_list, unpickle_pure_python, xml_etree_generate, xml_etree_iterparse, xml_etree_parse, xml_etree_process

msg280106 - (view)

Author: STINNER Victor (vstinner) * (Python committer)

Date: 2016-11-05 09:08

Oh, I forgot to mention that I compiled Python with "./configure -C". The purpose of the patch is to optimize Python when LTO and/or PGO compilation are not explicitly used.

msg280108 - (view)

Author: Antoine Pitrou (pitrou) * (Python committer)

Date: 2016-11-05 09:59

Can you compare against a PGO build? Ubuntu 14.04 is old, and I don't think this is something we should worry about.

Overall I think this manual approach is really the wrong way to look at it. Compilers can do better than us.

msg280115 - (view)

Author: STINNER Victor (vstinner) * (Python committer)

Date: 2016-11-05 15:37

Antoine Pitrou added the comment:

Can you compare against a PGO build?

Do you mean comparison between current Python with PGO and patched Python without PGO?

The hot attribute is ignored by GCC when PGO compilation is used.

Ubuntu 14.04 is old, and I don't think this is something we should worry about.

Well, it's a practical issue for me to run benchmarks for speed.python.org.

Moreover, I like the idea of getting a fast(er) Python even when no advanced optimization techniques like LTO or PGO is used. At least, it's common to build quickly Python using "./configure && make" for a quick benchmark.

msg280116 - (view)

Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer)

Date: 2016-11-05 16:14

Moreover, I like the idea of getting a fast(er) Python even when no advanced optimization techniques like LTO or PGO is used.

Seconded.

msg280125 - (view)

Author: Antoine Pitrou (pitrou) * (Python committer)

Date: 2016-11-05 20:02

Le 05/11/2016 à 16:37, STINNER Victor a écrit :

Antoine Pitrou added the comment:

Can you compare against a PGO build?

Do you mean comparison between current Python with PGO and patched Python without PGO?

Yes.

Ubuntu 14.04 is old, and I don't think this is something we should worry about.

Well, it's a practical issue for me to run benchmarks for speed.python.org.

Why isn't the OS updated on that machine?

msg280126 - (view)

Author: STINNER Victor (vstinner) * (Python committer)

Date: 2016-11-05 22:53

Antoine Pitrou added the comment:

Do you mean comparison between current Python with PGO and patched Python without PGO?

Yes.

Oh ok, sure. I will try to run these 2 benchmarks.

Ubuntu 14.04 is old, and I don't think this is something we should worry about.

Well, it's a practical issue for me to run benchmarks for speed.python.org.

Why isn't the OS updated on that machine?

I am not sure that I want to use PGO compilation to run benchmarks. Last time I checked, I noticed performance differences between two compilations. PGO compilation doesn't seem 100% deterministic.

Maybe PGO compilation is fine when you build Python to create a Linux package. But to get reliable benchmarks, I'm not sure that it's a good idea.

I'm still running benchmarks on Python recompiled many times using different compiler options, to measure the impact of the compiler options (especially LTO and/or PGO) on the benchmark stability.

msg280350 - (view)

Author: STINNER Victor (vstinner) * (Python committer)

Date: 2016-11-08 21:09

Do you mean comparison between current Python with PGO and patched Python without PGO?

Yes.

Ok, here you have. As expected, PGO compilation is faster than default compilation with my patch. PGO implements more optimization than just attribute((hot)), it also optimizes branches for example.

haypo@smithers$ python3 -m perf compare_to pgo.json.gz patch.json.gz -G --min-speed=5 Slower (56):

Benchmark hidden because not significant (8): fannkuch, float, json_dumps, nbody, regex_v8, richards, scimark_monte_carlo, scimark_sparse_mat_mult

msg280556 - (view)

Author: Roundup Robot (python-dev) (Python triager)

Date: 2016-11-11 01:14

New changeset 59b91b4e9506 by Victor Stinner in branch 'default': Issue #28618: Make hot functions using attribute((hot)) https://hg.python.org/cpython/rev/59b91b4e9506

msg280557 - (view)

Author: STINNER Victor (vstinner) * (Python committer)

Date: 2016-11-11 01:49

I tried different patches and ran many quick & dirty benchmarks.

I tried to use likely/unlikely macros (using GCC __builtin__expect): the effect is not significant on call_simple microbenchmark. I gave up on this part.

attribute((hot)) on a few Python core functions fixes the major slowdown on call_method on the revision 83877018ef97 (described in the initial message).

I noticed tiny differences when using attribute((hot)), speedup in most cases. I noticed sometimes slowdown, but very small (ex: 1%, but 1% on a microbenchmark doesn't mean anything).

I pushed my patch to try to keep stable performance when Python is not compiled with PGO.

If you would like to know more about the crazy effect of code placement in modern Intel CPUs, I suggest you to see the slides of this recent talk from an Intel engineer: https://llvmdevelopersmeetingbay2016.sched.org/event/8YzY/causes-of-performance-instability-due-to-code-placement-in-x86 "Causes of Performance Swings Due to Code Placement in IA by Zia Ansari (Intel), November 2016"

--

About PGO or not PGO: this question is not simple, I suggest to discuss it in a different place to not flood this issue ;-)

For my use case, I'm not convinced yet that PGO with our current build system produce reliable performance.

Not all Linux distributions compile Python using PGO: Fedora and RHEL don't compile Python using PGO for example. Bugzilla for Fedora: https://bugzilla.redhat.com/show_bug.cgi?id=613045

I guess that there also some developers running benchmarks on Python compiled with "./configure && make". I'm trying to enhance documentation and tools around Python benchmarks to advice developers to use LTO and/or PGO.

msg280568 - (view)

Author: STINNER Victor (vstinner) * (Python committer)

Date: 2016-11-11 09:10

Final result on speed-python:

haypo@speed-python$ python3 -m perf compare_to json_8nov/2016-11-10_15-39-default-8ebaa546a033.json 2016-11-11_02-13-default-59b91b4e9506.json -G

Slower (12):

Faster (39):

Benchmark hidden because not significant (13): float, genshi_xml, pickle, pickle_dict, pidigits, scimark_fft, scimark_monte_carlo, sqlalchemy_imperative, sqlite_synth, sympy_expand, sympy_sum, unpickle, unpickle_list

msg280606 - (view)

Author: STINNER Victor (vstinner) * (Python committer)

Date: 2016-11-11 19:52

Hum, sadly this benchmark is still unstable after my change 59b91b4e9506 ("Mark hot functions using attribute((hot))", oops, I wanted to write Mark, not Make :-/).

This benchmark is around 63.4 us during many months, whereas it reached 72.9 us at rev 59b91b4e9506, and the new run (also using hot attribute) gone back to 63.0 us...

I understand that json_loads depends on the code placement of some other functions which are not currently marked with the hot attribute.

https://speed.python.org/timeline/#/?exe=4&ben=json_loads&env=1&revs=50&equid=off&quarts=on&extr=on

msg280607 - (view)

Author: STINNER Victor (vstinner) * (Python committer)

Date: 2016-11-11 19:58

Same issue on this benchmark:

The benchmark is unstable, but the difference is small, especially compared to the difference of call_method without the hot attribute.

msg280675 - (view)

Author: Yury Selivanov (yselivanov) * (Python committer)

Date: 2016-11-12 22:25

Can we commit this to 3.6 too?

msg280679 - (view)

Author: STINNER Victor (vstinner) * (Python committer)

Date: 2016-11-12 23:40

Can we commit this to 3.6 too?

I worked on patches to try to optimize json_loads and regex_effbot as well, but it's still unclear to me how the hot attribute works, and I'm not 100% sure that using the attribut explicitly does not introduce a performance regession.

So I prefer to experiment such change in default right now.

msg280748 - (view)

Author: Inada Naoki (methane) * (Python committer)

Date: 2016-11-14 10:41

How about marking lookdict_unicode and lookdict_unicode_nodummy as hot?

msg280764 - (view)

Author: STINNER Victor (vstinner) * (Python committer)

Date: 2016-11-14 12:23

INADA Naoki added the comment:

How about marking lookdict_unicode and lookdict_unicode_nodummy as hot?

I don't understand well the effect of the hot attribute, so I suggest to run benchmarks and check that it has a non negligible effect on benchmarks ;-)

msg280831 - (view)

Author: Inada Naoki (methane) * (Python committer)

Date: 2016-11-15 11:56

I don't understand well the effect of the hot attribute

I compared lookdict_unicode_nodummy assembly by objdump -d dictobject.o. It looks completely same.

So I think only difference is placement. hot functions are in .text.hot section and linker groups hot functions. This reduces cache hazard possibility.

When compiling Python with PGO, we can see what function is hot by objdump.

~/work/cpython/Objects$ objdump -tj .text.hot dictobject.o

dictobject.o:     file format elf64-x86-64

SYMBOL TABLE:
0000000000000000 l    d  .text.hot      0000000000000000 .text.hot
00000000000007a0 l     F .text.hot      0000000000000574 lookdict_unicode_nodummy
00000000000046d0 l     F .text.hot      00000000000000e8 free_keys_object
00000000000001c0 l     F .text.hot      0000000000000161 new_keys_object
00000000000003b0 l     F .text.hot      00000000000003e8 insertdict
0000000000001180 l     F .text.hot      000000000000081f dictresize
00000000000019a0 l     F .text.hot      0000000000000165 find_empty_slot.isra.0
0000000000002180 l     F .text.hot      00000000000005f1 lookdict
0000000000001b10 l     F .text.hot      00000000000000c2 unicode_eq
0000000000002780 l     F .text.hot      0000000000000184 dict_traverse
0000000000004c20 l     F .text.hot      00000000000005f7 lookdict_unicode
0000000000006b20 l     F .text.hot      0000000000000330 lookdict_split
...

cold section of hot function is placed in .text.unlikely section.

$ objdump -t  dictobject.o  | grep lookdict
00000000000007a0 l     F .text.hot      0000000000000574 lookdict_unicode_nodummy
0000000000002180 l     F .text.hot      00000000000005f1 lookdict
000000000000013e l       .text.unlikely 0000000000000000 lookdict_unicode_nodummy.cold.6
0000000000000a38 l       .text.unlikely 0000000000000000 lookdict.cold.15
0000000000004c20 l     F .text.hot      00000000000005f7 lookdict_unicode
0000000000006b20 l     F .text.hot      0000000000000330 lookdict_split
0000000000001339 l       .text.unlikely 0000000000000000 lookdict_unicode.cold.28
0000000000001d01 l       .text.unlikely 0000000000000000 lookdict_split.cold.42

All lookdict* function are put in hot section, and all of cold part is 0 byte. It seems PGO put all lookdict* functions in hot section.

compiler info:

$ gcc -v
Using built-in specs.
COLLECT_GCC=gcc
COLLECT_LTO_WRAPPER=/usr/lib/gcc/x86_64-linux-gnu/5/lto-wrapper
Target: x86_64-linux-gnu
Configured with: ../src/configure -v --with-pkgversion='Ubuntu 5.4.0-6ubuntu1~16.04.4' --with-bugurl=file:///usr/share/doc/gcc-5/README.Bugs --enable-languages=c,ada,c++,java,go,d,fortran,objc,obj-c++ --prefix=/usr --program-suffix=-5 --enable-shared --enable-linker-build-id --libexecdir=/usr/lib --without-included-gettext --enable-threads=posix --libdir=/usr/lib --enable-nls --with-sysroot=/ --enable-clocale=gnu --enable-libstdcxx-debug --enable-libstdcxx-time=yes --with-default-libstdcxx-abi=new --enable-gnu-unique-object --disable-vtable-verify --enable-libmpx --enable-plugin --with-system-zlib --disable-browser-plugin --enable-java-awt=gtk --enable-gtk-cairo --with-java-home=/usr/lib/jvm/java-1.5.0-gcj-5-amd64/jre --enable-java-home --with-jvm-root-dir=/usr/lib/jvm/java-1.5.0-gcj-5-amd64 --with-jvm-jar-dir=/usr/lib/jvm-exports/java-1.5.0-gcj-5-amd64 --with-arch-directory=amd64 --with-ecj-jar=/usr/share/java/eclipse-ecj.jar --enable-objc-gc --enable-multiarch --disable-werror --with-arch-32=i686 --with-abi=m64 --with-multilib-list=m32,m64,mx32 --enable-multilib --with-tune=generic --enable-checking=release --build=x86_64-linux-gnu --host=x86_64-linux-gnu --target=x86_64-linux-gnu
Thread model: posix
gcc version 5.4.0 20160609 (Ubuntu 5.4.0-6ubuntu1~16.04.4)

msg280832 - (view)

Author: Inada Naoki (methane) * (Python committer)

Date: 2016-11-15 12:04

so I suggest to run benchmarks and check that it has a non negligible effect on benchmarks ;-)

When added _Py_HOT_FUNCTION to lookdict_unicode, lookdict_unicode_nodummy and lookdict_split (I can't measure L1 miss via perf stat -d because I use EC2 for benchmark):

$ ~/local/python-master/bin/python3 -m perf compare_to -G all-master.json all-patched.json Slower (28):

Faster (47):

Benchmark hidden because not significant (47): call_method_slots, call_method_unknown, call_simple, django_template, dulwich_log, json_loads, nbody, nqueens, pickle, pidigits, pybench.BuiltinMethodLookup, pybench.CompareIntegers, pybench. CompareInternedStrings, pybench.CompareLongs, pybench.ConcatStrings, pybench.CreateInstances, pybench.CreateNewInstances, pybench.DictWithFloatKeys, pybench.DictWithIntegerKeys, pybench.DictWithStringKeys, pybench.NestedListComprehensions , pybench.NormalInstanceAttribute, pybench.PythonMethodCalls, pybench.SecondSubmoduleImport, pybench.SimpleDictManipulation, pybench.SimpleFloatArithmetic, pybench.SimpleListComprehensions, pybench.SmallLists, pybench.SpecialInstanceAttri bute, pybench.StringMappings, pybench.TryExcept, pybench.TupleSlicing, pybench.UnicodeSlicing, raytrace, regex_dna, regex_v8, scimark_monte_carlo, sqlalchemy_imperative, sqlite_synth, sympy_expand, sympy_integrate, sympy_str, telco, torna do_http, unpack_sequence, unpickle, xml_etree_iterparse

msg280844 - (view)

Author: Roundup Robot (python-dev) (Python triager)

Date: 2016-11-15 14:15

New changeset cfc956f13ce2 by Victor Stinner in branch 'default': Issue #28618: Mark dict lookup functions as hot https://hg.python.org/cpython/rev/cfc956f13ce2

msg280845 - (view)

Author: STINNER Victor (vstinner) * (Python committer)

Date: 2016-11-15 14:18

How about marking lookdict_unicode and lookdict_unicode_nodummy as hot?

Ok, your benchmark results doens't look bad, so I marked the following functions as hot:

It's common to see these functions in the top 3 of "perf report".

msg280846 - (view)

Author: STINNER Victor (vstinner) * (Python committer)

Date: 2016-11-15 14:21

hot3.patch: Mark additional functions as hot

I'm not sure about this patch. It's hard to get reliable benchmark results on microbenchmarks :-/ It's hard to understand if a speedup comes from the hot attribute, or if the compiler decided itself to change the code placement. Without the hot attribute, the code placement seems random.

msg280849 - (view)

Author: STINNER Victor (vstinner) * (Python committer)

Date: 2016-11-15 14:28

I wrote hot3.patch when trying to make the following benchmarks more reliable:

msg280853 - (view)

Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer)

Date: 2016-11-15 14:40

This doesn't look wise. This is specific to single extension module and perhaps to single particular benchmark. Most Python code don't use json at all.

What is the top of "perf report"? How this list intersects with the list of functions in .text.hot section of PGO build? Make several PGO builds (perhaps on different computers). Is .text.hot section stable?

msg280859 - (view)

Author: STINNER Victor (vstinner) * (Python committer)

Date: 2016-11-15 15:42

New changeset cfc956f13ce2 by Victor Stinner in branch 'default': Issue #28618: Mark dict lookup functions as hot https://hg.python.org/cpython/rev/cfc956f13ce2

Here are benchmark results on the speed-python server:

haypo@speed-python$ PYTHONPATH=~/perf python -m perf compare_to 2016-11-15_09-12-default-ac93d188ebd6.json 2016-11-15_15-13-default-cfc956f13ce2.json -G --min-speed=1 Slower (6):

Faster (26):

Benchmark hidden because not significant (32): 2to3, call_method, call_method_slots, call_method_unknown, crypto_pyaes, genshi_text, go, html5lib, logging_silent, meteor_contest, nqueens, pickle_dict, pickle_list, pickle_pure_python, pidigits, python_startup, python_startup_no_site, regex_compile, regex_dna, regex_v8, richards, scimark_fft, scimark_lu, scimark_sparse_mat_mult, sqlalchemy_declarative, sympy_integrate, sympy_sum, unpack_sequence, unpickle, unpickle_list, unpickle_pure_python, xml_etree_parse

msg280860 - (view)

Author: STINNER Victor (vstinner) * (Python committer)

Date: 2016-11-15 15:50

Serhiy Storchaka:

  • json: scanstring_unicode()

This doesn't look wise. This is specific to single extension module and perhaps to single particular benchmark. Most Python code don't use json at all.

Well, I tried different things to make these benchmarks more stable. I didn't say that we should merge hot3.patch as it is :-) It's just an attempt.

What is the top of "perf report"?

For json_loads, it's:

14.99% _json.cpython-37m-x86_64-linux-gnu.so scanstring_unicode 8.34% python _PyUnicode_FromUCS1 8.32% _json.cpython-37m-x86_64-linux-gnu.so scan_once_unicode 8.01% python lookdict_unicode_nodummy 6.72% python siphash24 4.45% python PyDict_SetItem 4.26% python _PyObject_Malloc 3.38% python _PyEval_EvalFrameDefault 3.16% python _Py_HashBytes 2.72% python PyUnicode_New 2.36% python PyLong_FromString 2.25% python _PyObject_Free 2.02% libc-2.19.so __memcpy_sse2_unaligned 1.61% python PyDict_GetItem 1.40% python dictresize 1.24% python unicode_hash 1.11% libc-2.19.so _int_malloc 1.07% python unicode_dealloc 1.00% python free_keys_object

Result produced with:

$ perf record ./python ~/performance/performance/benchmarks/bm_json_loads.py --worker -v -l 128 -w0 -n 100
$ perf report

How this list intersects with the list of functions in .text.hot section of PGO build?

I checked which functions are considered as "hot" by a PGO build: I found more than 2,000 functions... I'm not interested to tag so many functions with _Py_HOT_FUNCTIONS. I would prefer to only tag something like the top 10 or top 25 functions.

I don't know the recommandations to tag functions as hot. I guess that what matters is the total size of hot functions. Should I be smaller than the L2 cache? Smaller than the L3 cache? I'm talking about instructions, but data share also these caches...

Make several PGO builds (perhaps on different computers). Is .text.hot section stable?

In my experience PGO builds don't provide stable performances, but I was never able to write an article on that because of so many bugs :-)

msg281459 - (view)

Author: STINNER Victor (vstinner) * (Python committer)

Date: 2016-11-22 10:30

FYI I wrote an article about this issue: https://haypo.github.io/analysis-python-performance-issue.html

Sadly, it seems like I was just lucky when adding attribute((hot)) fixed the issue, because call_method is slow again!

msg281463 - (view)

Author: Inada Naoki (methane) * (Python committer)

Date: 2016-11-22 11:07

Wow. It's sad that tagged version is accidentally slow...

I want to reproduce it and check perf record -e L1-icache-load-misses. But IaaS (EC2, GCE, Azure VM) doesn't support CPU performance counter.

msg281466 - (view)

Author: STINNER Victor (vstinner) * (Python committer)

Date: 2016-11-22 11:47

2016-11-22 12:07 GMT+01:00 INADA Naoki <report@bugs.python.org>:

I want to reproduce it and check perf record -e L1-icache-load-misses. But IaaS (EC2, GCE, Azure VM) doesn't support CPU performance counter.

You don't need to go that far to check performances: just run call_method and check timings. You need to compare on multiple revisions.

speed.python.org Timeline helps to track performances, to have an idea of the "average performance" and detect spikes.

msg281467 - (view)

Author: STINNER Victor (vstinner) * (Python committer)

Date: 2016-11-22 11:50

Naoki: "Wow. It's sad that tagged version is accidentally slow..."

If you use PGO compilation, for example use "./configure --enable-optimizations" as suggested by configure if you don't enable the option, you don't get the issue.

I hope that most Linux distribution use PGO compilation. I'm quite sure that it's the case for Ubuntu. I don't know for Fedora.

msg281473 - (view)

Author: Inada Naoki (methane) * (Python committer)

Date: 2016-11-22 12:19

I setup Ubuntu 14.04 on Azure, built python without neither PGO nor LTO. But I failed to reproduce it.

@haypo, would you give me two binaries?

$ ~/local/py-2a143/bin/python3 -c 'import sys; print(sys.version)' 3.7.0a0 (default:2a14385710dc, Nov 22 2016, 12:02:34) [GCC 4.8.4]

$ ~/local/py-acde8/bin/python3 -c 'import sys; print(sys.version)'
3.7.0a0 (default:acde821520fc, Nov 22 2016, 11:31:16) [GCC 4.8.4]

$ ~/local/py-2a143/bin/python3 bm_call_method.py ..................... call_method: Median +- std dev: 16.1 ms +- 0.6 ms

$ ~/local/py-acde8/bin/python3 bm_call_method.py
..................... call_method: Median +- std dev: 16.1 ms +- 0.7 ms

msg281477 - (view)

Author: STINNER Victor (vstinner) * (Python committer)

Date: 2016-11-22 13:17

But I failed to reproduce it.

Hey, performance issues with code placement is a mysterious secret :-) Nobody understands it :-D

The server runner the benchmark is a Intel Xeon CPU of 2011. It seems like code placement issues are more important on this CPU than my more recent laptop or desktop PC.

msg286662 - (view)

Author: STINNER Victor (vstinner) * (Python committer)

Date: 2017-02-01 17:21

Victor: "FYI I wrote an article about this issue: https://haypo.github.io/analysis-python-performance-issue.html Sadly, it seems like I was just lucky when adding attribute((hot)) fixed the issue, because call_method is slow again!"

I upgraded speed-python server (running benchmarks) to Ubuntu 16.04 LTS to support PGO compilation. I removed all old benchmark results and ran again benchmarks with LTO+PGO. It seems like benchmark results are much better now.

I'm not sure anymore that _Py_HOT_FUNCTION is really useful to get stable benchmarks, but it may help code placement a little bit. I don't think that it hurts, so I suggest to keep it. Since benchmarks were still unstable with _Py_HOT_FUNCTION, I'm not interested to continue to tag more functions with _Py_HOT_FUNCTION. I will now focus on LTO+PGO for stable benchmarks, and ignore small performance difference when PGO is not used.

I close this issue now.