Issue 29465: Modify _PyObject_FastCall() to reduce stack consumption (original) (raw)

Created on 2017-02-06 17:08 by vstinner, last changed 2022-04-11 14:58 by admin. This issue is now closed.

Files
File name Uploaded Description Edit
pyobject_fastcall.patch vstinner,2017-02-06 17:08 review
meth_fastcall_stacksize.patch vstinner,2017-02-06 17:11 review
pyobject_fastcall-2.patch vstinner,2017-02-07 01:05 review
pyobject_fastcall-3.patch vstinner,2017-02-07 01:12 review
pyobject_fastcall-4.patch vstinner,2017-02-07 14:46 review
pyobject_fastcall-5.patch vstinner,2017-02-08 17:35 review
pyobject_fastcall-6.patch vstinner,2017-02-08 17:38 review
Messages (18)
msg287153 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2017-02-06 17:08
While testing issue #29464 patch, I failed to see a major enhancement on the stack usage of fast calls without keyword arguments. The problem is that functions like _PyObject_FastCallKeywords() and _PyObject_FastCallDict() still have to pass a NULL argument for kwargs/kwnames, and have code to handle keyword arguments. Attached patch adds _PyObject_FastCall() to reduce the stack consumption. At the C level, keyword arguments are almost never used. For example, PyObject_CallFunctionObjArgs() is commonly used, whereas it only uses positional arguments. The patch changes also _PyObject_FastCallKeywords() and _PyObject_FastCallDict() to move the "slow" path creating temporary tuple and dict in a subfunction which is not inlined. The slow path requires more stack memory. Morecall, _PyObject_FastCallKeywords() and _PyObject_FastCallDict() are modified to call _PyObject_FastCall() if there is no keyword. The patch might make function calls without keyword arguments faster, I didn't check. Stack usage. $ ./python -c 'import _testcapi, sys; sys.setrecursionlimit(10**5); n=1000; s=_testcapi.meth_fastcall_stacksize(n); print("%.1f B/call" % (s/n))' * Reference: 832.8 B/call * Patch: 656.6 B/call (-176.2 B) I don't know why the stack usage is not an integer number of bytes? Combined with the issue #29464 "Specialize FASTCALL for functions with positional-only parameters", the stack usage can be even more reduced by a few bytes. See the issue #28870 for the previous work on reducing stack consumption.
msg287154 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2017-02-06 17:11
meth_fastcall_stacksize.patch: Patch adding _testcapi.meth_fastcall_stacksize() to measure the stack usage to call a METH_FASTCALL function.
msg287183 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2017-02-07 01:05
pyobject_fastcall-2.patch: More complete changes. Sorry, the patch also contains unrelated refactoring! It's a more advanced implementation which tries to reduce the depth of the C backtrace. For example, _PyObject_FastCall() is now inlined manually in _PyObject_FastCallDict(). PyObject_Call() is also rewritten. If the overall approach is validated, I will rewriten the patch differently to limit changes, or push some changes in multiple commits. Results of testcapi_stacksize.patch + stack_overflow_28870-sp.py (from issue #28870). Reference: haypo@smithers$ ../default-ref/python stack_overflow_28870-sp.py test_python_call: 8586 calls before crash, stack: 976 bytes/call test_python_getitem: 9188 calls before crash, stack: 912 bytes/call test_python_iterator: 7936 calls before crash, stack: 1056 bytes/call => total: 25710 calls, 2944 bytes Patch: haypo@smithers$ ./python stack_overflow_28870-sp.py test_python_call: 9883 calls before crash, stack: 848 bytes/call (-128 B) test_python_getitem: 10476 calls before crash, stack: 800 bytes/call (- 112 B) test_python_iterator: 8878 calls before crash, stack: 944 bytes/call (- 112 B) => total: 29237 calls (+3616), 2592 bytes (- 352 B)
msg287184 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2017-02-07 01:12
Oh, I forgot to rebase my local git branch: patch version 2 contains unrelated changes. Please see instead the path version 3 which was rebased.
msg287204 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2017-02-07 08:17
Benchmarks result. Some are slower (code placement issue?), the faster benchmark are not really much faster. haypo@speed-python$ python3 -m perf compare_to /home/haypo/benchmarks/2017-02-06_07-15-default-e06af4027546.json pyobject_fastcall-3_ref_e06af4027546.json -G --min-speed=4 Slower (4): - scimark_lu: 352 ms +- 11 ms -> 374 ms +- 15 ms: 1.06x slower (+6%) - float: 238 ms +- 3 ms -> 251 ms +- 4 ms: 1.05x slower (+5%) - logging_silent: 558 ns +- 15 ns -> 585 ns +- 16 ns: 1.05x slower (+5%) - raytrace: 1.04 sec +- 0.01 sec -> 1.09 sec +- 0.01 sec: 1.05x slower (+5%) Faster (5): - nbody: 233 ms +- 2 ms -> 216 ms +- 2 ms: 1.08x faster (-7%) - pickle_dict: 56.6 us +- 4.3 us -> 52.9 us +- 2.2 us: 1.07x faster (-6%) - nqueens: 217 ms +- 2 ms -> 205 ms +- 2 ms: 1.06x faster (-6%) - unpickle_pure_python: 670 us +- 11 us -> 642 us +- 8 us: 1.04x faster (-4%) - scimark_sor: 405 ms +- 11 ms -> 389 ms +- 10 ms: 1.04x faster (-4%) Benchmark hidden because not significant (55): (...)
msg287224 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2017-02-07 10:16
Oh, pyobject_fastcall-3.patch contains a performance bug :-p PyObject_Call() "unpacks" the tuple and then recreates a new tuple to call functions in for functions other than Python and C functions.
msg287234 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2017-02-07 14:46
Ok, I fixed the PyObject_Call() bug, and tried a new approach to help the compiler for code placement: I moved all "call" functions into a new Objects/call.c file. It should help the compiler to inline more code, and I move functions which call themself closer: _PyObject_FastCall(), _PyCFunction_FastCall() and _PyFunction_FastCall() are now very close in the same file for example. Not only the stack usage is better, but the performance seems also better: see benchmark results below. Again, if you like the _PyObject_FastCall() idea (to reduce the stack consumption), I will split my big patch into smaller patches to have a "smoother" hg/git history. Benchmark results on speed-python, CPython compiled with LTO+PGO: haypo@speed-python$ python3 -m perf compare_to /home/haypo/benchmarks/2017-02-06_07-15-default-e06af4027546.json pyobject_fastcall-4_ref_e06af4027546.json -G --min-speed=4 Faster (11): - sympy_expand: 949 ms +- 12 ms -> 878 ms +- 7 ms: 1.08x faster (-8%) - chameleon: 21.9 ms +- 0.2 ms -> 20.5 ms +- 0.3 ms: 1.07x faster (-7%) - sympy_str: 425 ms +- 5 ms -> 399 ms +- 5 ms: 1.07x faster (-6%) - sympy_integrate: 40.8 ms +- 0.3 ms -> 38.3 ms +- 0.4 ms: 1.06x faster (-6%) - sympy_sum: 192 ms +- 6 ms -> 181 ms +- 7 ms: 1.06x faster (-6%) - scimark_lu: 352 ms +- 11 ms -> 332 ms +- 13 ms: 1.06x faster (-6%) - xml_etree_generate: 208 ms +- 2 ms -> 197 ms +- 3 ms: 1.05x faster (-5%) - nbody: 233 ms +- 2 ms -> 222 ms +- 3 ms: 1.05x faster (-5%) - telco: 14.7 ms +- 0.3 ms -> 14.0 ms +- 0.3 ms: 1.05x faster (-5%) - scimark_fft: 662 ms +- 10 ms -> 633 ms +- 8 ms: 1.05x faster (-4%) - genshi_text: 71.0 ms +- 0.8 ms -> 68.3 ms +- 0.7 ms: 1.04x faster (-4%) Benchmark hidden because not significant (53): (...)
msg287236 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2017-02-07 15:05
Oh, with pyobject_fastcall-4.patch, the stack usage of test_python_iterator is even better. The reference is 1056 bytes/call, pyobject_fastcall-2.patch was 944 bytes/call. haypo@smithers$ ./python ../default/stack_overflow_28870-sp.py test_python_call: 9883 calls before crash, stack: 848 bytes/call test_python_getitem: 10476 calls before crash, stack: 800 bytes/call test_python_iterator: 9189 calls before crash, stack: 912 bytes/call => total: 29548 calls, 2560 bytes
msg287254 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2017-02-07 19:31
Awesome! This looks fantastic! I need to check the patch very carefully to be sure that we haven't missed something important. Isn't the Python directory more appropriate place for call.c?
msg287257 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2017-02-07 21:49
Serhiy Storchaka added the comment: > Isn't the Python directory more appropriate place for call.c? I moved code from other .c files in Objects/, so for me it seems more natural to add the code in Objects/ as well. It seems like most of the code in Python/ is not "aware" of types defined in Objects/. But I don't have a strong opinion on the right directory. Objects/call.c is 1500 lines long. IMHO the code became big enough to justify to move it to a new file ;-)
msg287261 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2017-02-07 22:42
pyobject_fastcall-4.patch combines a lot of changes. I wrote it to experiment _PyObject_FastCall(). I will rewrite these changes as a patch serie with smaller changes. The design of PyObject_FastCall*() is to be short and simple. Changes: * Add _PyObject_FastCall(), _PyFunction_FastCall(), _PyCFunction_FastCall(): similar to the _PyObject_FastCallKeywords() but without keyword arguments. Without keyword arguments, the checks on keyword arguments can be removed, and it allows to reduce the stack usage: one less parameter to C functions. * Move the slow path out of _PyObject_FastCallDict() and _PyObject_FastCallKeywords() in a new subfunction which is tagged with _Py_NO_INLINE to reduce the stack consumption of _PyObject_FastCallDict() and _PyObject_FastCallKeywords() in their fast paths, which are the most code paths. * Move all "call" functions into a new Objects/call.c function. This change should help code placement to enhance the usage of the CPU caches (especially the CPU L1 instruction cache). In my long experience with benchmarking last year, I notice huge performance differences caused by code placement. See my blog post: https://haypo.github.io/analysis-python-performance-issue.html Sadly, _Py_HOT_FUNCTION didn't fix the issue completely. * _PyObject_FastCallDict() and _PyObject_FastCallKeywords() "call" _PyObject_FastCall(). In fact, _PyObject_FastCall() is inlined manually in these functions, against, to minimize the stack usage. Similar change in _PyCFunction_FastCallDict() and PyCFunction_Call(). * Since the slow path is moved to a subfunction, I removed _Py_NO_INLINE from _PyStack_AsTuple() to allow the compiler to inline it if it wants to ;-) Since the stack usage is better with the patch, it seems like this strange has no negative impact on the stack usage. * Optimize PyMethodDescr_Type (call _PyMethodDescr_FastCallKeywords()) in _PyObject_FastCallKeywords() * I moved Py_EnterRecursiveCall() closer to the final function call, to simplify error handling. It would be nice to fix the issue #29306 before :-/ * I had to copy/paste the null_error() function from Objects/abstract.c. I don't think that it's worth it to add a _PyErr_NullError() function shared by abstract.c and call.c. Compilers are even able to merge duplicated functions ;-) * A few other changes.
msg287358 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2017-02-08 17:35
I splitted my big patch into smaller changes, see my pull request: https://github.com/python/cpython/pull/74 If you prefer Rietveld ([Review] button), I also attached the squashed change: pyobject_fastcall-5.patch.
msg287360 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2017-02-08 17:38
Crap, pyobject_fastcall-5.patch was not rebased correctly :-/ Please see pyobject_fastcall-6.patch instead.
msg287373 - (view) Author: Roundup Robot (python-dev) (Python triager) Date: 2017-02-09 01:03
New changeset 31342913fb1e by Victor Stinner in branch 'default': Fix PyCFunction_Call() performance issue https://hg.python.org/cpython/rev/31342913fb1e
msg287377 - (view) Author: Roundup Robot (python-dev) (Python triager) Date: 2017-02-09 02:00
New changeset e9807e307f1f561a2dfe3e9f97a38444528dba86 by Victor Stinner in branch 'master': Fix PyCFunction_Call() performance issue https://github.com/python/cpython/commit/e9807e307f1f561a2dfe3e9f97a38444528dba86
msg287531 - (view) Author: Roundup Robot (python-dev) (Python triager) Date: 2017-02-10 13:15
New changeset f23fa1f7b68f by Victor Stinner in branch 'default': Issue #29465: Add Objects/call.c file https://hg.python.org/cpython/rev/f23fa1f7b68f
msg288526 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2017-02-24 15:23
Hum, code to call functions changed a lot recently, and other issues already enhanced the stack usage. I should measure the stack usage to check if it's still worth it.
msg325823 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2018-09-19 23:21
> I should measure the stack usage to check if it's still worth it. IMHO we gone far enough in optimizing the stack usage: https://vstinner.github.io/contrib-cpython-2017q1.html I close the issue.
History
Date User Action Args
2022-04-11 14:58:42 admin set github: 73651
2018-09-19 23:21:23 vstinner set status: open -> closedresolution: fixedmessages: + stage: patch review -> resolved
2017-12-13 10:28:28 vstinner set pull_requests: - <pull%5Frequest25>
2017-02-24 15:23:19 vstinner set messages: +
2017-02-23 15:55:50 vstinner set title: Add _PyObject_FastCall() to reduce stack consumption -> Modify _PyObject_FastCall() to reduce stack consumption
2017-02-10 13:15:39 python-dev set messages: +
2017-02-09 02:00:23 python-dev set messages: +
2017-02-09 01:03:06 python-dev set nosy: + python-devmessages: +
2017-02-08 17:38:08 vstinner set files: + pyobject_fastcall-6.patchmessages: +
2017-02-08 17:35:47 vstinner set files: + pyobject_fastcall-5.patchmessages: +
2017-02-08 17:33:47 vstinner set pull_requests: + <pull%5Frequest25>
2017-02-07 22:42:44 vstinner set messages: +
2017-02-07 21:49:25 vstinner set messages: +
2017-02-07 19:31:17 serhiy.storchaka set messages: + stage: patch review
2017-02-07 15:05:00 vstinner set messages: +
2017-02-07 14:46:05 vstinner set files: + pyobject_fastcall-4.patchmessages: +
2017-02-07 10:16:13 vstinner set messages: +
2017-02-07 08:17:32 vstinner set messages: +
2017-02-07 01:12:06 vstinner set files: + pyobject_fastcall-3.patchmessages: +
2017-02-07 01:05:56 vstinner set files: + pyobject_fastcall-2.patchmessages: +
2017-02-06 17:11:05 vstinner set files: + meth_fastcall_stacksize.patchmessages: +
2017-02-06 17:08:27 vstinner create