Issue 45439: [C API] Move usage of tp_vectorcall_offset from public headers to the internal C API (original) (raw)

Created on 2021-10-11 22:18 by vstinner, last changed 2022-04-11 14:59 by admin. This issue is now closed.

Files
File name	Uploaded	Description	Edit
test_bench.patch	vstinner,2021-10-12 06:58
bench.py	vstinner,2021-10-12 06:59
bench_no_args.patch	vstinner,2021-10-12 07:38
bench_no_args_inline.py	vstinner,2021-10-12 07:38
bench_no_args_public.py	vstinner,2021-10-12 07:38
test_bench2.patch	vstinner,2021-10-12 20:18
bench2.py	vstinner,2021-10-12 20:18
sys_call.patch	vstinner,2021-10-13 21:54
stack_overflow-4.py	vstinner,2021-10-13 22:03

Pull Requests
URL	Status	Linked	Edit
PR 28890	merged	vstinner,2021-10-11 22:18
PR 28891	merged	vstinner,2021-10-11 22:19
PR 28893	merged	vstinner,2021-10-11 23:04
PR 28895	merged	vstinner,2021-10-12 00:58

Messages (10)
msg403695 - (view)	Author: STINNER Victor (vstinner) *	Date: 2021-10-11 22:18
The public C API should avoid accessing directly PyTypeObject members: see bpo-40170. I propose to move static inline functions to the internal C API, and only expose opaque function calls to the public C API.
msg403696 - (view)	Author: STINNER Victor (vstinner) *	Date: 2021-10-11 22:18
New changeset fb8f208a4ddb38eedee71f9ecd0f22058802dab1 by Victor Stinner in branch 'main': bpo-45439: _PyObject_Call() only checks tp_vectorcall_offset once (GH-28890) https://github.com/python/cpython/commit/fb8f208a4ddb38eedee71f9ecd0f22058802dab1
msg403698 - (view)	Author: STINNER Victor (vstinner) *	Date: 2021-10-11 22:42
New changeset ce3489cfdb9f0e050bdc45ce5d3902c2577ea683 by Victor Stinner in branch 'main': bpo-45439: Rename _PyObject_CallNoArg() to _PyObject_CallNoArgs() (GH-28891) https://github.com/python/cpython/commit/ce3489cfdb9f0e050bdc45ce5d3902c2577ea683
msg403708 - (view)	Author: STINNER Victor (vstinner) *	Date: 2021-10-12 06:38
New changeset d943d19172aa93ce88bade15b9f23a0ce3bc72ff by Victor Stinner in branch 'main': bpo-45439: Move _PyObject_CallNoArgs() to pycore_call.h (GH-28895) https://github.com/python/cpython/commit/d943d19172aa93ce88bade15b9f23a0ce3bc72ff
msg403768 - (view)	Author: STINNER Victor (vstinner) *	Date: 2021-10-13 00:28
I should also check again the stack consumption. Old issues: * bpo-29465: Modify _PyObject_FastCall() to reduce stack consumption * bpo-29234: Disable inlining of _PyStack_AsTuple() to reduce the stack consumption * bpo-29227: Reduce C stack consumption in function calls * bpo-28858: Fastcall uses more C stack See also: "Stack consumption" of https://vstinner.github.io/contrib-cpython-2017q1.html
msg403770 - (view)	Author: STINNER Victor (vstinner) *	Date: 2021-10-13 00:31
5 years ago, I added _PyObject_CallArg1() (similar to PyObject_CallOneArg()) and then I removed it since it consumed more stack memory than existing function, whereas I added _PyObject_CallArg1() to reduce the stack consumption. commit 7bfb42d5b7721ca26e33050d025fec5c43c00058 Author: Victor Stinner <victor.stinner@gmail.com> Date: Mon Dec 5 17:04:32 2016 +0100 Issue #28858: Remove _PyObject_CallArg1() macro Replace _PyObject_CallArg1(func, arg) with PyObject_CallFunctionObjArgs(func, arg, NULL) Using the _PyObject_CallArg1() macro increases the usage of the C stack, which was unexpected and unwanted. PyObject_CallFunctionObjArgs() doesn't have this issue.
msg403877 - (view)	Author: STINNER Victor (vstinner) *	Date: 2021-10-13 22:03
I measured the stack consumption using attached sys_call.patch and stack_overflow-4.py. Using gcc -O3, the stack consumption with PR 28893 is way better on the 6 benchmarks (6 ways to call functions), especially: PyObject_CallOneArg(): 624 bytes/call => 528 bytes/call (-96 bytes) PyObject_CallNoArg(): 608 bytes/call => 512 bytes/call (-96 bytes) _PyObject_CallNoArg(): 608 bytes/call => 512 bytes/call (-96 bytes) Python built in release mode with gcc -O3: ./configure && make === ref === $ ./python stack_overflow-4.py test_python_call: 10070 calls before crash, stack: 832 bytes/call test_python_getitem: 16894 calls before crash, stack: 496 bytes/call test_python_iterator: 12773 calls before crash, stack: 656 bytes/call test_callonearg: 13428 calls before crash, stack: 624 bytes/call test_callnoargs: 13779 calls before crash, stack: 608 bytes/call test_callnoargs_inline: 13782 calls before crash, stack: 608 bytes/call => total: 80726 calls, 3824 bytes === PR === $ ./python stack_overflow-4.py test_python_call: 11901 calls before crash, stack: 704 bytes/call test_python_getitem: 18703 calls before crash, stack: 448 bytes/call test_python_iterator: 14961 calls before crash, stack: 560 bytes/call test_callonearg: 15868 calls before crash, stack: 528 bytes/call test_callnoargs: 16366 calls before crash, stack: 512 bytes/call test_callnoargs_inline: 16365 calls before crash, stack: 512 bytes/call => total: 94164 calls, 3264 bytes
msg403878 - (view)	Author: STINNER Victor (vstinner) *	Date: 2021-10-13 22:37
Using LTO, the PR 28893 increases the stack memory usage. It's the opposite :-) PyObject_CallOneArg(): 672 bytes/call => 688 bytes/call (+16 bytes) PyObject_CallNoArg(): 640 bytes/call => 672 bytes/call (+32 bytes) _PyObject_CallNoArg(): 640 bytes/call => 672 bytes/call (+32 bytes) clang with LTO: ./configure --with-lto CC=clang LD=lld LDFLAGS="-fuse-ld=lld" make === ref === $ ./python stack_overflow-4.py test_python_call: 9187 calls before crash, stack: 912 bytes/call test_python_getitem: 15868 calls before crash, stack: 528 bytes/call test_python_iterator: 11901 calls before crash, stack: 704 bytes/call test_callonearg: 12468 calls before crash, stack: 672 bytes/call test_callnoargs: 13091 calls before crash, stack: 640 bytes/call test_callnoargs_inline: 13092 calls before crash, stack: 640 bytes/call => total: 75607 calls, 4096 bytes === PR === $ ./python stack_overflow-4.py test_python_call: 9186 calls before crash, stack: 912 bytes/call test_python_getitem: 15400 calls before crash, stack: 544 bytes/call test_python_iterator: 11384 calls before crash, stack: 736 bytes/call test_callonearg: 12177 calls before crash, stack: 688 bytes/call test_callnoargs: 12468 calls before crash, stack: 672 bytes/call test_callnoargs_inline: 12467 calls before crash, stack: 672 bytes/call => total: 73082 calls, 4224 bytes
msg403937 - (view)	Author: STINNER Victor (vstinner) *	Date: 2021-10-14 19:53
New changeset 3cc56c828d2d8f8659ea49447234bf0d2b87cd64 by Victor Stinner in branch 'main': bpo-45439: Move _PyObject_VectorcallTstate() to pycore_call.h (GH-28893) https://github.com/python/cpython/commit/3cc56c828d2d8f8659ea49447234bf0d2b87cd64
msg403939 - (view)	Author: STINNER Victor (vstinner) *	Date: 2021-10-14 19:59
I decided to merge my PR to address https://bugs.python.org/issue45439 initial issue: "[C API] Move usage of tp_vectorcall_offset from public headers to the internal C API". Last years, I added `tstate` parameters to internal C functions. The agreement was that only internal functions should use it, and indirectly that this `tstate` parameter should be hidden. I'm now sure exactly, but `tstate` started to pop up in `Include/cpython/abstract.h` around "call" functions. This PR fix this issue. About the impact on performances: well, it's really hard to draw a clear conclusion. Inlining, LTO and PGO give different results on runtime performance and stack memory usage. IMO the fact that public C API functions are now regular functions should not prevent us to continue (micro) optimizing Python. We can always add a variant to the internal C API using an API a little bit different (e.g. add `tstate` parameter) or defined as a static inline function, rather than a regular function. The unclear part is if PyObject_CallOneArg() (regular function call) is faster than _PyObject_CallOneArg() (static inline function, inlined). The performance may depend if it's called in the Python executable or in a dynamic library (PLT indirection which may be avoided by `gcc -fno-semantic-interposition`). Well, happy hacking and let's continue continuous benchmarking Python!

History
Date	User	Action	Args
2022-04-11 14:59:51	admin	set	github: 89602
2021-10-15 00:42:16	vstinner	set	status: open -> closedresolution: fixedstage: patch review -> resolved
2021-10-14 19:59:41	vstinner	set	messages: +
2021-10-14 19:53:12	vstinner	set	messages: +
2021-10-13 22:37:51	vstinner	set	messages: +
2021-10-13 22:03:50	vstinner	set	files: + stack_overflow-4.pymessages: +
2021-10-13 21:54:14	vstinner	set	files: + sys_call.patch
2021-10-13 00:31:56	vstinner	set	messages: +
2021-10-13 00:28:53	vstinner	set	messages: +
2021-10-12 20🔞48	vstinner	set	files: + bench2.py
2021-10-12 20🔞41	vstinner	set	files: + test_bench2.patch
2021-10-12 07:38:16	vstinner	set	files: + bench_no_args_public.py
2021-10-12 07:38:10	vstinner	set	files: + bench_no_args_inline.py
2021-10-12 07:38:04	vstinner	set	files: + bench_no_args.patch
2021-10-12 06:59:00	vstinner	set	files: + bench.py
2021-10-12 06:58:51	vstinner	set	files: + test_bench.patch
2021-10-12 06:38:27	vstinner	set	messages: +
2021-10-12 00:58:49	vstinner	set	pull_requests: + <pull%5Frequest27190>
2021-10-11 23:04:51	vstinner	set	pull_requests: + <pull%5Frequest27188>
2021-10-11 22:42:26	vstinner	set	messages: +
2021-10-11 22:19:46	vstinner	set	pull_requests: + <pull%5Frequest27186>
2021-10-11 22🔞36	vstinner	set	keywords: + patchstage: patch reviewpull_requests: + <pull%5Frequest27185>
2021-10-11 22🔞33	vstinner	set	messages: +
2021-10-11 22🔞05	vstinner	create