msg403695 - (view) |
Author: STINNER Victor (vstinner) *  |
Date: 2021-10-11 22:18 |
The public C API should avoid accessing directly PyTypeObject members: see bpo-40170. I propose to move static inline functions to the internal C API, and only expose opaque function calls to the public C API. |
|
|
msg403696 - (view) |
Author: STINNER Victor (vstinner) *  |
Date: 2021-10-11 22:18 |
New changeset fb8f208a4ddb38eedee71f9ecd0f22058802dab1 by Victor Stinner in branch 'main': bpo-45439: _PyObject_Call() only checks tp_vectorcall_offset once (GH-28890) https://github.com/python/cpython/commit/fb8f208a4ddb38eedee71f9ecd0f22058802dab1 |
|
|
msg403698 - (view) |
Author: STINNER Victor (vstinner) *  |
Date: 2021-10-11 22:42 |
New changeset ce3489cfdb9f0e050bdc45ce5d3902c2577ea683 by Victor Stinner in branch 'main': bpo-45439: Rename _PyObject_CallNoArg() to _PyObject_CallNoArgs() (GH-28891) https://github.com/python/cpython/commit/ce3489cfdb9f0e050bdc45ce5d3902c2577ea683 |
|
|
msg403708 - (view) |
Author: STINNER Victor (vstinner) *  |
Date: 2021-10-12 06:38 |
New changeset d943d19172aa93ce88bade15b9f23a0ce3bc72ff by Victor Stinner in branch 'main': bpo-45439: Move _PyObject_CallNoArgs() to pycore_call.h (GH-28895) https://github.com/python/cpython/commit/d943d19172aa93ce88bade15b9f23a0ce3bc72ff |
|
|
msg403768 - (view) |
Author: STINNER Victor (vstinner) *  |
Date: 2021-10-13 00:28 |
I should also check again the stack consumption. Old issues: * bpo-29465: Modify _PyObject_FastCall() to reduce stack consumption * bpo-29234: Disable inlining of _PyStack_AsTuple() to reduce the stack consumption * bpo-29227: Reduce C stack consumption in function calls * bpo-28858: Fastcall uses more C stack See also: "Stack consumption" of https://vstinner.github.io/contrib-cpython-2017q1.html |
|
|
msg403770 - (view) |
Author: STINNER Victor (vstinner) *  |
Date: 2021-10-13 00:31 |
5 years ago, I added _PyObject_CallArg1() (similar to PyObject_CallOneArg()) and then I removed it since it consumed more stack memory than existing function, whereas I added _PyObject_CallArg1() to reduce the stack consumption. commit 7bfb42d5b7721ca26e33050d025fec5c43c00058 Author: Victor Stinner <victor.stinner@gmail.com> Date: Mon Dec 5 17:04:32 2016 +0100 Issue #28858: Remove _PyObject_CallArg1() macro Replace _PyObject_CallArg1(func, arg) with PyObject_CallFunctionObjArgs(func, arg, NULL) Using the _PyObject_CallArg1() macro increases the usage of the C stack, which was unexpected and unwanted. PyObject_CallFunctionObjArgs() doesn't have this issue. |
|
|
msg403877 - (view) |
Author: STINNER Victor (vstinner) *  |
Date: 2021-10-13 22:03 |
I measured the stack consumption using attached sys_call.patch and stack_overflow-4.py. Using gcc -O3, the stack consumption with PR 28893 is *way better* on the 6 benchmarks (6 ways to call functions), especially: PyObject_CallOneArg(): 624 bytes/call => 528 bytes/call (-96 bytes) PyObject_CallNoArg(): 608 bytes/call => 512 bytes/call (-96 bytes) _PyObject_CallNoArg(): 608 bytes/call => 512 bytes/call (-96 bytes) Python built in release mode with gcc -O3: ./configure && make === ref === $ ./python stack_overflow-4.py test_python_call: 10070 calls before crash, stack: 832 bytes/call test_python_getitem: 16894 calls before crash, stack: 496 bytes/call test_python_iterator: 12773 calls before crash, stack: 656 bytes/call test_callonearg: 13428 calls before crash, stack: 624 bytes/call test_callnoargs: 13779 calls before crash, stack: 608 bytes/call test_callnoargs_inline: 13782 calls before crash, stack: 608 bytes/call => total: 80726 calls, 3824 bytes === PR === $ ./python stack_overflow-4.py test_python_call: 11901 calls before crash, stack: 704 bytes/call test_python_getitem: 18703 calls before crash, stack: 448 bytes/call test_python_iterator: 14961 calls before crash, stack: 560 bytes/call test_callonearg: 15868 calls before crash, stack: 528 bytes/call test_callnoargs: 16366 calls before crash, stack: 512 bytes/call test_callnoargs_inline: 16365 calls before crash, stack: 512 bytes/call => total: 94164 calls, 3264 bytes |
|
|
msg403878 - (view) |
Author: STINNER Victor (vstinner) *  |
Date: 2021-10-13 22:37 |
Using LTO, the PR 28893 *increases* the stack memory usage. It's the opposite :-) PyObject_CallOneArg(): 672 bytes/call => 688 bytes/call (+16 bytes) PyObject_CallNoArg(): 640 bytes/call => 672 bytes/call (+32 bytes) _PyObject_CallNoArg(): 640 bytes/call => 672 bytes/call (+32 bytes) clang with LTO: ./configure --with-lto CC=clang LD=lld LDFLAGS="-fuse-ld=lld" make === ref === $ ./python stack_overflow-4.py test_python_call: 9187 calls before crash, stack: 912 bytes/call test_python_getitem: 15868 calls before crash, stack: 528 bytes/call test_python_iterator: 11901 calls before crash, stack: 704 bytes/call test_callonearg: 12468 calls before crash, stack: 672 bytes/call test_callnoargs: 13091 calls before crash, stack: 640 bytes/call test_callnoargs_inline: 13092 calls before crash, stack: 640 bytes/call => total: 75607 calls, 4096 bytes === PR === $ ./python stack_overflow-4.py test_python_call: 9186 calls before crash, stack: 912 bytes/call test_python_getitem: 15400 calls before crash, stack: 544 bytes/call test_python_iterator: 11384 calls before crash, stack: 736 bytes/call test_callonearg: 12177 calls before crash, stack: 688 bytes/call test_callnoargs: 12468 calls before crash, stack: 672 bytes/call test_callnoargs_inline: 12467 calls before crash, stack: 672 bytes/call => total: 73082 calls, 4224 bytes |
|
|
msg403937 - (view) |
Author: STINNER Victor (vstinner) *  |
Date: 2021-10-14 19:53 |
New changeset 3cc56c828d2d8f8659ea49447234bf0d2b87cd64 by Victor Stinner in branch 'main': bpo-45439: Move _PyObject_VectorcallTstate() to pycore_call.h (GH-28893) https://github.com/python/cpython/commit/3cc56c828d2d8f8659ea49447234bf0d2b87cd64 |
|
|
msg403939 - (view) |
Author: STINNER Victor (vstinner) *  |
Date: 2021-10-14 19:59 |
I decided to merge my PR to address https://bugs.python.org/issue45439 initial issue: "[C API] Move usage of **tp_vectorcall_offset** from public headers to the internal C API". Last years, I added `tstate` parameters to internal C functions. The agreement was that only internal functions should use it, and indirectly that this `tstate` parameter should be hidden. I'm now sure exactly, but `tstate` started to pop up in `Include/cpython/abstract.h` around "call" functions. This PR fix this issue. About the impact on performances: well, it's really hard to draw a clear conclusion. Inlining, LTO and PGO give different results on runtime performance and stack memory usage. IMO the fact that public C API functions are now regular functions should not prevent us to continue (micro) optimizing Python. We can always add a variant to the internal C API using an API a little bit different (e.g. add `tstate` parameter) or defined as a static inline function, rather than a regular function. The unclear part is if PyObject_CallOneArg() (regular function call) is faster than _PyObject_CallOneArg() (static inline function, inlined). The performance may depend if it's called in the Python executable or in a dynamic library (PLT indirection which may be avoided by `gcc -fno-semantic-interposition`). Well, happy hacking and let's continue *continuous* benchmarking Python! |
|
|