msg282079 - (view) |
Author: STINNER Victor (vstinner) *  |
Date: 2016-11-30 11:32 |
Attached patch is a minor optimization for _PyFunction_FastCallDict(): avoid the creation of a tuple to pass keyword arguments, use a simple C array allocated by PyMem_Malloc(). It also uses a small stack of 80 bytes (2*5*sizeof(PyObject*)) allocated on the C stack to pass up to 5 keyword arguments (5 key+value pairs): it avoids completely the allocation on the heap memory I wrote _PyFunction_FastCallDict() (issue #27809): the code was based on function_call() which also uses PyTuple_New(). When I wrote the function, I didn't notice that PyEval_EvalCodeEx() doesn't expect a Python tuple object, but a C array. The patch also modifies function_call() to call _PyFunction_FastCallDict(), so it gets _PyFunction_FastCallDict() optimizations. |
|
|
msg282080 - (view) |
Author: STINNER Victor (vstinner) *  |
Date: 2016-11-30 11:49 |
fastcalldict.patch avoided INCREF/DECREF on keyword keys and values. This is wrong: we must hold strong references because the keyword dictionary can be technically modified: see issue #2016 and test_extcall. Hum, I'm quite sure that it's not the first time that I was bitten by this bug. That's maybe why I didn't try to implement this optimization the first time. fastcalldict-2.patch keeps INCREF/DECREF and so doesn't crash on test_extcall. |
|
|
msg282095 - (view) |
Author: Josh Rosenberg (josh.r) *  |
Date: 2016-11-30 19:35 |
Given you can't avoid the refcounting overhead, how much does this really help? Are there meaningful benefits in microbenchmarks? I'd worry that unconditional allocation from PyMem_Malloc might lose out relative to PyTuple_New, which is likely to not involve memory allocation at all (pulling from tuple free list). |
|
|
msg282098 - (view) |
Author: Josh Rosenberg (josh.r) *  |
Date: 2016-11-30 19:49 |
Minor correction: No allocation when small stack used, so you'd only see (possibly) regressions with 6+ keyword arguments (assuming the tuple free list applies for tuples that large). Admittedly a minor concern; keyword processing is already pretty slow, and large numbers of keywords being passed are likely a tiny fraction of call cases, so I'd guess microbenchmarks wouldn't show a big change, and broad benchmarks would be completely unaffected, but worth checking. |
|
|
msg282152 - (view) |
Author: Serhiy Storchaka (serhiy.storchaka) *  |
Date: 2016-12-01 09:14 |
I agree with Josh, PyTuple_New() can be faster than PyMem_Malloc() due to tuple free list. small_stack increases C stack consumption even for calls without keyword arguments. This is serious problem since we can't control stack overflow. |
|
|
msg282154 - (view) |
Author: STINNER Victor (vstinner) *  |
Date: 2016-12-01 10:05 |
Patch version: fix the "if (0)" to use the small stack allocated on the C stack. |
|
|
msg282161 - (view) |
Author: STINNER Victor (vstinner) *  |
Date: 2016-12-01 12:07 |
bench_fastcalldict.py: hardcore microbenchmark on _PyFunction_FastCallDict(). Pass keyword arguments to the tp_init slot of a Python constructor. Result for 1, 5 and 10 keyword arguments: kw1: Median +- std dev: [ref] 329 ns +- 21 ns -> [patch] 306 ns +- 17 ns: 1.07x faster (-7%) kw5: Median +- std dev: [ref] 508 ns +- 22 ns -> [patch] 481 ns +- 25 ns: 1.05x faster (-5%) kw10: Median +- std dev: [ref] 829 ns +- 45 ns -> [patch] 805 ns +- 39 ns: 1.03x faster (-3%) As expected, the difference is small, but it's faster :-) Indirect benefit is that the garbage collector should be less stressed :-) (tuples are tracked by the GC.) Note: Using a simple printf() in the C code, I noticed that it is not uncommon that _PyFunction_FastCallDict() is called with an empty dictionary for keyword arguments. Without the patch, an empty tuple was created. With my patch, "unpack" the empty dictionary costs nothing. |
|
|
msg282162 - (view) |
Author: STINNER Victor (vstinner) *  |
Date: 2016-12-01 12:08 |
(Oops, I attached the wrong benchmark script. It's now fixed.) |
|
|
msg282168 - (view) |
Author: STINNER Victor (vstinner) *  |
Date: 2016-12-01 12:44 |
> Note: Using a simple printf() in the C code, I noticed that it is not uncommon that _PyFunction_FastCallDict() is called with an empty dictionary for keyword arguments. Simplified Python example where _PyFunction_FastCallDict() is called with an empty dictionary: --- def f2(): pass def wrapper(func, *args, **kw): # CALL_FUNCTION_EX: func(**{}) calls PyObject_Call() with kwargs={} which # calls _PyFunction_FastCallDict() func(*args, **kw) def f(): # CALL_FUNCTION: calling wrapper calls fast_function() which calls # _PyEval_EvalCodeWithName() which creates an empty dictionary for kw wrapper(f2) f() --- But on this specific case, the speedup is *very* small: 3 nanoseconds :-) ./python -m perf timeit -s 'kw={}' -s 'def func(): pass' --duplicate=1000 'func(**kw)' (...) Median +- std dev: [ref] 108 ns +- 4 ns -> [patch] 105 ns +- 5 ns: 1.02x faster (-2%) |
|
|
msg282169 - (view) |
Author: STINNER Victor (vstinner) *  |
Date: 2016-12-01 12:47 |
Serhiy: "small_stack increases C stack consumption even for calls without keyword arguments. This is serious problem since we can't control stack overflow." This problem is not new and is worked around by Py_EnterRecursiveCall() macro which counts the depth of the Python stack. I didn't notice any stack overflow crash with my patch. We can reduce the "small_stack" size later if needed. |
|
|
msg282175 - (view) |
Author: STINNER Victor (vstinner) *  |
Date: 2016-12-01 13:25 |
> I agree with Josh, PyTuple_New() can be faster than PyMem_Malloc() due to tuple free list. According to benchmarks, PyTuple_New() is slower than PyMem_Malloc(). It's not surprising for me, using a tuple object requires extra work: * Track and then untrack the object from the garbage collector * Destructor uses Py_TRASHCAN_SAFE_BEGIN/Py_TRASHCAN_SAFE_END macros * Some additional indirectons When I started working on "fastcall", I was surprised that not creating tuples has a *significant* (positive) effect on performance. It seems to be between 5% and 45% faster. Obviously, it depends on the speed of the function body. The speedup is higher for faster functions, like fast functions implemented in C. |
|
|
msg282181 - (view) |
Author: Serhiy Storchaka (serhiy.storchaka) *  |
Date: 2016-12-01 14:27 |
The problem with C stack overflow is not new, but your patch may make it worse (I don't know if it actually make it worse). Py_EnterRecursiveCall() is used for limiting Python stack. It can't prevent C stack overflow. |
|
|
msg282191 - (view) |
Author: STINNER Victor (vstinner) *  |
Date: 2016-12-01 16:03 |
I pushed th echange b9c9691c72c5 to replace PyObject_CallFunctionObjArgs() with _PyObject_CallNoArg() or _PyObject_CallArg1(). These functions are simpler and don't allocate memory on the C stack. I made similar to PyObject_CallFunctionObjArgs() in Python 3.6 to this issue: don't create a tuple but a small stack allocated on the C stack or allocate heap memory to pass a C array of PyObject* to _PyObject_FastCall(). Currently, PyObject_CallFunctionObjArgs() uses a small stack for up to 5 positional arguments: it allocates 40 bytes on the C stack. Serhiy Storchaka: "The problem with C stack overflow is not new, but your patch may make it worse (I don't know if it actually make it worse)." I consider that 80 bytes is small enough for a C stack. As I wrote, we can reduce this "small stack" in the future if somone reports issues. "Py_EnterRecursiveCall() is used for limiting Python stack. It can't prevent C stack overflow." I know that the protection is not perfect. It's an heuristic. I don't even know if it counts Python builtin functions, or only pure Python functions. But I'm not sure that I understood your comment: do you suggest to use a tuple and reject this issue? To reduce the size of the small stack? Or to only allocate memory on the heap memory? If the issue is the memory allocated on the C stack, maybe we can use a free list for "stacks" (C array of PyObject*), as done for tuples? I'm not sure that a free list for PyMem_Malloc()/PyMem_Free() is useful, since it uses our efficient pymalloc. |
|
|
msg284518 - (view) |
Author: STINNER Victor (vstinner) *  |
Date: 2017-01-03 00:48 |
Quick update on the fastcall work. > I pushed th echange b9c9691c72c5 to replace PyObject_CallFunctionObjArgs() with _PyObject_CallNoArg() or _PyObject_CallArg1(). These functions are simpler and don't allocate memory on the C stack. Using _PyObject_CallArg1() increased the usage of the C stack: see issue #28858. The fix was to remove the _PyObject_CallArg1() macro, replaced with PyObject_CallFunctionObjArgs(func, arg, NULL). There is still an open issue #28870 to reduce the usage of the C stack memory even more, but it's more a general optimization. |
|
|
msg284521 - (view) |
Author: Roundup Robot (python-dev)  |
Date: 2017-01-03 01:05 |
New changeset 5f7cd3b6c9b1 by Victor Stinner in branch 'default': Issue #28839: Optimize function_call() https://hg.python.org/cpython/rev/5f7cd3b6c9b1 New changeset f9dd607dc04c by Victor Stinner in branch 'default': Optimize _PyFunction_FastCallDict() when kwargs is {} https://hg.python.org/cpython/rev/f9dd607dc04c |
|
|
msg284522 - (view) |
Author: STINNER Victor (vstinner) *  |
Date: 2017-01-03 01:10 |
I pushed the two obvious and safe optimization of fastcalldict-3.patch. |
|
|
msg284523 - (view) |
Author: STINNER Victor (vstinner) *  |
Date: 2017-01-03 01:22 |
fastcalldict-4.patch: Rebased patch. kw1: Median +- std dev: [ref] 290 ns +- 3 ns -> [patch] 253 ns +- 21 ns: 1.14x faster (-13%) kw5: Median +- std dev: [ref] 438 ns +- 33 ns -> [patch] 394 ns +- 27 ns: 1.11x faster (-10%) kw10: Median +- std dev: [ref] 663 ns +- 14 ns -> [patch] 617 ns +- 16 ns: 1.07x faster (-7%) This patch still always allocated 10 pointers (80 bytes) on the C stack: see issue #28870 which discuss options to use less memory. |
|
|
msg285177 - (view) |
Author: STINNER Victor (vstinner) *  |
Date: 2017-01-11 01:22 |
Ok, I give up on that one. I don't think that it's worth it. |
|
|