Issue 29735: Optimize functools.partial() for positional arguments (original) (raw)
Created on 2017-03-06 13:22 by vstinner, last changed 2022-04-11 14:58 by admin. This issue is now closed.
Messages (10)
Author: STINNER Victor (vstinner) *
Date: 2017-03-06 13:22
The pull request makes functools.partial() faster for positional arguments. It avoids the creation of a tuple for positional arguments. It allocates a small buffer for up to 5 parameters. But it seems like even if the small buffer is not used, it's still faster.
Use small buffer, total: 2 positional arguments.
haypo@smithers$ ./python -m perf timeit -s 'from functools import partial; f = lambda x, y: None; g = partial(f, 1)' 'g(2)' --duplicate=100 --compare-to ../master-ref/python --python-names=ref:patch --python-names=ref:patch ref: ..................... 138 ns +- 1 ns patch: ..................... 121 ns +- 1 ns
Median +- std dev: [ref] 138 ns +- 1 ns -> [patch] 121 ns +- 1 ns: 1.14x faster (-12%)
Don't use small buffer, total: 6 positional arguments.
haypo@smithers$ ./python -m perf timeit -s 'from functools import partial; f = lambda a1, a2, a3, a4, a5, a6: None; g = partial(f, 1, 2, 3, 4, 5)' 'g(6)' --duplicate=100 --compare-to ../master-ref/python --python-names=ref:patch --python-names=ref:patch ref: ..................... 156 ns +- 1 ns patch: ..................... 136 ns +- 0 ns
Median +- std dev: [ref] 156 ns +- 1 ns -> [patch] 136 ns +- 0 ns: 1.15x faster (-13%)
Another benchmark with 10 position arguments:
haypo@smithers$ ./python -m perf timeit -s 'from functools import partial; f = lambda a1, a2, a3, a4, a5, a6, a7, a8, a9, a10: None; g = partial(f, 1, 2, 3, 4, 5)' 'g(6, 7, 8, 9, 10)' --duplicate=100 --compare-to ../master-ref/python --python-names=ref:patch --python-names=ref:patch ref: ..................... 193 ns +- 1 ns patch: ..................... 166 ns +- 2 ns
Median +- std dev: [ref] 193 ns +- 1 ns -> [patch] 166 ns +- 2 ns: 1.17x faster (-14%)
Author: STINNER Victor (vstinner) *
Date: 2017-03-06 13:32
functools.partial() is commonly used in the the asyncio module. The asyncio doc suggests to use it, because of deliberate limitations of the asyncio API.
Author: Serhiy Storchaka (serhiy.storchaka) *
Date: 2017-03-06 14:58
What about C stack consumption? Is not this increase it?
Since nested partial()`s are collapsed, you need to interlace them with other wrapper for testing.
def decorator(f): def wrapper(*args): return f(*args) return wrapper
def f(*args): pass
for i in range(n): f = partial(f) f = decorator(f)
f(1, 2)
Author: Serhiy Storchaka (serhiy.storchaka) *
Date: 2017-03-06 16:52
If the underlying function doesn't support fast call, and either args or pto->args are empty, partial_call() makes two unneeded copyings. Arguments are copied from a tuple to the raw array and from the array to new tuple. This is what the current code does, but this can be avoided.
If the underlying function doesn't support fast call, and both args and pto->args are not empty, patched partial_call() makes one unneeded copyings. Arguments are copied from tuples to the raw array and from the array to the new tuple. Only one copying is needed (from tuples to the new tuple).
Author: STINNER Victor (vstinner) *
Date: 2017-03-14 10:35
If the underlying function doesn't support fast call, and both args and pto->args are not empty, patched partial_call() makes one unneeded copyings.
The simple workaround is to revert changes using FASTCALL in partial_call().
But for best performances, it seems like we need two code paths depending if the function supports fastcall or not. I will try to write a patch for that.
Author: STINNER Victor (vstinner) *
Date: 2017-03-14 12:08
bench_fastcall_partial.py: more complete microbenchmark.
I rewrote my patch:
- I added _PyObject_HasFastCall(callable): return 1 if callable supports FASTCALL calling convention for positional arguments
- I splitted partial_call() into 2 subfunctions: partial_fastcall() is specialized for FASTCALL, partial_call_impl() uses PyObject_Call() with a tuple for positional arguments
The patch fixes the performance regression for VARARGS and optimize FASTCALL:
haypo@smithers$ ./python -m perf compare_to ref.json patch.json --table +-----------------------------+---------+------------------------------+ | Benchmark | ref | patch | +=============================+=========+==============================+ | partial Python, 1+1 arg | 135 ns | 118 ns: 1.15x faster (-13%) | +-----------------------------+---------+------------------------------+ | partial Python, 2+0 arg | 114 ns | 91.4 ns: 1.25x faster (-20%) | +-----------------------------+---------+------------------------------+ | partial Python, 5+1 arg | 151 ns | 135 ns: 1.12x faster (-11%) | +-----------------------------+---------+------------------------------+ | partial Python, 5+5 arg | 192 ns | 168 ns: 1.15x faster (-13%) | +-----------------------------+---------+------------------------------+ | partial C VARARGS, 2+0 arg | 153 ns | 127 ns: 1.20x faster (-17%) | +-----------------------------+---------+------------------------------+ | partial C FASTCALL, 1+1 arg | 111 ns | 93.7 ns: 1.18x faster (-15%) | +-----------------------------+---------+------------------------------+ | partial C FASTCALL, 2+0 arg | 63.9 ns | 64.6 ns: 1.01x slower (+1%) | +-----------------------------+---------+------------------------------+
Not significant (1): partial C VARARGS, 1+1 arg
Author: STINNER Victor (vstinner) *
Date: 2017-03-14 12:10
What about C stack consumption? Is not this increase it?
Yes, my optimization consumes more C stack: small_stack allocates 80 bytes on the stack (for 5 positional arguments). Is it an issue?
Author: Serhiy Storchaka (serhiy.storchaka) *
Date: 2017-03-14 13:25
Nice results.
You made a great work for decreasing C stack consumption. It would be sad to lose it without good reasons. Could you please compare two variants, with and without small stack?
Author: STINNER Victor (vstinner) *
Date: 2017-03-14 15:02
I measured that my patch (pull request) increases the stack usage of 64 bytes per partial_call() call. I consider that it's accepted for a speedup between 1.12x faster and 1.25x faster.
Attached partial_stack_usage.py requires testcapi_stack_pointer.patch of issue #28870.
Original:
f(): [1000 calls] 624.0 B per call f2(): [1000 calls] 624.0 B per call
Patched:
f(): [1000 calls] 688.0 B per call (+64 B) f2(): [1000 calls] 688.0 B per call (+64 B)
Author: STINNER Victor (vstinner) *
Date: 2017-03-24 22:19
New changeset 0f7b0b397e12514ee213bc727c9939b66585cbe2 by Victor Stinner in branch 'master': bpo-29735: Optimize partial_call(): avoid tuple (#516) https://github.com/python/cpython/commit/0f7b0b397e12514ee213bc727c9939b66585cbe2
History
Date
User
Action
Args
2022-04-11 14:58:43
admin
set
github: 73921
2017-03-24 22:19:27
vstinner
set
messages: +
2017-03-14 20:42:37
vstinner
set
status: open -> closed
resolution: fixed
stage: patch review -> resolved
2017-03-14 15:02:03
vstinner
set
files: + partial_stack_usage.py
messages: +
2017-03-14 13:25:02
serhiy.storchaka
set
messages: +
2017-03-14 12:10:04
vstinner
set
messages: +
2017-03-14 12:08:12
vstinner
set
files: + bench_fastcall_partial.py
messages: +
2017-03-14 10:35:05
vstinner
set
messages: +
2017-03-06 16:52:11
serhiy.storchaka
set
messages: +
2017-03-06 14:58:01
serhiy.storchaka
set
messages: +
components: + Extension Modules
stage: patch review
2017-03-06 13:32:55
vstinner
set
nosy: + rhettinger, ncoghlan
2017-03-06 13:32:22
vstinner
set
nosy: + methane, serhiy.storchaka, yselivanov
messages: +
2017-03-06 13:29:15
vstinner
set
pull_requests: + <pull%5Frequest425>
2017-03-06 13:22:45
vstinner
create