Issue 29735: Optimize functools.partial() for positional arguments (original) (raw)

Created on 2017-03-06 13:22 by vstinner, last changed 2022-04-11 14:58 by admin. This issue is now closed.

Messages (10)

msg289100 - (view)

Author: STINNER Victor (vstinner) * (Python committer)

Date: 2017-03-06 13:22

The pull request makes functools.partial() faster for positional arguments. It avoids the creation of a tuple for positional arguments. It allocates a small buffer for up to 5 parameters. But it seems like even if the small buffer is not used, it's still faster.

Use small buffer, total: 2 positional arguments.

haypo@smithers$ ./python -m perf timeit -s 'from functools import partial; f = lambda x, y: None; g = partial(f, 1)' 'g(2)' --duplicate=100 --compare-to ../master-ref/python --python-names=ref:patch --python-names=ref:patch ref: ..................... 138 ns +- 1 ns patch: ..................... 121 ns +- 1 ns

Median +- std dev: [ref] 138 ns +- 1 ns -> [patch] 121 ns +- 1 ns: 1.14x faster (-12%)

Don't use small buffer, total: 6 positional arguments.

haypo@smithers$ ./python -m perf timeit -s 'from functools import partial; f = lambda a1, a2, a3, a4, a5, a6: None; g = partial(f, 1, 2, 3, 4, 5)' 'g(6)' --duplicate=100 --compare-to ../master-ref/python --python-names=ref:patch --python-names=ref:patch ref: ..................... 156 ns +- 1 ns patch: ..................... 136 ns +- 0 ns

Median +- std dev: [ref] 156 ns +- 1 ns -> [patch] 136 ns +- 0 ns: 1.15x faster (-13%)

Another benchmark with 10 position arguments:

haypo@smithers$ ./python -m perf timeit -s 'from functools import partial; f = lambda a1, a2, a3, a4, a5, a6, a7, a8, a9, a10: None; g = partial(f, 1, 2, 3, 4, 5)' 'g(6, 7, 8, 9, 10)' --duplicate=100 --compare-to ../master-ref/python --python-names=ref:patch --python-names=ref:patch ref: ..................... 193 ns +- 1 ns patch: ..................... 166 ns +- 2 ns

Median +- std dev: [ref] 193 ns +- 1 ns -> [patch] 166 ns +- 2 ns: 1.17x faster (-14%)

msg289103 - (view)

Author: STINNER Victor (vstinner) * (Python committer)

Date: 2017-03-06 13:32

functools.partial() is commonly used in the the asyncio module. The asyncio doc suggests to use it, because of deliberate limitations of the asyncio API.

msg289112 - (view)

Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer)

Date: 2017-03-06 14:58

What about C stack consumption? Is not this increase it?

Since nested partial()`s are collapsed, you need to interlace them with other wrapper for testing.

def decorator(f): def wrapper(*args): return f(*args) return wrapper

def f(*args): pass

for i in range(n): f = partial(f) f = decorator(f)

f(1, 2)

msg289120 - (view)

Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer)

Date: 2017-03-06 16:52

If the underlying function doesn't support fast call, and either args or pto->args are empty, partial_call() makes two unneeded copyings. Arguments are copied from a tuple to the raw array and from the array to new tuple. This is what the current code does, but this can be avoided.

If the underlying function doesn't support fast call, and both args and pto->args are not empty, patched partial_call() makes one unneeded copyings. Arguments are copied from tuples to the raw array and from the array to the new tuple. Only one copying is needed (from tuples to the new tuple).

msg289578 - (view)

Author: STINNER Victor (vstinner) * (Python committer)

Date: 2017-03-14 10:35

If the underlying function doesn't support fast call, and both args and pto->args are not empty, patched partial_call() makes one unneeded copyings.

The simple workaround is to revert changes using FASTCALL in partial_call().

But for best performances, it seems like we need two code paths depending if the function supports fastcall or not. I will try to write a patch for that.

msg289579 - (view)

Author: STINNER Victor (vstinner) * (Python committer)

Date: 2017-03-14 12:08

bench_fastcall_partial.py: more complete microbenchmark.

I rewrote my patch:

The patch fixes the performance regression for VARARGS and optimize FASTCALL:

haypo@smithers$ ./python -m perf compare_to ref.json patch.json --table +-----------------------------+---------+------------------------------+ | Benchmark | ref | patch | +=============================+=========+==============================+ | partial Python, 1+1 arg | 135 ns | 118 ns: 1.15x faster (-13%) | +-----------------------------+---------+------------------------------+ | partial Python, 2+0 arg | 114 ns | 91.4 ns: 1.25x faster (-20%) | +-----------------------------+---------+------------------------------+ | partial Python, 5+1 arg | 151 ns | 135 ns: 1.12x faster (-11%) | +-----------------------------+---------+------------------------------+ | partial Python, 5+5 arg | 192 ns | 168 ns: 1.15x faster (-13%) | +-----------------------------+---------+------------------------------+ | partial C VARARGS, 2+0 arg | 153 ns | 127 ns: 1.20x faster (-17%) | +-----------------------------+---------+------------------------------+ | partial C FASTCALL, 1+1 arg | 111 ns | 93.7 ns: 1.18x faster (-15%) | +-----------------------------+---------+------------------------------+ | partial C FASTCALL, 2+0 arg | 63.9 ns | 64.6 ns: 1.01x slower (+1%) | +-----------------------------+---------+------------------------------+

Not significant (1): partial C VARARGS, 1+1 arg

msg289580 - (view)

Author: STINNER Victor (vstinner) * (Python committer)

Date: 2017-03-14 12:10

What about C stack consumption? Is not this increase it?

Yes, my optimization consumes more C stack: small_stack allocates 80 bytes on the stack (for 5 positional arguments). Is it an issue?

msg289582 - (view)

Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer)

Date: 2017-03-14 13:25

Nice results.

You made a great work for decreasing C stack consumption. It would be sad to lose it without good reasons. Could you please compare two variants, with and without small stack?

msg289594 - (view)

Author: STINNER Victor (vstinner) * (Python committer)

Date: 2017-03-14 15:02

I measured that my patch (pull request) increases the stack usage of 64 bytes per partial_call() call. I consider that it's accepted for a speedup between 1.12x faster and 1.25x faster.

Attached partial_stack_usage.py requires testcapi_stack_pointer.patch of issue #28870.

Original:

f(): [1000 calls] 624.0 B per call f2(): [1000 calls] 624.0 B per call

Patched:

f(): [1000 calls] 688.0 B per call (+64 B) f2(): [1000 calls] 688.0 B per call (+64 B)

msg290183 - (view)

Author: STINNER Victor (vstinner) * (Python committer)

Date: 2017-03-24 22:19

New changeset 0f7b0b397e12514ee213bc727c9939b66585cbe2 by Victor Stinner in branch 'master': bpo-29735: Optimize partial_call(): avoid tuple (#516) https://github.com/python/cpython/commit/0f7b0b397e12514ee213bc727c9939b66585cbe2

History

Date

User

Action

Args

2022-04-11 14:58:43

admin

set

github: 73921

2017-03-24 22:19:27

vstinner

set

messages: +

2017-03-14 20:42:37

vstinner

set

status: open -> closed
resolution: fixed
stage: patch review -> resolved

2017-03-14 15:02:03

vstinner

set

files: + partial_stack_usage.py

messages: +

2017-03-14 13:25:02

serhiy.storchaka

set

messages: +

2017-03-14 12:10:04

vstinner

set

messages: +

2017-03-14 12:08:12

vstinner

set

files: + bench_fastcall_partial.py

messages: +

2017-03-14 10:35:05

vstinner

set

messages: +

2017-03-06 16:52:11

serhiy.storchaka

set

messages: +

2017-03-06 14:58:01

serhiy.storchaka

set

messages: +
components: + Extension Modules
stage: patch review

2017-03-06 13:32:55

vstinner

set

nosy: + rhettinger, ncoghlan

2017-03-06 13:32:22

vstinner

set

nosy: + methane, serhiy.storchaka, yselivanov
messages: +

2017-03-06 13:29:15

vstinner

set

pull_requests: + <pull%5Frequest425>

2017-03-06 13:22:45

vstinner

create