bpo-47009: Let PRECALL_NO_KW_LIST_APPEND do its own POP_TOP by sweeneyde · Pull Request #32239 · python/cpython (original) (raw)
Most code won't do y = L.append(x)
or whatnot, so PRECALL_NO_KW_LIST_APPEND
is almost always followed by POP_TOP
. We can verify at specialization time.
This saves a Py_INCREF(Py_None)
, a SET_TOP(Py_None)
, and POP_TOP's Py_DECREF(POP()); DISPATCH();
.
Some microbenchmarks:
from pyperf import Runner, perf_counter
def bench_append(loops, length): src = list(map(float, range(length))) arr = [] t0 = perf_counter()
for i in range(loops):
arr.clear()
for x in src:
arr.append(x)
return perf_counter() - t0
def bench_append_less_gc(loops, length): src = list(map(float, range(length))) out = [None] * loops t0 = perf_counter()
for i in range(loops):
arr = []
for x in src:
arr.append(x)
out[i] = arr
return perf_counter() - t0
runner = Runner() for n in [100, 1_000, 10_000, 100_000]: runner.bench_time_func(f"append {n}", bench_append, n, inner_loops=n) runner.bench_time_func(f"append-less-gc {n}", bench_append_less_gc, n, inner_loops=n)
From GCC, --enable-optimizations, --with-lto:
- append 100000: 14.9 ns +- 0.3 ns -> 13.3 ns +- 0.4 ns: 1.12x faster
- append 10000: 15.1 ns +- 0.3 ns -> 13.6 ns +- 0.5 ns: 1.11x faster
- append-less-gc 100000: 16.4 ns +- 0.5 ns -> 14.9 ns +- 0.4 ns: 1.10x faster
- append 1000: 15.6 ns +- 0.3 ns -> 14.2 ns +- 0.3 ns: 1.09x faster
- append 100: 18.9 ns +- 0.6 ns -> 17.3 ns +- 0.6 ns: 1.09x faster
- append-less-gc 100: 27.4 ns +- 1.1 ns -> 25.2 ns +- 1.2 ns: 1.09x faster
- append-less-gc 10000: 19.2 ns +- 0.3 ns -> 17.8 ns +- 0.2 ns: 1.08x faster
- append-less-gc 1000: 22.0 ns +- 0.6 ns -> 20.8 ns +- 0.3 ns: 1.06x faster
Geometric mean: 1.09x faster