bpo-47009: Let PRECALL_NO_KW_LIST_APPEND do its own POP_TOP by sweeneyde · Pull Request #32239 · python/cpython (original) (raw)

Most code won't do y = L.append(x) or whatnot, so PRECALL_NO_KW_LIST_APPEND is almost always followed by POP_TOP. We can verify at specialization time.

This saves a Py_INCREF(Py_None), a SET_TOP(Py_None), and POP_TOP's Py_DECREF(POP()); DISPATCH();.

Some microbenchmarks:

from pyperf import Runner, perf_counter

def bench_append(loops, length): src = list(map(float, range(length))) arr = [] t0 = perf_counter()

for i in range(loops):
    arr.clear()
    for x in src:
        arr.append(x)

return perf_counter() - t0

def bench_append_less_gc(loops, length): src = list(map(float, range(length))) out = [None] * loops t0 = perf_counter()

for i in range(loops):
    arr = []
    for x in src:
        arr.append(x)
    out[i] = arr

return perf_counter() - t0

runner = Runner() for n in [100, 1_000, 10_000, 100_000]: runner.bench_time_func(f"append {n}", bench_append, n, inner_loops=n) runner.bench_time_func(f"append-less-gc {n}", bench_append_less_gc, n, inner_loops=n)

From GCC, --enable-optimizations, --with-lto:

- append 100000: 14.9 ns +- 0.3 ns -> 13.3 ns +- 0.4 ns: 1.12x faster
- append 10000: 15.1 ns +- 0.3 ns -> 13.6 ns +- 0.5 ns: 1.11x faster
- append-less-gc 100000: 16.4 ns +- 0.5 ns -> 14.9 ns +- 0.4 ns: 1.10x faster
- append 1000: 15.6 ns +- 0.3 ns -> 14.2 ns +- 0.3 ns: 1.09x faster
- append 100: 18.9 ns +- 0.6 ns -> 17.3 ns +- 0.6 ns: 1.09x faster
- append-less-gc 100: 27.4 ns +- 1.1 ns -> 25.2 ns +- 1.2 ns: 1.09x faster
- append-less-gc 10000: 19.2 ns +- 0.3 ns -> 17.8 ns +- 0.2 ns: 1.08x faster
- append-less-gc 1000: 22.0 ns +- 0.6 ns -> 20.8 ns +- 0.3 ns: 1.06x faster

Geometric mean: 1.09x faster

https://bugs.python.org/issue47009