bpo-33234 Improve list() pre-sizing for inputs with known lengths by pablogsal · Pull Request #6493 · python/cpython (original) (raw)
I have collected more metrics. This are the L1 and L2 cache misses and related information:
THIS PR
Performance counter stats for './python -c for _ in range(10000): list([0]*10000)' (200 runs):
75916384 cache-references ( +- 2.52% )
659899 cache-misses # 0.869 % of all cache refs ( +- 3.55% )
4242376466 cycles ( +- 0.84% )
5316694198 instructions # 1.25 insn per cycle ( +- 0.02% )
1086705824 branches ( +- 0.01% )
71006 faults ( +- 0.00% )
0 migrations
1.708574749 seconds time elapsed ( +- 0.86% )
MASTER
Performance counter stats for './python -c for _ in range(10000): list([0]*10000)' (200 runs):
91939960 cache-references ( +- 1.98% )
730659 cache-misses # 0.795 % of all cache refs ( +- 2.53% )
4595555189 cycles ( +- 0.95% )
5383346835 instructions # 1.17 insn per cycle ( +- 0.02% )
1097849443 branches ( +- 0.02% )
91006 faults ( +- 0.00% )
0 migrations
1.851745158 seconds time elapsed ( +- 0.96% )
As you can see, this shows a great improvement in cache efficiency and branch predictions. As there is some concern regarding what happens with the overhead of calling __len_hint__
I have collected more metrics in that regard:
root@ubuntu-s-2vcpu-4gb-sfo2-01:~/cpython# ./python -m perf timeit "list(iter([0]*11))" -o old.json
root@ubuntu-s-2vcpu-4gb-sfo2-01:~/cpython# ./python -m perf timeit "list(iter([0]*10000))" -o old_big.json
root@ubuntu-s-2vcpu-4gb-sfo2-01:~/cpython# ./python -m perf timeit "list(iter([0]*11))" -o new.json
root@ubuntu-s-2vcpu-4gb-sfo2-01:~/cpython# ./python -m perf timeit "list(iter([0]*10000))" -o new_big.json
And these are the results:
root@ubuntu-s-2vcpu-4gb-sfo2-01:~/cpython# ./python -m perf compare_to old.json new.json -vv
Mean +- std dev: [old] 1.55 us +- 0.11 us -> [new] 1.66 us +- 0.11 us: 1.07x slower (+7%)
Significant (t=-5.31)
root@ubuntu-s-2vcpu-4gb-sfo2-01:~/cpython# ./python -m perf compare_to old_big.json
new_big.json
Mean +- std dev: [old_big] 206 us +- 18 us -> [new_big] 155 us +- 11 us: 1.33x faster (-25%)
As you can see the only case when this patch is slow is for very small iterables without __len__
, but we are talking about 7% difference (0.11us) vs the chunked linear improvement that you get for the preallocation.