bpo-33234 Improve list() pre-sizing for inputs with known lengths by pablogsal · Pull Request #6493 · python/cpython (original) (raw)

I have collected more metrics. This are the L1 and L2 cache misses and related information:

THIS PR
Performance counter stats for './python -c for _ in range(10000): list([0]*10000)' (200 runs):

      75916384      cache-references                                              ( +-  2.52% )
        659899      cache-misses              #    0.869 % of all cache refs      ( +-  3.55% )
    4242376466      cycles                                                        ( +-  0.84% )
    5316694198      instructions              #    1.25  insn per cycle           ( +-  0.02% )
    1086705824      branches                                                      ( +-  0.01% )
         71006      faults                                                        ( +-  0.00% )
             0      migrations

   1.708574749 seconds time elapsed                                          ( +-  0.86% )

MASTER
Performance counter stats for './python -c for _ in range(10000): list([0]*10000)' (200 runs):

      91939960      cache-references                                              ( +-  1.98% )
        730659      cache-misses              #    0.795 % of all cache refs      ( +-  2.53% )
    4595555189      cycles                                                        ( +-  0.95% )
    5383346835      instructions              #    1.17  insn per cycle           ( +-  0.02% )
    1097849443      branches                                                      ( +-  0.02% )
         91006      faults                                                        ( +-  0.00% )
             0      migrations

   1.851745158 seconds time elapsed                                          ( +-  0.96% )

As you can see, this shows a great improvement in cache efficiency and branch predictions. As there is some concern regarding what happens with the overhead of calling __len_hint__ I have collected more metrics in that regard:

root@ubuntu-s-2vcpu-4gb-sfo2-01:~/cpython# ./python -m perf timeit "list(iter([0]*11))" -o old.json
root@ubuntu-s-2vcpu-4gb-sfo2-01:~/cpython# ./python -m perf timeit "list(iter([0]*10000))" -o old_big.json
root@ubuntu-s-2vcpu-4gb-sfo2-01:~/cpython# ./python -m perf timeit "list(iter([0]*11))" -o new.json
root@ubuntu-s-2vcpu-4gb-sfo2-01:~/cpython# ./python -m perf timeit "list(iter([0]*10000))" -o new_big.json

And these are the results:

root@ubuntu-s-2vcpu-4gb-sfo2-01:~/cpython# ./python -m perf compare_to old.json new.json -vv
Mean +- std dev: [old] 1.55 us +- 0.11 us -> [new] 1.66 us +- 0.11 us: 1.07x slower (+7%)
Significant (t=-5.31)
root@ubuntu-s-2vcpu-4gb-sfo2-01:~/cpython# ./python -m perf compare_to old_big.json 
new_big.json
Mean +- std dev: [old_big] 206 us +- 18 us -> [new_big] 155 us +- 11 us: 1.33x faster (-25%)

As you can see the only case when this patch is slow is for very small iterables without __len__, but we are talking about 7% difference (0.11us) vs the chunked linear improvement that you get for the preallocation.