Message 259570 - Python tracker (original) (raw)

tl; dr I'm disappointed. According to the statistics module, running the bm_regex_v8.py benchmark more times with more iterations make the benchmark more unstable... I expected the opposite...

Patch version 2:

patch also performance/bm_pickle.py
change min_time from 100 ms to 500 ms with --fast
compute the number of runs using a maximum time, maximum time change with --fast and --rigorous

if options.rigorous:
```
   min_time = 1.0
```
```
   max_time = 100.0  # 100 runs
```
elif options.fast:
```
   min_time = 0.5
```
```
   max_time = 25.0   # 50 runs
```
else:
```
   min_time = 0.5
```
```
   max_time = 50.0   # 100 runs
```

To measure the stability of perf.py, I pinned perf.py to CPU cores which are isolated of the system using Linux "isolcpus" kernel parameter. I also forced the CPU frequency governor to "performance" and enabled "no HZ full" on these cores.

I ran perf.py 5 times on regex_v8.

Calibration (original => patched):

--fast: 1 iteration x 5 runs => 16 iterations x 50 runs
(no option): 1 iteration x 50 runs => 16 iterations x 100 runs

Approximated duration of the benchmark (original => patched):

--fast: 7 sec => 7 min 34 sec
(no option): 30 sec => 14 min 35 sec

(I made a mistake, so I was unable to get the exact duration.)

Hum, maybe timings are not well chosen because the benchmark is really slow (minutes vs seconds) :-/

Standard deviation, --fast:

(python2) 0.00071 (1.2%, mean=0.05961) => 0.01059 (1.1%, mean=0.96723)
(python3) 0.00068 (1.5%, mean=0.04494) => 0.05925 (8.0%, mean=0.74248)
(faster) 0.02986 (2.2%, mean=1.32750) => 0.09083 (6.9%, mean=1.31000)

Standard deviation, (no option):

(python2) 0.00072 (1.2%, mean=0.05957) => 0.00874 (0.9%, mean=0.97028)
(python3) 0.00053 (1.2%, mean=0.04477) => 0.00966 (1.3%, mean=0.72680)
(faster) 0.02739 (2.1%, mean=1.33000) => 0.02608 (2.0%, mean=1.33600)

Variance, --fast:

(python2) 0.00000 (0.001%, mean=0.05961) => 0.00009 (0.009%, mean=0.96723)
(python3) 0.00000 (0.001%, mean=0.04494) => 0.00281 (0.378%, mean=0.74248)
(faster) 0.00067 (0.050%, mean=1.32750) => 0.00660 (0.504%, mean=1.31000)

Variance, (no option):

(python2) 0.00000 (0.001%, mean=0.05957) => 0.00006 (0.006%, mean=0.97028)
(python3) 0.00000 (0.001%, mean=0.04477) => 0.00007 (0.010%, mean=0.72680)
(faster) 0.00060 (0.045%, mean=1.33000) => 0.00054 (0.041%, mean=1.33600)

Legend:

(python2) are timings of python2 ran by perf.py (of the "Min" line)
(python3) are timings of python3 ran by perf.py (of the "Min" line)
(faster) are the "1.34x" numbers of "faster" or "slower" of the "Min" line
percentages are: value * 100 / mean

It's not easy to compare these values since the number of iterations is very different (1 => 16) and so timings are very different (ex: 0.059 sec => 0.950 sec). I guess that it's ok to compare percentages.

I used the stability.py script, attached to this issue, to compute deviation and variance from the "Min" line of the 5 runs. The script takes the output of perf.py as input.

I'm not sure that 5 runs are enough to compute statistics.

Raw data.

Original perf.py.

$ grep ^Min original.fast Min: 0.059236 -> 0.045948: 1.29x faster Min: 0.059005 -> 0.044654: 1.32x faster Min: 0.059601 -> 0.044547: 1.34x faster Min: 0.060605 -> 0.044600: 1.36x faster

$ grep ^Min original Min: 0.060479 -> 0.044762: 1.35x faster Min: 0.059002 -> 0.045689: 1.29x faster Min: 0.058991 -> 0.044587: 1.32x faster Min: 0.060231 -> 0.044364: 1.36x faster Min: 0.059165 -> 0.044464: 1.33x faster

Patched perf.py.

$ grep ^Min patched.fast Min: 0.950717 -> 0.711018: 1.34x faster Min: 0.968413 -> 0.730810: 1.33x faster Min: 0.976092 -> 0.847388: 1.15x faster Min: 0.964355 -> 0.711083: 1.36x faster Min: 0.976573 -> 0.712081: 1.37x faster

$ grep ^Min patched Min: 0.968810 -> 0.729109: 1.33x faster Min: 0.973615 -> 0.731308: 1.33x faster Min: 0.974215 -> 0.734259: 1.33x faster Min: 0.978781 -> 0.709915: 1.38x faster Min: 0.955977 -> 0.729387: 1.31x faster

$ grep ^Calibration patched.fast Calibration: num_runs=50, num_loops=16 (0.73 sec per run > min_time 0.50 sec, estimated total: 36.4 sec) Calibration: num_runs=50, num_loops=16 (0.75 sec per run > min_time 0.50 sec, estimated total: 37.3 sec) Calibration: num_runs=50, num_loops=16 (0.75 sec per run > min_time 0.50 sec, estimated total: 37.4 sec) Calibration: num_runs=50, num_loops=16 (0.73 sec per run > min_time 0.50 sec, estimated total: 36.6 sec) Calibration: num_runs=50, num_loops=16 (0.73 sec per run > min_time 0.50 sec, estimated total: 36.7 sec)

$ grep ^Calibration patched Calibration: num_runs=100, num_loops=16 (0.73 sec per run > min_time 0.50 sec, estimated total: 73.0 sec) Calibration: num_runs=100, num_loops=16 (0.75 sec per run > min_time 0.50 sec, estimated total: 75.3 sec) Calibration: num_runs=100, num_loops=16 (0.73 sec per run > min_time 0.50 sec, estimated total: 73.2 sec) Calibration: num_runs=100, num_loops=16 (0.74 sec per run > min_time 0.50 sec, estimated total: 73.7 sec) Calibration: num_runs=100, num_loops=16 (0.73 sec per run > min_time 0.50 sec, estimated total: 72.9 sec)