Message 259570 - Python tracker (original) (raw)
tl; dr I'm disappointed. According to the statistics module, running the bm_regex_v8.py benchmark more times with more iterations make the benchmark more unstable... I expected the opposite...
Patch version 2:
- patch also performance/bm_pickle.py
- change min_time from 100 ms to 500 ms with --fast
- compute the number of runs using a maximum time, maximum time change with --fast and --rigorous
- if options.rigorous:
min_time = 1.0
max_time = 100.0 # 100 runs
- elif options.fast:
min_time = 0.5
max_time = 25.0 # 50 runs
- else:
min_time = 0.5
max_time = 50.0 # 100 runs
To measure the stability of perf.py, I pinned perf.py to CPU cores which are isolated of the system using Linux "isolcpus" kernel parameter. I also forced the CPU frequency governor to "performance" and enabled "no HZ full" on these cores.
I ran perf.py 5 times on regex_v8.
Calibration (original => patched):
- --fast: 1 iteration x 5 runs => 16 iterations x 50 runs
- (no option): 1 iteration x 50 runs => 16 iterations x 100 runs
Approximated duration of the benchmark (original => patched):
- --fast: 7 sec => 7 min 34 sec
- (no option): 30 sec => 14 min 35 sec
(I made a mistake, so I was unable to get the exact duration.)
Hum, maybe timings are not well chosen because the benchmark is really slow (minutes vs seconds) :-/
Standard deviation, --fast:
- (python2) 0.00071 (1.2%, mean=0.05961) => 0.01059 (1.1%, mean=0.96723)
- (python3) 0.00068 (1.5%, mean=0.04494) => 0.05925 (8.0%, mean=0.74248)
- (faster) 0.02986 (2.2%, mean=1.32750) => 0.09083 (6.9%, mean=1.31000)
Standard deviation, (no option):
- (python2) 0.00072 (1.2%, mean=0.05957) => 0.00874 (0.9%, mean=0.97028)
- (python3) 0.00053 (1.2%, mean=0.04477) => 0.00966 (1.3%, mean=0.72680)
- (faster) 0.02739 (2.1%, mean=1.33000) => 0.02608 (2.0%, mean=1.33600)
Variance, --fast:
- (python2) 0.00000 (0.001%, mean=0.05961) => 0.00009 (0.009%, mean=0.96723)
- (python3) 0.00000 (0.001%, mean=0.04494) => 0.00281 (0.378%, mean=0.74248)
- (faster) 0.00067 (0.050%, mean=1.32750) => 0.00660 (0.504%, mean=1.31000)
Variance, (no option):
- (python2) 0.00000 (0.001%, mean=0.05957) => 0.00006 (0.006%, mean=0.97028)
- (python3) 0.00000 (0.001%, mean=0.04477) => 0.00007 (0.010%, mean=0.72680)
- (faster) 0.00060 (0.045%, mean=1.33000) => 0.00054 (0.041%, mean=1.33600)
Legend:
- (python2) are timings of python2 ran by perf.py (of the "Min" line)
- (python3) are timings of python3 ran by perf.py (of the "Min" line)
- (faster) are the "1.34x" numbers of "faster" or "slower" of the "Min" line
- percentages are: value * 100 / mean
It's not easy to compare these values since the number of iterations is very different (1 => 16) and so timings are very different (ex: 0.059 sec => 0.950 sec). I guess that it's ok to compare percentages.
I used the stability.py script, attached to this issue, to compute deviation and variance from the "Min" line of the 5 runs. The script takes the output of perf.py as input.
I'm not sure that 5 runs are enough to compute statistics.
--
Raw data.
Original perf.py.
$ grep ^Min original.fast Min: 0.059236 -> 0.045948: 1.29x faster Min: 0.059005 -> 0.044654: 1.32x faster Min: 0.059601 -> 0.044547: 1.34x faster Min: 0.060605 -> 0.044600: 1.36x faster
$ grep ^Min original Min: 0.060479 -> 0.044762: 1.35x faster Min: 0.059002 -> 0.045689: 1.29x faster Min: 0.058991 -> 0.044587: 1.32x faster Min: 0.060231 -> 0.044364: 1.36x faster Min: 0.059165 -> 0.044464: 1.33x faster
Patched perf.py.
$ grep ^Min patched.fast Min: 0.950717 -> 0.711018: 1.34x faster Min: 0.968413 -> 0.730810: 1.33x faster Min: 0.976092 -> 0.847388: 1.15x faster Min: 0.964355 -> 0.711083: 1.36x faster Min: 0.976573 -> 0.712081: 1.37x faster
$ grep ^Min patched Min: 0.968810 -> 0.729109: 1.33x faster Min: 0.973615 -> 0.731308: 1.33x faster Min: 0.974215 -> 0.734259: 1.33x faster Min: 0.978781 -> 0.709915: 1.38x faster Min: 0.955977 -> 0.729387: 1.31x faster
$ grep ^Calibration patched.fast Calibration: num_runs=50, num_loops=16 (0.73 sec per run > min_time 0.50 sec, estimated total: 36.4 sec) Calibration: num_runs=50, num_loops=16 (0.75 sec per run > min_time 0.50 sec, estimated total: 37.3 sec) Calibration: num_runs=50, num_loops=16 (0.75 sec per run > min_time 0.50 sec, estimated total: 37.4 sec) Calibration: num_runs=50, num_loops=16 (0.73 sec per run > min_time 0.50 sec, estimated total: 36.6 sec) Calibration: num_runs=50, num_loops=16 (0.73 sec per run > min_time 0.50 sec, estimated total: 36.7 sec)
$ grep ^Calibration patched Calibration: num_runs=100, num_loops=16 (0.73 sec per run > min_time 0.50 sec, estimated total: 73.0 sec) Calibration: num_runs=100, num_loops=16 (0.75 sec per run > min_time 0.50 sec, estimated total: 75.3 sec) Calibration: num_runs=100, num_loops=16 (0.73 sec per run > min_time 0.50 sec, estimated total: 73.2 sec) Calibration: num_runs=100, num_loops=16 (0.74 sec per run > min_time 0.50 sec, estimated total: 73.7 sec) Calibration: num_runs=100, num_loops=16 (0.73 sec per run > min_time 0.50 sec, estimated total: 72.9 sec)