[Python-Dev] Stop using timeit, use perf.timeit! (original) (raw)

Steven D'Aprano steve at pearwood.info
Fri Jun 10 09:20:51 EDT 2016


On Fri, Jun 10, 2016 at 01:13:10PM +0200, Victor Stinner wrote:

Hi,

Last weeks, I made researchs on how to get stable and reliable benchmarks, especially for the corner case of microbenchmarks. The first result is a serie of article, here are the first three:

Thank you for this! I am very interested in benchmarking.

https://haypo.github.io/journey-to-stable-benchmark-system.html https://haypo.github.io/journey-to-stable-benchmark-deadcode.html https://haypo.github.io/journey-to-stable-benchmark-average.html

I strongly question your statement in the third:

[quote]
But how can we compare performances if results are random? 
Take the minimum?

No! You must never (ever again) use the minimum for 
benchmarking! Compute the average and some statistics like
the standard deviation:
[end quote]

While I'm happy to see a real-world use for the statistics module, I disagree with your logic.

The problem is that random noise can only ever slow the code down, it cannot speed it up. To put it another way, the random errors in the timings are always positive.

Suppose you micro-benchmark some code snippet and get a series of timings. We can model the measured times as:

measured time t = T + ε

where T is the unknown "true" timing we wish to estimate, and ε is some variable error due to noise in the system. But ε is always positive, never negative, and we always measure something larger than T.

Let's suppose we somehow (magically) know what the epsilons are:

measurements = [T + 0.01, T + 0.02, T + 0.04, T + 0.01]

The average is (4*T + 0.08)/4 = T + 0.02

But the minimum is T + 0.01, which is a better estimate than the average. Taking the average means that worse epsilons will effect your estimate, while the minimum means that only the smallest epsilon effects your estimate.

Taking the average is appropriate is if the error terms can be positive or negative, e.g. if they are measurement error rather than noise:

measurements = [T + 0.01, T - 0.02, T + 0.04, T - 0.01]

The average is (4*T + 0.02)/4 = T + 0.005

The minimum is T - 0.02, which is worse than the average.

Unless you have good reason to think that the timing variation is mostly caused by some error which can be both positive and negative, the minimum is the right statistic to use, not the average. But ask yourself: what sort of error, noise or external influence will cause the code snippet to run FASTER than the fastest the CPU can execute it?

-- Steve



More information about the Python-Dev mailing list