The mean misleads: why the minimum is the true measure of a function’s run time (original) (raw)

Imagine: you’ve got a function and you’d like to know how long it takes to run, on average. So you run the function a few times, measuring how long it takes each time, then take the average of those times.

That couldn’t possibly be wrong, could it?

In this post I will make the case that taking the mean is indeed the wrong approach most of the time, and that taking the minimum will give superior results. The reasoning behind this assertion is quite straightforward, and I’m pretty sure that by the time you’ve finished reading this article, you’ll agree.

I am not claiming that taking the minimum is better than analysing and understanding your data. Only that if you aren’t going to analyse your data, don’t know anything about its distribution, and are using a single-value aggregation to represent it, then the min is a better choice than the mean most of the time.

Before I start defaming the mean, let’s talk about…

What means are good for

The below chart represents a distribution with the mean value marked.

If you had to place a bet on which number would be drawn next from this distribution, the smart money would be on the mean: 100.

You can think of this distribution as representing some central value (100) combined with random noise. Sometimes the noise is positive, sometimes negative, but it averages out to zero, giving a symmetric distribution.

Lots of distributions are symmetric, so the mean is a good choice as a representative value a lot of the time.

But some distributions are not symmetric, such as the run times of functions, and in these cases, the smart money is not on the mean.

Let’s look at this asymmetry in a bit more detail...

Intuition from a continuously running function

I’d like to offer a slightly odd way of looking at things, before moving on to show that this gives a good feel for what’s happening whenever we try to quantify a function’s run time.

When you time how long a function takes to run, you are of course doing so at a particular moment in time. But what if you were to run the function continuously, again and again, and plot the results over time, what would that look like?

Below is some real data gathered while training a machine learning model. It shows run times (per iteration) for three different approaches, collected over several hours. Each of the three approaches involves work on the CPU and GPU, and this is running on my local computer while I’m also doing other things, so there is plenty of fighting over resources.

Please pardon the non-zero-based y axis, but relative difference is not the main point.

When we look at this chart, our pattern-detecting brains can see plain-as-day that orange is the fastest function (when allowed to run full speed), followed by green, then blue. Despite the fact that green takes only 0.2% longer than orange.

It would be a very odd thing to do to take the mean of each line above to try and work out which function is going to give the best performance in the long run, wouldn’t it? That would include the interruptions in the calculation! You can see that taking the minimum value of each of those lines is a much better representation of the function’s performance potential (and don’t forget, often you will be measuring run times on a local machine but really you want to know how the function will run on a production server that has fewer interruptions).

At this point, you might be thinking “OK so the minimum is the best metric to use in some cases, but does that apply to all measures of function run time?”

A good question, to which the answer is no, not all, but most.

The distribution of function run times

If you run a function twice, it’s unlikely that it will take exactly as long each time. But why not, exactly? Why should a function complete in 401ms on one run, but take 408ms on the next? The answer is of course: interruptions. Your computer has other stuff to do. Render the mouse, check for updates, take out the garbage, and a thousand other things.

So you can think of the run time of any function as being some perfect-world, theoretical minimum run time (defined by the number of operations it performs, clock speed, the speed of light, etc.), plus all the interruptions.

Another way to think about this is in terms of signal and noise. When quantifying the run time of a function, our task is to collect some data and separate the signal from the noise. The signal we seek is the function’s potential. The noise is the interruptions: time that elapsed while the function was running, but that we shouldn’t attribute to a representative run time of the function. In the chart above, the flat lines are the signal we want to extract, while the mountainous ranges contain additional noise; and since the noise will always be a positive value, we can extract the signal from the noise with a simple minimum operation.

The fact that the noise is always positive is what gives us the asymmetric distribution.

When visualising functions that take several seconds to run, the asymmetry is quite obvious, but even for functions that finish in a fraction of a second, a skew to the left is evident.

(My apologies to statistician readers who may refer to a distribution where values are bunched up to the left as being “right skewed”. Since this terminology is rather counterintuitive and I suspect not common knowledge, I will refer to such distributions as being skewed to the left.)

The below charts show 1,000 runs each for three different functions, using a mix of CPU calculations, disk reads, and moving data to the GPU for some CUDA operations:

The x axes are scaled from 0 to the maximum value, so although barely visible, there are tiny marks all the way to the right in each case.

Even calculating primes — which is purely on the CPU — has a few outliers, occasionally taking well over twice as long to run. Just by thinking about it you can know that these functions are not going to have outliers to the left; a function is not going to run in half the time just for the fun of it (unless of course you’ve got a function that exits early in some situations, but I’m assuming here that all functions are completing their tasks).

By this point, you might be thinking: sure, the mean is suited to symmetric distributions, and function run times tend to be skewed to the left, but does it really matter? If I take the mean instead of the minimum, will it ever lead me to make an incorrect decision?

The answer is yes.

To explore the boundaries of this assertion, I’ll switch to synthetic data…

Synthetic setup

To generate synthetic data that fits the typical left skew of a function’s run time distribution, one could use Beta distributions, F-Distributions, or Chi-squared distributions. I tried these but found that a log normal distribution is nice and simple and a good fit; this is a distribution where the log of the values form a normal distribution.

I tweaked the offset and spread to create two distributions, one simulating run times of a fast function, and one simulating a function that’s 3% slower.

For the experiments that follow, I will sample 20 values from the fast function distribution, and take the mean. Then I’ll sample 20 vales from the slow function distribution and take the mean. If the mean for the slow function is higher than the fast function, I’ll count this as a ‘correct’ result. I’ll repeat all this 500 times, so we can see what percentage of the time we get the correct answer. Then I’ll do it all over again taking the minimum instead of the mean.

Would you like to hazard a guess what percentage of the time the mean will give the correct answer (that the slow function is slower)? And what about the min?

My guess is that the mean will be correct 67.4% of the time, while the min will be correct 94.2% of the time.

Let’s see how we did…

Nailed it

This chart might take a moment to wrap your head around... Looking at just the means, each of the 500 triangles represents the mean run time (over 20 runs) of the fast function on the x axis, and the mean run time (over 20 runs) of the slow function on the y axis. If the means for both are the same, the mark will be on the diagonal line. If a mark is above the line, it means the mean value for the slow function was larger, which we count as correct (and make green). The red marks below the diagonal line are incorrect: they represent the scenario where you ran each function 20 times, took the means of each, and the comparison told you that the slow function was faster.

I think this is rather alarming: we’ve got a function that’s 3% faster than an alternative (not a bad speed improvement), and 20 runs (a decent sample size, especially for longer-running operations) yet taking the means will lead you to choose the incorrect function almost a third of the time! This makes me wonder how many slow functions have been selected over the years as a result of taking the mean.

Above, I’ve focused on the binary question of whether the result was correct or not. But you can also see in the chart that there is greater variance in the mean values than the mins. Let’s focus in on this variance and ponder the question: “how good is each method at reporting the true speed difference between two functions?”

In the below chart, each of the 500 experiments gets a blue line (the difference between the minimums of 20 runs each of the fast and slow functions) and an orange line (same, but using the means). The white dotted line is the true difference in run times of 3%.

As you can see, not only will the mean you give the entirely incorrect result more of the time, but the value that you get quantifying the difference is more likely to overstate or understate the true difference.

So, should you throw out the mean and always use the min?

Oh if only life were so simple.

Here’s another pair of simulated functions, with distributions that are skewed, but closer to symmetric than the previous pair.

This just happens to be the crossover point (between left skewed and symmetric) where mean and min are equally accurate:

And it’s not impossible to imagine a function whose run time is a truly symmetric distribution. Perhaps you’re hitting an API served from one of many servers and the performance of those servers is normally distributed, or the amount of network traffic follows a normal distribution. So here’s another pair of functions, this time with symmetric distributions:

I this case, of course the mean is the correct choice, because (say it with me now): means are for symmetric distributions.

Another case where the mean won’t mislead (much) is if you have a skewed distribution, but quite a large difference in run times (in the below, the slow function is 10% slower):

In this case, the mean is correct 95.2% of the time (leading you to choose a 10% slower function only 1 in 20 times — that’s totally fine, right?).

Despite the less-awful performance of the mean, the min is the logical choice and here is comfortably correct 100% of the time.

For the final synthetic test, let’s look at the effect of sample size. If we go back to the first pair (with 3% difference) and simulate running each function 200 times instead of 20 (and again, repeating the simulation 500 times per function), we see this:

As expected, we get less variance and taking the mean will give the correct result a lot of the time, but again the min is the sensible choice.

So we can see that it doesn’t matter how small or large your sample size is, or how big the difference in performance between functions, taking the min will give a more accurate result, unless the distribution of your run times is symmetric (or close to it).

I’d hate to be accused of only using synthetic data, so I’d better show these charts for some real functions. I’ll need to create two functions such that I can state definitively that one is faster than the other, without measuring how long they take to run. For this, I have created:

fast_function, which does some stuff involving the disk and CPU
slow_function, which calls fast_function and then does more stuff

I ran each function 10,000 times, then split the (shuffled) times into 500 batches of 20 runs each to simulate taking the mean/min of 20 runs, 500 times; the same data structure as the synthetic setup above.

The functions are quite fast and the distribution doesn’t have much skew, so it’s going to be close:

This gives results that aren’t as quite as compelling as the synthetic data:

Even though empirically this isn’t a landslide win for the min, no matter how I tweaked this setup, the minimum was always correct more often than the mean, and always more densely packed (some of the outlier mean results show a difference of +/- 50%, but the min never suffered from such misleading outliers).

By this point, you may be willing to agree that for any sufficiently left-skewed set of values, the minimum is a better single-value representation than the mean. This only leaves us with the question of “what on earth is a sufficiently left-skewed distribution?”

There’s no way to quantify this that wouldn’t also quantify the performance of your functions, and thus negate the requirement to quantify the skew. So here I’m going to offer some practical advice for how to proceed when you have the desire to assess a function’s run time:

If you don’t know anything about the distribution of your run times, and could not be bothered investigating, and you have no reason to believe that the distribution would by symmetric, assume that the minimum will give the most accurate representation.
If the decision is important, plot the values for each function and if it isn’t visually clear which function is faster, get more data or flip a coin and move on to more impactful decisions.

Now, we’re mostly done with the core point of the article, but since I have your attention, I’d like to talk to you about our lord … no just kidding, I’d like to talk to you about compression and visualisation. Feel free to wriggle about in your chair a bit before reading on.

Lossy and lossless representations

If you run a function 10 times, and record how long it takes to run each time, you will have 10 numbers and a choice to make:

Compress those values into a single value (a lossy operation)
Look at all ten values to assess the function (a lossless operation)

If you need to compare two functions in an ‘objective’ way then you probably need to discard some information to boil each function down to a single number, and you know that the best way to do that is to simply take the minimum value (it will take a while before this stops feeling like a weird thing to do).

But there may be times when you want to plot the results in a chart for a richer understanding of the data you’ve gathered, and in the world of charts there are lossy and lossless options.

Let’s return to the very first chart, three functions with their run times plotted over time.

This is a lossless chart. Every point in the data has a place on the chart. But this is not a normal way to look at run times.

The most common choice for representing data like this is the histogram, a lossy chart. These work pretty well if you have hundreds of data points, or a fairly regular distribution. They work less well for smaller samples, messier distributions, or comparing multiple distributions. Here’s the same data in histogram form:

This one in particular isn’t terrible, but I had to fiddle with the bin sizes to coax it into showing me what I knew to be true. And that’s the problem with histograms: the story they tell depends on the bin size, and since each bin is a lossy compression of the data that goes into it, you’re losing information and potentially getting the wrong impression.

A better option is the event plot, a lossless chart. As with the line chart, this makes it perfectly clear which function is the best performer (when you have in mind that a function’s run time is a combination of potential + positive noise).

(Strictly speaking, this chart type is not always lossless, because even with semi-transparent lines, 11 lines on the exact same value look the same as 10 lines. If this displeases you, a strip plot with jitter is another option.)

Another lossless option is the Empirical distribution function, which again shows clearly which function is capable of running the fastest:

Thanks to

in the comments for informing me of their existence!

I will default to a histogram when communicating with other humans, since a histogram requires less explanation. But when I want to make an intelligent decision based on some data, I use an event plot.

The point of all this is to prompt you to think about when you’re discarding data with lossy aggregations and lossy charts, and to be aware of the lossless options.

OK now we’re at the end. Did I mention there would be a test at the end?

Final exam

Here’s a little test to see if you’ve been picking up what I’m putting down.

Forget about programming for a moment and imagine you’re a plumber tasked with selecting the pipe design with the best flow rate. For each pipe design, you collect 10 measurements (flow rate, in litres per second) taken at various times.

How would you aggregate the 10 measurements into a single value so that you could compare between pipe designs?

If the answer isn’t immediately clear, perhaps looking at a plot of the values for two pipes will help:

If that doesn’t help, think about the signal and the noise. Each pipe presumably has some theoretical maximum flow rate based on its physical properties (the signal), but other factors reduce the flow at different times (the noise). The noise is always negative, meaning that the maximum is the logical way to aggregate these values. (Unless you have additional information to suggest that this will not lead to the optimal selection, but we’re talking about sensible defaults in the absence of such information).

This feels like a good place for an intro…

The story behind this story

About a year ago, I was reading through the Python documentation when I came across this paragraph on the timeit module page.

TL;DR use the min, not the mean.

My first thought was pffft, that can’t be right. I had been a lifelong keen bean for the mean and it seemed rather absurd that taking the mean could be wrong. To me, “mean” was synonymous with “a good single-value representation of a set of values”. And besides, everyone uses the mean. Surely they can’t all be wrong. What kind of world would this be if people went around being incorrect about stuff?

Although I didn’t like the idea that I’d been wrong all this time, the one thing worse than being wrong in the past is being wrong in the future, so I set out to test it for myself, looking at the distributions of all sorts of functions, working out how to represent this synthetically so I could find the crossover point where one approach (mean or min) made more sense. Not to mention testing all this for harmonic means, geometric means, log means, medians, modes, and various percentiles (all are inferior to min, for perhaps obvious reasons).

There’s an emotional side to this too: taking the mean feels like the better option: it’s incorporating information from all data points which is surely a good thing, while taking the min is throwing away all but a single value, which is surely a bad thing. My monkey brain appears to have some heuristic that favors the mean, and is irrationally apprehensive about taking the min, even though my logical brain clearly understands why the min gives superior results. In fact, I’d be willing to bet that some people will read this article, agree with the premises, and accept the conclusion, but continue to use the mean.

Eventually though it all clicked and I was convinced. At that point, I thought: this seems like the sort of thing other mean-fiends might like to have their minds changed about. And here we are.

Hey, thanks for reading! Bye now.