PEP 450 – Adding A Statistics Module To The Standard Library | peps.python.org (original) (raw)

Author:

Steven D’Aprano

Status:

Final

Type:

Standards Track

Created:

01-Aug-2013

Python-Version:

3.4

Post-History:

13-Sep-2013


Table of Contents

Abstract

This PEP proposes the addition of a module for common statistics functions such as mean, median, variance and standard deviation to the Python standard library. See also http://bugs.python.org/issue18606

Rationale

The proposed statistics module is motivated by the “batteries included” philosophy towards the Python standard library. Raymond Hettinger and other senior developers have requested a quality statistics library that falls somewhere in between high-end statistics libraries and ad hoc code. [1]Statistical functions such as mean, standard deviation and others are obvious and useful batteries, familiar to any Secondary School student. Even cheap scientific calculators typically include multiple statistical functions such as:

Graphing calculators aimed at Secondary School students typically include all of the above, plus some or all of:

and others [2]. Likewise spreadsheet applications such as Microsoft Excel, LibreOffice and Gnumeric include rich collections of statistical functions [3].

In contrast, Python currently has no standard way to calculate even the simplest and most obvious statistical functions such as mean. For those who need statistical functions in Python, there are two obvious solutions:

Numpy is perhaps the most full-featured solution, but it has a few disadvantages:

This leads to option number 2, DIY statistics functions. At first glance, this appears to be an attractive option, due to the apparent simplicity of common statistical functions. For example:

def mean(data): return sum(data)/len(data)

def variance(data): # Use the Computational Formula for Variance. n = len(data) ss = sum(x**2 for x in data) - (sum(data)**2)/n return ss/(n-1)

def standard_deviation(data): return math.sqrt(variance(data))

The above appears to be correct with a casual test:

data = [1, 2, 4, 5, 8] variance(data) 7.5

But adding a constant to every data point should not change the variance:

data = [x+1e12 for x in data] variance(data) 0.0

And variance should never be negative:

variance(data*100) -1239429440.1282566

By contrast, the proposed reference implementation gets the exactly correct answer 7.5 for the first two examples, and a reasonably close answer for the third: 6.012. numpy does no better [6].

Even simple statistical calculations contain traps for the unwary, starting with the Computational Formula itself. Despite the name, it is numerically unstable and can be extremely inaccurate, as can be seen above. It is completely unsuitable for computation by computer [7]. This problem plagues users of many programming language, not just Python [8], as coders reinvent the same numerically inaccurate code over and over again [9], or advise others to do so [10].

It isn’t just the variance and standard deviation. Even the mean is not quite as straightforward as it might appear. The above implementation seems too simple to have problems, but it does:

While the above mean implementation does not fail quite as catastrophically as the naive variance does, a standard library function can do much better than the DIY versions.

The example above involves an especially bad set of data, but even for more realistic data sets accuracy is important. The first step in interpreting variation in data (including dealing with ill-conditioned data) is often to standardize it to a series with variance 1 (and often mean 0). This standardization requires accurate computation of the mean and variance of the raw series. Naive computation of mean and variance can lose precision very quickly. Because precision bounds accuracy, it is important to use the most precise algorithms for computing mean and variance that are practical, or the results of standardization are themselves useless.

Comparison To Other Languages/Packages

The proposed statistics library is not intended to be a competitor to such third-party libraries as numpy/scipy, or of proprietary full-featured statistics packages aimed at professional statisticians such as Minitab, SAS and Matlab. It is aimed at the level of graphing and scientific calculators.

Most programming languages have little or no built-in support for statistics functions. Some exceptions:

R

R (and its proprietary cousin, S) is a programming language designed for statistics work. It is extremely popular with statisticians and is extremely feature-rich [11].

C#

The C# LINQ package includes extension methods to calculate the average of enumerables [12].

Ruby

Ruby does not ship with a standard statistics module, despite some apparent demand [13]. Statsample appears to be a feature-rich third-party library, aiming to compete with R [14].

PHP

PHP has an extremely feature-rich (although mostly undocumented) set of advanced statistical functions [15].

Delphi

Delphi includes standard statistical functions including Mean, Sum, Variance, TotalVariance, MomentSkewKurtosis in its Math library [16].

GNU Scientific Library

The GNU Scientific Library includes standard statistical functions, percentiles, median and others [17]. One innovation I have borrowed from the GSL is to allow the caller to optionally specify the pre-calculated mean of the sample (or an a priori known population mean) when calculating the variance and standard deviation [18].

Design Decisions Of The Module

My intention is to start small and grow the library as needed, rather than try to include everything from the start. Consequently, the current reference implementation includes only a small number of functions: mean, variance, standard deviation, median, mode. (See the reference implementation for a full list.)

I have aimed for the following design features:

API

The initial version of the library will provide univariate (single variable) statistics functions. The general API will be based on a functional modelfunction(data, ...) -> result, where data is a mandatory iterable of (usually) numeric data.

The author expects that lists will be the most common data type used, but any iterable type should be acceptable. Where necessary, functions may convert to lists internally. Where possible, functions are expected to conserve the type of the data values, for example, the mean of a list of Decimals should be a Decimal rather than float.

Calculating mean, median and mode

The mean, median* and mode functions take a single mandatory argument and return the appropriate statistic, e.g.:

Functions provided are:

mode is the sole exception to the rule that the data argument must be numeric. It will also accept an iterable of nominal data, such as strings.

Calculating variance and standard deviation

In order to be similar to scientific calculators, the statistics module will include separate functions for population and sample variance and standard deviation. All four functions have similar signatures, with a single mandatory argument, an iterable of numeric data, e.g.:

variance([1, 2, 2, 2, 3]) 0.5

All four functions also accept a second, optional, argument, the mean of the data. This is modelled on a similar API provided by the GNU Scientific Library [18]. There are three use-cases for using this argument, in no particular order:

  1. The value of the mean is known a priori.
  2. You have already calculated the mean, and wish to avoid calculating it again.
  3. You wish to (ab)use the variance functions to calculate the second moment about some given point other than the mean.

In each case, it is the caller’s responsibility to ensure that given argument is meaningful.

Functions provided are:

Other functions

There is one other public function:

Specification

As the proposed reference implementation is in pure Python, other Python implementations can easily make use of the module unchanged, or adapt it as they see fit.

What Should Be The Name Of The Module?

This will be a top-level module statistics.

There was some interest in turning math into a package, and making this a sub-module of math, but the general consensus eventually agreed on a top-level module. Other potential but rejected names included stats (too much risk of confusion with existing stat module), and statslib(described as “too C-like”).

Discussion And Resolved Issues

This proposal has been previously discussed here [21].

A number of design issues were resolved during the discussion on Python-Ideas and the initial code review. There was a lot of concern about the addition of yet another sum function to the standard library, see the FAQs below for more details. In addition, the initial implementation of sum suffered from some rounding issues and other design problems when dealing with Decimals. Oscar Benjamin’s assistance in resolving this was invaluable.

Another issue was the handling of data in the form of iterators. The first implementation of variance silently swapped between a one- and two-pass algorithm, depending on whether the data was in the form of an iterator or sequence. This proved to be a design mistake, as the calculated variance could differ slightly depending on the algorithm used, and variance etc. were changed to internally generate a list and always use the more accurate two-pass implementation.

One controversial design involved the functions to calculate median, which were implemented as attributes on the median callable, e.g. median,median.low, median.high etc. Although there is at least one existing use of this style in the standard library, in unittest.mock, the code reviewers felt that this was too unusual for the standard library. Consequently, the design has been changed to a more traditional design of separate functions with a pseudo-namespace naming convention, median_low,median_high, etc.

Another issue that was of concern to code reviewers was the existence of a function calculating the sample mode of continuous data, with some people questioning the choice of algorithm, and whether it was a sufficiently common need to be included. So it was dropped from the API, and mode now implements only the basic schoolbook algorithm based on counting unique values.

Another significant point of discussion was calculating statistics oftimedelta objects. Although the statistics module will not directly support timedelta objects, it is possible to support this use-case by converting them to numbers first using the timedelta.total_seconds method.

Frequently Asked Questions

Shouldn’t this module spend time on PyPI before being considered for the standard library?

Older versions of this module have been available on PyPI [22] since 2010. Being much simpler than numpy, it does not require many years of external development.

Does the standard library really need yet another version of sum?

This proved to be the most controversial part of the reference implementation. In one sense, clearly three sums is two too many. But in another sense, yes. The reasons why the two existing versions are unsuitable are described here [23] but the short summary is:

There was some interest in “fixing” one or the other of the existing sums. If this occurs before 3.4 feature-freeze, the decision to keep statistics.sumcan be re-considered.

Will this module be backported to older versions of Python?

The module currently targets 3.3, and I will make it available on PyPI for 3.3 for the foreseeable future. Backporting to older versions of the 3.x series is likely (but not yet decided). Backporting to 2.7 is less likely but not ruled out.

Is this supposed to replace numpy?

No. While it is likely to grow over the years (see open issues below) it is not aimed to replace, or even compete directly with, numpy. Numpy is a full-featured numeric library aimed at professionals, the nuclear reactor of numeric libraries in the Python ecosystem. This is just a battery, as in “batteries included”, and is aimed at an intermediate level somewhere between “use numpy” and “roll your own version”.

Future Work

References

This document has been placed in the public domain.