[Python-Dev] PEP 450 adding statistics module (original) (raw)

Steven D'Aprano steve at pearwood.info
Sun Sep 8 21:19:54 CEST 2013


On Sun, Sep 08, 2013 at 10:25:22AM -0700, Guido van Rossum wrote:

Steven, I'd like to just approve the PEP, given the amount of discussion that's happened already (though I didn't follow much of it). I quickly glanced through the PEP and didn't find anything I'd personally object to, but then I found your section of open issues, and I realized that you don't actually specify the proposed API in the PEP itself. It's highly unusual to approve a PEP that doesn't contain a specification. What did I miss?

You didn't miss anything, but I may have.

Should the PEP go through each public function in the module (there are only 11)? That may be a little repetitive, since most have the same, or almost the same, signatures. Or is it acceptable to just include an overview? I've come up with this:

API

The initial version of the library will provide univariate (single
variable) statistics functions.  The general API will be based on a
functional model ``function(data, ...) -> result``, where ``data``
is a mandatory iterable of (usually) numeric data.

The author expects that lists will be the most common data type used,
but any iterable type should be acceptable.  Where necessary, functions
may convert to lists internally.  Where possible, functions are
expected to conserve the type of the data values, for example, the mean
of a list of Decimals should be a Decimal rather than float.


Calculating the mean, median and mode

    The ``mean``, ``median`` and ``mode`` functions take a single
    mandatory argument and return the appropriate statistic, e.g.:

    >>> mean([1, 2, 3])
    2.0

    ``mode`` is the sole exception to the rule that the data argument
    must be numeric.  It will also accept an iterable of nominal data,
    such as strings.


Calculating variance and standard deviation

    In order to be similar to scientific calculators, the statistics
    module will include separate functions for population and sample
    variance and standard deviation.  All four functions have similar
    signatures, with a single mandatory argument, an iterable of
    numeric data, e.g.:

    >>> variance([1, 2, 2, 2, 3])
    0.5

    All four functions also accept a second, optional, argument, the
    mean of the data.  This is modelled on a similar API provided by
    the GNU Scientific Library[18].  There are three use-cases for
    using this argument, in no particular order:

        1)  The value of the mean is known *a priori*.

        2)  You have already calculated the mean, and wish to avoid
            calculating it again.

        3)  You wish to (ab)use the variance functions to calculate
            the second moment about some given point other than the
            mean.

    In each case, it is the caller's responsibility to ensure that
    given argument is meaningful.

Is this satisfactory or do I need to go into more detail?

-- Steven



More information about the Python-Dev mailing list