[Python-Dev] PEP 450 adding statistics module (original) (raw)

Steven D'Aprano steve at pearwood.info
Thu Aug 15 15:08:22 CEST 2013

Previous message: [Python-Dev] PEP 450 adding statistics module
Next message: [Python-Dev] PEP 450 adding statistics module
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On 15/08/13 21:42, Mark Dickinson wrote:

The PEP and code look generally good to me.

I think the API for median and its variants deserves some wider discussion: the reference implementation has a callable 'median', and variant callables 'median.low', 'median.high', 'median.grouped'. The pattern of attaching the variant callables as attributes on the main callable is unusual, and isn't something I've seen elsewhere in the standard library. I'd like to see some explanation in the PEP for why it's done this way. (There was already some discussion of this on the issue, but that was more centered around the implementation than the API.) I'd propose two alternatives for this: either have separate functions 'median', 'medianlow', 'medianhigh', etc., or have a single function 'median' with a "method" argument that takes a string specifying computation using a particular method. I don't see a really good reason to deviate from standard patterns here, and fear that users would find the current API surprising.

Alexander Belopolsky has convinced me (off-list) that my current implementation is better changed to a more conservative one of a callable singleton instance with methods implementing the alternative computations. I'll have something like:

def _singleton(cls): return cls()

@_singleton class median: def call(self, data): ... def low(self, data): ... ...

In my earlier stats module, I had a single median function that took a argument to choose between alternatives. I called it "scheme":

median(data, scheme="low")

R uses parameter called "type" to choose between alternate calculations, not for median as we are discussing, but for quantiles:

quantile(x, probs ... type = 7, ...).

SAS also uses a similar system, but with different numeric codes. I rejected both "type" and "method" as the parameter name since it would cause confusion with the usual meanings of those words. I eventually decided against this system for two reasons:

Each scheme ended up needing to be a separate function, for ease of both implementation and testing. So I had four private median functions, which I put inside a class to act as namespace and avoid polluting the main namespace. Then I needed a "master function" to select which of the methods should be called, with all the additional testing and documentation that entailed.
The API doesn't really feel very Pythonic to me. For example, we write:

mystring.rjust(width) dict.items()

rather than mystring.justify(width, "right") or dict.iterate("items"). So I think individual methods is a better API, and one which is more familiar to most Python users. The only innovation (if that's what it is) is to have median a callable object.

As far as having four separate functions, median, median_low, etc., it just doesn't feel right to me. It puts four slight variations of the same function into the main namespace, instead of keeping them together in a namespace. Names like median_low merely simulates a namespace with pseudo-methods separated with underscores instead of dots, only without the advantages of a real namespace.

(I treat variance and std dev differently, and make the sample and population forms separate top-level functions rather than methods, simply because they are so well-known from scientific calculators that it is unthinkable to me to do differently. Whenever I use numpy, I am surprised all over again that it has only a single variance function.)

-- Steven

Previous message: [Python-Dev] PEP 450 adding statistics module
Next message: [Python-Dev] PEP 450 adding statistics module
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

More information about the Python-Dev mailing list