[Python-Dev] PEP 450 adding statistics module (original) (raw)

Terry Reedy tjreedy at udel.edu
Mon Sep 9 09:06:24 CEST 2013

Previous message: [Python-Dev] PEP 450 adding statistics module
Next message: [Python-Dev] PEP 450 adding statistics module
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On 9/8/2013 5:41 PM, Guido van Rossum wrote:

On Sun, Sep 8, 2013 at 1:48 PM, Oscar Benjamin <oscar.j.benjamin at gmail.com> wrote:

On 8 September 2013 18:32, Guido van Rossum <guido at python.org> wrote:

Going over the open issues:

- Parallel arrays or arrays of tuples? I think the API should require an array of tuples. It is trivial to zip up parallel arrays to the required format, while if you have an array of tuples, extracting the parallel arrays is slightly more cumbersome. Also for manipulating of the raw data, an array of tuples makes it easier to do insertions or removals without worrying about losing the correspondence between the arrays. For something like this, where there are multiple obvious formats for the input data, I think it's reasonable to just request whatever is convenient for the implementation. Not really. The implementation may change, or its needs may not be obvious to the caller. I would say the right thing to do is request something easy to remember, which often means consistent. In general, Python APIs definitely skew towards lists of tuples rather than parallel arrays, and for good reasons -- that way you benefit most from built-in operations like slices and insert/append.

This question has been discussed in the statistical software community for decades, going back to when storage was on magnetic tape, where contiguity was even more important than cache locality. In my experience with multiple packages, the most common format for input is tables where rows represent cases, samples, or whatever, which translates as lists of records (or tuples), just as with relational databases. Columns then represent a 'variable'. So I think we should go with that.

Some packages might transpose the data internally, but that is an internal matter. The tradeoff is that storing by cases makes adding a new case easier, while storing by variables makes adding a new variable easier.

-- Terry Jan Reedy

Previous message: [Python-Dev] PEP 450 adding statistics module
Next message: [Python-Dev] PEP 450 adding statistics module
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

More information about the Python-Dev mailing list