[Python-ideas] minmax() function returning (minimum, maximum) tuple of a sequence (original) (raw)

Steven D'Aprano steve at pearwood.info
Thu Oct 14 13:23:31 CEST 2010


On Thu, 14 Oct 2010 08:54:31 am you wrote:

After some thought, I've found a way to make running several "running calculations" in parallel fast. Speed should be comparable to having used the non-running variants.

Speed "should be" comparable? Are you guessing or have you actually timed it?

And surely the point of all this extra coding is to make something run faster, not "comparable to", the sequential algorithm?

The method is to give each running calculation "blocks" of values instead of just one at a time. The applyinparallel(iterable, blocksize=1000, *runningcalcs) function would get blocks of values from the iterable and pass them to each running calculation separately. So RunningMax would look something like this:

class RunningMax(RunningCalc): def init(self): self.maxvalue = None def feed(self, value): if self.maxvalue is None or value > self.maxvalue: self.maxvalue = value def feedMultiple(self, values): self.feed(max(values))

As I understand it, you have a large list of data, and you want to calculate a number of statistics on it. The naive algorithm does each calculation sequentially:

a = min(data) b = max(data) c = f(data) # some other statistics d = g(data) ... x = z(data)

If the calculations take t1, t2, t3, ..., tx time, then the sequential calculation takes sum(t1, t2, ..., tx) plus a bit of overhead. If you do it in parallel, this should reduce the time to max(t1, t2, ..., tx) plus a bit of overhead, potentially a big saving.

But unless I've missed something significant, all you are actually doing is slicing data up into small pieces, then calling each function min, max, f, g, ..., z on each piece sequentially:

block = data[:size] a = min(block) b = max(block) c = f(block) ... block = data[size:2size] a = min(a, min(block)) b = max(b, max(block)) c = f(c, f(block)) ... block = data[2size:3*size] a = min(a, min(block)) b = max(b, max(block)) c = f(c, f(block)) ...

Each function still runs sequentially, but you've increased the amount of overhead a lot: your slicing and dicing the data, plus calling each function multiple times.

And since each "running calculation" class needs to be hand-written to suit the specifics of the calculation, that's a lot of extra work just to get something which I expect will run slower than the naive sequential algorithm.

I'm also distracted by the names, RunningMax and RunningCalc. RunningFoo doesn't mean "do Foo in parallel", it means to return intermediate calculations. For example, if I ran a function called RunningMax on this list:

[1, 2, 1, 5, 7, 3, 4, 6, 8, 6]

I would expect it to yield or return:

[1, 2, 5, 7, 8]

-- Steven D'Aprano



More information about the Python-ideas mailing list