[Python-Dev] Accumulation module (original) (raw)

Raymond Hettinger python at rcn.com
Wed Jan 14 07:40:43 EST 2004


> * Is there a way to compute the standard deviation without multiple > passes over the data (one to compute the mean and a second to sum the > squares of the differences from the mean?

[Tim]

The textbook formula does it in one pass, by accumulating running totals of the elements and of their squares. It's numerically atrocious, though, and should never be used (it can even end up delivering a negative result!). Knuth gives a much better recurrence formula for this, doing one-pass computation of variance in a numerically stable way.

I found it.

There is a suggestion that the formula would be better off if the recurrence sums are accumulated along with a cumulative error term that tracks the loss of significance when adding a small adjustment to a larger sum. The textbook exercise indicates that that technique is the most useful in the presence of crummy rounding. Do you think there is any value in using the error term tracking?

def stddev(data): # Knuth -- Seminumerical Algorithms 4.2.2 p.216 it = iter(data) m = float(it.next()) # new running mean s = 0.0 # running sum((x-mean)**2) k = 1 # number of items dm = ds = 0 # carried forward error term for x in it: k += 1

    adjm = (x-m)/k - dm
    newm = m + adjm
    dm = (newm - m) - adjm
    
    adjs = (x-m)*(x-newm) - ds
    news = s + adjs
    ds = (news - s) - adjs
    s = news

    m = newm
return (s / (k-1)) ** 0.5  # sample standard deviation

Without the error term tracking, the inner loop simplifies to:

    k += 1
    newm = m+(x-m)/k
    s += (x-m)*(x-newm)
    m = newm

Raymond Hettinger



More information about the Python-Dev mailing list