[Python-Dev] Re: heaps (original) (raw)

Tim Peters tim.one@comcast.net
Mon, 05 May 2003 22:35:28 -0400


[David Eppstein, on the bar-raising behavior of

 person > q[0]

]

Good point. If any permutation of the input sequence is equally likely, and you're selecting the best k out of n items, the expected number of times you have to hit the data structure in your heapq solution is roughly k ln n, so the total expected time is O(n + k log k log n), with a really small constant factor on the O(n) term. The sorting solution I suggested has total time O(n log k), and even though sorting is built-in and fast it can't compete when k is small. Random pivoting is O(n + k), but with a larger constant factor, so your heapq solution looks like a winner.

In real Python Life, it's the fastest way I know (depending ...).

For fairness, it might be interesting to try another run of your test in which the input sequence is sorted in increasing order rather than random.

Comparing the worst case of one against the best case of the other isn't my idea of fairness , but sure. On the best-1000 of a million floats test, and sorting the floats first, the heap method ran about 30x slower than on random data, and the sort method ran significantly faster than on random data (a factor of 1.3x faster). OTOH, if I undo my speed tricks and call a function in the sort method (instead of doing it all micro-optimized inline), that slows the sort method by a bit over a factor of 2.

I.e., replace the random generation of seq by seq = range(M) I'd try it myself, but I'm still running python 2.2 and haven't installed heapq. I'd have to know more about your application to have an idea whether the sorted or randomly-permuted case is more representative.

Of course -- so would I .

Here's a surprise: I coded a variant of the quicksort-like partitioning method, at the bottom of this mail. On the largest-1000 of a million random-float case, times were remarkably steady across trials (i.e., using a different set of a million random floats each time):

heapq 0.96 seconds sort (micro-optimized) 3.4 seconds KBest (below) 2.6 seconds

The KBest code creates new lists with wild abandon. I expect it does better than the sort method anyway because it gets to exploit its own form of "raise the bar" behavior as more elements come in. For example, on the first run, len(buf) exceeded 3000 only 14 times, and the final pivot value each time is used by put() as an "ignore the input unless it's bigger than that" cutoff:

pivoted w/ 0.247497558554 pivoted w/ 0.611006884768 pivoted w/ 0.633565558936 pivoted w/ 0.80516673256 pivoted w/ 0.814304890889 pivoted w/ 0.884660572175 pivoted w/ 0.89986744075 pivoted w/ 0.946575251872 pivoted w/ 0.980386533221 pivoted w/ 0.983743795382 pivoted w/ 0.992381911217 pivoted w/ 0.994243625292 pivoted w/ 0.99481443021 pivoted w/ 0.997044443344

The already-sorted case is also a bad case for this method, because then the pivot is never big enough to trigger the early exit in put().

def split(seq, pivot): lt, eq, gt = [], [], [] lta, eqa, gta = lt.append, eq.append, gt.append for x in seq: c = cmp(x, pivot) if c < 0: lta(x) elif c: gta(x) else: eqa(x) return lt, eq, gt

KBest(k, minusinf) remembers the largest k objects

from a sequence of objects passed one at a time to

put(). minusinf must be smaller than any object

passed to put(). After feeding in all the objects,

call get() to retrieve a list of the k largest (or

as many as were passed to put(), if put() was called

fewer than k times).

class KBest(object): slots = 'k', 'buflim', 'buf', 'cutoff'

def __init__(self, k, minusinf):
    self.k = k
    self.buflim = 3*k
    self.buf = []
    self.cutoff = minusinf

def put(self, obj):
    if obj <= self.cutoff:
        return

    buf = self.buf
    buf.append(obj)
    if len(buf) <= self.buflim:
        return

    # Reduce len(buf) by at least one, by retaining
    # at least k, and at most len(buf)-1, of the
    # largest objects in buf.
    from random import choice
    sofar = []
    k = self.k
    while len(sofar) < k:
        pivot = choice(buf)
        buf, eq, gt = split(buf, pivot)
        sofar.extend(gt)
        if len(sofar) < k:
            sofar.extend(eq[:k - len(sofar)])

    self.buf = sofar
    self.cutoff = pivot

def get(self):
    from random import choice
    buf = self.buf
    k = self.k
    if len(buf) <= k:
        return buf

    # Retain only the k largest.
    sofar = []
    needed = k
    while needed:
        pivot = choice(buf)
        lt, eq, gt = split(buf, pivot)
        if len(gt) <= needed:
            sofar.extend(gt)
            needed -= len(gt)
            if needed:
                takefromeq = min(len(eq), needed)
                sofar.extend(eq[:takefromeq])
                needed -= takefromeq
            # If we still need more, they have to
            # come out of things < pivot.
            buf = lt
        else:
            # gt alone is too large.
            buf = gt

    assert len(sofar) == k
    self.buf = sofar
    return sofar