[Python-Dev] Re: heaps (original) (raw)

David Eppstein eppstein@ics.uci.edu
Sat, 03 May 2003 22:46:58 -0700


On 5/4/03 1:26 AM -0400 Tim Peters <tim.one@comcast.net> wrote:

it's normal in N-best applications that N is much smaller than the number of items being ranked, and you don't want to consume more than O(N) memory (for example, google wants to show you the best-scoring 25 documents of the 6 million matches it found):

Ok, I think you're right, for this sort of thing heapq is better. One could extend my priorityDictionary code to limit memory like this but it would be unnecessary work when the extra features it has over heapq are not used for this sort of algorithm.

On the other hand, if you really want to find the n best items in a data stream large enough that you care about using only space O(n), it might also be preferable to take constant amortized time per item rather than the O(log n) that heapq would use, and it's not very difficult nor does it require any fancy data structures. Some time back I needed some Java code for this, haven't had an excuse to port it to Python. In case anyone's interested, it's online at <http://www.ics.uci.edu/~eppstein/161/KBest.java>. Looking at it now, it seems more complicated than it needs to be, but maybe that's just the effect of writing in Java instead of Python (I've seen an example of a three-page Java implementation of an algorithm in a textbook that could easily be done in a dozen Python lines).

David Eppstein http://www.ics.uci.edu/~eppstein/ Univ. of California, Irvine, School of Information & Computer Science