[Python-Dev] Re: heaps (original) (raw)

Tim Peters tim.one@comcast.net
Sun, 04 May 2003 01:26:09 -0400

Previous message: [Python-Dev] Re: heaps
Next message: [Python-Dev] Re: heaps
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

[Tim]

... priorityDictionary looks like an especially nice API for this specific algorithm, but, e.g., impossible to use directly for maintaining an N- best queue (priorityDictionary doesn't support multiple values with the same priority, right?

That was wrong: the dict maps items to priorities, and I read it backwards. Sorry!

if we're trying to find the 10,000 poorest people in America, counting only one as dead broke would be too Republican for some peoples' tastes ). OTOH, heapq is easy and efficient for that class of heap application.

[David Eppstein]

I agree with your main points (heapq's inability to handle certain priority queue applications doesn't mean it's useless, and its implementation-specific API helps avoid fooling programmers into thinking it's any more than what it is). But I am confused at this example. Surely it's just as easy to store (income,identity) tuples in either data structure.

As above, I was inside out.

"Just as easy" can't be answered without trying to write actual code, though. Given that heapq and priorityDictionary are both min-heaps, to avoid artificial pain let's look for the people with the N highest incomes instead.

For an N-best queue using heapq, "the natural" thing is to define people like so:

class Person: def init(self, income): self.income = income

def __cmp__(self, other):
    return cmp(self.income, other.income)

and then the N-best calculation is as follows; it's normal in N-best applications that N is much smaller than the number of items being ranked, and you don't want to consume more than O(N) memory (for example, google wants to show you the best-scoring 25 documents of the 6 million matches it found):

"""

N-best queue for people with the N largest incomes.

import heapq

dummy = Person(-1) # effectively an income of -Inf q = [dummy] * N # it's fine to use the same object N times

for person in people: if person > q[0]: heapq.heapreplace(q, person)

The result list isn't sorted.

result = [person for person in q if q is not dummy] """

I'm not as experienced with priorityDictionary. For heapq, the natural cmp is the one that compares objects' priorities. For priorityDictionary, we can't use that, because Person instances will be used as dict keys, and then two Persons with the same income couldn't be in the queue at the same time. So Person.cmp will have to change in such a way that distinct Persons never compare equal. I also have to make sure that a Person is hashable. I see there's another subtlety, apparent only from reading the implementation code: in the heap maintained alongside the dict, it's actually

(priority, object)

tuples that get compared. Since I expect to see Persons with equal income, when two such tuples get compared, they'll tie on the priority, and go on to compare the Persons. So I have to be sure too that comparing two Persons is cheap.

Pondering all that for a while, it seems best to make sure Person doesn't define cmp or hash at all. Then instances will get compared by memory address, distinct Persons will never compare equal, comparing Persons is cheap, and hashing by memory address is cheap too:

class Person: def init(self, income): self.income = income

The N-best code is then:

""" q = priorityDictionary()

for dummy in xrange(N): q[Person(-1)] = -1 # have to ensure these are distinct Persons

for person in people: if person.income > q.smallest().income: del q[q.smallest()] q[person] = person.income

The result list is sorted.

result = [person for person in q if person.income != -1] """

Perhaps paradoxically, I had to know essentially everything about how priorityDictionary is implemented to write a correct and efficient algorithm here. That was true of heapq too, of course, but there were fewer subtleties to trip over there, and heapq isn't trying to hide its implementation.

BTW, there's a good use of heapq for you: you could use it to maintain the under-the-covers heap inside priorityDictionary! It would save much of the code, and possibly speed it too (heapq's implementation of popping usually requires substantially fewer comparisons than priorityDictionary.smallest uses; this is explained briefly in the comments before _siftup, deferring to Knuth for the gory details).

If you mean, you want to find the 10k smallest income values (rather than the people having those incomes), then it may be that a better data structure would be a simple list L in which the value of L[i] is the count of people with income i.

Well, leaving pennies out of it, incomes in the USA span 9 digits, so something taking O(N) memory would still be most attractive.

Previous message: [Python-Dev] Re: heaps
Next message: [Python-Dev] Re: heaps
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]