[Python-Dev] Re: heaps (original) (raw)

Tim Peters tim_one@email.msn.com
Thu, 1 May 2003 02:13:51 -0400


[Raymond Hettinger]

I'm quite pleased with the version already in CVS. It is a small masterpiece of exposition, sophistication, simplicity, and speed. A class based interface is not necessary for every algorithm.

[David Eppstein]

It has some elegance, but omits basic operations that are necessary for many heap-based algorithms and are not provided by this interface.

I think Raymond was telling you it isn't intended to be "an interface", rather it's quite up-front about being a collection of functions that operate directly on a Python list, implementing a heap in a very straightforward way, and deliberately not trying to hide any of that. IOW, it's a concrete data type, not an abstract one. I asked, and it doesn't feel like apologizing for being what it is .

That's not to say Python couldn't benefit from providing an abstract heap API too, and umpteen different implementations specialized to different kinds of heap applications. It is saying that heapq isn't trying to be that, so pointing out that it isn't falls kinda flat.

Specifically, the three algorithms that use heaps in my upper-division undergraduate algorithms classes are heapsort (for which heapq works fine, but you would generally want to use L.sort() instead), Dijkstra's algorithm (and its relatives such as A* and Prim), which needs the ability to decrease keys, and event-queue-based plane sweep algorithms (e.g. for finding all crossing pairs in a set of line segments) which need the ability to delete items from other than the top.

Then some of those will want a different implementation of a heap. The algorithms in heapq are still suitable for many heap applications, such as maintaining an N-best list (like retaining only the 10 best-scoring items in a long sequence), and A* on a search tree (when there's only one path to a node, decrease-key isn't needed; A* on a graph is harder).

To see how important the lack of these operations is, I decided to compare two implementations of Dijkstra's algorithm.

I don't think anyone claimed-- or would claim --that a heapq is suitable for all heap purposes.

The priority-dict implementation from http://aspn.activestate.com/ASPN/Cookbook/Python/Recipe/119466 takes as input a graph, coded as nested dicts {vertex: {neighbor: edge length}}. This is a variation of a graph coding suggested in one of Guido's essays that, as Raymond suggests, avoids using a separate class based interface.

Here's a simplification of my dictionary-based Dijkstra implementation: def Dijkstra(G,start,end=None): D = {} # dictionary of final distances P = {} # dictionary of predecessors Q = priorityDictionary() # est.dist. of non-final vert. Q[start] = 0 for v in Q: D[v] = Q[v] for w in G[v]: vwLength = D[v] + G[v][w] if w not in D and (w not in Q or vwLength < Q[w]): Q[w] = vwLength P[w] = v return (D,P) Here's a translation of the same implementation to heapq (untested since I'm not running 2.3). Since there is no decrease in heapq, nor any way to find and remove old keys,

A heapq is a list, so you could loop over the list to find an old object. I wouldn't recommend that in general , but it's easy, and if the need is rare then the advertised fact that a heapq is a plain list can be very convenient. Deleting an object from "the interior" still isn't supported directly, of course. It's possible to do so efficiently with this implementation of a heap, but since it doesn't support an efficient way to find an old object to begin with, there seemed little point to providing an efficient delete-by-index function. Here's one such:

import heapq

def delete_obj_at_index(heap, pos): lastelt = heap.pop() if pos >= len(heap): return

# The rest is a lightly fiddled variant of heapq._siftup.
endpos = len(heap)
# Bubble up the smaller child until hitting a leaf.
childpos = 2*pos + 1    # leftmost child position
while childpos < endpos:
    # Set childpos to index of smaller child.
    rightpos = childpos + 1
    if rightpos < endpos and heap[rightpos] <= heap[childpos]:
        childpos = rightpos
    # Move the smaller child up.
    heap[pos] = heap[childpos]
    pos = childpos
    childpos = 2*pos + 1
# The leaf at pos is empty now.  Put lastelt there, and and bubble
# it up to its final resting place (by sifting its parents down).
heap[pos] = lastelt
heapq._siftdown(heap, 0, pos)

I changed the algorithm to add new tuples for each new key, leaving the old tuples in place until they bubble up to the top of the heap.

def Dijkstra(G,start,end=None): D = {} # dictionary of final distances P = {} # dictionary of predecessors Q = [(0,None,start)] # heap of (est.dist., pred., vert.) while Q: dist,pred,v = heappop(Q) if v in D: continue # tuple outdated by decrease-key, ignore D[v] = dist P[v] = pred for w in G[v]: heappush(Q, (D[v] + G[v][w], v, w)) return (D,P) My analysis of the differences between the two implementations: - The heapq version is slightly complicated (the two lines if...continue) by the need to explicitly ignore tuples with outdated priorities. This need for inserting low-level data structure maintenance code into higher-level algorithms is intrinsic to using heapq, since its data is not structured in a way that can support efficient decrease key operations.

It surprised me that you tried using heapq at all for this algorithm. I was also surprised that you succeeded <0.9 wink>.

- Since the heap version had no way to determine when a new key was smaller than an old one, the heapq implementation needed two separate data structures to maintain predecessors (middle elements of tuples for items in queue, dictionary P for items already removed from queue). In the dictionary implementation, both types of items stored their predecessors in P, so there was no need to transfer this information from one structure to another.

- The dictionary version is slightly complicated by the need to look up old heap keys and compare them with the new ones instead of just blasting new tuples onto the heap. So despite the more-flexible heap structure of the dictionary implementation, the overall code complexity of both implementations ends up being about the same. - Heapq forced me to build tuples of keys and items, while the dictionary based heap did not have the same object-creation overhead (unless it's hidden inside the creation of dictionary entries).

Rest easy, it's not.

On the other hand, since I was already building tuples, it was convenient to also store predecessors in them instead of in some other structure.

- The heapq version uses significantly more storage than the dictionary: proportional to the number of edges instead of the number of vertices. - The changes I made to Dijkstra's algorithm in order to use heapq might not have been obvious to a non-expert; more generally I think this lack of flexibility would make it more difficult to use heapq for cookbook-type implementation of textbook algorithms.

Depends on the specific algorithms in question, of course. No single heap implementation is the best choice for all algorithms, and heapq would be misleading people if, e.g., it did offer a decrease_key function -- it doesn't support an efficient way to do that, and it doesn't pretend to.

- In Dijkstra's algorithm, it was easy to identify and ignore outdated heap entries, sidestepping the inability to decrease keys. I'm not convinced that this would be as easy in other applications of heaps.

All that is explaining why this specific implementation of a heap isn't suited to the task at hand. I don't believe that was at issue, though. An implementation of a heap that is suited for this task may well be less suited for other tasks.

- One of the reasons to separate data structures from the algorithms that use them is that the data structures can be replaced by ones with equivalent behavior, without changing any of the algorithm code. The heapq Dijkstra implementation is forced to include code based on the internal details of heapq (specifically, the line initializing the heap to be a one element list), making it less flexible for some uses. The usual reason one might want to replace a data structure is for efficiency, but there are others: for instance, I teach various algorithms classes and might want to use an implementation of Dijkstra's algorithm as a testbed for learning about different priority queue data structures. I could do that with the dictionary-based implementation (since it shows nothing of the heap details) but not the heapq one.

You can wrap any interface you like around heapq (that's very easy to do in Python), but it won't change that heapq's implementation is poorly suited to this application. priorityDictionary looks like an especially nice API for this specific algorithm, but, e.g., impossible to use directly for maintaining an N-best queue (priorityDictionary doesn't support multiple values with the same priority, right? if we're trying to find the 10,000 poorest people in America, counting only one as dead broke would be too Republican for some peoples' tastes ). OTOH, heapq is easy and efficient for that class of heap application.

Overall, while heapq was usable for implementing Dijkstra, I think it has significant shortcomings that could be avoided by a more well-thought-out interface that provided a little more functionality and a little clearer separation between interface and implementation.

heapq isn't trying to separate them at all -- quite the contrary! It's much like the bisect module that way. They find very good uses in practice.

I should note that I objected to heapq at the start, because there are several important heap implementation techniques, and just one doesn't fit anyone all the time. My objection went away when Guido pointed out how much like bisect it is: since it doesn't pretend one whit to generality or opaqueness, it can't be taken as promising more than it actually does, nor can it interfere with someone (so inclined) defining a general heap API: it's not even a class, just a handful of functions. Useful, too, just as it is. A general heap API would be nice, but it wouldn't have much (possibly nothing) to do with heapq.