Compare algorithms for heapq.smallest « Python recipes « ActiveState Code (original) (raw)

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 from random import * from heapq import * cmps = 0 class Int(int): def __lt__(self, other): global cmps cmps += 1 return int.__lt__(self, other) def __le__(self, other): global cmps cmps += 1 return int.__le__(self, other) def count_cmps(f, data, k): 'Count comparisons in a call to f(k, data)' global cmps data = data[:] shuffle(data) cmps = 0 result = f(k, data) assert result[:10] == list(range(10)) return cmps # -------- variants of nsmallest ------- def heapifying_smallest(k, data): heapify(data) result = [heappop(data) for j in range(k)] data.extend(result) return result def select_nth(data, n): if len(data) == 1: return data[0] pivot = choice(data) lhs, rhs = [], [] for elem in data: (lhs if elem < pivot else rhs).append(elem) if len(lhs) >= n+1: return select_nth(lhs, n) else: return select_nth(rhs, n - len(lhs)) def selecting_smallest(k, data): pivot = select_nth(data, k) return sorted(elem for elem in data if elem <= pivot)[:k] def partitioning_smallest(n, data): if len(data) <= 1: return data pivot = choice(data) lhs, rhs = [], [] for elem in data: (lhs if elem <= pivot else rhs).append(elem) if n < len(lhs): return partitioning_smallest(n, lhs) else: return sorted(lhs) + partitioning_smallest(n - len(lhs), rhs) if __name__ == '__main__': # compare nsmallest implementations n, k = 100000, 100 print('n: %d\tk: %d' % (n, k)) data = list(map(Int, range(n))) for f in [nsmallest, heapifying_smallest, selecting_smallest, partitioning_smallest]: counts = sorted(count_cmps(f, data, k) for i in range(5)) print(counts, f.__name__)

This benchmarking tool was created to show the relative performance of three different approaches to writing heapq.smallest().

This approach is space efficient, taking only space for a k-length list. Only one pass is made over the input iterable. For most datasets, this algorithm is very efficient because only a handful of entries ever need to get inserted in the heap. Most of elements only need one comparison against the smallest element seen so far. So, for a small k, and large n, the total number of comparisons is only a little higher than n.

The approach is not space efficient; it consumes space for an n-length list. It makes one pass over the input iterable and another pass to heapify that list. Pulling off the k smallest elements takes O(k log k) time.

This approach takes the most space, n elements for pulling the whole list into memory and about have that size (on average) for the initial partition.

Here are the comparative results for n=100000 and k=100:

[105714, 105861, 105987, 106048, 106452] nsmallest
[166525, 166534, 166569, 166605, 166629] heapifying_smallest
[222351, 290743, 327718, 355590, 496865] selecting_smallest
[131171, 149186, 191046, 208020, 217504] partitioning_smallest

The results for partitioning do much better if efforts are made to chose a pivot in that is close to the intended cut-off point. For example:

pivot = sorted(choice(data) for i in range(9))[9*n//len(data)]