[Python-Dev] Drastically improving list.sort() for lists of strings/ints (original) (raw)

Tim Peters tim.peters at gmail.com
Sun Sep 11 17:42:22 EDT 2016

Previous message (by thread): [Python-Dev] Drastically improving list.sort() for lists of strings/ints
Next message (by thread): [Python-Dev] Drastically improving list.sort() for lists of strings/ints
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

[redirected from python-dev, to python-ideas; please send followups only to python-ideas]

[Elliot Gorokhovsky <elgo8537 at colorado.edu>]

... TL;DR: Should I spend time making list.sort() detect if it is sorting lexicographically and, if so, use a much faster algorithm?

It will be fun to find out ;-)

As Mark, and especially Terry, pointed out, a major feature of the current sort is that it can exploit many kinds of pre-existing order. As the paper you referenced says, "Realistic sorting problems are usually far from random." But, although they did run some tests against data with significant order, they didn't test against any algorithms aiming at exploiting uniformity. Just against their radix sort variants, and against a quicksort.

That's where it's likely next to impossible to guess in advance whether radix sort will have a real advantage. All the kinds of order the current sort can exploit are far from obvious, because the mechanisms it employs are low-level & very general. For example, consider arrays created by this function, for even n:

def bn(n):
    r = [None] * n
    r[0::2] = range(0, n//2)
    r[1::2] = range(n//2, n)
    return r

Then, e.g.,

bn(16) [0, 8, 1, 9, 2, 10, 3, 11, 4, 12, 5, 13, 6, 14, 7, 15]

This effectively takes range(n), cuts it in half, and does a "perfect shuffle" on the two halves.

You'll find nothing in the current code looking for this as a special case, but it nevertheless sorts such arrays in "close to" O(n) time, and despite that there's no natural run in the input longer than 2 elements.

That said, I'd encourage you to write your code as a new list method at first, to make it easiest to run benchmarks. If that proves promising, then you can worry about how to make a single method auto-decide which algorithm to use.

Also use the two-array version. It's easier to understand and to code, and stability is crucial now. The extra memory burden doesn't bother me - an array of C pointers consumes little memory compared to the memory consumed by the Python objects they point at.

Most of all, have fun! :-)

Previous message (by thread): [Python-Dev] Drastically improving list.sort() for lists of strings/ints
Next message (by thread): [Python-Dev] Drastically improving list.sort() for lists of strings/ints
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

More information about the Python-Dev mailing list