[Python-Dev] Drastically improving list.sort() for lists of strings/ints (original) (raw)
Elliot Gorokhovsky elliot.gorokhovsky at gmail.com
Sun Sep 11 20:01:37 EDT 2016
- Previous message (by thread): [Python-Dev] Drastically improving list.sort() for lists of strings/ints
- Next message (by thread): [Python-Dev] Python 3.7: remove all private C functions from the Python C API?
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
Wow, Tim himself! Regarding performance on semi-ordered data: we'll have to benchmark to see, but intuitively I imagine radix would meet Timsort because verifying that a list of strings is sorted takes Omega(nw) (which gives a lower bound on Timsort), where w is the word length. Radix sort is Theta(nw). So at least asymptotically it checks out. I think if one uses the two-array algorithm, other semi-sortings can also be exploited, since the items get placed into their respective buckets in the order in which they appear in the list. So, for the example you gave, one pass would sort it correctly (since the list has the property if x_1 and x_2 are in bucket b, x1 comes before x2 in the list, so x1 will also come before x2 in the bucket. Except possibly for one "border bucket" that includes n/2). And then it would just be Theta(nw/b) in each bucket to verify sorted. I mean honestly the cool thing about radix is that the best case for Timsort on strings, Omega(nw), is the worst case for radix! So the point is I think the two array version, at least, preserves a lot of structure. Anyway, I hope to have benchmarks (relatively) soon! (I'm a senior in high school so I'm pretty busy...but I'll try to get on this as soon as I can).
On Sun, Sep 11, 2016 at 3:42 PM Tim Peters <tim.peters at gmail.com> wrote:
[redirected from python-dev, to python-ideas; please send followups only to python-ideas]
[Elliot Gorokhovsky <elgo8537 at colorado.edu>] > ... > TL;DR: Should I spend time making list.sort() detect if it is sorting > lexicographically and, if so, use a much faster algorithm? It will be fun to find out ;-) As Mark, and especially Terry, pointed out, a major feature of the current sort is that it can exploit many kinds of pre-existing order. As the paper you referenced says, "Realistic sorting problems are usually far from random." But, although they did run some tests against data with significant order, they didn't test against any algorithms aiming at exploiting uniformity. Just against their radix sort variants, and against a quicksort. That's where it's likely next to impossible to guess in advance whether radix sort will have a real advantage. All the kinds of order the current sort can exploit are far from obvious, because the mechanisms it employs are low-level & very general. For example, consider arrays created by this function, for even
n
: def bn(n): r = [None] * n r[0::2] = range(0, n//2) r[1::2] = range(n//2, n) return r Then, e.g., >>> bn(16) [0, 8, 1, 9, 2, 10, 3, 11, 4, 12, 5, 13, 6, 14, 7, 15] This effectively takes range(n), cuts it in half, and does a "perfect shuffle" on the two halves. You'll find nothing in the current code looking for this as a special case, but it nevertheless sorts such arrays in "close to" O(n) time, and despite that there's no natural run in the input longer than 2 elements. That said, I'd encourage you to write your code as a new list method at first, to make it easiest to run benchmarks. If that proves promising, then you can worry about how to make a single method auto-decide which algorithm to use. Also use the two-array version. It's easier to understand and to code, and stability is crucial now. The extra memory burden doesn't bother me - an array of C pointers consumes little memory compared to the memory consumed by the Python objects they point at. Most of all, have fun! :-) -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.python.org/pipermail/python-dev/attachments/20160912/891c1497/attachment.html>
- Previous message (by thread): [Python-Dev] Drastically improving list.sort() for lists of strings/ints
- Next message (by thread): [Python-Dev] Python 3.7: remove all private C functions from the Python C API?
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]