bpo-28685: Optimize sorted() list.sort() with type-specialized comparisons by embg · Pull Request #582 · python/cpython (original) (raw)

Description of the optimization (see also this poster)

The idea is simple: in practice, it's very uncommon to sort type-heterogeneous lists. This is because lists in general tend to be used in a homogeneous way (if you're iterating and the type is changing, your code may break, depending on what you're doing), and because comparison is often not defined in the heterogeneous context ("apples and oranges").

So, instead of checking types during every single compare in the sort (dynamic dispatch), we can simply iterate once in a pre-sort check and see if the list is type-homogeneous. If it is, we can replace PyObject_RichCompareBool with whatever compare function would have ended up being dispatched for that type. Since this check is cheap and very unlikely to fail, and checking types every time we compare is expensive, this is a reasonable optimization to consider.

This is, however, only the beginning of what's possible. Namely, there are many safety checks that have to be performed during every compare in the common cases (string, int, float, tuple) that one encounters in practice. For example, character width has to be checked for both strings every time two strings are compared. Since these checks almost never fail in practice (because, e.g., non-latin strings are uncommon in practice, etc.), we can move them out of the comparison function and into the pre-sort check, as well. We then write special-case compare functions (I implemented one for each of the four types mentioned above) that are selected iff. the assumptions necessary to use them are verified for each list element.

Benchmarks

I considered two sets of benchmarks: one organized by type (random lists of that type), and one organized by structure. Full benchmark scripts can be found here. The results are below (standard deviations were less than 0.3% of the mean for all measurements):

By type

Type Percent improvement on random lists of [type] (1-patched/unpatched)
heterogeneous (lots of float with an int at the end; worst-case) -1.5%
float 48%
bounded int (magnitude smaller than 2^32) 48.4%
latin string (all characters in [0,255]) 32.7%
general int (reasonably uncommon?) 17.2%
general string (reasonably uncommon?) 9.2%
tuples of float 63.2%
tuples of bounded int 64.8%
tuples of latin string 55.8%
tuples of general int 50.3%
tuples of general string 44.1%
tuples of heterogeneous 41.5%

By structure

These are just the benchmarks described in Objects/listsort.txt. The first table is the loss we experience if we sort structured heterogeneous lists (worst case: list is already sorted, we go all the way through doing n type-checks, and then we only end up doing n comparisons. Tragic, but extremely unlikely in practice; in practice, we would usually find the first heterogeneous element early on, and break out of the check, but here, the single, lonely float is hiding all the way at the end of the list of int, so we don't find it until we've done all n checks):

Benchmark (for heterogeneous lists, worst-case) Percent improvement (1-patched/unpatched)
\sort -17.2%
/sort -19.8%
3sort -18.0%
+sort -18.8%
%sort -10.0%
~sort -2.1%
=sort -21.3%

The second table is the same benchmark, but on homogeneous lists (int):

Benchmark (for homogeneous lists) Percent improvement (1-patched/unpatched)
\sort 54.6%
/sort 56.5%
3sort 53.5%
+sort 55.3%
%sort 52.4%
~sort 48.0%
=sort 45.2%

Patch summary

Here we describe at a high level what each section of the patch does:

Line numbers in Objects/listobject.c What the lines do
1053-1069 Define a struct to hold the function pointers we will select in the pre-sort check. This struct then has to be passed in to every function that performs comparison (to keep things in local scope).
1075-1080 Compare function for heterogeneous lists; just a wrapper for PyObject_RichCompareBool. To be selected if all of our pre-checks fail.
1086-1108 Compare function for general homogeneous lists; just a wrapper for ob_type->tp_richcompare, which is stored by the pre-sort check at compare_funcs.key_richcompare. This yields modest optimization (neighbourhood of 10%), but we generally hope we can do better.
1111-1127 Compare function for lists of latin string. During the pre-sort check, we verify that every string in the list uses one character per byte; otherwise, we default to the general homogeneous compare. If this check is even somewhat likely to pass, it's worth it, because the payoff is large, as can be seen in the Benchmarks section. The compare function basically directly accesses the data buffers of the two strings and memcmps them.
1130-1154 Compare function for lists of bounded long. During the pre-sort check, we verify that every int in the list fits in a single machine word. If that check passes, we can use this optimized compare function, which basically directly compares the machine words representing the two ints (taking sign into account). This is faster than the general comparison, which has to figure out which word is most significant for both inputs, etc, in addition to all the type-checking.
1157-1166 Compare function for lists of float. Doesn't assume anything; just directly compares the two floats, skipping all the unnecessary type-checking. Because PyFloat_Type->tp_richcompare does a lot of typechecking that we want to move out of the sort loop, it pays to have this optimized compare available.
1173-1233 Compare function for lists of non-empty tuple. Tuple comparison is optimized on two levels. Namely, after selecting compare_funcs.key_compare in the pre-sort check, we run the pre-sort check again on the list T = [x[0] for x in L] (we don't actually run the check twice, but we do something functionally equivalent to this). If T is type-homogeneous, or even better, satisfies the requirements for one of our special-case compares, we can replace the call to PyObject_RichCompareBool for the first tuple element with a call to compare_funcs.tuple_elem_compare. This allows us to bypass two levels of wasteful safety checks. If the first elements of the two tuples are equal, of course, we have to call PyObject_RichCompareBool on subsequent elements; the idea is that this is uncommon in practice.
2168-2212 First part of the pre-sort check: we set the variables key_type, keys_are_all_same_type, ints_are_bounded, strings_are_latin, and keys_are_in_tuples (which is 1 iff. every list element is a non-empty tuple, in which case all the other variables refer to the list [x[0] for x in L]).
2215-2243 Second part of the pre-sort check: given values for those variables, select the appropriate compare function. If keys_are_in_tuples and key_type != &PyTuple_Type, then use the other variables to select compare_funcs.tuple_elem_compare, and set compare_funcs.key_compare = unsafe_tuple_compare.

Selected quotes from the python-ideas thread

Terry Reedy:

Do reference this thread, and quote Tim's approval in principle, if he did not post on the tracker.

Tim Peters:

Would someone please move the patch along? I expect it's my fault it's languished so long, since I'm probably the natural person to review it, but I've been buried under other stuff.

But the patch doesn't change anything about the sorting algorithm itself - even shallow knowledge of how timsort works is irrelevant. It's just plugging in a different bottom-level object comparison function when that appears valuable.

I've said from the start that it's obvious (to me ;-) ) that it's an excellent tradeoff. At worst it adds one simple (pre)pass over the list doing C-level pointer equality comparisons. That's cheap. The worst-case
damage is obviously small, the best-case gain is obviously large, and the best cases are almost certainly far more common than the worst cases in most code.

Later in that message, Tim also pointed out a bug, which has been fixed in this version of the patch.

https://bugs.python.org/issue28685