[Python-Dev] Drastically improving list.sort() for lists of strings/ints (original) (raw)
Terry Reedy tjreedy at udel.edu
Sun Sep 11 16:48:50 EDT 2016
- Previous message (by thread): [Python-Dev] Drastically improving list.sort() for lists of strings/ints
- Next message (by thread): [Python-Dev] Drastically improving list.sort() for lists of strings/ints
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
On 9/11/2016 3:45 AM, Elliot Gorokhovsky wrote:
Hello all,
I am interested in making a non-trivial improvement to list.sort(),
This is non-trivial, and harder than you seem to think it is.
but before I put in the work, I want to test the waters and see if this is something the community would accept.
The debate on proposed enhancements is usually whether they are really enhancements, all things considered. For special-case speedups, the 'all things' include the frequency of the special cases, the ease of detecting them, the thoroughness of testintg, and the maintainability of the proposed, and likely more complicated, code.
Basically, I want to implement radix sort for lists of strings.
Radix sort was invented for fixed-length strings of digits, as in all-digit id 'numbers', so 10 bins. Ascii strings need 128 bins, general byte strings need 256, still manageable. General unicode strings require 1,114,112 bins, most of which will be empty for most characters positions. This is harder to manage. So are variable-length strings.
In CPython 3.3+, strings at the C level are not just strings but are 1, 2, or 4 bytes per char strings. So you could specifically target lists of bytes (and bytearrays) and lists of strings limited to 1-byte characters. The same code should pretty much work for both.
... In-place radix sort is significantly faster for lexicographic sorting than Timsort (or in general any comparison-based sort, since radix can beat the nlogn barrier).
This unqualified statement is doubly wrong.
First, with respect to sorting in general: 'aysmptotically faster' only means faster for 'large enough' tasks. Many real world tasks are small, and big tasks gets broken down into multiple small tasks. 'Asymptotically slower' algoritms may be faster for small tasks. Tim Peters investigated and empirically determined that an O(nn) binary insort, as he optimized it on real machines, is faster than O(nlogn) sorting for up to around 64 items. So timsort uses binary insertion to sort up to 64 items. Similarly, you would have to investigate whether there is a size below which timsort is faster.
Second, with respect to timsort in particular: timsort is designed to exploit structure and run faster than O(n*logn) in special cases. If a list is already sorted, timsort will do one O(n) scan and stop. Any radix sort will take several times longer. If a list is reverse sorted, timsort will do one O(n) scan and do an O(n) reverse. If a list is the concatenation of two sorted lists, timsort will find the two sorted sublists and merge them. If a sorted list has unsorted items appended to the end, timsort will sort the appended items and then do a merge. I expect any radix sort to be slower for all these cases. Tim Peters somewhere documented his experiments and results with various special but plausible cases of non-randomness.
What you might propose is this: if the initial up or down sequence already determined by timsort is less than, say, 64 items, and the length of the list is more than some empirically determined value, and all items are bytes or byte-char strings and some further checks determine the same, then switch to rsort. But you should start by putting a module on PyPI.
-- Terry Jan Reedy
- Previous message (by thread): [Python-Dev] Drastically improving list.sort() for lists of strings/ints
- Next message (by thread): [Python-Dev] Drastically improving list.sort() for lists of strings/ints
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]