[Python-Dev] Drastically improving list.sort() for lists of strings/ints (original) (raw)

Chris Angelico rosuav at gmail.com
Sun Sep 11 12:14:06 EDT 2016


On Sun, Sep 11, 2016 at 5:45 PM, Elliot Gorokhovsky <elgo8537 at colorado.edu> wrote:

I am interested in making a non-trivial improvement to list.sort(), but before I put in the work, I want to test the waters and see if this is something the community would accept. Basically, I want to implement radix sort for lists of strings. So list.sort() would detect if it is sorting a list of strings (which is one of the more common things you sort in python) and, if so, use in-place radix sort (see https://xlinux.nist.gov/dads/HTML/americanFlagSort.html). In-place radix sort is significantly faster for lexicographic sorting than Timsort (or in general any comparison-based sort, since radix can beat the nlogn barrier). If you don't believe that last claim, suppose for the sake of the argument that it's true (because if I actually implemented this I could supply benchmarks to prove it).

I'd like to see these benchmarks, actually. Sounds interesting. How does it fare on different distributions of data - for instance, strings consisting exclusively of ASCII digits and punctuation eg "01:12:35,726 --> 01:12:36,810", or strings consisting primarily of ASCII but with occasional BMP or astral characters, or strings primarily of Cyrillic text, etc? What if every single string begins with a slash character (eg if you're sorting a bunch of path names)? At what list size does it become visibly faster?

Could this be put onto PyPI and then benchmarked as lst.sort() vs flagsort(lst) ?

ChrisA



More information about the Python-Dev mailing list