[Python-Dev] collections.sortedtree (original) (raw)

Raymond Hettinger raymond.hettinger at gmail.com
Fri Mar 28 22:29:51 CET 2014

Previous message: [Python-Dev] collections.sortedtree
Next message: [Python-Dev] collections.sortedtree
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On Mar 26, 2014, at 1:31 PM, Marko Rauhamaa <marko at pacujo.net> wrote:

I have made a full implementation of a balanced tree and would like to know what the process is to have it considered for inclusion in Python 3.

To summarize, the implementation closely parallels dict() features and resides in collectionsmodule.c under the name collections.sortedtree. It uses solely the "<" operator to compare keys. I have chosen the AVL tree as an implementation technique.

FWIW, I think there may be room for a sorted collection somewhere in the standard library.

As others have said, the best place to start is by putting a module on PyPi to let it mature and to compete with other approaches to the problem.

Here are a few random thoughts on the over-all idea:

An AVL balanced tree isn't the only solution or necessarily the best solution to the problem. Tree nodes tend to take more space than denser structures and they have awful cache locality (these are the same reasons that deques use doubly-linked blocks rather than a plain doubly linked lists).
Another approach I have experimented with uses lazy sorting. That lets insertion be an O(1) step and incurs a one-time sorting cost upon the next lookup (and because Timsort exploits partial orderings, this can be very cheap). A lazy sorted list is dense and sequentially ordered in memory (reducing space overhead, taking full advantage of cache locality and memory controller auto-prefetch, and providing fast iteration speed by not needing to chase pointers). The lazy sort approach works best in applications that spend most of the time doing lookups and have only infrequent deletions and insertions.
The name of the tool probably should not be sortedtree. Ideally, the tool should be named for what it does, not how it does it (for the most part, users don't need to know whether the underlying implementation is a red-black tree, b-tree, judy array, sqlite database, or lazy list). That is why (I think) that Python dicts are called dicts rather than hash tables (the implementation) or an associative array (the computer science term for the abstract datatype).
There are plenty of data structures that have had good utility and popularity outside of the standard library. I that it is a good thing that blists, numpy arrays, databases, and panda dataframes all live outside the standard library. Those tools are easier to maintain externally and it is easier for you to keep control over the design. Remember the saying, "the standard library is where code goes to die" (and I would add that it should already be mature (or nearly dead) by the time it gets there).
That said, it is a reasonable possibility that the standard library would benefit from some kind sorted collection (the idea comes up from time to time).

Raymond Hettinger -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.python.org/pipermail/python-dev/attachments/20140328/aba86088/attachment.html>

Previous message: [Python-Dev] collections.sortedtree
Next message: [Python-Dev] collections.sortedtree
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

More information about the Python-Dev mailing list