[Python-Dev] More compact dictionaries with faster iteration (original) (raw)
Gregory P. Smith greg at krypto.org
Mon Dec 10 07:38:19 CET 2012
- Previous message: [Python-Dev] More compact dictionaries with faster iteration
- Next message: [Python-Dev] More compact dictionaries with faster iteration
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
On Sun, Dec 9, 2012 at 5:44 PM, Raymond Hettinger < raymond.hettinger at gmail.com> wrote:
The current memory layout for dictionaries is unnecessarily inefficient. It has a sparse table of 24-byte entries containing the hash value, key pointer, and value pointer.
Instead, the 24-byte entries should be stored in a dense table referenced by a sparse table of indices. For example, the dictionary: d = {'timmy': 'red', 'barry': 'green', 'guido': 'blue'} is currently stored as: entries = [['--', '--', '--'], [-8522787127447073495, 'barry', 'green'], ['--', '--', '--'], ['--', '--', '--'], ['--', '--', '--'], [-9092791511155847987, 'timmy', 'red'], ['--', '--', '--'], [-6480567542315338377, 'guido', 'blue']] Instead, the data should be organized as follows: indices = [None, 1, None, None, None, 0, None, 2] entries = [[-9092791511155847987, 'timmy', 'red'], [-8522787127447073495, 'barry', 'green'], [-6480567542315338377, 'guido', 'blue']] Only the data layout needs to change. The hash table algorithms would stay the same. All of the current optimizations would be kept, including key-sharing dicts and custom lookup functions for string-only dicts. There is no change to the hash functions, the table search order, or collision statistics. The memory savings are significant (from 30% to 95% compression depending on the how full the table is). Small dicts (size 0, 1, or 2) get the most benefit. For a sparse table of size t with n entries, the sizes are: currsize = 24 * t newsize = 24 * n + sizeof(index) * t In the above timmy/barry/guido example, the current size is 192 bytes (eight 24-byte entries) and the new size is 80 bytes (three 24-byte entries plus eight 1-byte indices). That gives 58% compression. Note, the sizeof(index) can be as small as a single byte for small dicts, two bytes for bigger dicts and up to sizeof(Pyssizet) for huge dict. In addition to space savings, the new memory layout makes iteration faster. Currently, keys(), values, and items() loop over the sparse table, skipping-over free slots in the hash table. Now, keys/values/items can loop directly over the dense table, using fewer memory accesses. Another benefit is that resizing is faster and touches fewer pieces of memory. Currently, every hash/key/value entry is moved or copied during a resize. In the new layout, only the indices are updated. For the most part, the hash/key/value entries never move (except for an occasional swap to fill a hole left by a deletion). With the reduced memory footprint, we can also expect better cache utilization. For those wanting to experiment with the design, there is a pure Python proof-of-concept here: http://code.activestate.com/recipes/578375 YMMV: Keep in mind that the above size statics assume a build with 64-bit Pyssizet and 64-bit pointers. The space savings percentages are a bit different on other builds. Also, note that in many applications, the size of the data dominates the size of the container (i.e. the weight of a bucket of water is mostly the water, not the bucket).
+1 I like it. The plethora of 64-bit NULL values we cart around in our internals bothers me, this gets rid of some.
The worst case of ~30% less memory consumed for the bucket (ignoring the water) is still decent. For a program with a lot of instances of small to medium sized objects I'd expect a measurable win.
I'm interested in seeing performance and memory footprint numbers from a C implementation of this.
-gps -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.python.org/pipermail/python-dev/attachments/20121209/ae6f53fd/attachment.html>
- Previous message: [Python-Dev] More compact dictionaries with faster iteration
- Next message: [Python-Dev] More compact dictionaries with faster iteration
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]