Issue 9685: tuples should remember their hash value (original) (raw)

process

Status: closed Resolution: rejected
Dependencies: Superseder:
Assigned To: Nosy List: Anthony Sottile, benjamin.peterson, christian.heimes, dtorp, georg.brandl, mark.dickinson, meador.inge, python-dev, rhettinger, serhiy.storchaka, tim.peters
Priority: low Keywords: easy, patch

Created on 2010-08-25 19:33 by dtorp, last changed 2022-04-11 14:57 by admin. This issue is now closed.

Files
File name Uploaded Description Edit
tuplehash.patch christian.heimes,2013-01-07 09:10 review
Messages (14)
msg114929 - (view) Author: David Albert Torpey (dtorp) Date: 2010-08-25 19:33
Dictionary keys are commonly numbers, strings, or tuples. Python has optimized numbers and strings to remember their hash values on successive calls. Tuples should do this too since their recursive hash function can take a long time to compute. Tuples are Python's official record type and the one obvious way of making non-scalar dictionary keys. The code to do this in stringobject.c is short and sweet, so this major speed boost should be an easy thing to. static long string_hash(PyStringObject *a) { register Py_ssize_t len; register unsigned char *p; register long x; if (a->ob_shash != -1) <== return a->ob_shash; <== len = Py_SIZE(a); p = (unsigned char *) a->ob_sval; x = *p << 7; while (--len >= 0) x = (1000003*x) ^ *p++; x ^= Py_SIZE(a); if (x == -1) <== x = -2; <== a->ob_shash = x; return x; } The code in tupleobject.c would just need to add the four lines marked above. Here's what is looks like now. static long tuplehash(PyTupleObject *v) { register long x, y; register Py_ssize_t len = Py_SIZE(v); register PyObject **p; long mult = 1000003L; x = 0x345678L; p = v->ob_item; while (--len >= 0) { y = PyObject_Hash(*p++); if (y == -1) return -1; x = (x ^ y) * mult; /* the cast might truncate len; that doesn't change hash stability */ mult += (long)(82520L + len + len); } x += 97531L; if (x == -1) x = -2; return x; } Thank you guys for all of your work. *David
msg114930 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2010-08-25 19:36
This seems reasonable. Will look at it in the next few days.
msg114932 - (view) Author: Tim Peters (tim.peters) * (Python committer) Date: 2010-08-25 20:00
- Tuple objects don't currently reserve space to store their hash code, so it's likely this would increase the size of every tuple. - It's unclear to me which natural use patterns would actually enjoy a major speed boost. Note that dicts remember the hash codes of keys already, regardless of whether the key type remembers them too. A tuple is typically constructed right before being used in a dict lookup, so for a one-shot use no time would be saved. If the new tuple is used in multiple dict lookups, sure - but is that common?
msg114936 - (view) Author: Benjamin Peterson (benjamin.peterson) * (Python committer) Date: 2010-08-25 21:37
FWIW, I'm -1 on this without a demonstrable improvement on some real-world cases.
msg114937 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2010-08-25 22:04
Hello Tim! If you have a chance, please also take a look at which I was planning to work on in the next couple of weeks. For memoizing tuple hashes, I'm inclined to think the one extra field is worth it. That would help all the cases where people are concerned about double accesses to dicts in a look-before-you-leap pattern or for a pattern of fetch-item-update-value-store-new-item. It looks like the code for collections.OrderedDict() would benefit because it does multiple lookups and stores on the same key: http://svn.python.org/view/python/branches/release27-maint/Lib/collections.py?revision=84148&view=markup It would also help the multiple lookups and stores in caching code such as that at http://code.activestate.com/recipes/498245-lru-and-lfu-cache-decorators I suppose we could prepare a patch, instrument it, and try it with Twisted, SQLalchemy, and Django to find-out how many tuple hash calculations would be saved by memoizing.
msg114939 - (view) Author: Benjamin Peterson (benjamin.peterson) * (Python committer) Date: 2010-08-25 22:14
2010/8/25 Raymond Hettinger <report@bugs.python.org>: > I suppose we could prepare a patch, instrument it, and try it with Twisted, SQLalchemy, and Django to find-out how many tuple hash calculations would be saved by memoizing. You should also check memory usage.
msg179202 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2013-01-06 18:32
Sorry, Raymond. It was a bad day for Roundup.
msg179235 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2013-01-06 22:38
My apologies once again, Raymond. I mistakenly thought that I unassigned the issue from you (it was a Roundup bug at this day). As for the issue, I totally agree with Tim.
msg179248 - (view) Author: Mark Dickinson (mark.dickinson) * (Python committer) Date: 2013-01-07 08:40
Given the responses so far, I suggest closing this as rejected.
msg179250 - (view) Author: Christian Heimes (christian.heimes) * (Python committer) Date: 2013-01-07 09:10
I'm not too worried about the slightly increased memory usage. For example one of our largest application instances consumes about 8 GB memory right now. It has just about 22k tuples in gc.get_objects(). An additional Py_hash_t in tuple's struct would increase the memory usage by less than 200kB. I've attached a simple patch.
msg179254 - (view) Author: Georg Brandl (georg.brandl) * (Python committer) Date: 2013-01-07 09:59
Still, actual benefits in some kind of benchmark will be needed to show that this is not a premature optimization.
msg179277 - (view) Author: Christian Heimes (christian.heimes) * (Python committer) Date: 2013-01-07 19:27
Benchmark doesn't show any serious improvements. Some test cases are even slower. Report on Linux freimann 3.2.0-35-generic #55-Ubuntu SMP Wed Dec 5 17:42:16 UTC 2012 x86_64 x86_64 Total CPU cores: 6 ### call_simple ### Min: 0.201824 -> 0.208248: 1.03x slower Avg: 0.210608 -> 0.217300: 1.03x slower Significant (t=-2.23) Stddev: 0.00818 -> 0.00829: 1.0134x larger Timeline: b'http://tinyurl.com/axqoqp4' ### go ### Min: 0.534641 -> 0.550004: 1.03x slower Avg: 0.537874 -> 0.552782: 1.03x slower Significant (t=-11.89) Stddev: 0.00184 -> 0.00211: 1.1495x larger Timeline: b'http://tinyurl.com/b5k3ua4' ### pathlib ### Min: 0.121589 -> 0.117025: 1.04x faster Avg: 0.126679 -> 0.122279: 1.04x faster Significant (t=3.64) Stddev: 0.00429 -> 0.00427: 1.0048x smaller Timeline: b'http://tinyurl.com/acbb69o' ### spectral_norm ### Min: 0.280749 -> 0.305213: 1.09x slower Avg: 0.281194 -> 0.305390: 1.09x slower Significant (t=-120.69) Stddev: 0.00044 -> 0.00011: 4.1101x smaller Timeline: b'http://tinyurl.com/awyeejp' The following not significant results are hidden, use -v to show them: call_method, call_method_slots, call_method_unknown, chaos, fannkuch, fastpickle, fastunpickle, float, formatted_logging, iterative_count, json_dump_v2, json_load, meteor_contest, nbody, normal_startup, nqueens, pidigits, raytrace, regex_compile, regex_effbot, regex_v8, richards, silent_logging, simple_logging, startup_nosite, telco, threaded_count, unpack_sequence.
msg179281 - (view) Author: Roundup Robot (python-dev) (Python triager) Date: 2013-01-07 20:24
New changeset 17038de56fd4 by Christian Heimes in branch 'default': Add a comment about *not* caching the hash value. Issue #9685 suggested to memorize the hash value, but the feature request was rejected because no speed ups were found. http://hg.python.org/cpython/rev/17038de56fd4
msg364258 - (view) Author: Anthony Sottile (Anthony Sottile) * Date: 2020-03-15 20:46
hit this today unfortunately -- I'm working with some pretty complex (and nested) `typing.NamedTuple` objects and the lack of caching here results in quite the slowdown (in my macro benchmarks it's the difference between a render pass taking 180ms and 25ms) the class in question is being used as a cache key: https://github.com/asottile/babi/blob/1be4e80eddc1bff0eb8047cc89337fdf006ad148/babi/highlight.py#L217 with the hash cached: ``` μs event 123744 startup 27833 kEND5 2859 ^X ``` without the hash cached: ``` μs event 122575 startup 180415 kEND5 3725 ^X ``` (my hash cache of course being slower than it could be in C) ``` @@ -214,10 +214,21 @@ class Region(NamedTuple): scope: Scope +_state_hash_cache = {} + + class State(NamedTuple): entries: Tuple['Entry', ...] while_stack: Tuple[Tuple['WhileRule', int], ...] + def __hash__(self): + k = id(self) + try: + return _state_hash_cache[k] + except KeyError: + ret = _state_hash_cache[k] = super().__hash__() + return ret + @classmethod def root(cls, entry: 'Entry') -> 'State': return cls((entry,), ()) ``` ___ I get that this is a pretty extreme case and it's unlikely to change the resolution of this issue, but figured I'd point it out in case anyone else is hitting similar issues
History
Date User Action Args
2022-04-11 14:57:05 admin set github: 53894
2020-03-15 20:46:20 Anthony Sottile set nosy: + Anthony Sottilemessages: +
2013-01-07 20:24:29 python-dev set nosy: + python-devmessages: +
2013-01-07 19:35:41 benjamin.peterson set status: open -> closedresolution: rejected
2013-01-07 19:27:45 christian.heimes set messages: +
2013-01-07 09:59:45 georg.brandl set messages: +
2013-01-07 09:10:28 christian.heimes set files: + tuplehash.patchnosy: + christian.heimesmessages: + keywords: + patchstage: needs patch -> patch review
2013-01-07 08:40:53 mark.dickinson set nosy: + mark.dickinsonmessages: +
2013-01-06 22:38:11 serhiy.storchaka set messages: +
2013-01-06 21:29:17 rhettinger set assignee: rhettinger ->
2013-01-06 18:32:22 serhiy.storchaka set assignee: rhettingermessages: + nosy: + serhiy.storchaka
2013-01-06 17:53:44 georg.brandl set nosy: + georg.brandl
2013-01-05 21:47:47 meador.inge set nosy: + meador.inge
2013-01-04 15:37:26 serhiy.storchaka set type: resource usage -> performancecomponents: + Interpreter Coreversions: + Python 3.4, - Python 3.2
2010-10-31 16:28:58 rhettinger set assignee: rhettinger -> (no value)
2010-08-25 22:14:50 benjamin.peterson set messages: +
2010-08-25 22:04:14 rhettinger set messages: +
2010-08-25 21:37:33 benjamin.peterson set nosy: + benjamin.petersonmessages: +
2010-08-25 20:00:02 tim.peters set nosy: + tim.petersmessages: +
2010-08-25 19:36:11 rhettinger set priority: normal -> lowassignee: rhettingerversions: + Python 3.2, - Python 2.6keywords: + easynosy: + rhettingermessages: + stage: needs patch
2010-08-25 19:33:15 dtorp create