Issue 1772851: Decimal and long hash, compatibly and efficiently (original) (raw)

Created on 2007-08-13 04:48 by mark.dickinson, last changed 2022-04-11 14:56 by admin. This issue is now closed.

Files
File name Uploaded Description Edit
decimal_hash.patch mark.dickinson,2007-08-13 04:48 New Decimal.__hash__ and long.__hash__
long_hash.patch mark.dickinson,2007-09-15 04:06
decimal_hash_v2.patch mark.dickinson,2007-09-15 04:07
Messages (7)
msg53024 - (view) Author: Mark Dickinson (mark.dickinson) * (Python committer) Date: 2007-08-13 04:48
Make hash of int/long periodic with period 2**32-1 (or 2**64-1 on 64-bit systems). This is makes it possible to define an efficient Decimal hash. Hash for int/long is already close to periodic with period 2**32-1; it's a small change to make it genuinely periodic. With this change, it's possible to write a Decimal __hash__ such that: (1) hash(Decimal(n))==hash(n) for all integers n, and (2) Decimal hash can be computed without gross inefficiencies. The current implementation of Decimal hash is almost unusable for very large Decimals: hash(Decimal("1E999999999")) first converts its argument to a long, taking gigabytes of memory and a lot of time. It's simple to satisfy either (1) or (2) above, but it seems impossible to satisfy both without touching the long hash. The patch alters int_hash and long_hash to make them periodic, adds some tests to Lib/test/test_hash.py, and implements an efficient Decimal.__hash__. If there's any chance of the patch being accepted I'll write some additional tests in Lib/test/test_decimal.py to check hash compatibility between longs and Decimals. This patch fixes (most of) bug 1770416.
msg55845 - (view) Author: Daniel Diniz (ajaksu2) * (Python triager) Date: 2007-09-12 17:09
IMHO this patch should be considered for (at least) py3k, in which long becomes the new int. As there is current interest in long/int performance[1] this seems like a good time to raise this kind of issue. Mark, could you please outline the semantic changes (specially at wraparound)? Would the hash value changes happen only in "wraparound (= subtraction + of 2**(bits_in_long)) by incrementing" cases? Would hash(-1) become -1? Sorry to ask all that, but my C (and maths) is (are) *very* limited. I guess an invalid hash value is the reason to have hash(-1) == hash(-2) == -2 currently, if so the period would have to skip that, right? I don't see the need for "y" and would chain some IFs (if x == -1 OR abs(x) == ULONG_MAX), but those are weak guesses. Maybe it's worth discussing this patch without linking its most important change (long hash becoming periodic) to Decimal: it would have many impacts beyond Decimal and is rather a core change (as opposed to Lib). Also, a recent discussion on hashing[2] shows that sometimes the current behavior was well planned :) 1 - http://mail.python.org/pipermail/python-3000/2007-September/010374.html 2 - http://mail.python.org/pipermail/python-3000/2007-September/010327.html
msg55872 - (view) Author: Mark Dickinson (mark.dickinson) * (Python committer) Date: 2007-09-12 20:54
To help explain what's going on, here's some Python code. The Python function long_hash1 below has the properties that: (1) long_hash1(n) == hash(n) for almost all integers n, with a very small set of exceptions. (The first positive exception is 2**33-1, and the next after that is 3*2**33 - 1. The negative exceptions have the same absolute value as the positive ones, so the first negative exception is n = 1-2**33). (2) long_hash1(n) has a simple closed form, making it possible to compute an equivalent Decimal hash efficiently. (3) The current long_hash in Objects/longobject.c can be changed, by adding a single line of C code, so that hash(n) == long_hash1(n) always. That line is: if ((unsigned long)x < v->ob_digit[i]) x++; added directly after the addition. Explanation: the exceptions in (1) arise precisely when the addition x += v->ob_digit[i] in the long_hash code overflows (in the *unsigned* sense---equivalently, when the addition turns a negative x into a nonnegative one). Since ob_digit[i] only has 15 significant bits, and x has 32 (or 64), such overflow is going to be rare---it'll occur for roughly one addition in every 2**18 (or about 4 additions in every million), for `random' n. So for `most' n, hash(n) and long_hash1(n) are equal. So what about long_hash2(n)? This is what the patched long_hash is actually equivalent to; it's essentially the same as long_hash1(n), but fixed up to be genuinely periodic; that is, long_hash1(n+ULONG_MAX) == long_hash1(n) for any integer n. long_hash1 and long_hash2 are equal almost always for positive n (the exceptions being multiples of ULONG_MAX), equal about half the time for negative n, and off-by-one from each other about half the time. I don't really know whether the periodicity is that useful---the predictability is really what matters, so the 1-line change to produce a hash function equivalent to long_hash1 would do just fine as far as making Decimal work. For what it's worth, I regard this patch as an ugly hack. I don't like the idea of changing something as fundamental as long.__hash__, and I don't like having Decimal.__hash__ depend on knowing exactly how long.__hash__ gets its values. But there's a real problem here, namely that, for example, >>> set([Decimal("1E1000000")]) takes several minutes to complete. (Try it!) And I can't think of any other easy way to solve this. Alternative suggestions welcomed! LONG_BITS = 32 W = 2**LONG_BITS HW = W // 2 ULONG_MAX = W - 1 def long_hash1(n): if n == 0: h = 0 else: h = (1 + (abs(n)-1) % ULONG_MAX) * (1 if n > 0 else -1) # reduce mod W to lie in the usual range for a (signed) C long h = (h + HW) % W - HW if h == -1: h = -2 return int(h) def long_hash2(n): h = n % ULONG_MAX h = (h + HW) % W - HW if h == -1: h = -2 return int(h)
msg55875 - (view) Author: Daniel Diniz (ajaksu2) * (Python triager) Date: 2007-09-12 22:28
Thanks a lot for the explanation. I still think it could be valuable (C?)py3k change (but now for the predictability, as you say :) Regarding the performance of Decimal.__hash__: I believe (int) hash performance would be a moot issue if Decimal.__int__ was (much) faster. I've been, for the last three days, trying to convert decimal.py to use longs instead of tuples for Decimal._int. Here are some (hand picked) results (on a Celeron M 1.46GHz): In [906]: D = int_decimal.Decimal (...) In [914]: g = D("1E1000000") (...) In [916]: g Out[916]: Decimal("1E+1000000") (...) In [918]: %timeit k = int(g) 10 loops, best of 3: 3.22 s per loop (...) In [920]: %timeit k = hash(int(g)) 10 loops, best of 3: 3.28 s per loop (...) In [922]: log(int(g)) Out[922]: 2302585.0929940455 In [923]: log(10**1000000) Out[923]: 2302585.0929940455 I'm not very competent in many skills needed for the job AND I'm working between field trips (next starts tomorrow), so, unfortunately, the results are crude and awfully buggy so far. Most "(...)" above were exceptions for simple things like "hash(g)" and there are lots of these issues. The conversion is based on using divisions and multiplications to truncate/extend Decimal._int, from the notion that the bottleneck is the tuple->str->int conversion. I predict lots of currently fast things getting slower and a higher memory footprint, but it might still be worth it and I'm willing to test the idea. As I told Facundo in a private email (y otro llegará hoy o mañana), the simple cases seem workable for me, but I'm not able to tell right now if this has any real chances of succeeding (because I'm bad at maths :/). I've not even looked at __pow__, __mul__ and __div__ yet! Anyway, I'll try to tidy things up as much as I can and email Facundo (and yourself, if you are interested) the results of this large ugly hack before now + 24h.
msg55924 - (view) Author: Mark Dickinson (mark.dickinson) * (Python committer) Date: 2007-09-15 04:06
Here's a revised patch, that only touches longobject.c and does the minimum necessary to alter long.__hash__ from being *almost* completely predictable to being completely predictable. To summarize the effects of this patch: - the current long.__hash__ is predictable except for a very small number of (unpredictable) inputs; this patch makes it predictable for all inputs, thereby making it possible to define an efficient Decimal.__hash__. To be precise, the patched long.__hash__ has the property that it's periodic of period ULONG_MAX in the ranges [1, infinity) and (-infinity, -1]. - the value of hash(n) is unchanged for any n that's small enough to be representable as an int, and also unchanged for the vast majority of long integers n of reasonable size. (For integers up to 450 digits long, the new hash value differs from the old for around 1 in every 2500 integers on a 32-bit system; on a 64-bit system the figure is 1 in every 10000 billion.) - On my test system (Dual Xeon 2.8GHz/SuSE Linux 10.2/gcc 4.1), using timeit.timeit('hash(n)') there's no measurable slowdown for small integers. Unfortunately there *is* slowdown for larger integers: around 20% slowdown for 100 digit integers and around 70% extra time for 1000 digit integers, though in all cases the times are in fractions of a microsecond. It doesn't help that gcc doesn't optimize the inner loop perfectly: with the -O3 flag, the addition and subsequent correction are compiled to (with x in eax and v->ob_digit[i] in edx): addl %eax, %edx cmpl %eax, %edx adcl $0, %edx and the cmpl here is entirely redundant. Maybe there's some way of writing the C code that makes it easier for gcc to pick up on this optimization?
msg55925 - (view) Author: Mark Dickinson (mark.dickinson) * (Python committer) Date: 2007-09-15 04:07
And here's the updated decimal.__hash__ patch that goes with the long.__hash__ patch.
msg56043 - (view) Author: Facundo Batista (facundobatista) * (Python committer) Date: 2007-09-19 17:57
Applied the patchs long_hash.patch (rev 58208) and decimal_hash_v2.patch (rev 58211)
History
Date User Action Args
2022-04-11 14:56:25 admin set github: 45306
2007-09-19 17:57:19 facundobatista set status: open -> closedresolution: fixedmessages: +
2007-09-15 04:07:43 mark.dickinson set files: + decimal_hash_v2.patchmessages: +
2007-09-15 04:06:42 mark.dickinson set files: + long_hash.patchmessages: +
2007-09-12 22:28:27 ajaksu2 set messages: +
2007-09-12 20:54:42 mark.dickinson set messages: +
2007-09-12 17:09:03 ajaksu2 set nosy: + ajaksu2messages: +
2007-08-13 04:48:04 mark.dickinson create