Issues with hash-function for float64 version of klib's hash-map · Issue #21866 · pandas-dev/pandas (original) (raw)
Problem description
Hash-maps for float64 use the following hash-function
#define kh_float64_hash_func(key) (khint32_t)((asint64(key))>>33^(asint64(key))^(asint64(key))<<11)
However, in order to guarantee consistent behavior, the following constrains must be met:
==
must be an equivalence relationx==y
=>hash(x)==hash(y)
Following IEEE-754, floats aren't an equivalence relation, this is fixed defining "equal" as
#define kh_float64_hash_equal(a, b) ((a) == (b) || ((b) != (b) && (a) != (a)))
Thus, apart from trivial equivalence classes, there are the following two:
{0.0, -0.0}
{x | x is NAN}
for which the second constrain doesn't hold:
hash(0.0)=0 != 1073741824=2^30=hash(-0.0)
hash(float("nan")) = 1073479680 != 2147221504 =hash(-float("nan"))
Due to the way klib uses this hash value, the values "0.0" and "-0.0" end up in different buckets only if there are more than 2^29 (ca. 6e8) elements in the hash-map already. The same holds for float("nan")
and -float("nan")
. However in the case of not-a-number there are much more elements in the equivalence class, so the inconsistent behavior can be triggered for much smaller sizes, for example for
import struct
NAN1=struct.unpack("d", struct.pack("L", 9221120237041090560))[0]
NAN2=struct.unpack("d", struct.pack("L", 9221120237041090561))[0]
Proposed solution:
There are already some workarounds for this problem scattered throughout pandas, for example for pd.unique(...)
here or for isin
here.
However, this
- is not the right place to work around the bug, as it makes the code not really clearer
- fixes only the not-a-number case, but not
{0.0, -0.0}
- most probably everyone reusing the hash-map-implementation in the fututre will first get the NAN-case wrong and then fix it with a further workaround.
A better approach would be to fix the hash-function of float64-hash-map, a possibility would be:
#include <math.h>
//right for all but not -0.0 and NANs
#define kh_float64_hash_func_0_NAN(key) (khint32_t)((f64_to_i64(key))>>33^(f64_to_i64(key))^(f64_to_i64(key))<<11)
//right for all except NANs
#define kh_float64_hash_func_NAN(key) ((key)==0.0 ? kh_float64_hash_func_0_NAN(0.0) : kh_float64_hash_func_0_NAN(key))
//right for all, also- 0.0 and NANs
#define kh_float64_hash_func(key) ((key) != (key) ? kh_float64_hash_func_NAN(NAN) : kh_float64_hash_func_NAN(key))