Issues with hash-function for float64 version of klib's hash-map · Issue #21866 · pandas-dev/pandas (original) (raw)

Problem description

Hash-maps for float64 use the following hash-function

#define kh_float64_hash_func(key) (khint32_t)((asint64(key))>>33^(asint64(key))^(asint64(key))<<11)

However, in order to guarantee consistent behavior, the following constrains must be met:

  1. == must be an equivalence relation
  2. x==y => hash(x)==hash(y)

Following IEEE-754, floats aren't an equivalence relation, this is fixed defining "equal" as

#define kh_float64_hash_equal(a, b) ((a) == (b) || ((b) != (b) && (a) != (a)))

Thus, apart from trivial equivalence classes, there are the following two:

  1. {0.0, -0.0}
  2. {x | x is NAN}

for which the second constrain doesn't hold:

Due to the way klib uses this hash value, the values "0.0" and "-0.0" end up in different buckets only if there are more than 2^29 (ca. 6e8) elements in the hash-map already. The same holds for float("nan") and -float("nan"). However in the case of not-a-number there are much more elements in the equivalence class, so the inconsistent behavior can be triggered for much smaller sizes, for example for

import struct
NAN1=struct.unpack("d", struct.pack("L", 9221120237041090560))[0]
NAN2=struct.unpack("d", struct.pack("L", 9221120237041090561))[0]

Proposed solution:

There are already some workarounds for this problem scattered throughout pandas, for example for pd.unique(...) here or for isin here.

However, this

  1. is not the right place to work around the bug, as it makes the code not really clearer
  2. fixes only the not-a-number case, but not {0.0, -0.0}
  3. most probably everyone reusing the hash-map-implementation in the fututre will first get the NAN-case wrong and then fix it with a further workaround.

A better approach would be to fix the hash-function of float64-hash-map, a possibility would be:

   #include <math.h>
    //right for all but not -0.0 and NANs
    #define kh_float64_hash_func_0_NAN(key) (khint32_t)((f64_to_i64(key))>>33^(f64_to_i64(key))^(f64_to_i64(key))<<11)
    //right for all except NANs
    #define kh_float64_hash_func_NAN(key) ((key)==0.0 ? kh_float64_hash_func_0_NAN(0.0) : kh_float64_hash_func_0_NAN(key))
    //right for all, also- 0.0 and NANs
    #define kh_float64_hash_func(key) ((key) != (key) ? kh_float64_hash_func_NAN(NAN) : kh_float64_hash_func_NAN(key))