[Python-Dev] Counting collisions for the win (original) (raw)
Frank Sievertsen pydev at sievertsen.de
Mon Jan 23 09:53:06 CET 2012
- Previous message: [Python-Dev] Counting collisions for the win
- Next message: [Python-Dev] Counting collisions for the win
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
Hello,
I'd still prefer to see a randomized hash()-function (at least for 3.3).
But to protect against the attacks it would be sufficient to use randomization for collision resolution in dicts (and sets).
What if we use a second (randomized) hash-function in case there are many collisions in ONE lookup. This hash-function is used only for collision resolution and is not cached.
The benefits:
- protection against the known attacks
- hash(X) stays stable and the same
- dict order is only changed when there are many collisions
- doctests will not break
- enhanced collision resolution
- RNG doesn't have to be initialized in smaller programs
- nearly no slowdown of most dicts
- second hash-function is only used for keys with higher collision-rate
- lower probability to leak secrets
- possibility to use different secrets for each dict
The drawback:
- need to add a second hash-function
- slower than using one hash-function only, when > 20 collisions
- need to add this to container-types? (if used for py3.3)
- need to expose this to the user? (if used for py3.3)
- works only for datatypes with this new function
- possible to implement without breaking ABI?
The following code is meant for explanation purpose only:
for(perturb = hash; ; perturb >>= 5) { i = (i << 2) + i + perturb + 1;
if((collisions++) == 20) {
// perturb is already zero after 13 rounds.
// 20 collisions are rare.
// you can add && (ma_mask > 256) to make 100% sure
// that it's not used for smaller dicts.
if(Py_TYPE(key)->tp_flags & Py_TPFLAGS_HAVE_RANDOMIZED_HASH) {
// If type has a randomized hash, use this now for lookup
i = perturb = PyObject_RandomizedHash(key));
}
.....
If I got this right we could add a new function "tp_randomized_hash" to 3.3 release. But can we also add functions in older releases, without breaking ABI?
If not, can we implement this somehow using a flag?
FOR OLDER RELEASE < 3.3:
Py_hash_t PyObject_RandomizedHash(PyVarObject *o) { PyTypeObject *tp = Py_TYPE(v); if(! (tp->tp_flags & Py_TPFLAGS_HAVE_RANDOMIZED_HASH)) return -1;
global_flags_somewhere->USE_RANDOMIZED_HASH = 1;
return (*tp->tp_hash)(v);
}
.... and in unicodeobject.c: (and wherever we need randomization)
static Py_hash_t unicode_hash(PyUnicodeObject *self) { Py_ssize_t len; Py_UNICODE *p; Py_hash_t x; Py_hash_t prefix=0; Py_hash_t suffix=0;
if(global_flags_somewhere->USE_RANDOMIZED_HASH) {
global_flags_somewhere->USE_RANDOMIZED_HASH = 0;
initialize_rng_if_not_already_done_and_return_seed(&prefix,
&suffix);
..... (and don't cache in this case) .....
It's ugly, but if I understand this correctly, the GIL will protect us against race-conditions, right?
Hello, internals experts: Would this work or is there a better way to do this without breaking the ABI?
Frank
- Previous message: [Python-Dev] Counting collisions for the win
- Next message: [Python-Dev] Counting collisions for the win
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]