[Python-Dev] Hash collision security issue (now public) (original) (raw)

Paul McMillan paul at mcmillan.ws
Sun Jan 1 04:29:59 CET 2012

Previous message: [Python-Dev] Hash collision security issue (now public)
Next message: [Python-Dev] Hash collision security issue (now public)
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

I'm not too concerned about a 3rd party being able to guess the random seed -- this would require much more effort on their part, since they would have to generate a new set of colliding keys each time they think they have guessed the hash

This is incorrect. Once an attacker has guessed the random seed, any operation which reveals the ordering of hashed objects can be used to verify the answer. JSON responses would be ideal. In fact, an attacker can do a brute-force attack of the random seed offline. Once they have the seed, generating collisions is a fast process.

The goal isn't perfection, but we need to do better than a simple salt. I propose we modify the string hash function like this:

https://gist.github.com/0a91e52efa74f61858b5

This code is based on PyPy's implementation, but the concept is universal. Rather than choosing a single short random seed per process, we generate a much larger random seed (r). As we hash, we deterministically choose a portion of that seed and incorporate it into the hash process. This modification is a minimally intrusive change to the existing hash function, and so should not introduce unexpected side effects which might come from switching to a different class of hash functions.

I've worked through this code with Alex Gaynor, Antoine Pitrou, and Victor Stinner, and have asked several mathematicians and security experts to review the concept. The reviewers who have gotten back to me thus far have agreed that if the initial random seed is not flawed, this should not overly change the properties of the hash function, but should make it quite difficult for an attacker to deduce the necessary information to predictably cause hash collisions. This function is not designed to protect against timing attacks, but should be nontrivial to reverse even with access to timing data.

Empirical testing shows that this unoptimized python implementation produces ~10% slowdown in the hashing of ~20 character strings. This is probably an acceptable trade off, and actually provides better performance in the case of short strings than a high-entropy fixed-length seed prefix.

-Paul

Previous message: [Python-Dev] Hash collision security issue (now public)
Next message: [Python-Dev] Hash collision security issue (now public)
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

More information about the Python-Dev mailing list