[Python-Dev] Hash collision security issue (now public) (original) (raw)

Victor Stinner victor.stinner at haypocalc.com
Sun Jan 1 18:32:51 CET 2012


Le 01/01/2012 04:29, Paul McMillan a écrit :

This is incorrect. Once an attacker has guessed the random seed, any operation which reveals the ordering of hashed objects can be used to verify the answer. JSON responses would be ideal. In fact, an attacker can do a brute-force attack of the random seed offline. Once they have the seed, generating collisions is a fast process.

If we want to protect a website against this attack for example, we must suppose that the attacker can inject arbitrary data and can get (indirectly) the result of hash(str) (e.g. with the representation of a dict in a traceback, with a JSON output, etc.).

The goal isn't perfection, but we need to do better than a simple salt.

I disagree. I don't want to break backward compatibility and have a hash() function different for each process, if the change is not an effective protection against the "hash vulnerability".

It's really hard to write a good (secure) hash function: see for example the recent NIST competition (started in 2008, will end this year). Even good security researcher are unable to write a strong and fast hash function. It's easy to add a weakness in the function if you don't have a good background in cryptography. The NIST competition gives 4 years to analyze new hash functions. We should not rush to add a quick "hack" if it doesn't solve correctly the problem (protect against a collision attack and preimage attack).

http://en.wikipedia.org/wiki/NIST_hash_function_competition http://en.wikipedia.org/wiki/Collision_attack

Runtime performance does matter, I'm not completly sure that changing Python is the best place to add a countermeasure against a vulnerability. I don't want to slow down numpy for a web vulnerability. Because there are different use cases, a better compromise is maybe to add a runtime option to use a secure hash function, and keep the unsafe but fast hash function by default.

I propose we modify the string hash function like this:

https://gist.github.com/0a91e52efa74f61858b5

Always allocate 2**21 bytes just to workaround one specific kind of attack is not acceptable. I suppose that the maximum acceptable is 4096 bytes (or better 256 bytes).

Crytographic hash functions don't need random data, why would Python need 2 MB (!) for its hash function?

Victor



More information about the Python-Dev mailing list