[Python-Dev] Counting collisions for the win (original) (raw)

Victor Stinner victor.stinner at haypocalc.com
Fri Jan 20 01:48:53 CET 2012

Previous message: [Python-Dev] python build failed on mac
Next message: [Python-Dev] Counting collisions for the win
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

Hi,

I'm working on the hash collision issue since 2 or 3 weeks. I evaluated all solutions and I think that I have now a good knowledge of the problem and how it should be solved. The major issue is to have a minor or no impact on applications (don't break backward compatibility). I saw three major solutions:

use a randomized hash
use two hashes, a randomized hash and the actual hash kept for backward compatibility
count collisions on dictionary lookup

Using a randomized hash does break a lot of tests (e.g. tests relying on the representation of a dictionary). The patch is huge, too big to backport it directly on stable versions. Using a randomized hash may also break (indirectly) real applications because the application output is also somehow "randomized". For example, in the Django test suite, the HTML output is different at each run. Web browsers may render the web page differently, or crash, or ... I don't think that Django would like to sort attributes of each HTML tag, just because we wanted to fix a vulnerability.

Randomized hash has also a major issue: if the attacker is able to compute the secret, (s)he can easily compute collisions and exploit the hash collision vulnerability again. I don't know exactly how complex it is to compute the secret, but our hash function is weak (it is far from being cryptographic, it is really simple to run it backward). If someone writes a fast function to compute the secret, we will go back to the same point.

IMO using two hashes has the same disavantages of the randomized hash solution, whereas it is more complex to implement.

The last solution is very simple: count collision and raise an exception if it hits a limit. The path is something like 10 lines whereas the randomized hash is more close to 500 lines, add a new file, change Visual Studio project file, etc. First I thaught that it would break more applications than the randomized hash, but I tried on Django: the test suite fails with a limit of 20 collisions, but not with a limit of 50 collisions, whereas the patch uses a limit of 1000 collisions. According to my basic tests, a limit of 35 collisions requires a dictionary with more than 10,000,000 integer keys to raise an error. I am not talking about the attack, but valid data.

More details about my tests on the Django test suite: http://bugs.python.org/issue13703#msg151620

I propose to solve the hash collision vulnerability by counting collisions because it does fix the vulnerability with a minor or no impact on applications or backward compatibility. I don't see why we should use a different fix for Python 3.3. If counting collisons solves the issue for stable versions, it is also enough for Python 3.3. We now know all issues of the randomized hash solution, and I think that there are more drawbacks than advantages. IMO the randomized hash is overkill to fix the hash collision issue.

I just have some requests on Marc Andre Lemburg patch:

the limit should be configurable: a new function in the sys module should be enough. It may be private (or replaced by an environment variable?) in stable versions
the set type should also be patched (I didn't check if it is vulnerable or not using the patch)
the patch has no test! (a class with a fixed hash should be enough to write a test)
the limit must be documented somwhere
the exception type should be different than KeyError

Victor

Previous message: [Python-Dev] python build failed on mac
Next message: [Python-Dev] Counting collisions for the win
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

More information about the Python-Dev mailing list