(original) (raw)

On Thu, Jan 19, 2012 at 4:48 PM, Victor Stinner <victor.stinner@haypocalc.com> wrote:

Hi,

I'm working on the hash collision issue since 2 or 3 weeks. I

evaluated all solutions and I think that I have now a good knowledge

of the problem and how it should be solved. The major issue is to have

a minor or no impact on applications (don't break backward

compatibility). I saw three major solutions:

�- use a randomized hash

�- use two hashes, a randomized hash and the actual hash kept for

backward compatibility

�- count collisions on dictionary lookup

Using a randomized hash does break a lot of tests (e.g. tests relying

on the representation of a dictionary). The patch is huge, too big to

backport it directly on stable versions. Using a randomized hash may

also break (indirectly) real applications because the application

output is also somehow "randomized". For example, in the Django test

suite, the HTML output is different at each run. Web browsers may

render the web page differently, or crash, or ... I don't think that

Django would like to sort attributes of each HTML tag, just because we

wanted to fix a vulnerability.

Randomized hash has also a major issue: if the attacker is able to

compute the secret, (s)he can easily compute collisions and exploit

the hash collision vulnerability again. I don't know exactly how

complex it is to compute the secret, but our hash function is weak (it

is far from being cryptographic, it is really simple to run it

backward). If someone writes a fast function to compute the secret, we

will go back to the same point.

IMO using two hashes has the same disavantages of the randomized hash

solution, whereas it is more complex to implement.

The last solution is very simple: count collision and raise an

exception if it hits a limit. The path is something like 10 lines

whereas the randomized hash is more close to 500 lines, add a new

file, change Visual Studio project file, etc. First I thaught that it

would break more applications than the randomized hash, but I tried on

Django: the test suite fails with a limit of 20 collisions, but not

with a limit of 50 collisions, whereas the patch uses a limit of 1000

collisions. According to my basic tests, a limit of 35 collisions

requires a dictionary with more than 10,000,000 integer keys to raise

an error. I am not talking about the attack, but valid data.

More details about my tests on the Django test suite:

http://bugs.python.org/issue13703#msg151620

--

I propose to solve the hash collision vulnerability by counting

collisions because it does fix the vulnerability with a minor or no

impact on applications or backward compatibility. I don't see why we

should use a different fix for Python 3.3. If counting collisons

solves the issue for stable versions, it is also enough for Python

3.3. We now know all issues of the randomized hash solution, and I

think that there are more drawbacks than advantages. IMO the

randomized hash is overkill to fix the hash collision issue.

�

I just have some requests on Marc Andre Lemburg patch:

�- the limit should be configurable: a new function in the sys module

should be enough. It may be private (or replaced by an environment

variable?) in stable versions

�- the set type should also be patched (I didn't check if it is

vulnerable or not using the patch)

�- the patch has no test! (a class with a fixed hash should be enough

to write a test)

�- the limit must be documented somwhere

�- the exception type should be different than KeyError

Victor

_______________________________________________

Python-Dev mailing list

Python-Dev@python.org

http://mail.python.org/mailman/listinfo/python-dev

Unsubscribe: http://mail.python.org/mailman/options/python-dev/guido%40python.org

--
--Guido van Rossum (python.org/\~guido)