[Python-Dev] Status of the fix for the hash collision vulnerability (original) (raw)
Guido van Rossum guido at python.org
Fri Jan 13 03:57:42 CET 2012
- Previous message: [Python-Dev] Status of the fix for the hash collision vulnerability
- Next message: [Python-Dev] Status of the fix for the hash collision vulnerability
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
Hm... I started out as a big fan of the randomized hash, but thinking more about it, I actually believe that the chances of some legitimate app having
1000 collisions are way smaller than the chances that somebody's code will break due to the variable hashing. In fact we know for a fact that the latter will break code, since it changes the order of items in a dict. This affects many tests written without this in mind, and I assume there will be some poor sap out there who uses Python's hash() function to address some external persistent hash table or some other external datastructure. How pathological the data needs to be before the collision counter triggers? I'd expect very pathological.
This is depending on how the counting is done (I didn't look at MAL's patch), and assuming that increasing the hash table size will generally reduce collisions if items collide but their hashes are different.
That said, even with collision counting I'd like a way to disable it without changing the code, e.g. a flag or environment variable.
--Guido
On Thu, Jan 12, 2012 at 5:24 PM, Victor Stinner < victor.stinner at haypocalc.com> wrote:
Many people proposed their own idea to fix the vulnerability, but only 3 wrote a patch:
- Glenn Linderman proposes to fix the vulnerability by adding a new "safe" dict type (only accepting string keys). His proof-of-concept (SafeDict.py) uses a secret of 64 random bits and uses it to compute the hash of a key. - Marc Andre Lemburg proposes to fix the vulnerability directly in dict (for any key type). The patch raises an exception if a lookup causes more than 1000 collisions. - I propose to fix the vulnerability only in the Unicode hash (not for other types). My patch adds a random secret initialized at startup (it can be disabled or fixed using an environment variable). -- I consider that Glenn's proposition is not applicable in practice because all applications and all libraries have to be patched to use the new "safe" dict type. Some people are concerned by possible regression introduced by Marc's proposition: his patch may raise an exception for legitimate data. My proposition tries to be "just enough" secure with a low (runtime performance) overhead. My patch becomes huge (and so backporting is more complex), whereas Marc's patch is very simple and so trivial to backport. -- It is still unclear to me if the fix should be enabled by default for Python < 3.3. Because the overhead (of my patch) is low, I would prefer to enable the fix by default, to protect everyone with a simple Python upgrade. I prefer to explain how to disable explicitly the randomized hash (PYTHONHASHSEED=0) (or how to fix application bugs) to people having troubles with randomized hash, instead of leaving the hole open by default. -- We might change hash() for types other than str, but it looks like web servers are only concerned by dict with string keys. We may use Paul's hash function if mine is not enough secure. My patch doesn't fix the DoS, it just make the attack more complex. The attacker cannot pregenerate data for an attack: (s)he has first to compute the hash secret, and then compute hash collisions using the secret. The hash secret is a least 64 bits long (128 bits on a 64 bit system). So I hope that computing collisions requires a lot of CPU time (is slow) to make the attack ineffective with today computers. -- I plan to write a nice patch for Python 3.3, then write a simpler patch for 3.1 and 3.2 (duplicate os.urandom code to keep it unchanged, maybe don't create a new random.c file, maybe don't touch the test suite while the patch breaks many tests), and finally write patches for Python 2.6 and 2.7. Details about my patch: - I tested it on Linux (32 and 64 bits) and Windows (Seven 64 bits) - a new PYTHONSEED environment variable allow to control the randomized hash: PYTHONSEED=0 disables completly the randomized hash (restore the previous behaviour), PYTHONSEED=value uses a fixed seed for processes sharing data and needind same hash values (multiprocessing users?) - no overhead on hash(str) - no startup overhead on Linux - startup overhead is 10% on Windows (see the issue, I propose another solution with a startup overhead of 1%) The patch is not done, some tests are still failing because of the randomized hash. -- FYI, PHP released a version 5.3.9 adding "maxinputvars directive to prevent attacks based on hash collisions (CVE-2011-4885)". Victor
Python-Dev mailing list Python-Dev at python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/guido%40python.org
-- --Guido van Rossum (python.org/~guido) -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.python.org/pipermail/python-dev/attachments/20120112/1d1f9275/attachment.html>
- Previous message: [Python-Dev] Status of the fix for the hash collision vulnerability
- Next message: [Python-Dev] Status of the fix for the hash collision vulnerability
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]