Message 151870 - Python tracker (original) (raw)
On Mon, Jan 23, 2012 at 1:32 PM, Dave Malcolm <report@bugs.python.org> wrote:
Dave Malcolm <dmalcolm@redhat.com> added the comment:
I'm attaching an attempt at backporting haypo's random-8.patch to 2.7
Changes relative to random-8.patch:
* The randomization is off by default, and must be enabled by setting a new environment variable PYTHONHASHRANDOMIZATION to a non-empty string. (if so then, PYTHONHASHSEED also still works, if provided, in the same way as in haypo's patch)
* All of the various "Py_hash_t" become "long" again (Py_hash_t was added in 3.2: )
* I expanded the randomization from just PyUnicodeObject to also cover these types:
* PyStringObject
* PyBufferObject
The randomization does not cover numeric types: if we change the hash of int so that hash(i) no longer equals i, we also have to change it consistently for long, float, complex, decimal.Decimal and fractions.Fraction; however, there are 3rd-party numeric types that have their own hash implementation that mimics int.hash (see e.g. numpy)
As noted in http://bugs.python.org/issue13703#msg151063 and http://bugs.python.org/issue13703#msg151064, it's not possible to directly create a dict with integer keys via JSON or XML-RPC.
This seems like a tradeoff between the risk of attack via other means vs breakage induced by not having hash() == hash() for the various equivalent numerical representations in pre-existing code.
Exactly. I would NOT worry about hash repeatability for integers and complex data structures. It is not at the core of the common problem (maybe a couple application specific problems but not a general "all python web apps" severity problem).
Doing it for base byte string and unicode string like objects is sufficient. Good catch on doing it for buffer objects, I'd forgotten about those. ;) A big flaw with haypo's patch is that it only considers unicode instead of all byte-string-ish stuff. (the code in does that better).
* To support my expanded usage of the random secret, I moved:
PyAPI_DATA(_Py_unicode_hash_secret_t) _Py_unicode_hash_secret
from unicodeobject.h to object.h and renamed it to:
PyAPI_DATA(_Py_HashSecret_t) _Py_HashSecret;
This also exposes it for usage by C extension modules, just in case they need it (Murphy's Law suggests we will need if we don't expose it). This is an extension of the API, but warranted, I feel. My plan for downstream RHEL is to add this explicitly to the RPM metadata as a "Provides" of the RPM providing libpython.so so that if something needs to use it, it can express a "Requires" on it; I assume that something similar is possible with .deb)
Exposing this is good. There is a hash table implementation within modules/expat/xmlparse.c that should probably use it as well.
* generalized test_unicode.HashTest to support the new env var and the additional types. In my version, get_hash takes a _repr string rather than an object, so that I can test it with a buffer(). Arguably the tests should thus be moved from test_unicode to somewhere else, but this location keeps things consistent with haypo's patch.
haypo: in random-8.patch, within test_unicode.HashTest.test_null_hash, "hash_empty" seems to be misnamed
Lets move this to a better location in all patches. At this point haypo's patch is not done yet so relevant bits of what you are doing here is likely to be fed back into the eventual 3.3 tip patch.
-gps