Message 151870 - Python tracker (original) (raw)

On Mon, Jan 23, 2012 at 1:32 PM, Dave Malcolm <report@bugs.python.org> wrote:

Dave Malcolm <dmalcolm@redhat.com> added the comment:

I'm attaching an attempt at backporting haypo's random-8.patch to 2.7

Changes relative to random-8.patch:

  * The randomization is off by default, and must be enabled by setting     a new environment variable PYTHONHASHRANDOMIZATION to a non-empty string.     (if so then, PYTHONHASHSEED also still works, if provided, in the same     way as in haypo's patch)

  * All of the various "Py_hash_t" become "long" again (Py_hash_t was     added in 3.2: )

  * I expanded the randomization from just PyUnicodeObject to also cover     these types:

    * PyStringObject

    * PyBufferObject

    The randomization does not cover numeric types: if we change the hash of     int so that hash(i) no longer equals i, we also have to change it     consistently for long, float, complex, decimal.Decimal and     fractions.Fraction; however, there are 3rd-party numeric types that     have their own hash implementation that mimics int.hash (see     e.g. numpy)

    As noted in http://bugs.python.org/issue13703#msg151063 and     http://bugs.python.org/issue13703#msg151064, it's not possible     to directly create a dict with integer keys via JSON or XML-RPC.

    This seems like a tradeoff between the risk of attack via other means     vs breakage induced by not having hash() == hash() for the various     equivalent numerical representations in pre-existing code.

Exactly. I would NOT worry about hash repeatability for integers and complex data structures. It is not at the core of the common problem (maybe a couple application specific problems but not a general "all python web apps" severity problem).

Doing it for base byte string and unicode string like objects is sufficient. Good catch on doing it for buffer objects, I'd forgotten about those. ;) A big flaw with haypo's patch is that it only considers unicode instead of all byte-string-ish stuff. (the code in does that better).

  * To support my expanded usage of the random secret, I moved:

      PyAPI_DATA(_Py_unicode_hash_secret_t) _Py_unicode_hash_secret

    from unicodeobject.h to object.h and renamed it to:

      PyAPI_DATA(_Py_HashSecret_t) _Py_HashSecret;

    This also exposes it for usage by C extension modules, just in case     they need it (Murphy's Law suggests we will need if we don't expose     it).   This is an extension of the API, but warranted, I feel.  My     plan for downstream RHEL is to add this explicitly to the RPM metadata     as a "Provides" of the RPM providing libpython.so so that if something     needs to use it, it can express a "Requires" on it; I assume that     something similar is possible with .deb)

Exposing this is good. There is a hash table implementation within modules/expat/xmlparse.c that should probably use it as well.

  * generalized test_unicode.HashTest to support the new env var and the     additional types.  In my version, get_hash takes a _repr string rather     than an object, so that I can test it with a buffer().  Arguably the     tests should thus be moved from test_unicode to somewhere else, but this     location keeps things consistent with haypo's patch.

    haypo: in random-8.patch, within test_unicode.HashTest.test_null_hash,     "hash_empty" seems to be misnamed

Lets move this to a better location in all patches. At this point haypo's patch is not done yet so relevant bits of what you are doing here is likely to be fed back into the eventual 3.3 tip patch.

-gps