[Python-Dev] Hashes in Python3.5 for tuples and frozensets (original) (raw)

Chris Angelico rosuav at gmail.com
Thu May 17 04:16:16 EDT 2018


On Thu, May 17, 2018 at 5:21 PM, Anthony Flury via Python-Dev <python-dev at python.org> wrote:

Victor, Thanks for the link, but to be honest it will just confuse people - neither the link or the related bpo entries state that the fix is only limited to strings. They simply talk about hash randomization - which in my opinion implies ALL hash algorithms; which is why I asked the question.

I am not sure how much should be exposed about the scope of security fixes but you can understand my (and other's) confusion. I am aware that applications shouldn't make assumptions about the value of any given hash value - apart from some simple assumptions based hash value equality (i.e. if two objects have different hash values they can't be the same value).

The hash values of Python objects are calculated by the hash method, so arbitrary objects can do what they like, including degenerate algorithms such as:

class X: def hash(self): return 7

So it's impossible to randomize ALL hashes at the language level. Only str and bytes hashes are randomized, because they're the ones most likely to be exploitable - for instance, a web server will receive a query like "http://spam.example/target?a=1&b=2&c=3" and provide a dictionary {"a":1, "b":2, "c":3}. Similarly, a JSON decoder is always going to create string keys in its dictionaries (JSON objects). Do you know of any situation in which an attacker can provide the keys for a dict/set as integers?

/B//TW : // // //This question was prompted by a question on a social media platform about the whether hash values are transferable between across platforms. Everything I could find stated that after Python 3.3 ALL hash values were randomized - but that clearly isn't the case; and the original questioner identified that some hash values are randomized and other aren't.// /

That's actually immaterial. Even if the hashes weren't actually randomized, you shouldn't be making assumptions about anything specific in the hash, save that within one Python process, two equal values will have equal hashes (and therefore two objects with unequal hashes will not be equal).

//I did suggest strongly to the original questioner that relying on the same hash value across different platforms wasn't a clever solution - their original plan was to store hash values in a cross system database to enable quick retrieval of data (!!!). I did remind the OP that a hash value wasn't guaranteed to be unique anyway - and they might come across two different values with the same hash - and no way to distinguish between them if all they have is the hash. Hopefully their revised design will store the key, not the hash./

Uhh.... if you're using a database, let the database do the work of being a database. I don't know what this "cross system database" would be implemented in, but if it's a proper multi-user relational database engine like PostgreSQL, it's already going to have way better indexing than anything you'd do manually. I think there are WAY better solutions than worrying about Python's inbuilt hashing.

If you MUST hash your data for sharing and storage, the easiest solution is to just use a cryptographic hash straight out of hashlib.py.

ChrisA



More information about the Python-Dev mailing list