improve hash efficiency by directly using str/unicode hash by joernhees · Pull Request #746 · RDFLib/rdflib (original) (raw)
During investigation of a performance issue in my graph pattern learner, i noticed (sadly not for the first time) that rdflib spends massive amounts of time when hashing Identifiers. For example below you can see how about 34 % of the execution time of a minimal example are spent in Identifier.__hash__
:
This change makes hashing about 25 % more efficient for URIRefs, 15 % for Literals.
After this change (nothing else changed) for my code (using various sets, dicts, ...) this means a speedup of ~4:
Before, hashing performed several string concatenations to get the fqn, then hash that and XOR with str/unicode hash. It did this to avoid potential hash collisions between 'foo', URIRef('foo'), Literal('foo').
However, those scenarios can be considered corner cases. Testing the new hashing version in worst-case collision scenarios, it actually performs very close to the old behavior, but clearly outperforms it in more normal ones.
Test code for ipython:
from rdflib import URIRef, Literal
def test():
# worst case collisions
s = set()
for i in range(100000):
_s = u'foo:%d' % i
s.add(_s)
s.add(URIRef(_s))
s.add(Literal(_s))
assert len(s) == 300000
%timeit test()
"more natural ones:"
%timeit set(URIRef('asldkfjlsadkfsaldfj:%d' % i) for i in range(100000))
%timeit set(Literal('asldkfjlsadkfsaldfj%d' % i) for i in range(100000))
Results:
Old:
1 loop, best of 3: 940 ms per loop
1 loop, best of 3: 334 ms per loop
1 loop, best of 3: 610 ms per loop
New:
1 loop, best of 3: 945 ms per loop
1 loop, best of 3: 250 ms per loop
1 loop, best of 3: 515 ms per loop