rdflib (original) (raw)

During investigation of a performance issue in my graph pattern learner, i noticed (sadly not for the first time) that rdflib spends massive amounts of time when hashing Identifiers. For example below you can see how about 34 % of the execution time of a minimal example are spent in Identifier.__hash__:

This change makes hashing about 25 % more efficient for URIRefs, 15 % for Literals.

After this change (nothing else changed) for my code (using various sets, dicts, ...) this means a speedup of ~4:

Before, hashing performed several string concatenations to get the fqn, then hash that and XOR with str/unicode hash. It did this to avoid potential hash collisions between 'foo', URIRef('foo'), Literal('foo').

However, those scenarios can be considered corner cases. Testing the new hashing version in worst-case collision scenarios, it actually performs very close to the old behavior, but clearly outperforms it in more normal ones.

Test code for ipython:

from rdflib import URIRef, Literal

def test():
    # worst case collisions
    s = set()
    for i in range(100000):
        _s = u'foo:%d' % i
        s.add(_s)
        s.add(URIRef(_s))
        s.add(Literal(_s))
    assert len(s) == 300000

%timeit test()

"more natural ones:"
%timeit set(URIRef('asldkfjlsadkfsaldfj:%d' % i) for i in range(100000))
%timeit set(Literal('asldkfjlsadkfsaldfj%d' % i) for i in range(100000))

Results:

Old:
1 loop, best of 3: 940 ms per loop
1 loop, best of 3: 334 ms per loop
1 loop, best of 3: 610 ms per loop

New:
1 loop, best of 3: 945 ms per loop
1 loop, best of 3: 250 ms per loop
1 loop, best of 3: 515 ms per loop