incorrect canonical graph algorithm in rdflib.compare · Issue #385 · RDFLib/rdflib (original) (raw)

Take the graphs g.ttl

@prefix : <example:test> .
_:1 :p _:2 .
_:2 :p _:3 .
_:3 :p _:4 .
_:4 :p _:5 .
_:5 :p _:6 .
_:6 :p _:1 .

and h.ttl

@prefix : <example:test> .
_:01 :p _:02 .
_:02 :p _:03 .
_:03 :p _:01 .

_:11 :p _:12 .
_:12 :p _:13 .
_:13 :p _:11 .

Then compare them:

>>> from rdflib.compare import IsomorphicGraph
>>> g = IsomorphicGraph()
>>> h = IsomorphicGraph()
>>> g.parse('g.ttl', format='n3')
<Graph identifier=file:///home/urs/g.ttl (<class 'rdflib.graph.Graph'>)>
>>> h.parse('h.ttl', format='n3')
<Graph identifier=file:///home/urs/h.ttl (<class 'rdflib.graph.Graph'>)>
>>> g == h
True

It says that they are isomorphic, but they aren't.
The following method from rdflib.compare._TripleCanonicalizer gives a hint why it can't work:

    def _vhashtriple(self, triple, target_term, done):
        for i, term in enumerate(triple):
            if not isinstance(term, BNode):
                yield term
            elif done or (term == target_term):
                yield i
            else:
                yield self._canonicalize(term, done=True)

The last line says it all ;-)
This is definitely a polynomial time algorithm and therefore won't be able to canonicalize general RDF graphs.

By the way, rdflib/tools/graphisomorphism.py duplicates this code.

Since this bugs me right now, I will try to think of another algorithm.