incorrect canonical graph algorithm in rdflib.compare · Issue #385 · RDFLib/rdflib (original) (raw)
Take the graphs g.ttl
@prefix : <example:test> .
_:1 :p _:2 .
_:2 :p _:3 .
_:3 :p _:4 .
_:4 :p _:5 .
_:5 :p _:6 .
_:6 :p _:1 .
and h.ttl
@prefix : <example:test> .
_:01 :p _:02 .
_:02 :p _:03 .
_:03 :p _:01 .
_:11 :p _:12 .
_:12 :p _:13 .
_:13 :p _:11 .
Then compare them:
>>> from rdflib.compare import IsomorphicGraph
>>> g = IsomorphicGraph()
>>> h = IsomorphicGraph()
>>> g.parse('g.ttl', format='n3')
<Graph identifier=file:///home/urs/g.ttl (<class 'rdflib.graph.Graph'>)>
>>> h.parse('h.ttl', format='n3')
<Graph identifier=file:///home/urs/h.ttl (<class 'rdflib.graph.Graph'>)>
>>> g == h
True
It says that they are isomorphic, but they aren't.
The following method from rdflib.compare._TripleCanonicalizer
gives a hint why it can't work:
def _vhashtriple(self, triple, target_term, done):
for i, term in enumerate(triple):
if not isinstance(term, BNode):
yield term
elif done or (term == target_term):
yield i
else:
yield self._canonicalize(term, done=True)
The last line says it all ;-)
This is definitely a polynomial time algorithm and therefore won't be able to canonicalize general RDF graphs.
By the way, rdflib/tools/graphisomorphism.py
duplicates this code.
Since this bugs me right now, I will try to think of another algorithm.