RGDA1 graph canonicalization sometimes still collapses distinct BNodes · Issue #725 · RDFLib/rdflib (original) (raw)

During the evaluation of my graph pattern learner i'm currently trying to generate all possible (different) SPARQL BGPs of a given length (5 at the moment). With up to 11 variables, enumerating all of those graphs might be stretching it a bit, but i'm nearly there. However, to do this, i need the canonical representations of SPARQL BGPs. As discussed before (#483), i'm reducing SPARQL BGPs (and especially their variables) to RDF graphs with BNodes (see here if interested), then run RGDA1 on it, and map the canonical BNode labels to the SPARQL Variables.

Similarly to #494, I noticed that sometimes during this process i still lose nodes. Minimal test-case below (PR with test will follow):

g = rdflib.Graph() g += [ (rdflib.term.BNode('N0a76d42406b84fe4b8029d0a7fa04244'), rdflib.term.URIRef(u'http://www.w3.org/1999/02/22-rdf-syntax-ns#object'), rdflib.term.BNode('v2')), (rdflib.term.BNode('N0a76d42406b84fe4b8029d0a7fa04244'), rdflib.term.URIRef(u'http://www.w3.org/1999/02/22-rdf-syntax-ns#predicate'), rdflib.term.BNode('v0')), (rdflib.term.BNode('N0a76d42406b84fe4b8029d0a7fa04244'), rdflib.term.URIRef(u'http://www.w3.org/1999/02/22-rdf-syntax-ns#subject'), rdflib.term.URIRef(u'urn:gp_learner:fixed_var:target')), (rdflib.term.BNode('N0a76d42406b84fe4b8029d0a7fa04244'), rdflib.term.URIRef(u'http://www.w3.org/1999/02/22-rdf-syntax-ns#type'), rdflib.term.URIRef(u'http://www.w3.org/1999/02/22-rdf-syntax-ns#Statement')), (rdflib.term.BNode('N2f62af5936b94a8eb4b1e4bfa8e11d95'), rdflib.term.URIRef(u'http://www.w3.org/1999/02/22-rdf-syntax-ns#object'), rdflib.term.BNode('v1')), (rdflib.term.BNode('N2f62af5936b94a8eb4b1e4bfa8e11d95'), rdflib.term.URIRef(u'http://www.w3.org/1999/02/22-rdf-syntax-ns#predicate'), rdflib.term.BNode('v0')), (rdflib.term.BNode('N2f62af5936b94a8eb4b1e4bfa8e11d95'), rdflib.term.URIRef(u'http://www.w3.org/1999/02/22-rdf-syntax-ns#subject'), rdflib.term.URIRef(u'urn:gp_learner:fixed_var:target')), (rdflib.term.BNode('N2f62af5936b94a8eb4b1e4bfa8e11d95'), rdflib.term.URIRef(u'http://www.w3.org/1999/02/22-rdf-syntax-ns#type'), rdflib.term.URIRef(u'http://www.w3.org/1999/02/22-rdf-syntax-ns#Statement')), (rdflib.term.BNode('N5ae541f93e1d4e5880450b1bdceb6404'), rdflib.term.URIRef(u'http://www.w3.org/1999/02/22-rdf-syntax-ns#object'), rdflib.term.BNode('v5')), (rdflib.term.BNode('N5ae541f93e1d4e5880450b1bdceb6404'), rdflib.term.URIRef(u'http://www.w3.org/1999/02/22-rdf-syntax-ns#predicate'), rdflib.term.BNode('v4')), (rdflib.term.BNode('N5ae541f93e1d4e5880450b1bdceb6404'), rdflib.term.URIRef(u'http://www.w3.org/1999/02/22-rdf-syntax-ns#subject'), rdflib.term.URIRef(u'urn:gp_learner:fixed_var:target')), (rdflib.term.BNode('N5ae541f93e1d4e5880450b1bdceb6404'), rdflib.term.URIRef(u'http://www.w3.org/1999/02/22-rdf-syntax-ns#type'), rdflib.term.URIRef(u'http://www.w3.org/1999/02/22-rdf-syntax-ns#Statement')), (rdflib.term.BNode('N86ac7ca781f546ae939b8963895f672e'), rdflib.term.URIRef(u'http://www.w3.org/1999/02/22-rdf-syntax-ns#object'), rdflib.term.URIRef(u'urn:gp_learner:fixed_var:source')), (rdflib.term.BNode('N86ac7ca781f546ae939b8963895f672e'), rdflib.term.URIRef(u'http://www.w3.org/1999/02/22-rdf-syntax-ns#predicate'), rdflib.term.BNode('v0')), (rdflib.term.BNode('N86ac7ca781f546ae939b8963895f672e'), rdflib.term.URIRef(u'http://www.w3.org/1999/02/22-rdf-syntax-ns#subject'), rdflib.term.URIRef(u'urn:gp_learner:fixed_var:target')), (rdflib.term.BNode('N86ac7ca781f546ae939b8963895f672e'), rdflib.term.URIRef(u'http://www.w3.org/1999/02/22-rdf-syntax-ns#type'), rdflib.term.URIRef(u'http://www.w3.org/1999/02/22-rdf-syntax-ns#Statement')), (rdflib.term.BNode('Nac82b883ca3849b5ab6820b7ac15e490'), rdflib.term.URIRef(u'http://www.w3.org/1999/02/22-rdf-syntax-ns#object'), rdflib.term.BNode('v1')), (rdflib.term.BNode('Nac82b883ca3849b5ab6820b7ac15e490'), rdflib.term.URIRef(u'http://www.w3.org/1999/02/22-rdf-syntax-ns#predicate'), rdflib.term.BNode('v3')), (rdflib.term.BNode('Nac82b883ca3849b5ab6820b7ac15e490'), rdflib.term.URIRef(u'http://www.w3.org/1999/02/22-rdf-syntax-ns#subject'), rdflib.term.URIRef(u'urn:gp_learner:fixed_var:target')), (rdflib.term.BNode('Nac82b883ca3849b5ab6820b7ac15e490'), rdflib.term.URIRef(u'http://www.w3.org/1999/02/22-rdf-syntax-ns#type'), rdflib.term.URIRef(u'http://www.w3.org/1999/02/22-rdf-syntax-ns#Statement'))]

cg = rdflib.compare.to_canonical_graph(g)

for g we will get the following "stats":

print 'graph length: %d, nodes: %d' % (len(g), len(g.all_nodes())) print 'triple_bnode degrees:' for triple_bnode in g.subjects(rdflib.RDF['type'], rdflib.RDF['Statement']): print len(list(g.triples([triple_bnode, None, None]))) print 'all node out-degrees:' print sorted([len(list(g.triples([node, None, None]))) for node in g.all_nodes()]) print 'all node in-degrees:' print sorted([len(list(g.triples([None, None, node]))) for node in g.all_nodes()])

output:

graph length: 20, nodes: 14
triple_bnode degrees:
4
4
4
4
4
all node out-degrees:
[0, 0, 0, 0, 0, 0, 0, 0, 0, 4, 4, 4, 4, 4]
all node in-degrees:
[0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 2, 3, 5, 5]

for cg we'll get the following:

print 'graph length: %d, nodes: %d' % (len(cg), len(cg.all_nodes())) print 'triple_bnode degrees:' for triple_bnode in cg.subjects(rdflib.RDF['type'], rdflib.RDF['Statement']): print len(list(cg.triples([triple_bnode, None, None]))) print 'all node out-degrees:' print sorted([len(list(cg.triples([node, None, None]))) for node in cg.all_nodes()]) print 'all node in-degrees:' print sorted([len(list(cg.triples([None, None, node]))) for node in cg.all_nodes()])

output:

graph length: 20, nodes: 13
triple_bnode degrees:
4
4
4
4
4
all node out-degrees:
[0, 0, 0, 0, 0, 0, 0, 0, 4, 4, 4, 4, 4]
all node in-degrees:
[0, 0, 0, 0, 0, 1, 1, 1, 1, 3, 3, 5, 5]

@jimmccusker could you maybe have another look?