Canonical form of SPARQL Patterns · Issue #483 · RDFLib/rdflib (original) (raw)

I'm currently performing >> 1M SPARQL Queries as part of some machine learning algorithm. As this takes a while, i thought about caching results for SPARQL queries. The problem here is, that different SPARQL queries can contain Variables with different names, but are isomorphic otherwise. Example:

select * where { ?s foo:bar foo:bla }

is isomorphic to

select * where { ?s2 foo:bar foo:bla }

For quick checking in a cache it would be cool to have a canonical form of a SPARQL Pattern, very much like #441 (rdflib.compare.to_canonical_graph(g1)) for rdflib.Graph.

A SPARQL Query's pattern part can be represented as an rdflib.Graph which contains Variables. By replacing Variables with BNodes (using the variable name as bnode id) one gets pretty close to a graph that one could use the to_canonical_graph algorithm on, with one exception: BNodes can't be used as predicates (RDF Concepts).

As this is out of spec, i guess it's ok this fails:

In [1]: from rdflib import * INFO:rdflib:RDFLib Version: 4.2.1-dev

In [2]: from rdflib.compare import *

In [3]: g = Graph()

In [4]: g.add((BNode('v1'), BNode('v2'), URIRef('foo')))

In [5]: to_canonical_graph(g)

KeyError Traceback (most recent call last) [...] /usr/local/lib/python2.7/site-packages/rdflib/compare.pyc in _canonicalize_bnodes(self, triple, labels) 456 for term in triple: 457 if isinstance(term, BNode): --> 458 yield BNode(value="cb%s" % labels[term]) 459 else: 460 yield term

KeyError: rdflib.term.BNode('v2')

Nevertheless, as this is quite close to a cool feature and graph canonicalization isn't exactly the easiest problem to think about: is it maybe possible to slightly adapt the RGDA1 algorithm to support BNodes in the predicate position as well and thereby also making it fit for SPARQL Patterns? Maybe @jimmccusker has an idea on this?