add chunk serializer & tests by nicholascar · Pull Request #1968 · RDFLib/rdflib (original) (raw)

@gjhiggins @ashleysommer @aucampia what do you think of the approach here?

Minimally, it should indicate clearly that it is restricted to Graph serialization because, as the test below shows, context information is not preserved:

@pytest.mark.xfail(reason="Context information not preserved") def test_chunking_of_conjunctivegraph(): nquads = """
http://example.org/alice http://purl.org/dc/terms/publisher "Alice" . http://example.org/bob http://purl.org/dc/terms/publisher "Bob" . _:harry http://purl.org/dc/terms/publisher "Harry" . _:harry http://xmlns.com/foaf/0.1/name "Harry" _:harry . _:harry http://xmlns.com/foaf/0.1/mbox mailto:harry@work.example.org _:harry . _:alice http://xmlns.com/foaf/0.1/name "Alice" http://example.org/alice . _:alice http://xmlns.com/foaf/0.1/mbox mailto:alice@work.example.org http://example.org/alice . _:bob http://xmlns.com/foaf/0.1/name "Bob" http://example.org/bob . _:bob http://xmlns.com/foaf/0.1/mbox mailto:bob@oldcorp.example.org http://example.org/bob . _:bob http://xmlns.com/foaf/0.1/knows _:alice http://example.org/bob .""" g = ConjunctiveGraph() g.parse(data=nquads, format="nquads")

# make a temp dir to work with
temp_dir_path = Path(tempfile.TemporaryDirectory().name)
Path(temp_dir_path).mkdir()

# serialize into chunks file with 100 triples each
serialize_in_chunks(
    g, max_triples=100, file_name_stem="chunk_100", output_dir=temp_dir_path
)

# check, when a graph is made from the chunk files, it's isomorphic with original
g2 = ConjunctiveGraph()
for f in Path(temp_dir_path).glob("*.nt"):
    g2.parse(f, format="nt")

assert len(list(g.contexts())) == len(list(g2.contexts()))

The need to chunk serialize files is a small one - a project I'm working on needs it - and I thought it interesting enough to make an RDFLib tool for, rather than just keeping the code within the project.

RDFLib has traditionally been ambivalent about what's perceived as core vs non-core. Additional functionality appears to inevitably accrete, up to a point where it gets migrated out en masse into a separate package, the contents of which gradually become obsolete as they either fall out of use or are subsequently integrated into core library functionality.

Additional non-core functionality does have a regrettable tendency to languish in an untended and unkempt state. For instance, there's tools/graphisomorphism.py which is

currently broken (and has been since 2018)
long-obsolete, refererring as it does to RDFa as a supported format and based on Sean B. Palmers's 2004 rdfdiff.py implementation
Is subject to the same triples-only limitation.
Is obsoleted in functionality by both rdflib.compare.isomorphic and the weaker Graph.isomorphic()

Is it even worth bothering with a relatively trivial fix/update ...

diff --git a/rdflib/tools/graphisomorphism.py b/rdflib/tools/graphisomorphism.py index 004b567b..75462eb9 100644 --- a/rdflib/tools/graphisomorphism.py +++ b/rdflib/tools/graphisomorphism.py @@ -27,6 +27,10 @@ class IsomorphicTestableGraph(Graph): """ return hash(tuple(sorted(self.hashtriples())))

def hash(self):

   # return hash(tuple(sorted(self.hashtriples())))

```
   return self.internal_hash()
```

def hashtriples(self):
    for triple in self:
        g = ((isinstance(t, BNode) and self.vhash(t)) or t for t in triple)

@@ -49,19 +53,19 @@ class IsomorphicTestableGraph(Graph): else: yield self.vhash(triple[p], done=True)

def eq(self, G):

def eq(self, g): """Graph isomorphism testing."""

   if not isinstance(G, IsomorphicTestableGraph):

   if not isinstance(g, IsomorphicTestableGraph):
       return False

```
   elif len(self) != len(G):
```

   elif len(self) != len(g):
       return False

   elif list.__eq__(list(self), list(G)):

   elif list.__eq__(list(self), list(g)):
       return True  # @@

   return self.internal_hash() == G.internal_hash()

   return self.internal_hash() == g.internal_hash()

def ne(self, G):

def ne(self, g): """Negative graph isomorphism testing."""

```
   return not self.__eq__(G)
```

```
   return not self.__eq__(g)
```

def main(): @@ -82,10 +86,10 @@ def main(): default="xml", dest="inputFormat", metavar="RDF_FORMAT",

   choices=["xml", "trix", "n3", "nt", "rdfa"],

   choices=["xml", "n3", "nt", "turtle", "trix", "trig", "nquads", "json-ld", "hext"],
   help="The format of the RDF document(s) to compare"

```
   + "One of 'xml','n3','trix', 'nt', "
```

   + "or 'rdfa'.  The default is %default",

   + "One of 'xml', 'turtle', 'n3', 'nt', 'trix', 'trig', 'nquads', 'json-ld'"

   + "or 'hext'.  The default is %default",

)

(options, args) = op.parse_args()

when its appearance in tools is unlikely to persist for much longer?

There are a few non-core contributions in the closed PRs which I'm recruiting for preservaition in the cookbook. I'm guessing that a command-line version of graphisomorphism will ultimately end up there.