feat: sort longturtle blank nodes by edmondchuc · Pull Request #2997 · RDFLib/rdflib (original) (raw)

Summary of changes

Fixes #1890 - Sorting Turtle output?

This change improves git diffing Turtle data serialized with the longturtle serializer. Previously, blank nodes were not sorted, and round-trips using RDFLib's turtle parser and longturtle serializer would have blank node objects flip flop around, making it difficult to read real changes using git diff.

This PR fixes the above by implementing a sort on values where triples in the object position are blank nodes. The blank nodes are sorted by grabbing their concise-bounded description graph and sorting it as a string in their longturtle serialization.

This adds an additional cost to the longturtle serializer, but I think the cost is worth it if we are after a deterministic output with blank nodes.

Note that this depends on RDF data parsed using the turtle parser and its behaviour of how it assigns blank nodes. Exact serialization with data added via the graph object cannot be guaranteed. This would require implementing RDF Canonicalization to guarantee the same blank node identifiers in the graph.

Once RDF Canonicalization is implemented, sorting by the blank node identifier directly will be enough to guarantee deterministic serialization, and the expensive CBD sorting can be removed.

Update: looks like top-level blank nodes with no inbound relationships don't get sorted. I think this makes sense since we're only applying the sort to blank nodes in the object position. If we cared about sorting blank nodes in the subject position, we can apply the same kind of sorting based on the text serialization of the CBD onto the subject blank nodes.

for subject in subjects_list:
if self.isDone(subject):
continue
if firstTime:
firstTime = False
if self.statement(subject) and not firstTime:
self.write("\n")

Fixes #2767 - Bug in longturtle serialization

This PR also fixes the missing trailing whitespace in the special case described by @mschiedon.

The longturtle serializer fails to emit a whitespace separator between a predicate and a list of objects if one of these objects is a blank node (and the blank node cannot be 'inlined', i.e. is used more than once)

This fix was previously fixed in PR #2700 but was inadvertently reverted in PR #2731 when I was trying to fix the ruff linting rule in the test case. To get around the ruff linting rule, I've now moved the target result of the test case into a text file.

Checklist