namespace.py fix compute_qname missing namespaces by tgbugs · Pull Request #649 · RDFLib/rdflib (original) (raw)

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service andprivacy statement. We’ll occasionally send you account related emails.

Already on GitHub?Sign in to your account

Conversation

This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.Learn more about bidirectional Unicode characters

[ Show hidden characters]({{ revealButtonHref }})

tgbugs

The way the compute_qname worked was to call split_uri
and then see if there was a match in self.store.prefix.
This produced incomplete behavior for namesapces that
end in LETTERS_ instead of / or #. This commit corrects
this behavior by iterating through name and testing
namespace + name[:i] to see if there is a matching prefix.

@joernhees

tests fail... what spec are you referring to when saying that namespaces should not end in / or # ?

@tgbugs

Maybe I have incorrectly assumed that OWLAPI implements prefix handling according to the spec. This change was intended to match its behavior. To clarify about / and # endings: these are correct, and they remain correct with my changes, essentially this adds the ability to use arbitrary prefixes.

I will check again to see if this is in 4.2.2 and figure out why the tests are failing.

@satra

@tgbugs - i decided to fix this together with #632 in the turtle serializer itself. see here: #661

and for the full discussion between @gromgull and myself see #660

@tgbugs

@satra the solution in #661 looks good to me. Closing this. Thanks!

@tgbugs

I have reverted the change to NAME_START_CATEGORIES that was causing the build failures. Tests should pass now.

@satra

@tgbugs - with this change does the following work for you:

graph = Graph()
graph.bind('GENO', 'http://purl.obolibrary.org/obo/GENO_')
graph.bind('RO_has_phenotype',
                'http://purl.obolibrary.org/obo/RO_0002200')
graph.add((URIRef('http://example.org'),
           URIRef('http://purl.obolibrary.org/obo/RO_0002200'),
           URIRef('http://purl.obolibrary.org/obo/GENO_0000385')))
output = [val for val in
          graph.serialize(format='turtle').decode().splitlines()
          if not val.startswith('@prefix')]
output = ' '.join(output)
assert 'RO_has_phenotype: ' in output
assert 'GENO:0000385' in output

@tgbugs

@satra No, that example fails, producing ' <http://example.org> ns1:RO_0002200 ns1:GENO_0000385 . '. From my experiments the issues arrises because a call to graph.namespace_manager.qname(URIRef('http://purl.obolibrary.org/obo/RO_0002200')) (which is buried in serialize) forces the creation of ('ns1', rdflib.term.URIRef('http://purl.obolibrary.org/obo/')). A call to graph.namespace_manager.qname(URIRef('http://purl.obolibrary.org/obo/GENO_0000385')) does not cause this behavior, which leads me to believe it is because my code stops just short of the end (it assumes that users would not prefix an entire iri) I think I can fix this and will look into it.

@tgbugs

@satra Fixed. Was stopping at len(name) instead of len(name) + 1. Your tests now pass. If we care about efficiency the slowest part of this seems to be the fact that I have to convert the concatenated name to a URIRef every time in order to check the prefixes.

I can pull in your changes to test/test_turtle_serialize.py to this pr if you want.

@satra

@tgbugs - great. could you please add the test below to: test_turtle_serialize.py

@gromgull and @joernhees - let's focus on this PR and i'll close the other one. any questions or considerations we are overlooking?

def test_turtle_namespace(): graph = Graph() graph.bind('GENO', 'http://purl.obolibrary.org/obo/GENO_') graph.bind('RO_has_phenotype', 'http://purl.obolibrary.org/obo/RO_0002200') graph.add((URIRef('http://example.org'), URIRef('http://purl.obolibrary.org/obo/RO_0002200'), URIRef('http://purl.obolibrary.org/obo/GENO_0000385'))) output = [val for val in graph.serialize(format='turtle').decode().splitlines() if not val.startswith('@prefix')] output = ' '.join(output) assert 'RO_has_phenotype: ' in output assert 'GENO:0000385' in output

@tgbugs

I slightly modified the test case for this so that it forces us to match the longest prefix which is contained in an iri instead of the shortest. I'm pretty sure we can improve significantly on the performance here if needs be.

@tgbugs

Actual issue: if there is an exact prefix match then it doesn't check to see if there are longer matches. Will fix.

@satra

@satra

@gromgull - just a ping here to see if this can be merged. also is there a timeline for a release?

@gromgull

I hope to get a 4.2.2 out "quite soon", i.e. maybe before end of february. I'll review this before then!

gromgull

output_prefix = prefix
output_namespace = namespace
output_name = name
for i in xrange(1, len(name) + 1):

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we really cannot have this loop here. This function gets called for every single URI in the store on serialisation.

You can have a loong URI and this will call the store for each substring in that URL. For some stores, this means network activity (sqlalchemy, etc).

If we really want to fix this, the namespace manager needs to index the available prefixes by length in init and then try the longest first. But even this means many "if uri starts with x" calls. Some sort of trie-data-structure may be useful here.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok. This is what I expected we would have to do because this implementation produces worst case behavior all the time. Will take a look and see how it can be improved. Do we have any nice pathological cases that I could use for performance testing?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So my first round of performance testing on the 3 different implementations produced some weird results which I think are not representative of the spectrum of workloads. I serialized https://github.com/SciCrunch/NIF-Ontology/blob/master/ttl/NIF-Molecule.ttl 10 times using 3 different implementations and average the time spend calling compute_qname in namespace.py as measured by cProfile in python3.5, there were 246680 calls to compute_qname for each run.

This was a quick and dirty test and I am going to set it up a more robust suite since I don't believe that the implementation in question here is actually the fastest for all workloads as the numbers above would seem to suggest.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Turns out string.split is very, very slow. New implementation using len and slicing gives 1.162s using the same method as above.

joernhees

if prefix is None:
name = None
for lennamespace, prefix, namespace in self.ordered_prefixnames:
if namespace in uri:

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should be if uri.startswith(namespace):

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My testing showed that startswith is slower, however if there are cases where a namespace might somehow appear somewhere other than at the start of a URI then we would need to use startswith to prevent false positives. Personally I haven't been able to come up with any cases where this might happen.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yupp... correctness > speed

@joernhees

i'd leave the old method in as is and only if that doesn't find anything try your approach...

anyhow, we should probably overhaul the whole thing and use some trie / KMP

@tgbugs

Do you have any examples where the old method is faster than the new one?

@joernhees

The old method uses a dict for lookup (so O(1) per lookup), based on splits by URI parts, so (O(m) for m parts per URI). Overall O(m).

Your method goes through all namespaces (on avg. n/2), so uses O(n) for n registered namespaces.

The old method will scale better in light of many registered namespaces. Your method is likely to be faster if only few namespaces are registered.

@tgbugs

I'm happy to try the old implementation first, but it can produce incorrect results in cases where a valid namespace is defined that does not terminate in members of NAME_START_CATEGORIES (eg https://github.com/tgbugs/rdflib/blob/master/test/test_turtle_serialize.py#L78 will fail) so it will take more changes to make sure we are actually getting the longest matching namespace. In addition, looking at the existing split_uri method I wonder whether on average URIs are longer (m) or there are more prefixes defined (n). Clearly without knowing the empirical distributions of m and n we cant know for certain, but I suspect that on average m is going to be larger than n. Thoughts?

edit: My analysis is not quite right, because split_uri actually starts from the back of the URI so m is actually determined by the average fragment length, which means that m is indeed smaller than n on average. However, this means we still need to deal with the prefix length issue so I will take a look.

@nicholascar

OK, we're going through an rdflib management change and when that settles down and all roles are known - only a few days off - we'll be reviewing all open PRs and dealing with as many as we can. Sorry this has waited for so long but just give it one week more please! Nick

@tgbugs

I'm working on rebasing this PR so that it is easier to review and merge.

@satra @tgbugs

@tgbugs

test for longest substring check if split_uri produced an exact prefix

@tgbugs

squashed previous bad implementation attemps here

merged with rdflib/master resolving conflicts in favor of tgbugs

@tgbugs

Attempting to implement the 'faster' version of compute_qname revealed significant issues with the current split_uri implementation.

@tgbugs

The bug caused only the first of multiple urls with a common prefix to be inserted into the subtrie of a new prefix

@tgbugs

If Nd is not in namestart categories then we get nasty nondeterminism in serialization of hex strings.

@tgbugs

@tgbugs

@tgbugs

used a tuple for speed as well

@tgbugs

If we do not repopulate the trie with known namespaces then we will not be able to retrieve them with split_uri leading to major confusion. Added a test to catch this case.

@tgbugs

Fixed compute_qname bad behavior where it would fail to split a url and return nothing despite the fact that the url itself was a valid namespace. compute_qname now behaves consistently so that http://foo.com/ -> foo: if foo is defined as the prefix. Added tests to make sure this doesnt happened again.

@tgbugs

These tests make it possible to catch issues with xml serialization of qnames.

@tgbugs

The rdfxml serializer relied on namespace.py compute_qname to enforce qname correctness. Chagnging split characteristics to support n3 prefixes broke that assumption. This commit is a first pass at fixing the behvior. It may be worth renaming functions to make it clear that the current 'compute_qname' function is no longer actually computing a qname but a n3 prefix (or something like that).

@tgbugs

@nicholascar

Opening and closing to trigger Travis

This was referenced

Mar 16, 2020

@ghost ghost mentioned this pull request

Dec 23, 2021