namespace.py fix compute_qname missing namespaces by tgbugs · Pull Request #649 · RDFLib/rdflib (original) (raw)
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service andprivacy statement. We’ll occasionally send you account related emails.
Already on GitHub?Sign in to your account
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.Learn more about bidirectional Unicode characters
[ Show hidden characters]({{ revealButtonHref }})
The way the compute_qname worked was to call split_uri
and then see if there was a match in self.store.prefix.
This produced incomplete behavior for namesapces that
end in LETTERS_ instead of / or #. This commit corrects
this behavior by iterating through name and testingnamespace + name[:i]
to see if there is a matching prefix.
tests fail... what spec are you referring to when saying that namespaces should not end in / or # ?
Maybe I have incorrectly assumed that OWLAPI implements prefix handling according to the spec. This change was intended to match its behavior. To clarify about / and # endings: these are correct, and they remain correct with my changes, essentially this adds the ability to use arbitrary prefixes.
I will check again to see if this is in 4.2.2 and figure out why the tests are failing.
@tgbugs - i decided to fix this together with #632 in the turtle serializer itself. see here: #661
and for the full discussion between @gromgull and myself see #660
@satra the solution in #661 looks good to me. Closing this. Thanks!
I have reverted the change to NAME_START_CATEGORIES that was causing the build failures. Tests should pass now.
@tgbugs - with this change does the following work for you:
graph = Graph()
graph.bind('GENO', 'http://purl.obolibrary.org/obo/GENO_')
graph.bind('RO_has_phenotype',
'http://purl.obolibrary.org/obo/RO_0002200')
graph.add((URIRef('http://example.org'),
URIRef('http://purl.obolibrary.org/obo/RO_0002200'),
URIRef('http://purl.obolibrary.org/obo/GENO_0000385')))
output = [val for val in
graph.serialize(format='turtle').decode().splitlines()
if not val.startswith('@prefix')]
output = ' '.join(output)
assert 'RO_has_phenotype: ' in output
assert 'GENO:0000385' in output
@satra No, that example fails, producing ' <http://example.org> ns1:RO_0002200 ns1:GENO_0000385 . '
. From my experiments the issues arrises because a call to graph.namespace_manager.qname(URIRef('http://purl.obolibrary.org/obo/RO_0002200'))
(which is buried in serialize) forces the creation of ('ns1', rdflib.term.URIRef('http://purl.obolibrary.org/obo/'))
. A call to graph.namespace_manager.qname(URIRef('http://purl.obolibrary.org/obo/GENO_0000385'))
does not cause this behavior, which leads me to believe it is because my code stops just short of the end (it assumes that users would not prefix an entire iri) I think I can fix this and will look into it.
@satra Fixed. Was stopping at len(name)
instead of len(name) + 1
. Your tests now pass. If we care about efficiency the slowest part of this seems to be the fact that I have to convert the concatenated name to a URIRef every time in order to check the prefixes.
I can pull in your changes to test/test_turtle_serialize.py
to this pr if you want.
@tgbugs - great. could you please add the test below to: test_turtle_serialize.py
@gromgull and @joernhees - let's focus on this PR and i'll close the other one. any questions or considerations we are overlooking?
def test_turtle_namespace(): graph = Graph() graph.bind('GENO', 'http://purl.obolibrary.org/obo/GENO_') graph.bind('RO_has_phenotype', 'http://purl.obolibrary.org/obo/RO_0002200') graph.add((URIRef('http://example.org'), URIRef('http://purl.obolibrary.org/obo/RO_0002200'), URIRef('http://purl.obolibrary.org/obo/GENO_0000385'))) output = [val for val in graph.serialize(format='turtle').decode().splitlines() if not val.startswith('@prefix')] output = ' '.join(output) assert 'RO_has_phenotype: ' in output assert 'GENO:0000385' in output
I slightly modified the test case for this so that it forces us to match the longest prefix which is contained in an iri instead of the shortest. I'm pretty sure we can improve significantly on the performance here if needs be.
Actual issue: if there is an exact prefix match then it doesn't check to see if there are longer matches. Will fix.
@gromgull - just a ping here to see if this can be merged. also is there a timeline for a release?
I hope to get a 4.2.2 out "quite soon", i.e. maybe before end of february. I'll review this before then!
output_prefix = prefix |
---|
output_namespace = namespace |
output_name = name |
for i in xrange(1, len(name) + 1): |
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we really cannot have this loop here. This function gets called for every single URI in the store on serialisation.
You can have a loong URI and this will call the store for each substring in that URL. For some stores, this means network activity (sqlalchemy, etc).
If we really want to fix this, the namespace manager needs to index the available prefixes by length in init and then try the longest first. But even this means many "if uri starts with x" calls. Some sort of trie-data-structure may be useful here.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok. This is what I expected we would have to do because this implementation produces worst case behavior all the time. Will take a look and see how it can be improved. Do we have any nice pathological cases that I could use for performance testing?
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So my first round of performance testing on the 3 different implementations produced some weird results which I think are not representative of the spectrum of workloads. I serialized https://github.com/SciCrunch/NIF-Ontology/blob/master/ttl/NIF-Molecule.ttl 10 times using 3 different implementations and average the time spend calling compute_qname in namespace.py as measured by cProfile in python3.5, there were 246680 calls to compute_qname for each run.
- current mainline rdflib implementation of compute_qname: 1.2658s
- loop over uri implementation (current for this pull request) of compute_qname: 1.1765s
- loop on namespaces implementation of compute_qname: 1.2025s
This was a quick and dirty test and I am going to set it up a more robust suite since I don't believe that the implementation in question here is actually the fastest for all workloads as the numbers above would seem to suggest.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Turns out string.split is very, very slow. New implementation using len and slicing gives 1.162s using the same method as above.
if prefix is None: |
---|
name = None |
for lennamespace, prefix, namespace in self.ordered_prefixnames: |
if namespace in uri: |
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this should be if uri.startswith(namespace):
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My testing showed that startswith is slower, however if there are cases where a namespace might somehow appear somewhere other than at the start of a URI then we would need to use startswith to prevent false positives. Personally I haven't been able to come up with any cases where this might happen.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yupp... correctness > speed
i'd leave the old method in as is and only if that doesn't find anything try your approach...
anyhow, we should probably overhaul the whole thing and use some trie / KMP
Do you have any examples where the old method is faster than the new one?
The old method uses a dict for lookup (so O(1) per lookup), based on splits by URI parts, so (O(m) for m parts per URI). Overall O(m).
Your method goes through all namespaces (on avg. n/2), so uses O(n) for n registered namespaces.
The old method will scale better in light of many registered namespaces. Your method is likely to be faster if only few namespaces are registered.
I'm happy to try the old implementation first, but it can produce incorrect results in cases where a valid namespace is defined that does not terminate in members of NAME_START_CATEGORIES (eg https://github.com/tgbugs/rdflib/blob/master/test/test_turtle_serialize.py#L78 will fail) so it will take more changes to make sure we are actually getting the longest matching namespace. In addition, looking at the existing split_uri method I wonder whether on average URIs are longer (m
) or there are more prefixes defined (n
). Clearly without knowing the empirical distributions of m
and n
we cant know for certain, but I suspect that on average m
is going to be larger than n
. Thoughts?
edit: My analysis is not quite right, because split_uri actually starts from the back of the URI so m
is actually determined by the average fragment length, which means that m
is indeed smaller than n
on average. However, this means we still need to deal with the prefix length issue so I will take a look.
OK, we're going through an rdflib management change and when that settles down and all roles are known - only a few days off - we'll be reviewing all open PRs and dealing with as many as we can. Sorry this has waited for so long but just give it one week more please! Nick
I'm working on rebasing this PR so that it is easier to review and merge.
test for longest substring check if split_uri produced an exact prefix
squashed previous bad implementation attemps here
merged with rdflib/master resolving conflicts in favor of tgbugs
Attempting to implement the 'faster' version of compute_qname revealed significant issues with the current split_uri implementation.
The bug caused only the first of multiple urls with a common prefix to be inserted into the subtrie of a new prefix
If Nd is not in namestart categories then we get nasty nondeterminism in serialization of hex strings.
used a tuple for speed as well
If we do not repopulate the trie with known namespaces then we will not be able to retrieve them with split_uri leading to major confusion. Added a test to catch this case.
Fixed compute_qname bad behavior where it would fail to split a url and return nothing despite the fact that the url itself was a valid namespace. compute_qname now behaves consistently so that http://foo.com/ -> foo: if foo is defined as the prefix. Added tests to make sure this doesnt happened again.
These tests make it possible to catch issues with xml serialization of qnames.
The rdfxml serializer relied on namespace.py compute_qname to enforce qname correctness. Chagnging split characteristics to support n3 prefixes broke that assumption. This commit is a first pass at fixing the behvior. It may be worth renaming functions to make it clear that the current 'compute_qname' function is no longer actually computing a qname but a n3 prefix (or something like that).
Opening and closing to trigger Travis
This was referenced
Mar 16, 2020
ghost mentioned this pull request