turtle & n3 parser don't correctly handle \uXXXX escapes · Issue #335 · RDFLib/rdflib (original) (raw)

rdflib-4.0.1

According to http://www.w3.org/TeamSubmission/turtle/#sec-strings Turtle strings and URIs can use -escape sequences to represent Unicode code points, especially including \uXXXX and \UXXXXXXXX escapes.

Parsing a file called test.ttl looking like this:

@prefix owl:    <http://www.w3.org/2002/07/owl#> .
@prefix dbpedia:    <http://dbpedia.org/resource/> .
@prefix ns147:  <http://wo.dbpedia.org/resource/> .
dbpedia:Animal  owl:sameAs  ns147:Dundat_yi ,
        <http://vls.dbpedia.org/resource/B\u00EAesten_(ryk)> .

will incorrectly "double" escape the \:

In [1]: import rdflib

In [2]: g = rdflib.ConjunctiveGraph()

In [3]: g.parse('test.ttl', format='turtle') No handlers could be found for logger "rdflib.term" Out[3]: <Graph identifier=file:///Users/joern/Downloads/test.ttl (<class 'rdflib.graph.Graph'>)>

In [4]: list(g) Out[4]: [(rdflib.term.URIRef(u'http://dbpedia.org/resource/Animal'), rdflib.term.URIRef(u'http://www.w3.org/2002/07/owl#sameAs'), rdflib.term.URIRef(u'http://vls.dbpedia.org/resource/B\\u00EAesten_(ryk)'))]

trying to serialize this will obviously fail:

In [5]: g.serialize(format='turtle')

Exception Traceback (most recent call last) in () ----> 1 g.serialize(format='turtle')

/usr/local/lib/python2.7/site-packages/rdflib/graph.pyc in serialize(self, destination, format, base, encoding, **args) 901 if destination is None: 902 stream = BytesIO() --> 903 serializer.serialize(stream, base=base, encoding=encoding, **args) 904 return stream.getvalue() 905 if hasattr(destination, "write"):

...

/usr/local/lib/python2.7/site-packages/rdflib/term.pyc in n3(self, namespace_manager) 220 def n3(self, namespace_manager = None): 221 if not _is_valid_uri(self): --> 222 raise Exception('"%s" does not look like a valid URI, I cannot serialize this as N3/Turtle. Perhaps you wanted to urlencode it?'%self) 223 224 if namespace_manager:

Exception: "http://vls.dbpedia.org/resource/B\u00EAesten_(ryk)" does not look like a valid URI, I cannot serialize this as N3/Turtle. Perhaps you wanted to urlencode it?

For n3 there are two remarks: one under http://www.w3.org/TeamSubmission/n3/#subsets even pointing out that there is an option for this in cwm (which our n3 parser is based on) and http://www.w3.org/TeamSubmission/n3/#sec-mediaReg clearly stating that the -escapes are ok as well.

Am i missing something?
If not I'll try to fix this in the n3 parser (which we use for turtle as well) and include the above file as a new test-case.