URLInputSource can be abused to retrieve arbitrary documents if used naïvely · Issue #1844 · RDFLib/rdflib (original) (raw)

Discussed in #1543

Originally posted by alexdutton July 20, 2021
This is mostly related to rdflib-jsonld, but the dereferencing implementation is in rdflib, hence raising it here.

Scenario

If a web service takes POSTed JSON-LD data, e.g. as part of a Linked Data Notifications implementation, rdflib will attempt to resolve any URL in the @context. This can lead to:

Problem

rdflib provides no way to control how external references are resolved, nor a way to implement caching of external resources.

An implementor should be able to:

These things should either be possible directly, or there should be an obvious way to hook them in.

Resolution

A new Resolver base class should be added that takes responsibility for resolving external references and returning InputSource instances, probably encapsulating the create_input_source() behaviour in a resolve() method. There should be a default implementation that resolves everything called e.g. DefaultResolver. Maybe this resolver has an instantiation parameter like resolve_schemes=('file', 'http', 'https') so it's easy to turn off dereferencing.

An optional resolver argument should be added to Graph.parse(), so that implementors can override the default behaviour. This is then passed down to the Parser.parse() plugin implementation, defaulting to an instance of DefaultResolver if not specified.

Finally, rdflib-jsonld can be updated to use the resolver instead of create_input_source directly.

Maybe there should also be a way to install a global default resolver to easily implement these protections without having to track down every Graph.parse() call.

Happy to put together a PR if/when an approach is agreed.