URLInputSource can be abused to retrieve arbitrary documents if used naïvely · Issue #1844 · RDFLib/rdflib (original) (raw)

Discussed in #1543

Originally posted by alexdutton July 20, 2021
This is mostly related to rdflib-jsonld, but the dereferencing implementation is in rdflib, hence raising it here.

Scenario

If a web service takes POSTed JSON-LD data, e.g. as part of a Linked Data Notifications implementation, rdflib will attempt to resolve any URL in the @context. This can lead to:

attackers being able to probe internal networks, by having rdflib request potential non-public URLs
reflection attacks, if the same or slightly-different URLs are repeated multiple times in the @context
resource exhaustion, as the entire remote file is loaded into memory before JSON parsing is attempted (admittedly an rdflib-jsonld issue)
denial of service, if web or task workers are tied up waiting for extended periods for HTTP requests to complete
attackers being able to probe the local filesystem using file:// URLs

Problem

rdflib provides no way to control how external references are resolved, nor a way to implement caching of external resources.

An implementor should be able to:

add URLs to a safelist, if e.g. they only expect certain JSON-LD contexts to be used
provide local copies of remote resources, to obviate needing to make HTTP requests
hook in a caching mechanism

These things should either be possible directly, or there should be an obvious way to hook them in.

Resolution

A new Resolver base class should be added that takes responsibility for resolving external references and returning InputSource instances, probably encapsulating the create_input_source() behaviour in a resolve() method. There should be a default implementation that resolves everything called e.g. DefaultResolver. Maybe this resolver has an instantiation parameter like resolve_schemes=('file', 'http', 'https') so it's easy to turn off dereferencing.

An optional resolver argument should be added to Graph.parse(), so that implementors can override the default behaviour. This is then passed down to the Parser.parse() plugin implementation, defaulting to an instance of DefaultResolver if not specified.

Finally, rdflib-jsonld can be updated to use the resolver instead of create_input_source directly.

Maybe there should also be a way to install a global default resolver to easily implement these protections without having to track down every Graph.parse() call.

Happy to put together a PR if/when an approach is agreed.