URLInputSource can be abused to retrieve arbitrary documents if used naïvely · Issue #1844 · RDFLib/rdflib (original) (raw)
Discussed in #1543
Originally posted by alexdutton July 20, 2021
This is mostly related to rdflib-jsonld, but the dereferencing implementation is in rdflib, hence raising it here.
Scenario
If a web service takes POSTed JSON-LD data, e.g. as part of a Linked Data Notifications implementation, rdflib will attempt to resolve any URL in the @context
. This can lead to:
- attackers being able to probe internal networks, by having rdflib request potential non-public URLs
- reflection attacks, if the same or slightly-different URLs are repeated multiple times in the
@context
- resource exhaustion, as the entire remote file is loaded into memory before JSON parsing is attempted (admittedly an rdflib-jsonld issue)
- denial of service, if web or task workers are tied up waiting for extended periods for HTTP requests to complete
- attackers being able to probe the local filesystem using
file://
URLs
Problem
rdflib provides no way to control how external references are resolved, nor a way to implement caching of external resources.
An implementor should be able to:
- add URLs to a safelist, if e.g. they only expect certain JSON-LD contexts to be used
- provide local copies of remote resources, to obviate needing to make HTTP requests
- hook in a caching mechanism
These things should either be possible directly, or there should be an obvious way to hook them in.
Resolution
A new Resolver
base class should be added that takes responsibility for resolving external references and returning InputSource
instances, probably encapsulating the create_input_source()
behaviour in a resolve()
method. There should be a default implementation that resolves everything called e.g. DefaultResolver
. Maybe this resolver has an instantiation parameter like resolve_schemes=('file', 'http', 'https')
so it's easy to turn off dereferencing.
An optional resolver
argument should be added to Graph.parse()
, so that implementors can override the default behaviour. This is then passed down to the Parser.parse()
plugin implementation, defaulting to an instance of DefaultResolver
if not specified.
Finally, rdflib-jsonld can be updated to use the resolver
instead of create_input_source
directly.
Maybe there should also be a way to install a global default resolver to easily implement these protections without having to track down every Graph.parse()
call.
Happy to put together a PR if/when an approach is agreed.