Add ability to detect and mark ill-typed literals by ashleysommer · Pull Request #1773 · RDFLib/rdflib (original) (raw)

Regarding naming.. I'm using "ill-formed" because the term "well-formed" is used in the PySHACL spec document, in the context of describing a "well-formed literal". A Literal that is not "well-formed" is one where the the lexical does not match the datatype. Therefore, I concluded a Literal that is not "well-formed" could be called "ill-formed". But maybe "ill-typed" is a better description.

EDIT: I just saw that indeed the language in the W3C SHACL test suite refers to these literals as "ill-formed".

Okay, happy with name then.

When I started this patch, my assumption was that if we cannot prove the Literal is ill-formed, then it must be well-formed. Thus ill_formed is False until proven to the true by the checks. But thinking about it since then, you're right, maybe it should be a three valued property, with None being the default. If there are built-in tests, then it can be marked as true/false. otherwise it is None, which is unknown.

This is actually even a bit more hairy now that I think about it, we have the following scenarios as I interpret RDF 1.1 Concepts and Abstract Syntax / 3.3 Literals

  1. The Literal's datatype is not a recognized datatype IRIs: In this case the concept is not well defined, as it is only defined for recognized datatype IRIs
    In this case I would say the reasonable value for Literal.ill_formed is None.
  2. The Literal's datatype is a recognized datatype IRIs: In this case the value should ideally only ever be true or false, and I think in this case it is maybe reasonable to conclude that it is NOT ill_typed unless we can show it is ill_typed, however maybe there are cases here which we are not yet sure for which Literal.ill_formed could also be None.

I made a PR to change ill_typed to be three valued: ashleysommer#7

The logic it follows is in a docstring:

property ill_formed: Optional[bool]

For recognized datatype IRIs, this value will be True if the literal is ill formed, otherwise it will be False. Literal.value (i.e. the literal value) should always be defined if this value is True, but should not be considered reliable if this property is not True.

If the literal’s datatype is not in the set of recognized datatype IRIs this value will be None.

Happy to make any further changes to it if you want, just let me know.