How We Identify Things (on the Semantic Web) (original) (raw)

Status

This is ready for public consumption as a discussion piece. I'll try to keep updating it based on feedback I get at sandro@w3.org. You may want to cc: www-rdf-interest@w3.org.

The Problem

The semantic web works by sending around little statements which tell people and machines about the relationships between objects. The objects and even the relationships are identified by text strings, which often look a lot like web addresses. How do these strings correspond to the things actually being discussed in the statements? How do people (and possibly even machines) learn about this correspondence?

Techniques

There seem to be five main techniques, which I call "slash", "hash", "variables", "minting", and "TDB", as well as a few interesting less-understood ones. Each of these is a set of conventions for people to use in identifying things about which they are formally expressing knowledge. Some qualities are summarized in this table, and then they are discussed in more detail below.

| | Slash | Hash | Variables | Minting | TDB | | | ------------------- | ---------------------------------- | ---------------------------------------------- | ----------------------------------------------------- | ------------------------------------------ | ------------------------------------------------------------------------------------------------------------------------------ | | Typical Syntax: | HTTP URL | HTTP URI-Reference (URL#fragment) | "blank" node in RDF graph | URN with no a priori semantics (tag, uuid) | TDB (thing-described-by) URN | | Example: | http://.../Creator | ...22-rdf-syntax-ns#type | [contact:homePageAddress "http://www.w3.org"\] | tag:w3.org,2001:rdf:type | tdb:2001:http://.../Creator | | Denotation: | Either web content or other object | object which might be described in web content | object described in any content which uses the symbol | object | object which was described in web content at some point in time | | Meaning can change: | Yes | Yes | Not unless other meanings change | Yes | No | | Clickable: | Yes | Yes | No | No | No | | Authority Pointer: | Yes | Yes | No | No | Optional |

In general, machines don't care which of these is being used, since they don't actually understand anything beyond the knowledge we are formally expressing. They just compare symbols for equality with other symbols they know -- they shouldn't recognize them from some other domain of interaction. To a person, the symbol tag:sandro@w3.org,2001:The_movie_called_Star_Wars may be vastly more evocative than uuid:0fc671c0-ae9a-11d5-989b-0050ba4812a6, but all a semantic web agent will know about either string is exactly what it was told.

There are some exceptions, in that some techniques embed information in a standardized way. Slash, hash, and TDB all embed a URI which may be automatically usable to retrieve some content. Hash users generally expect the content will be some kind of authoritative or definitional information about the object. With all the techniques, relationships between objects and URIs with definitional content may be stated explicitly, but without embedded information the authority identification will be missing.

The "Slash" Technique

We use http:// URIs as symbols denoting not just web pages, but people, places, books, etc. This may be the most natural approach for RDF, which started as a way to talk about web pages.

Pros:

Some things (web pages) have obvious identifiers (their URLs). Cons:
People can get confused about whether the symbol denotes a web page or some other object. (The denotation can generally be disambiguated when the symbol is used in RDF by using the appropriate schema information.)
Symbols have embedded organizational information, which may not be appropriate. How do you create a symbol if you don't have a domain name? Maybe someone lets you use theirs, but does that give them additional authority?
The wording in the relevant RFCs suggests to some people that HTTP URIs cannot denote things like people & places.

See an IRC discussion of the subject

The "Hash" Technique

We use URI-References (URIs with fragment identifiers, likehttp://example.com/joe#dog) as symbols. The fragment indicates to a web client that it should do something special with a page (in a manner related to its media-type). This may help make it clear that the page itself is not being identified. If the media-type specifies a semantic web language, the identifier is strongly-linked to additional formal knowledge.

Variation: to reduce possible confusion and collisions among media-types' uses of fragment identifiers, use a restricted syntax, like ...rdf-syntax-ns**#deref(type)**. This stops us from using the elegant resource="#foo" syntax, however.

Pros:

Ties very naturally into relative URIs, eg the information on the page can just say "#dog". Cons:
It's still a web address, which can confuse people.
If the media-type is XML, the fragment has a different defined meaning.
The wording in the relevant RFCs might suggest to some people unless the media-type is properly known and defined, URI-References cannot have meaning.

The "Variables" Technique

Use existential variables qualified with a uniqueProperty. In n3 one can write "[ foaf:mbox mailto:sandro@w3.org]", which identifies "the thing which has the mailbox sandro@w3.org", ie me.

Pros:

Very clear semantics, basically side-stepping the whole issue Cons:
Without clever implementation techniques, this can be a lot slower to process (for machines and people!). (The variation solves this problem.)

The "Minting" Technique

Make up a new never-before-used identifier, using an algorithm like UUIDor tag. Add statements as necessary to restrict and document its meaning.

Minting is very similar to using Variables. View minting as Skolemizing, and you realize the only differences, in an asserted RDF graph, is that you can optionally merge the existential nodes if you use minting. If the graph is being used with a different attitude (eg as a pattern in a query), the difference is greater -- Skolemizing loses the information about which terms were variables, so you need to manage that separately. Of course, maybe we should be managing it separately anyway.....

Pros:

Easier to create than Hash/Slash Cons:
Confusion over semantics of merged graphs. (Who gets to define the denotation?)

The "Thing Described By" Technique

Use tdb URIs, which denote the thing described by the text available via some other (included) URI at a given point in time. Presumably media-type information could be used to distinguish formal descriptions from informal ones.

Pros:

Clear semantics
Address change-over-time Cons:
How do you retrieve the historical information it names?

Definitions

One technique not well explored is to have the identifier be an object's definition. This is how one might interpret n3's [...] syntax, but not how it's implemented. The distinction from Variables comes in two places:

A definition is a "closed" formula, taken as complete and somehow more important than the formulas in which it might be used, and
The definition is actually a single text string, which can be compared on a character basis. So the defining logical formula must be encoded in some canonical style.

By using a "data" URL, this could look to software exactly like a long Hash-style identifier; the issues about the meaning of a "definition" might be the same as for the contents address by a Hash identifier's base URI.

OIDs

An OID is an ISO standard "object identifier", with its denotation defined (but not necessarily published) by an identified authority.

Just Using Variables (Plus Boot-Strapping)

While many of these techniques can be used simultaneously, the possibilities for confusion get even greater. So it's worth noting that all the other techniques can be cleanly subsumed under the Variables technique, with a little use of one other technique, such as Hash or TDB. This may get us close to a best-of-all-worlds solution.

For example, here's a web page about one of my dogs, Taiko. With the Slash technique, I might just use that address as the identifier for my dog. Using Variables (with some way to bootstrap the predicate contact:homePageAddress), I might identify him with the n3 expression:

[ contact:homePageAddress "http://www.drum.org/~natasha/pets/taiko.html" ]

I could use a tag URI like tag:sandro@w3.org,2001-09-20:Taiko, or I could just say:

[ tag:authorityName "sandro@w3.org"; tag:authorityDate "2001-09-20"; tag:name "Taiko" ]

(Notice that none of those three properties is a uniqueProperty, but the combination of all three is.)

This approach can be used to the exclusion of all others, except that we need a way to name some boot-strapping properties (eg tag:authorityName). It also nicely allows for new approaches not yet thought of.

Use Cases

Let's start with some simple facts: Tim Berners-Lee, Director of the W3C, was born in 1955. Now try to figure out how we work with those facts using the various techniques described above.

Basic Identification

How do you identify the person, organization, year, and relationships?

Discovery & Verification

How might an agent learn these facts? If it had them, how might it attempt to prove/disprove them (or at least gather evidence)?

Disagreement

However you identified the W3C, there might be disagreement. (What is the W3C? Is it the 500+ members? Is it the team? Is it the union of parts of the host sites? Is it the union of the Working Groups? Who exactly created RDF M&S 1.0?) How do you approach these kind of subtle disagreements in denotation?

The Facts Evolve

What happens when our set of facts changes due to our learning more information? (We're still assuming a monotonic system. I don't think other assumptions interact much with these issues.) Tim Berners-Lee, born in 1955, served as W3C director from 1994 to 2051, when he retired and was replaced by Aaron Swartz.

How would this information be encoded? How might discovery and verification be handled?

Historical Reconstruction

Years later, how do we prove which person was the W3C Director who approved RDF Model & Syntax Version 14 (in the year 2048)? What if we don't trust Aaron or the W3C any more?

How We Identify Things (on the Semantic Web) (original) (raw)

Status

The Problem

Techniques

The "Slash" Technique

The "Hash" Technique

The "Variables" Technique

The "Minting" Technique

The "Thing Described By" Technique

Definitions

OIDs

Just Using Variables (Plus Boot-Strapping)

Use Cases

Basic Identification

Discovery & Verification

Disagreement

The Facts Evolve

Historical Reconstruction

Links