SoftWare Heritage persistent IDentifiers (SWHIDs) — Software Heritage documentation (original) (raw)

version 1.6, last modified 2021-04-30

Overview#

You can point to objects present in the Software Heritage archive by the means of SoftWare Heritage persistent IDentifiers, or SWHIDs for short, that are guaranteed to remain stable (persistent) over time. Their syntax, meaning, and usage is described below. Note that they are identifiers and not URLs, even though URL-based resolvers for SWHIDs are also available.

A SWHID consists of two separate parts, a mandatory core identifier that can point to any software artifact (or “object”) available in the Software Heritage archive, and an optional list of qualifiers that allows to specify the context where the object is meant to be seen and point to a subpart of the object itself.

Objects come in different types:

Each object is identified by an intrinsic, type-specific object identifier that is embedded in its SWHID as described below. The intrinsic identifiers embedded in SWHIDs are strong cryptographic hashes computed on the entire set of object properties. Together, these identifiers form a Merkle structure, specifically a Merkle DAG.

See the Software Heritage data model for an overview of object types and how they are linked together. Seeswh.model.git_objects for details on how the intrinsic identifiers embedded in SWHIDs are computed.

The optional qualifiers are of two kinds:

Syntax#

Syntactically, SWHIDs are generated by the <identifier> entry point in the following grammar:

::= [ ] ;

::= "swh" ":" ":" ":" ; ::= "1" ; ::= "snp" (* snapshot ) | "rel" ( release ) | "rev" ( revision ) | "dir" ( directory ) | "cnt" ( content ) ; ::= 40 * ; ( intrinsic object id, as hex-encoded SHA1 *) ::= "0" | "1" | "2" | "3" | "4" | "5" | "6" | "7" | "8" | "9" ; ::= | "a" | "b" | "c" | "d" | "e" | "f" ;

:= ";" [ ] ; ::= | ; ::= | | | ; ::= "origin" "=" ; ::= "visit" "=" ; ::= "anchor" "=" ; ::= "path" "=" ; ::= "lines" "=" ["-" ] ; ::= + ; ::= (* RFC 3987 IRI ) ::= ( RFC 3987 absolute path *)

Where:

in either case all occurrences of ; (and %, as required by the RFC) have been percent-encoded (as %3B and %25 respectively). Other characters can be percent-encoded, e.g., to improve readability and/or embeddability of SWHID in other contexts.

Semantics#

Core identifiers#

: is used as separator between the logical parts of core identifiers. Theswh prefix makes explicit that these identifiers are related to SoftWare Heritage. 1 (<scheme_version>) is the current version of this identifier scheme. Future editions will use higher version numbers, possibly breaking backward compatibility, but without breaking the resolvability of SWHIDs that conform to previous versions of the scheme.

A SWHID points to a single object, whose type is explicitly captured by<object_type>:

The actual object pointed to is identified by the intrinsic identifier<object_id>, which is a hex-encoded (using lowercase ASCII characters) SHA1 computed on the content and metadata of the object itself, as follows:

Qualifiers#

; is used as separator between the core identifier and the optional qualifiers, as well as between qualifiers. Each qualifier is specified as a key/value pair, using = as a separator.

The following context qualifiers are available:

The following fragment qualifier is available:

We recommend to equip identifiers meant to be shared with as many qualifiers as possible. While qualifiers may be listed in any order, it is good practice to present them in the order given above, i.e., origin, visit, anchor,path, lines. Redundant information should be omitted: for example, if the visit is present, and the path is relative to the snapshot indicated there, then the anchor qualifier is superfluous; similarly, if the path is empty, it may be omitted.

Interoperability#

URI scheme#

The swh URI scheme is registered at IANA for SWHIDs. The present documents constitutes the scheme specification for such URI scheme.

Git compatibility#

SWHIDs for contents, directories, revisions, and releases are, at present, compatible with the Git way of computing identifiers for its objects. The <object_id> part of a SWHID for a content object is the Git blob identifier of any file with the same content; for a revision it is the Git commit identifier for the same revision, etc. This is not the case for snapshot identifiers, as Git does not have a corresponding object type.

Note that Git compatibility is incidental and is not guaranteed to be maintained in future versions of this scheme (or Git).

Automatically fixing invalid SWHIDs#

User interfaces may fix invalid SWHIDs, by lower-casing the<identifier_core> part of a SWHID, if it contains upper-case letters because of user errors or limitations in software displaying SWHIDs.

However, implementations displaying or generating SWHIDs should not rely on this behavior, and must display or generate only valid SWHIDs when technically possible.

User interfaces should show an error when such an automatic fix occurs, so users have a chance to fix their SWHID before pasting it to an other interface that does not perform the same corrections. This also makes it easier to understand issues when a case-sensitive qualifier has its casing altered.

Examples#

Core identifiers#

Identifiers with qualifiers#

Implementation#

Computing#

An important property of any SWHID is that its core identifier is intrinsic: it can be computed from the object itself, without having to rely on any third party. An implementation of SWHID that allows to do so locally is theswh identifytool, available from the swh.modelPython package under the GPL license. This package can be installed via the pippackage manager with the one liner pip3 install swh.model[cli] on any machine with Python (at least version 3.7) and pip installed (on a Debian or Ubuntu system a simple apt install python3 python3-pipwill suffice, see the general instructions for other platforms).

SWHIDs are also automatically computed by Software Heritage for all archived objects as part of its archival activity, and can be looked up via the projectWeb interface.

This has various practical implications:

Choosing what type of SWHID to use#

swh:1:dir: SWHIDs are the most robust SWHIDs, as they can be recomputed from the simplest objects (a directory structure on a filesystem), even when all metadata is lost, without relying on the Software Heritage archive.

Therefore, we advise implementers and users to prefer this type of SWHIDs over swh:1:rev: and swh:1:rel: to reference a source code artifacts.

However, since keeping the metadata is also important, you should add an anchor qualifier to swh:1:dir: SWHIDs whenever possible, so the metadata stored in the Software Heritage archive can be retrieved when needed.

This means, for example, that you should preferswh:1:dir:a8eded6a2d062c998ba2dcc3dcb0ce68a4e15a58;anchor=swh:1:rel:22ece559cc7cc2364edc5e5593d63ae8bd229f9fover swh:1:rel:22ece559cc7cc2364edc5e5593d63ae8bd229f9f.

Resolvers#

Software Heritage resolver#

SWHIDs can be resolved using the Software Heritage Web interface. In particular, the root endpoint / can be given a SWHID and will lead to the browsing page of the corresponding object, like this:https://archive.softwareheritage.org/<identifier>.

A dedicated /resolve endpoint of the Software Heritage Web API is also available to programmatically resolve SWHIDs; see: GET /api/1/resolve/(swhid)/.

Examples:

Third-party resolvers#

The following third party resolvers support SWHID resolution:

Note that resolution via Identifiers.org currently only supports core identifiers due to syntactic incompatibilities with qualifiers.

Examples:

References#