Statically-linked C/C++ libraries (original) (raw)

The goal is to explore the current situation of crates including statically linked C/C++ libraries and to start a discussion about ways to make it easier to import external code in crates in a secure and reliable manner.

Overview

To get an idea of the extent of this pattern, let's explore crates.io content with an analysis of the crates with more than 100k downloads on 2022-08-07 (the 4,7k top crates, see the methodology for more details).

There are currently 70 C/C++ native libraries included with git submodules in 58 crates from the top 4,7k crates. Some of them are widely used, like libz-sys with 20M downloads and 46 reverse dependencies, or libgit2-sys with 11M downloads. Among these crates:

Two main patterns appear:

Note: This only covers the crates containing submodules, but sometimes the code is vendored directly into the repository, like freetype-sys which has a copy of freetype2 sources. In any case, the source becomes part of the crate uploaded to the registry.

Case studies

Let's have a closer looks at a few representative crates.

mozjpeg-sys

curl-sys

openssl-src

Issues

A lot of widely-used crates include third-party libraries, with little consistency. It causes problems in terms of:

Possible improvements

Just like -sys crates have an official definition in cargo docs, with a set of recommended practices, a first step could be to write an RFC with similar guidelines for external source crates. This could build upon implementations, and allow an easy convergence for libraries using different patterns. It could then be improved by additional tooling or metadata.

Dedicated -src crates

Having dedicated crates (with the -src suffix for discoverability) seems to have quite a few advantages:

One obvious big drawback is the maintenance overhead.

Consistent feature-based configuration

Ideally there should be a recommended way (through features of -sys crates) to:

The is already a pre-RFC by @kornel to discuss this.

Accurate metadata

License

The license of a crate should cover all files included in the crate archive, including external embedded files.

Using a dedicated crate makes it easier by allowing to easily document different licenses for external code and -sys crate.

Source identification

The other missing information is a way to identify the included software, if possible in a machine-readable manner (CPE, SWID tags, PURL, etc.). It would make it possible to integrate properly with SBOM, automate CVE detection, automate upstream version update, etc.

Note that it would be possible to identify statically linked libraries at compile time already, but this does not work on sources only and does not provide a proper software identifier, just a library name.

Versioning

Most existing -src crates use the SemVer build metadata to provide upstream version. Build metadata is defined as a series of dot separated identifiers using only ASCII alphanumerics and hyphens, which are ignored when determining version precedence. Hence, the format is quite flexible, but cannot be used for actual version comparisons (which need to rely on the base SemVer version).

Using the upstream version directly as the crate version would cause some trouble:

It could also be a separate metadata (maybe part of the source software id), but it would make upstream version invisible in most use cases.

Source embedding

There are two ways:

Improved tooling

Some cargo-based tooling could learn to detect -src crates and implement special handling (extract upstream version, etc.), maybe using additional metadata.

It could also provide automation to alleviate the maintenance burden (automate PRs for upstream version update, security advisories based on CVEs, etc.).

And now?

crates.io is a widely used repository of C/C++ libraries, providing a great experience for Rust developers who rely on them. But the current usage patterns have shortcomings, and are not a great fit for current software supply-chain security and traceability needs.

I'm particularly interested in feedback from -sys and -src crates maintainers about the upstream library handling, how it could be improved and their opinion on the discussed issues.

Potential next steps:

Creating a project group could help coordinate future work on this topic.

Thanks to @Shnatsel for feedback on the initial draft of this post, and to @tofay for feedback on software identifiers for SBOM.