[RFC] File system sandboxing in Clang/LLVM (original) (raw)

TL;DR

Proposing to set up a file system sandbox by default for developer builds that enforces use of vfs::FileSystem instead of direct llvm::sys::fs and llvm::MemoryBuffer::getFile*() calls, preventing new code from bypassing file system virtualization.

Background

In 2014, Clang gained the ability to virtualize file system reads through the vfs::FileSystem interface, mainly in support of the new -ivfsoverlay option. In 2018, the API was lifted from Clang into LLVM for use by other projects.

Until recently, many parts of Clang and LLVM were still using the llvm::sys::fs and llvm::MemoryBuffer::getFile*() APIs directly instead of going through the interface. This inconsistency means that different parts of the compiler see the file system differently.

Over the past couple of weeks, I’ve been working towards Clang and LLVM having a consistent view of the file system by adopting the vfs::FileSystem interface. Now, I would like to ensure that no more new direct uses of llvm::sys::fs and llvm::MemoryBuffer::getFile*() make it into the compiler.

Proposal

Set up a sandbox for file system reads within Clang itself. The idea is that for developer builds, the compiler aborts whenever a direct use of the discouraged APIs occurs, forcing LLVM/Clang developers to use the vfs::FileSystem interface instead. This behavior can be turned off with the CMake option -DLLVM_ENABLE_IO_SANDBOX=NO if needed.

Motivation

My team at Apple is working on sound compilation caching in Clang. Our implementation relies on a fast scanning step that builds up a CAS database containing all the input files necessary for compilation. During the compilation itself, Clang loads the database and the contained file system snapshot is exposed via the vfs::FileSystem API.

Not using vfs::FileSystem consistently in the compiler breaks CAS-based caching. For example, in a distributed setting, circumventing the CAS database means going to the actual builder file system that wasn’t set up with the expected files, resulting in compilation failures or unexpected compiler behavior.

Implementation

My PR #165350 contains an initial implementation that calls reportFatalInternalError() whenever the discouraged APIs are used directly. This behavior is enabled by default for asserts builds and can be controlled with new CMake option -DLLVM_ENABLE_IO_SANDBOX.

The sandbox is currently only applied to clang -cc1 and clang -cc1as invocations; Clang driver itself and other binaries remain unaffected for now.

By turning on the sandbox for local development and in pre-merge CI, we can prevent Clang/LLVM developers from introducing more non-virtualized file system reads into the codebase. Over time, we can work on eliminating the remaining violations that are not reachable by running the check-clang target.

Developer Impact

When the sandbox is enabled (default for assert builds):

This should not affect release builds or production use.

Bypass Mechanism

For specific areas where we are virtualizing compiler inputs in a different way (such as the module cache), and for incremental adoption, there is a bypass mechanism that:

This allows intentional direct file system access where necessary while keeping it explicit and auditable.

Example of sandbox bypass:

// This is a compiler-internal input/output, let's bypass the sandbox.
auto BypassSandbox = llvm::sys::sandbox::scopedDisable();
auto BufOrErr = llvm::MemoryBuffer::getFile(Path);

Current Status

Right now, the upstream repo with the sandbox enabled passes the pre-commit checks on all platforms. I was also able to bootstrap an Apple Clang toolchain and build a handful of internal projects with the sandbox enabled. This demonstrates that the sandbox does not break existing tested code paths in multiple build and test configurations.

Migration Path

  1. Phase 1 (this RFC): Enable sandbox in assert builds to prevent new violations.
  2. Phase 2: Identify and fix remaining violations in untested code paths.
  3. Phase 3: Add clang-tidy check for static verification that non-virtualized file system reads in Clang and LLVM only occur with the explicit bypass mechanism applied locally.
  4. Future: Extend sandbox to other tools beyond clang -cc1 and clang -cc1as.
  5. Future: Apply similar approach to file system writes (see below).

File System Writes

Clang has the same problem with compiler outputs / file system writes. Some parts of the Clang use the new vfs::OutputBackend interface, but some still create raw_fd_ostream and similar directly. With our caching implementation this means that replaying a cached compilation may not materialize all outputs.

This should be unified as well, but I consider that a separate follow-up project.

Alternatives Considered

I have considered building a file system sandbox outside of the compiler, but that would most likely result in platform-specific solutions that cannot be broadly applied during development upstream. I believe that building this infrastructure in the most cross-platform way and in the upstream repository provides the greatest benefit to the community.

Questions?

I welcome any feedback and questions on this approach. The key decision point is whether the community is comfortable with enabling the sandbox for assert builds by default, with the understanding that some compilations may fail due to sandbox violations that haven’t been fixed yet in untested code paths. The sandbox can always be disabled with -DLLVM_ENABLE_IO_SANDBOX=NO.

:white_check_mark: this RFC was accepted in this message.