[RFC] Upstreaming ClangIR (original) (raw)

Hey folks,

This RFC proposes upstreaming ClangIR: incorporating the llvm/clangir repo from LLVM’s incubator into the mainstream llvm/llvm-project.


Background

A little over a year ago, an RFC introducing ClangIR was published: a new higher-level IR for C/C++. It’s an MLIR based C/C++ dialect for Clang generated out of the Clang AST and can be lowered to other IRs – check it out for more background and motivation bits and see the FAQ below to understand what has changed since. The ClangIR page also contains general information, documentation and usage instructions.

A year of progress

Last October Evolution of ClangIR talk was presented at the LLVM Dev Meeting (the video should be available on LLVM’s youtube soon). It explores some aspects of the design and some of the past year’s achievements. Given the progress and community built around it I believe that CIR is no longer ‘experimental’ in concept and the working group (MLIR C/C++ frontend folks) now believes that the dialect and architecture are in the right direction.

The project also grew from two contributors from January 2023, to a total of nine by the end of 2023, with four currently active ones. I also expect that number to increase due to upstreaming (see next section). Some of the achievements include:

Why upstream now?

Why is this a good moment in time to include ClangIR into llvm-project?

The project is currently getting contributions from some new interested parties (e.g. see recent OpenACC RFC), and it’s more convenient to everyone involved for ClangIR collaboration to happen directly upstream, including examples like: OpenACC bits shared between Clang and Flang, and SYCL/MLIR effort already using a fork from intel/llvm. It also happens that it’s more appealing for some of the entities involved to directly contribute to an upstream llvm-project instead of a project under the incubator - incorporation into their process become an easier step.

ClangIR is young enough to be actively redesigned. Its evolution so far has been driven by lifetime checker and LLVM lowering, but there’s more to cover on C/C++ language extensions (GPUs, HPC, …), static analysis, debug info, sanitizers, etc. The project is also mature enough not to cause major breakages/churn to the rest of LLVM, and is of sufficient quality that one expects from the LLVM infrastructure.

Stakeholders

The conversation of upstreaming ClangIR has already started among interested parties, here’s a list of community members that would rather see ClangIR upstreamed sooner than later:

If you are reading this, and I missed your project (or your support), please chime-in!

Implementation Strategy

ClangIR’s development follows some guiding principles:

Source code

Most of the new code is in clang/lib/CIR, clang/include/clang/{CIR,CIRFrontendAction} and clang/test/CIR. Additional changes in the codebase include:

Compiler Flags

From clangir.org:

By passing -fclangir-enable to the clang driver, the compilation pipeline is modified and CIR gets emitted from Clang AST and then lowered to LLVM IR, backend, etc … To get CIR printed out of a compiler invocation the flag -emit-cir can be used to tell the compiler to stop right after CIR is produced.

ClangIR codegen (CIRGen) and passes are hidden behind flags:

Prefixing clangir in flag names has been our way to mark behavior as experimental, though alternatively these flags could be changed and prefixed with experimental - as done by similarly experimental past projects, e.g. the new pass manager.

Builds

Building ClangIR is optional and can be accomplished by setting the proper CMake flag: CLANG_ENABLE_CIR. It works very similar to existing flags like CLANG_ENABLE_ARCMT or CLANG_ENABLE_STATIC_ANALYZER.

Note that CIR test execution is also tied to overall CMake enablement, e.g. ninja check-clang-cir only works if the proper CMake setup is done.

Git strategy & Timeline

This is probably a more engaging discussion and I’d prefer to first focus on getting approval on the proposal before tackling this (maybe even on its own RFC). So unless this becomes somehow critical to the decision, perhaps best to wait for a follow up?

FAQ

Is there an easy way to play around with ClangIR?

Yes, compiler explorer to the rescue! See an example here: Compiler Explorer. Note that it’s still missing proper setup with a more updated C++ standard library version in order to play with coroutines and other more modern features.

To what extent has the current design of ClangIR changed since the initial RFC?

The initial design has changed on top of community feedback since then. The top three changes in ClangIR are:

How about the Kleckner criteria (build time footprint)?

Reid Kleckner (@rnk) raised some good questions regarding ClangIR’s compile time footprint. For the “C/C++ → CIR → LLVM” path, we have only been able to gather compile time numbers for the part of the SingleSource tests we’re able to build from the LLVM testsuite - results are noisy though. Unfortunately, it’s not a reliable performance comparison as many of these tests are too small.

For the “C/C++ → CIR → C++ lifetime analysis” path there’s currently no good proxy to compare against, especially given CIR codegen is only done for source files being analyzed (no CIRGen for definitions from headers, only declarations are emitted).

The honest answer is that we don’t have reliable numbers to show just yet. Though it’s also worth mentioning that there are possible compile time benefits unique to MLIR around function pass level parallelism.

How much longer does Clang’s build and testing get?

Time to build: The ClangIR specific code added were in the noise compared to a build that also built both Clang and MLIR. However, the cost of building MLIR is pretty significant. The average build time measured to add MLIR to the LLVM_ENABLE_PROJECTS list was ~45% overhead compared to just building Clang. (conf: 2x AMD, 166 cores, 224GB)

Time to run tests (assuming nothing else to build): ninja check-clang-cir reports in ~2s for release builds and ~6s for debug builds (~225 tests. conf: Apple M1 Max laptop, 64GB).

What’s the progress on static analysis?

The lifetime checker is the only current piece in that direction, and it does very simple analysis - it’s capable of catching low hanging fruits from modern C++ mainly because the higher level operations and the AST back references are really useful in the compiler understanding C++. Over the past year we (subset of MLIR C/C++ frontend folks) had many discussions and guidance from some of the experts in the community (such as Gabor Horvath, Dmytro Hrybenko and Artem Dergachev), and some open project ideas we’d like to see in the future include: teach dataflow analysis framework to use ClangIR and implement some of Clang’s CFG-based analysis (e.g., AnalysisBasedWarnings) with CIR passes (this would also be great for compile time evaluation).

Assuming there’s a large amount of code duplication between ClangIR (CIRGen) generation and LLVM tradition IR generation in CodeGen (IRGen), what are the expectations for maintainers (for example, if someone fixes a bug in IRGen, should they also fix it in CIRGen?)

No. CIRGen follows the general skeleton of IRGen… However, there are no plans to merge both code generators. One area of improvement is about the sharing of AST queries done by both - there are duplicated helpers that gather information from types and other AST properties, and those should be shared. We currently track a bunch of these and plan to send a specific RFC in the future to discuss proper mechanisms to address them.

On the expectations for maintainers: none. If the developers of IRGen want to be helpful they can communicate the new gap, but nothing is required. We’ve been operating as a few people playing catchup for years now, we’re fine with that until the community decides it’s worth their time to keep up.

Acknowledgements

Thanks to everyone who contributed PRs, created issues and participated in the C/C++ MLIR frontend meetings. Special thanks to folks who contributed to the project in the past year: Nathan Lanza (@lanza), Vinicius Couto Espindola (@sitio-couto), Hongtao Hu (@htyu), David Olsen, Yury Gribov, Oleg Kamenkov, Henrich Lauko (@xheno), Jeremy Kun (@j2kun), Keyi Zhang, Sirui Mu (@Lancern), Roman Rusyaev (@rusyaev-roman), Zhou (@redbopo), Ivan Murashko (@ivanmurashko), Nikolas Klauser (@philnik) and Fabian Mara Cordero (@fabianmc).

:white_check_mark: RFC accepted in this message.