[GSoC 2024] On Demand Parsing in Clang (original) (raw)
Description of the project: Clang, like any C++ compiler, parses a sequence of characters as they appear, linearly. The linear character sequence is then turned into tokens and AST before lowering to machine code. In many cases the end-user code uses a small portion of the C++ entities from the entire translation unit but the user still pays the price for compiling all of the redundancies.
This project proposes to process the heavy compiling C++ entities upon using them rather than eagerly. This approach is already adopted in Clang’s CodeGen where it allows Clang to produce code only for what is being used. On demand compilation is expected to significantly reduce the compilation peak memory and improve the compile time for translation units which sparsely use their contents. In addition, that would have a significant impact on interactive C++ where header inclusion essentially becomes a no-op and entities will be only parsed on demand.
The Cling interpreter implements a very naive but efficient cross-translation unit lazy compilation optimization which scales across hundreds of libraries in the field of high-energy physics.
// A.h
#include <string>
#include <vector>
template <class T, class U = int> struct AStruct {
void doIt() { /*...*/ }
const char* data;
// ...
};
template<class T, class U = AStruct<T>>
inline void freeFunction() { /* ... */ }
inline void doit(unsigned N = 1) { /* ... */ }
// Main.cpp
#include "A.h"
int main() {
doit();
return 0;
}
This pathological example expands to 37253 lines of code to process. Cling builds an index (it calls it an autoloading map) where it contains only forward declarations of these C++ entities. Their size is 3000 lines of code. The index looks like:
// A.h.index
namespace std{inline namespace __1{template <class _Tp, class _Allocator> class __attribute__((annotate("$clingAutoload$vector"))) __attribute__((annotate("$clingAutoload$A.h"))) __vector_base;
}}
...
template <class T, class U = int> struct __attribute__((annotate("$clingAutoload$A.h"))) AStruct;
Upon requiring the complete type of an entity, Cling includes the relevant header file to get it. There are several trivial workarounds to deal with default arguments and default template arguments as they now appear on the forward declaration and then the definition. You can read more in [1].
Although the implementation could not be called a reference implementation, it shows that the Parser and the Preprocessor of Clang are relatively stateless and can be used to process character sequences which are not linear in their nature. In particular namespace-scope definitions are relatively easy to handle and it is not very difficult to return to namespace-scope when we lazily parse something. For other contexts such as local classes we will have lost some essential information such as name lookup tables for local entities. However, these cases are probably not very interesting as the lazy parsing granularity is probably worth doing only for top-level entities.
Such implementation can help with already existing issues in the standard such as CWG2335, under which the delayed portions of classes get parsed immediately when they’re first needed, if that first usage precedes the end of the class. That should give good motivation to upstream all the operations needed to return to an enclosing scope and parse something.
Implementation approach: Upon seeing a tag definition during parsing we could create a forward declaration, record the token sequence and mark it as a lazy definition. Later upon complete type request, we could re-position the parser to parse the definition body. We already skip some of the template specializations in a similar way [2, 3].
Another approach is every lazy parsed entity to record its token stream and change the Toks stored on LateParsedDeclarations to optionally refer to a subsequence of the externally-stored token sequence instead of storing its own sequence (or maybe change CachedTokens so it can do that transparently). One of the challenges would be that we currently modify the cached tokens list to append an “eof” token, but it should be possible to handle that in a different way.
In some cases, a class definition can affect its surrounding context in a few ways you’ll need to be careful about here:
struct X
appearing inside the class can introduce the nameX
into the enclosing context.static inline
declarations can introduce global variables with non-constant initializers that may have arbitrary side-effects.
For point (2), there’s a more general problem: parsing any expression can trigger a template instantiation of a class template that has a static data member with an initializer that has side-effects. Unlike the above two cases, I don’t think there’s any way we can correctly detect and handle such cases by some simple analysis of the token stream; actual semantic analysis is required to detect such cases. But perhaps if they happen only in code that is itself unused, it wouldn’t be terrible for Clang to have a language mode that doesn’t guarantee that such instantiations actually happen.
Alternative and more efficient implementation could be to make the lookup tables range based but we do not have even a prototype proving this could be a feasible approach.
Expected result:
- Design and implementation of on-demand compilation for non-templated functions
- Support non-templated structs and classes
- Run performance benchmarks on relevant codebases and prepare report
- Prepare a community RFC document
- [Stretch goal] Support templates
The successful candidate should commit to regular participation in weekly meetings, deliver presentations, and contribute blog posts as requested. Additionally, they should demonstrate the ability to navigate the community process with patience and understanding.
Further reading
[1] root/README/README.CXXMODULES.md at master · root-project/root · GitHub
[2] https://github.com/llvm/llvm-project/commit/b9fa99649bc99
[3] https://github.com/llvm/llvm-project/commit/0f192e89405ce
**Project size:**Large
Difficulty: Hard
Confirmed Mentor: Vassil Vassilev,Matheus Izvekov