Modernizing Compiler Design for Carbon's Toolchain (original) (raw)

Chandler Carruth

@chandlerc1024
chandlerc@{google,gmail}.com

CppNow 2023

Traditional compiler design

The 'Dragon Book' cover

Lexer: text → tokens

Handles text: characters, encodings, unicode, whitespace, comments, …
Generally a regular language, and implemented with finite automata
Produces encapsulated tokens:
- Keywords
- Punctuation & operators
- Identifiers
- Custom lexed literals: numbers, strings

Parser: tokens → AST (Abstract Syntax Tree)

Handles nesting, structure, and relationships between tokens
Generally a context-free or context-sensitive language
Formalized with grammars and/or forms of pushdown automata
- Common grammar categories: LL, LL(k), LR, LR(k), LALR, …
- Loads of theory in this space
If not using formal grammar, often implemented with a recursive descent parser

Semantic analysis: AST → correct AST

Handles name resolution, type checking, and validating the input program
Simple languages mostly reject invalid ASTs from the parser
More complex languages transform and expand the AST
- Implicit operations
- Instantiation or metaprogramming
Most complex languages (C++) also have feedback to change the parser… 😡
Should be the last stage to detect errors in the program

Lowering: AST → IR → … → machine code

IR: Intermediate Representation
Handles modeling and optimizing the execution of the program on a machine
Often iterative, potentially with successively lower level IRs
Eventually produces machine code that can run on the target machine
Shouldn’t fail on a valid input
This is its own whole system below the first phase (LLVM, etc)

Historical influences on architecture

Much of this was formalized around machines & languages from >50 years ago
Semantic analysis and lowering were fairly straightforward
- Simple language rules (like C, or B)
- Minimal optimization in lowering
- No IRs, just AST → machine code (assembly)
Focus on lexing, parsing, and ASTs

Imagined direct model

---------
| Lexer |
---------

Imagined direct model

---------     ----------
| Lexer | --> | Parser |
---------     ----------
        ---^--
        Tokens

Imagined direct model

---------     ----------     -------------
| Lexer | --> | Parser | --> | Semantics |
---------     ----------     -------------
        ---^--        ----^-----
        Tokens        Parse Tree

Imagined direct model

---------     ----------     -------------     ------------
| Lexer | --> | Parser | --> | Semantics | --> | Lowering |
---------     ----------     -------------     ------------
        ---^--        ----^-----           -^-
        Tokens        Parse Tree           AST

Incremental & lazy parsing design

                      ----------                     -------------     ------------
                      |        | -- One function  -> |           |     |          |
                      | Parser |                     | Semantics |     | Lowering |
                      |        |                     |           |     |          |
                      ----------                     -------------     ------------

Incremental & lazy parsing design

                      ----------                     -------------     ------------
                      |        | -- One function  -> |           | --> |          |
                      | Parser |                     | Semantics |     | Lowering |
                      |        |                     |           |     |          |
                      ----------                     -------------     ------------

Incremental & lazy parsing design

                      ----------                     -------------     ------------
                      |        | -- One function  -> |           | --> |          |
                      | Parser | -- Next function -> | Semantics | --> | Lowering |
                      |        |                     |           |     |          |
                      ----------                     -------------     ------------

Incremental & lazy parsing design

                      ----------                     -------------     ------------
                      |        | -- One function  -> |           | --> |          |
                      | Parser | -- Next function -> | Semantics | --> | Lowering |
                      |        | -- Next function -> |           | --> |          |
                      ----------                     -------------     ------------

Incremental & lazy parsing design

---------             ----------                     -------------     ------------
|       | <- Next --- |        | -- One function  -> |           | --> |          |
| Lexer |             | Parser | -- Next function -> | Semantics | --> | Lowering |
|       |             |        | -- Next function -> |           | --> |          |
---------             ----------                     -------------     ------------

Incremental & lazy parsing design

---------             ----------                     -------------     ------------
|       | <- Next --- |        | -- One function  -> |           | --> |          |
| Lexer |             | Parser | -- Next function -> | Semantics | --> | Lowering |
|       | -- Token -> |        | -- Next function -> |           | --> |          |
---------             ----------                     -------------     ------------

Designed around “locality” and streaming compilation

All of this came from an idea of locality in the parsed code
- One token, one AST node at a time
Enables streaming compilation
Rooted in the needs of machines where the source code was often larger than the machine memory

ASTs further exacerbate the cache impact due to pervasive pointers for edges

And real world ASTs are … massive

#include <vector>

int test_sum(std::vector<int> data) { 
  int result = 0;
  for (const auto& element : data) {
    result += element;
  }
  return result;
}

FunctionDecl 0x120d95460 <:4:1, line:10:1> line:4:5 test_sum 'int (std::vector)' |-ParmVarDecl 0x120d95368 <col:14, col:31=""> col:31 used data 'std::vector':'std::vector' destroyed -CompoundStmt 0x120dd5d50 <col:37, line:10:1=""> |-DeclStmt 0x120dcde28 <line:5:3, col:17=""> | -VarDecl 0x120dcdda0 <col:3, col:16=""> col:7 used result 'int' cinit | -IntegerLiteral 0x120dcde08 'int' 0 |-CXXForRangeStmt 0x120dd5c08 <line:6:3, line:8:3=""> | |-<<>> | |-DeclStmt 0x120dce1d0 | | -VarDecl 0x120dcdf98 col:30 implicit used __range1 'std::vector &' cinit | | -DeclRefExpr 0x120dcde40 'std::vector':'std::vector' lvalue ParmVar 0x120d95368 'data' 'std::vector':'std::vector' | |-DeclStmt 0x120dd2f90 | | -VarDecl 0x120dce268 col:28 implicit used __begin1 'iterator':'std::__wrap_iter' cinit | | -CXXMemberCallExpr 0x120dce408 'iterator':'std::__wrap_iter' | | -MemberExpr 0x120dce3d8 '' .begin 0x120db6c60 | | -DeclRefExpr 0x120dce1e8 'std::vector':'std::vector' lvalue Var 0x120dcdf98 '__range1' 'std::vector &' | |-DeclStmt 0x120dd2fa8 | | -VarDecl 0x120dce310 col:28 implicit used __end1 'iterator':'std::__wrap_iter' cinit | | -CXXMemberCallExpr 0x120dd2eb0 'iterator':'std::__wrap_iter' | | -MemberExpr 0x120dd2e80 '' .end 0x120db6fd0 | | -DeclRefExpr 0x120dce208 'std::vector':'std::vector' lvalue Var 0x120dcdf98 '__range1' 'std::vector &' | |-CXXOperatorCallExpr 0x120dd56c0 'bool' '!=' adl | | |-ImplicitCastExpr 0x120dd56a8 'bool (*)(const __wrap_iter &, const __wrap_iter &) noexcept' | | | -DeclRefExpr 0x120dd40f8 'bool (const __wrap_iter &, const __wrap_iter &) noexcept' lvalue Function 0x120dd3460 'operator!=' 'bool (const __wrap_iter &, const __wrap_iter &) noexcept' | | |-ImplicitCastExpr 0x120dd40c8 'const __wrap_iter':'const std::__wrap_iter' lvalue | | | -DeclRefExpr 0x120dd2fc0 'iterator':'std::__wrap_iter' lvalue Var 0x120dce268 '__begin1' 'iterator':'std::__wrap_iter' | | -ImplicitCastExpr 0x120dd40e0 'const __wrap_iter':'const std::__wrap_iter' lvalue | | -DeclRefExpr 0x120dd2fe0 'iterator':'std::__wrap_iter' lvalue Var 0x120dce310 '__end1' 'iterator':'std::__wrap_iter' | |-CXXOperatorCallExpr 0x120dd5860 '__wrap_iter':'std::__wrap_iter' lvalue '++' | | |-ImplicitCastExpr 0x120dd5848 '__wrap_iter &(*)() noexcept' | | | -DeclRefExpr 0x120dd5718 '__wrap_iter &() noexcept' lvalue CXXMethod 0x120dd0528 'operator++' '__wrap_iter &() noexcept' | | -DeclRefExpr 0x120dd56f8 'iterator':'std::__wrap_iter' lvalue Var 0x120dce268 '__begin1' 'iterator':'std::__wrap_iter' | |-DeclStmt 0x120dcdf38 <col:8, col:34=""> | | -VarDecl 0x120dcded0 <col:8, col:28=""> col:20 used element 'int const &' cinit | | -ImplicitCastExpr 0x120dd5b98 'int const':'const int' lvalue | | -CXXOperatorCallExpr 0x120dd5a20 'int':'int' lvalue '' | | |-ImplicitCastExpr 0x120dd5a08 'reference ()() const noexcept' | | | -DeclRefExpr 0x120dd58f0 'reference () const noexcept' lvalue CXXMethod 0x120dd0070 'operator*' 'reference () const noexcept' | | -ImplicitCastExpr 0x120dd58d8 'const std::__wrap_iter' lvalue | | -DeclRefExpr 0x120dd58b8 'iterator':'std::__wrap_iter' lvalue Var 0x120dce268 '__begin1' 'iterator':'std::__wrap_iter' | -CompoundStmt 0x120dd5cf0 <col:36, line:8:3=""> | -CompoundAssignOperator 0x120dd5cc0 <line:7:5, col:15=""> 'int' lvalue '+=' ComputeLHSTy='int' ComputeResultTy='int' | |-DeclRefExpr 0x120dd5c68 'int' lvalue Var 0x120dcdda0 'result' 'int' | -ImplicitCastExpr 0x120dd5ca8 'int':'int' | -DeclRefExpr 0x120dd5c88 'int const':'const int' lvalue Var 0x120dcded0 'element' 'int const &' -ReturnStmt 0x120dd5d40 <line:9:3, col:10=""> -ImplicitCastExpr 0x120dd5d28 'int' -DeclRefExpr 0x120dd5d08 'int' lvalue Var 0x120dcdda0 'result' 'int' </line:9:3,></line:7:5,></col:36,></col:8,></col:8,></line:6:3,></col:3,></line:5:3,></col:37,></col:14,>

Do we need a better approach?

Carbon’s compile-time goals

Carbon has a goal of fast compile times:

Software development iteration has a critical “edit, test, debug” cycle. Developers will use IDEs, editors, compilers, and other tools that need different levels of parsing. For small projects, raw parsing speed is essential; for large software systems, scalability of parsing is also necessary.

– Carbon’s goal for fast and scalable development

Not just about compiling to a binary…

Need fast tooling:
- Formatting
- Jump-to-definition
- Refactoring tools and other IDE plugins
Many of these need interactive levels of speed
- distcc and other distributed compute approaches don’t apply

We set ourselves a challenge:

Parse and lex at
10 million lines of code / second
Semantic analysis at
1 million lines of code / second
Lower to a binary at
0.1 million lines of code / second
Maybe…

Thinking about it the other way is eye opening:

Average budget of 100 ns per line to lex and parse
Average budget of 1 μs per line for semantic analysis
Let’s think about these in terms of our latency numbers…

Latency numbers table:

Operation	Time in ns	Time in ms
CPU cycle	0.3 - 0.5
L1 cache reference	1
Branch misprediction	3
L2 cache reference	4
Mutex lock/unlock	17
Main memory reference	100
SSD Random Read	17,000	0.017
Read 1 MB sequentially from memory	38,000	0.038
Read 1 MB sequentially from SSD	622,000	0.622

Mapping those onto our budgets

About 200-300 cycles, maybe 500 instructions, per line lexed & parsed
Only one main memory access per line lexed & parsed (on average)!!!
Can’t allocate memory per token… or line of tokens…
So many approaches just stop being viable if you want to hit this
Have to use every byte of every cache line accessed to have any hope

😱😱😱

Ok, so how are we going to do this?

Data-oriented compiler design

Based on data-oriented design popularized in the games industry
Locality alone isn’t a good model for modern hardware
All inputs likely in memory, no slow disk or memory in practice
Little or no “interesting” computation involved
Memory- and cache-bandwidth will almost always be limiting factor

Simple and memory-dense rather than lazy

Inputs all fit in memory, no underlying reason for laziness
Emphasis on processing input near memory bandwidth speed
Need resulting memory representation to be similarly dense as input
Even for (very) large inputs, no reason to add complexity to eager approach
- 10k lines of code (< 40 bytes / line) easily fits into cache across wide range of hardware
- Streaming through cache at memory bandwidth seems easily acceptable speed
- Can focus on efficient, tight processing

Core data structure pattern:

Primary homogenous dense array packed with most ubiquitous data
Indexed access to allow most connections to use adjacency
- Avoids storage for these connections
Side arrays for each different kind of secondary data needed
- Connected to primary array with compressed index (smaller than pointer)
Flyweight handles wrap indices and provide keys to the APIs

Advantages of this pattern:

Can densely pack main array, almost every bit is used
- Differently shaped data factored into their own densely packed arrays
- Enumerated states or bit-pack flags to fully use every byte of cache
Indexed access to the “next” node doesn’t depend on reading the current node
- Processor can pipeline these memory accesses

Let’s look at how this manifests at every layer

Data-oriented lexing!

Lexing directly fits the desired pattern

Lex into a token buffer
- Dense array of tokens
- Side arrays of identifiers, strings, literals
Expose a “token” as a flyweight index into the buffer

Lexing details: source locations

Classical challenge is mapping tokens back to source locations
- Very expensive – one of the most optimized parts of Clang’s lexer
Carbon instead has a token be the source location
- Directly encodes its location within the source
- Computes the location extents on demand
Result is a single 32-bit token ID gives both token & location

Lexing details: balanced delimiters

Another challenge in parsing is recovering from unbalanced delimiters
To make the parser much simpler, lexer pre-computes balanced delimiters
- Also lets it synthesize tokens to re-balance when necessary
- Parser never has to handle unbalanced token streams

Lexing implementation: a guided tour, live!

Data-oriented parsing!

Challenge: how to represent a tree

Linearize the tree into an array based on expected iteration
- Makes the common traversal a linear walk
- Makes the hot edges between nodes be adjacency, breaking dependency chain
- Postorder traversal ends up being the most useful
Force 1:1 correspondence to tokens
- Simpler allocation
- A single edge to a token is always sufficient, no ranges, etc.
Leverage introducers to ensure the traversal “brackets” constructs
Result ends up being essentially a stack machine for observing the parsed structure

var x: i32 = y + 1;
// TokenizedBuffer:
//     --------
// 1)  | var  |
//     --------
// 2)  | x    |
//     --------
// 3)  | :    |
//     --------
// 4)  | i32  |
//     --------
// 5)  | =    |
//     --------
// 6)  | y    |
//     --------
// 7)  | +    |
//     --------
// 8)  | 1    |
//     --------
// 9)  | ;    |
//     --------

                     var    x    :    i32    =    y    +    1    ;
// TokenizedBuffer:
//     --------
// 1)  | var  |
//     --------
// 2)  | x    |
//     --------
// 3)  | :    |
//     --------
// 4)  | i32  |
//     --------
// 5)  | =    |
//     --------
// 6)  | y    |
//     --------
// 7)  | +    |                                 ( y )     ( 1 )
//     --------                                    \__   __/
// 8)  | 1    |                                      ( + )
//     --------
// 9)  | ;    |
//     --------
                     var    x    :    i32    =    y    +    1    ;

                     var    x    :    i32    =    y    +    1    ;
// TokenizedBuffer:
//     --------
// 1)  | var  |
//     --------
// 2)  | x    |
//     --------
// 3)  | :    |
//     --------
// 4)  | i32  |
//     --------
// 5)  | =    |
//     --------
// 6)  | y    |
//     --------
// 7)  | +    |           ( x )     ( i32 )     ( y )     ( 1 )
//     --------              \__   __/             \__   __/
// 8)  | 1    |                ( : )                 ( + )
//     --------
// 9)  | ;    |
//     --------
                     var    x    :    i32    =    y    +    1    ;

                     var    x    :    i32    =    y    +    1    ;
// TokenizedBuffer:
//     --------
// 1)  | var  |
//     --------
// 2)  | x    |
//     --------
// 3)  | :    |
//     --------
// 4)  | i32  |
//     --------
// 5)  | =    |
//     --------
// 6)  | y    |
//     --------
// 7)  | +    |           ( x )     ( i32 )     ( y )     ( 1 )
//     --------              \__   __/             \__   __/
// 8)  | 1    |    ( var )     ( : )       ( = )     ( + )
//     --------        \__________\___________\_________\______
// 9)  | ;    |                                                ( ; )
//     --------
                     var    x    :    i32    =    y    +    1    ;

                     var    x    :    i32    =    y    +    1    ;
// TokenizedBuffer:
//     --------
// 1)  | var  |
//     --------
// 2)  | x    |
//     --------
// 3)  | :    |
//     --------
// 4)  | i32  |
//     --------
// 5)  | =    |
//     --------
// 6)  | y    |
//     --------
// 7)  | +    |           ( `<2>x` )     ( `<3>i32` )     ( `<6>y` )     ( `<7>1` )
//     --------              \__   __/             \__   __/
// 8)  | 1    |    ( `<1>var` )     ( `<4>:` )       ( `<5>=` )     ( `<8>+` )
//     --------        \__________\___________\_________\______
// 9)  | ;    |                                                ( `<9>;` )
//     --------
                     var    x    :    i32    =    y    +    1    ;

                     var    x    :    i32    =    y    +    1    ;
// TokenizedBuffer:
//     --------
// 1)  | var  |    ( var )
//     --------       |
// 2)  | x    |       |   ( x )
//     --------       |     |
// 3)  | :    |       |     |       ( i32 )
//     --------       |      \__   __/
// 4)  | i32  |       |        ( : )
//     --------       |          |
// 5)  | =    |       |          |         ( = )
//     --------       |          |           |
// 6)  | y    |       |          |           |  ( y )
//     --------       |          |           |    |
// 7)  | +    |       |          |           |    |       ( 1 )
//     --------       |          |           |     \__   __/
// 8)  | 1    |       |          |           |       ( + )
//     --------        \__________\___________\_________\______
// 9)  | ;    |                                                ( ; )
//     --------
                     var    x    :    i32    =    y    +    1    ;

                     var    x    :    i32    =    y    +    1    ;
// TokenizedBuffer:                                                   ParseTree:
//     --------                                                        --------
// 1)  | var  |    ( `<1>var` )                                             | `<1>var`  |
//     --------       |                                                --------
// 2)  | x    |       |   ( `<2>x` )                                        | `<2>x`    |
//     --------       |     |                                          --------
// 3)  | :    |       |     |       ( `<3>i32` )                            | `<3>i32`  |
//     --------       |      \__   __/                                 --------
// 4)  | i32  |       |        ( `<4>:` )                                   | `<4>:`    |
//     --------       |          |                                     --------
// 5)  | =    |       |          |         ( `<5>=` )                       | `<5>=`    |
//     --------       |          |           |                         --------
// 6)  | y    |       |          |           |  ( `<6>y` )                  | `<6>y`    |
//     --------       |          |           |    |                    --------
// 7)  | +    |       |          |           |    |       ( `<7>1` )        | `<7>1`    |
//     --------       |          |           |     \__   __/           --------
// 8)  | 1    |       |          |           |       ( `<8>+` )             | `<8>+`    |
//     --------        \__________\___________\_________\______        --------
// 9)  | ;    |                                                ( `<9>;` )   | `<9>;`    |
//     --------                                                        --------
                     var    x    :    i32    =    y    +    1    ;

Parse tree implementation: a guided tour, live!

How does the parser build the tree?

Technically a recursive descent parser
Faces a classic problem – deep recursion exhausting the call stack
- C++ and Clang have this problem as well
- Has led to a number of “exciting” tricks to work around it
- Especially problematic for library usage of the compiler
Carbon creates a dedicated stack data structure for its parser
- Turns the parse into a “normal” state machine without recursive calls

Parser implementation: more live tour!

Data-oriented semantics!

Core idea: model semantics as an IR

IR here in the sense of a compiler’s Intermediate Representation
- Generally structured in blocks with an order of execution
For runtime code, this generates exactly the structure we want
- First, to type check and validate the semantics
- Then, to lower into a lower level IR like LLVM’s
We map other parts of semantic analysis into the IR space:
- Declaring names, types, etc., are compile time evaluation
- Type checking is evaluation, constant propagation, and then verifying
- Template instantiation similar to “specialization” in optimizing compilers
Imperative model for building all constructs in the language
Lends itself to well understood techniques for fast compile time:
- Interpreters
- Efficient streaming data structures
Some hope that this lends itself to implementing actual metaprogramming features
- But way too early to tell…
Really, this is very early and an area of active work

Live tiniest demo of Semantics IR

Built by walking the postorder parse tree

Primary consumer driving the parse tree design
Walk observes the parse structure
- Uses a stack data structure to provide any context or other details
Efficiency hinges on a single pass over the parse tree
- This is difficult though
- Some places we defer subtrees: nested function bodies
- Other places use intermediate data structures
- Template instantiations are a fundamental challenge, but tolerable

Gets more help from the language

Single pass is easier if we emit things before they’re needed
Carbon has a principle of information accumulation
- Toolchain leverages this to directly emit complete semantics
- Limiting necessary repeated passes to monomorphization & instantiation

Again, really early days on semantics.

More to come!

Data-oriented lowering!

Eh… not really…

Somewhat data-oriented lowering?

Because of the semantics IR, some aspects of lowering will fall out:
- Traversing the IR to emit LLVM IR will be fast and efficient
- Other operations like high-level optimizations or constant evaluation similarly
But LLVM itself is not especially data-oriented in its design
Currently no plan to try to change this, but interesting area of future work

Ultimately, limited by LLVM today

LLVM heavily uses sparse, pointer-based data structures
Tends to be very inefficient, often >50% cycles stalled on data
- Despite heavily optimized data structures below the core IR

Also, we have a long way to go before this can be our focus!

Aside: testing the compiler

Basic testing follows usual patterns

Easier to write focused unit tests due to simpler language
- No preprocessor in the lexer, etc.
Serializing each layer into line delimited JSON
- Easy to do quick things with grep
- Can use powerful tools like jq
LLVM FileCheck-style testing throughout

Fuzz testing is a more interesting challenge

Historically, C++ compilers have been ~impossible to fuzz test
- None were built with continuous fuzz testing
- Very large number of failures throughout the system
- Difficult to create interesting inputs
Carbon decided to have fuzz testing from day one
- And to fuzz test each layer

Fuzz testing with more complex inputs

Also developed a protobuf-based fuzzer
Models the Carbon AST in protobuf messages
Uses structural fuzz testing systems build on top of protobufs to synthesize interesting and complex ASTs
Renders them into source as fuzzer inputs

Our goal: a compiler that does not crash

May still have plenty of bugs, but they shouldn’t crash
Users should always get a useful error message
Only way we know to achieve this is to do it from the very beginning

Some key takeaways…

Re-think traditional compiler design

Challenge assumptions with aggressive goals

Compiler & implementation design should be a major input to language design

Complete semantic analysis and lowering
Interop with C++ through Clang (come see tomorrow’s keynote!)
Bundling and being able to access a complete Clang-based C++ toolchain
Modern error messages and diagnostics
Building a carbon-format tool on top of the parser
Generating arbitrary amounts of random Carbon source code
- Parameterized distribution of language constructs
- Parameterized distribution of name & comment lengths
- Able to replicate representative distributions or edge cases
Aggressive benchmarking & optimization of each layer
- Maybe even generating C++ code for comparative benchmarking

All part of Carbon’s ongoing efforts to

Modernize compiler design and implementation
And deliver radical improvements to compile times

Thank you!

Resources and more information:

https://github.com/carbon-language/carbon-lang#getting-started
https://github.com/carbon-language/carbon-lang/tree/trunk/toolchain
- Design doc linked here and from the toolchain tree above
Toolchain channel of our Discord server