Overview of GCC’s internals — gcc-python-plugin 0.16 documentation (original) (raw)

To add a new compiler warning to GCC, it’s helpful to have a high-level understanding of how GCC works, so here’s the 10,000 foot view of how GCC turns source code into machine code.

The short version is that GCC applies a series of optimization passes to your code, gradually converting it from a high-level representation into machine code, via several different internal representations.

Each programming language supported by GCC has a “frontend”, which parses the source files.

For the case of C and C++, the preprocessor manipulates the code first before the frontend sees it. You can see the preprocessor output with the-E option.

Exactly what happens in each frontend varies by language: some language frontends emit language-specific trees, and some convert to a language-independent tree representation known as GENERIC. In any case, we eventually we reach a representation known as GIMPLE. The GIMPLE representation contains simplified operations, with temporary variables added as necessary to avoid nested sub-expressions.

For example, given this C code:

int main(int argc, char **argv) { int i;

printf("argc: %i\n", argc);

for (i = 0; i < argc; i++) { printf("argv[%i]: %s\n", argv[i]); }

helper_function();

return 0; }

we can see a dump of a C-like representation of the GIMPLE form by passing-fdump-tree-gimple to the command-line:

$ gcc -fdump-tree-gimple test.c $ cat test.c.004t.gimple

giving something like this:

main (int argc, char * * argv) { const char * restrict D.3258; long unsigned int D.3259; long unsigned int D.3260; char * * D.3261; char * D.3262; const char * restrict D.3263; int D.3264; int i;

D.3258 = (const char * restrict) &"argc: %i\n"[0]; printf (D.3258, argc); i = 0; goto <D.2050>; <D.2049>: D.3259 = (long unsigned int) i; D.3260 = D.3259 * 8; D.3261 = argv + D.3260; D.3262 = *D.3261; D.3263 = (const char * restrict) &"argv[%i]: %s\n"[0]; printf (D.3263, D.3262); i = i + 1; <D.2050>: if (i < argc) goto <D.2049>; else goto <D.2051>; <D.2051>: helper_function (); D.3264 = 0; return D.3264; }

It’s far easier to see the GIMPLE using:

./gcc-with-python examples/show-gimple.py test.c

which generates bitmaps showing the “control flow graph” of the functions in the file, with source on the left-hand side, and GIMPLE on the right-hand side:

image of a control flow graph in GIMPLE form

Each function is divided into “basic blocks”. Each basic block consists of a straight-line sequence of code with a single entrypoint and exit: all branching happens between basic blocks, not within them. The basic blocks form a “control flow graph” of basic blocks, linked together by edges. Each block can contain a list of gcc.Gimple statements.

You can work with this representation from Python using gcc.Cfg

Once the code is in GIMPLE form, GCC then attempts a series of optimizations on it.

Some of these optimizations are listed here:http://gcc.gnu.org/onlinedocs/gccint/Tree-SSA-passes.html

If you’re looking to add new compiler warnings, it’s probably best to hook your code into these early passes.

The GIMPLE representation actually has several forms:

You can tell what form a function is in by looking at the flags of the current pass. For example:

if ps.properties_provided & gcc.PROP_cfg:

...then this gcc.Function ought to have a gcc.Cfg:

do_something_with_cfg(fn.cfg)

if ps.properties_provided & gcc.PROP_ssa:

...then we have SSA data

do_something_with_ssa(fn)

Here’s our example function, after conversion to GIMPLE SSA:

./gcc-with-python examples/show-ssa.py test.c

image of a control flow graph in GIMPLE SSA form

You can see that the local variable i has been split into three versions:

As is normal with SSA, GCC inserts fake functions known as “PHI” at the start of basic blocks where needed in order to merge the multiple possible values of a variable. You can see one in our example at the top of the loop in block 4:

i_1 = PHI <i_4(2), i_11(3)>

where i_1 either gets the value of i_4, or of i_11, depending on whether we reach here via block 2 (at the start of the iteration) or block 3 (continuing the “for” loop).

After these optimizations passes are done, GCC converts the GIMPLE SSA representation into a lower-level representation known as Register Transfer Language (RTL). This is probably too low-level to be of interest to those seeking to add new compiler warnings: at this point it’s attempting to work with the available opcodes and registers on the target CPU with the aim of generating efficient machine code.

See http://gcc.gnu.org/onlinedocs/gccint/RTL.html

The RTL form uses the same Control Flow Graph machinery as the GIMPLE representation, but with RTL expressions within the basic blocks.

Once in RTL, GCC applies a series of further optimizations, before finally generating assembly language (which it submits to as, the GNU assembler):http://gcc.gnu.org/onlinedocs/gccint/RTL-passes.htmlYou can see the assembly language using the -S command line option.

$ ./gcc -S test.c $ cat test.s