Compiler Design Lab, Saarland University (original) (raw)

About

Automatic optimization strategies (e.g. "-O3") often do not produce the code that the programmer desires. This can be due to different reasons, e.g. the compiler is not allowed to do some transformations because of conservative analysis results. Another frequent case is that the preconditions of an optimization are invalidated by previous phases ("phase ordering problem").

In consequence, code blocks that rely on specific optimizations to reach maximum performance are transformed manually. This causes a variety of disadvantages such as significant time investment, error proneness, reduced maintainability and/or portability of the optimized code, or unwanted mutual reactions with other optimizations.

Therefore, it is desirable to have a language feature that allows to apply custom optimization strategies to specific code segments. Noise is a language extension that allows a programmer to create such optimization strategies with annotations. A prototypical Noise extension for C/C++ has been implemented in the Clang frontend of LLVM.

Noise is being developed as part of the ECOUSS project that is funded by the German Federal Ministry of Education and Research (BMBF).

Usage & Semantics

Annotations are easiest used through the makro NOISE(...) that is defined in the included header file noise.h. The desired optimizations in the desired order are supplied in its argument list as a string. Currently, the noise attribute can be attached to functions, loop statements, or compound statements:

Annotated Functions
The specified optimization strategy is applied to the entire body of the function. No other optimizations (such as those requested via command line options of the compiler) are performed on the function.
Annotated Compound Statements
The specified optimization strategy is applied to the entire compound statement. No other optimizations (such as those requested via command line options of the compiler) are performed on the compound statement, but on everything outside (unless there are nested noise segments).
Annotated Loops
The specified optimization strategy is applied to the entire loop. No other optimizations (such as those requested via command line options of the compiler) are performed on the loop, but on everything outside (unless there are nested noise segments).

Nested Annotations

If code segments are annotated at different levels, the corresponding optimization strategies are applied inside-out. Thus, an inner code segment is first optimized with its own custom strategy and then with all those of its enclosing annotated statements.

Available Optimizations

The current implementation allows to employ all transformations available in LLVM under the LLVM-internal names (see LLVM's Analysis and Transformation Passes)

Examples:

"constprop" - simple constant propagation
"sccp" - sparse conditional constant propagation
"reassociate" - reassociate commutative expressions
"die" - dead instruction elimination
"dse" - trivial dead store elimination
"dce" - dead code elimination
"adce" - aggressive dead code elimination
"instcombine" - combine redundant instructions
"gvn" - global value numbering for redundancy elimination
"bb-vectorize" - basic block vectorization
"loop-simplify" - loop canonicalization
"indvars" - canonicalize induction variables
"licm" - loop invariant code motion
"loop-unswitch" - loop unswitching
"loop-reduce" - loop strength reduction
"loop-rotate" - loop rotation
"verify" - module verifier

Additionally, we implemented the following special-purpose transformations:

Function Inlining ("inline" or "inline(name)").
With this transformation it is possible to force inlining of specific function calls without relying on the compiler's heuristics. This possibly allows additional optimization opportunities afterwards, e.g. transformations that would have to be inter-procedural before now can be applied locally. If "inline" is specified without function name, all calls inside the code segment are inlined (if their code is available).
Explicit Loop Unrolling ("unroll" or "unroll(N)").
We provide the possibility to both rely on LLVM's heuristics for unrolling or to force it. If N is supplied, unrolling the loop N times is forced. If N is not supplied, the phase itself decides whether and how the loop should be unrolled.
Loop Fusion.
Noise can fuse multiple loops within an annotated region to form a single loop that contains all necessary instructions. This can improve further optimizations and cache usage.
Code Dispatching
Create specialized variants of an annotated code region by introducing a dynamic dispatcher (case distinction on the specialized variable). This can uncover further optimization potential by exploiting knowledge about runtime values of a variable.
Loop Vectorization ("wfv" or "wfv(N)").
In addition to the LLVM-internal phases "bb-vectorize" and "loop-vectorize" we provide "wfv-vectorize", a wrapper around libWFV that can be used to vectorize data-parallel loops. The integration is currently in an early phase but already shows promising results for simple examples.

Preparatory Transformation Phases

To facilitate its usage, the NOISE makro includes a set of standard optimization phases that are scheduled before the user-defined ones. These phases build a good basis for most of the other phases by transforming the code to SSA form and normalizing it without significant changes:

"tbaa" - type based alias analysis
"basicaa" - basic alias analysis
"scalarrepl" - scalar replacement of aggregates (transform code to SSA form
"loop-simplify" - canonicalize loops
"indvars" - canonicalize induction variables
"early-cse" - simple common subexpression elimination

Fully User-Defined Optimization Strategies

For fully user-defined optimization strategies, the makro RAWNOISE(...) exists. A code segment annotated with the empty attribute RAWNOISE() is ignored by all optimization phases, except if the segment itself is nested in another annotated segment.

Examples

No optimization performed (despite -O3)

C++ source code:

RAWNOISE("") int testDisableOpts(int x) { if (x > 3) return x+4; else x += 4;

return x; }

Result:

define i32 @_Z15testDisableOptsi(i32 %x) nounwind { entry: %retval = alloca i32, align 4 %x.addr = alloca i32, align 4 store i32 %x, i32* %x.addr, align 4, !tbaa !0 %0 = load i32* %x.addr, align 4, !tbaa !0 %cmp = icmp sgt i32 %0, 3 %1 = load i32* %x.addr, align 4, !tbaa !0 %add = add nsw i32 %1, 4 br i1 %cmp, label %if.then, label %if.else

if.then: store i32 %add, i32* %retval br label %return

if.else: store i32 %add, i32* %x.addr, align 4, !tbaa !0 %2 = load i32* %x.addr, align 4, !tbaa !0 store i32 %2, i32* %retval br label %return

return: %3 = load i32* %retval ret i32 %3 }

Same code, standard optimizations

C++ source code:

NOISE("") int testStdOpts(int x) { if (x > 3) return x+4; else x += 4;

return x; }

Result:

define i32 @_Z11testStdOptsi(i32 %x) nounwind { entry: %add = add nsw i32 %x, 4 ret i32 %add }

Same code, SSA conversion + cfg simplification

C++ source code:

RAWNOISE("scalarrepl simplifycfg") int testSimple2(int x) { if (x > 3) return x+4; else x += 4;

return x; }

Result:

define i32 @_Z11testSimple2i(i32 %x) nounwind { entry: %add = add nsw i32 %x, 4 ret i32 %add }

Same code, cfg simplification + SSA conversion

C++ source code:

// Bad ordering results in missed optimization opportunities. RAWNOISE("simplifycfg scalarrepl") int testBadOrder(int x) { if (x > 3) return x+4; else x += 4;

return x; }

Result:

define i32 @_Z12testBadOrderi(i32 %x) nounwind { entry: %cmp = icmp sgt i32 %x, 3 %add = add nsw i32 %x, 4 br i1 %cmp, label %if.then, label %if.else

if.then: br label %return

if.else: br label %return

return: ret i32 %add }

Inlining

C++ source code:

attribute((noinline)) int testToInline(int i) { return i + 23; }

int testInlining1(int j) { int j2 = testToInline(j) - 7; // Not inlined. int j3; NOISE("inline") { j3 = testToInline(j) * 2; // Inlined. j3 += j2; // Not combined with inlined code. } return j3; }

Result:

define i32 @_Z13testInlining1i(i32 %j) nounwind { entry: %call = call i32 @_Z12testToInlinei(i32 %j) %sub = add nsw i32 %call, -7 %add.i.i = add nsw i32 %j, 23 %mul.i = mul nsw i32 %add.i.i, 2 %add.i = add nsw i32 %mul.i, %sub ret i32 %add.i }

Inlining + instcombine + constprop

C++ source code:

attribute((noinline)) int testToInline(int i) { return i + 23; }

int testInlining2(int j) { int j2 = testToInline(j) - 7; // Not inlined. int j3; NOISE("inline instcombine constprop") { j3 = testToInline(j) * 2; // Inlined. j3 += j2; // Combined with inlined code. } return j3; }

Result:

define i32 @_Z13testInlining2i(i32 %j) nounwind { entry: %call = call i32 @_Z12testToInlinei(i32 %j) %sub = add nsw i32 %call, -7 %add.i.i = shl i32 %j, 1 %mul.i = add i32 %add.i.i, 46 %add.i = add nsw i32 %mul.i, %sub ret i32 %add.i }

Inlining + loop invariant code motion + loop unrolling

C++ source code:

#include "noise.h"

int attribute((noinline)) testToInline(int i) { return i + 23; }

int testNoiseInlineUnroll(int x, int y) { NOISE("inline licm unroll(4)") for (int i=0; i<16; ++i) { int lic = y * y; x += lic; x += testToInline(y); x *= 2; }

return x; }

Result:

define i32 @_Z21testNoiseInlineUnrollii(i32 %x, i32 %y) nounwind { entry: %mul.i = mul nsw i32 %y, %y %add.i.i = add nsw i32 %y, 23 br label %for.body.i

for.body.i: %x.addr.02.i = phi i32 [ %x, %entry ], [ %mul2.3.i, %for.body.i ] %i.01.i = phi i32 [ 0, %entry ], [ %inc.3.i, %for.body.i ] %add.i = add nsw i32 %x.addr.02.i, %mul.i %add1.i = add nsw i32 %add.i, %add.i.i %mul2.i = mul nsw i32 %add1.i, 2 %inc.i = add nsw i32 %i.01.i, 1 %add.1.i = add nsw i32 %mul2.i, %mul.i %add1.1.i = add nsw i32 %add.1.i, %add.i.i %mul2.1.i = mul nsw i32 %add1.1.i, 2 %inc.1.i = add nsw i32 %inc.i, 1 %add.2.i = add nsw i32 %mul2.1.i, %mul.i %add1.2.i = add nsw i32 %add.2.i, %add.i.i %mul2.2.i = mul nsw i32 %add1.2.i, 2 %inc.2.i = add nsw i32 %inc.1.i, 1 %add.3.i = add nsw i32 %mul2.2.i, %mul.i %add1.3.i = add nsw i32 %add.3.i, %add.i.i %mul2.3.i = mul nsw i32 %add1.3.i, 2 %inc.3.i = add nsw i32 %inc.2.i, 1 %cmp.3.i = icmp slt i32 %inc.3.i, 16 br i1 %cmp.3.i, label %for.body.i, label %_Z21testNoiseInlineUnrollii_noise.entry.exit

_Z21testNoiseInlineUnrollii_noise.entry.exit: ret i32 %mul2.3.i }

Data-Parallel Loop Vectorization using WFV

C++ source code:

#include "noise.h" void testNoiseWFV(int x, int y, int* in, int* out) { NOISE("wfv licm") for (int i=0; i<32; ++i) { int lic = x * y; out[i] = in[i] + lic; } }

Result:

define void @Z13testNoiseWFViiPiS(i32 %x, i32 %y, i32* %in, i32* %out) nounwind { entry: %mul...i.i = mul nsw i32 %x, %y %0 = insertelement <4 x i32> undef, i32 %mul...i.i, i32 0 %1 = insertelement <4 x i32> %0, i32 %mul...i.i, i32 1 %2 = insertelement <4 x i32> %1, i32 %mul...i.i, i32 2 %3 = insertelement <4 x i32> %2, i32 %mul...i.i, i32 3 br label %for.cond.i

for.cond.i: %i.0.i = phi i32 [ 0, %entry ], [ %wfv.inc.i, %codeRepl.i ] %cmp.i = icmp slt i32 %i.0.i, 8 br i1 %cmp.i, label %codeRepl.i, label %_Z13testNoiseWFV2iiPiS__noise.entry.exit

codeRepl.i: %idxprom...i.i = sext i32 %i.0.i to i64 %4 = getelementptr i32* %in, i64 %idxprom...i.i %pktPtrCast.i.i = bitcast i32* %4 to <4 x i32>* %5 = load <4 x i32>* %pktPtrCast.i.i, align 16, !tbaa !0 %add...i.i = add nsw <4 x i32> %5, %3 %6 = getelementptr i32* %out, i64 %idxprom...i.i %pktPtrCast1.i.i = bitcast i32* %6 to <4 x i32>* store <4 x i32> %add...i.i, <4 x i32>* %pktPtrCast1.i.i, align 16, !tbaa !0 %inc.i = add nsw i32 %i.0.i, 1 %wfv.inc.i = add i32 %inc.i, 4 br label %for.cond.i

_Z13testNoiseWFViiPiS__noise.entry.exit: ret void }

Download

Download Noise from GitHub.

We are optimistic to eventually reintegrate our extension into the Clang trunk since it is minimally invasive and well capsulated.