RFC: Compound directive splitting in OpenMP (original) (raw)

September 16, 2024, 4:55pm 1

Introduction

This is a follow-up to [RFC] Compound constructs in OpenMP.

The prototype is a hybrid of the two options presented there.

Prototype #1: everything in a single commit
[clang][OpenMP] Prototype #1 of directive splitting by kparzysz · Pull Request #108855 · llvm/llvm-project (github.com)

Prototype #2: Same as prototype #1, but separated into two parts

Introducing AST nodes for opaque directives: [clang][OpenMP] Add AST nodes for opaque block/loop directives by kparzysz · Pull Request #109286 · llvm/llvm-project · GitHub
The rest of the code: [clang][OpenMP] Prototype #2 of directive splitting by kparzysz · Pull Request #109288 · llvm/llvm-project · GitHub

@alexey-bataev @erichkeane @Meinersbur @jdoerfert

Definitions (for clarity):

leaf directive — a directive without any proper constituents
compound directive — a directive that is either a combined or composite directive
combined directive — a directive that consists of two or more constituents, and which is equivalent to specifying the constituents in separate #pragmas.
composite directive — a directive that consists of two or more constituents, but whose semantics may not be equivalent to specifying its constitients separately.

Functionality:

For OpenMP constructs in function templates, create an opaque AST node with all clauses attached to it. There are two such placeholder nodes: one for loop-associated directives, and one for the remaining ones.
For non-templates (incuding template instantiations), expand the compound directive into a sequence of “unit” constructs (where unit is either a leaf or a composite constructs), where each original clause is attached to the unit construct(s) according to the OpenMP rules.

Example:
For the following C++ code (one of the clang tests)

struct S1 {
  double a;
  int r1(int n){
    int b = 1;
    #pragma omp target teams num_teams(n-b)
    {
      this->a = (double)b + 1.5;
    }
    return (int)a;
  }
};

int bar(int n){
  int a = 0;
  S1 S;
  a += S.r1(n);
  return a;
}

The AST before:

[...]
| |   |-OMPTargetTeamsDirective 0x5595997dd5a8 <line:7:5, col:44>
| |   | |-OMPNum_teamsClause 0x5595997ba888 <col:30, col:43>
| |   | | `-ImplicitCastExpr 0x5595997ba858 <col:41> 'int' <LValueToRValue>
| |   | |   `-DeclRefExpr 0x5595997ba838 <col:41> 'int' lvalue OMPCapturedExpr 0x5595997ba7d0 '.capture_expr.' 'int'
| |   | |-OMPFirstprivateClause 0x5595997dd4a0 <<invalid sloc>> <implicit>
| |   | | `-DeclRefExpr 0x5595997dcab8 <line:9:25> 'int' lvalue Var 0x5595997ba688 'b' 'int' refers_to_enclosing_variable_or_capture
| |   | |-OMPMapClause 0x5595997dd4e0 <<invalid sloc>> <implicit>
| |   | | `-MemberExpr 0x5595997dca88 <col:7, col:13> 'double' lvalue ->a 0x5595997ba440
| |   | |   `-CXXThisExpr 0x5595997dca78 <col:7> 'S1 *' this
| |   | `-CapturedStmt 0x5595997dd238 <line:8:5, line:10:5>
| |   |   |-CapturedDecl 0x5595997bab10 <<invalid sloc>> <invalid sloc> nothrow
| |   |   | |-CapturedStmt 0x5595997dcf98 <line:8:5, line:10:5>
| |   |   | | |-CapturedDecl 0x5595997bb000 <<invalid sloc>> <invalid sloc> nothrow
| |   |   | | | |-CapturedStmt 0x5595997dcd10 <line:8:5, line:10:5>
| |   |   | | | | |-CapturedDecl 0x5595997dc8a0 <<invalid sloc>> <invalid sloc> nothrow
| |   |   | | | | | |-CompoundStmt 0x5595997dcba8 <line:8:5, line:10:5>
| |   |   | | | | | | `-BinaryOperator 0x5595997dcb88 <line:9:7, col:29> 'double' lvalue '='
[...]

The AST after:

[...]
| |   | `-OMPTargetDirective 0x55dc1e6df030 <col:5, col:44>
| |   |   |-OMPFirstprivateClause 0x55dc1e6c8c50 <<invalid sloc>> <implicit>
| |   |   | `-DeclRefExpr 0x55dc1e6c84b8 <line:9:25> 'int' lvalue Var 0x55dc1e6a5c88 'b' 'int' refers_to_enclosing_variable_or_capture
| |   |   |-OMPMapClause 0x55dc1e6c8c90 <<invalid sloc>> <implicit>
| |   |   | `-MemberExpr 0x55dc1e6c8488 <col:7, col:13> 'double' lvalue ->a 0x55dc1e6a5a40
| |   |   |   `-CXXThisExpr 0x55dc1e6c8478 <col:7> 'S1 *' this
| |   |   `-CapturedStmt 0x55dc1e6defc0 <line:7:5, col:44>
| |   |     |-CapturedDecl 0x55dc1e6a6110 <<invalid sloc>> <invalid sloc> nothrow
| |   |     | |-CapturedStmt 0x55dc1e6def50 <col:5, col:44>
| |   |     | | |-CapturedDecl 0x55dc1e6a6600 <<invalid sloc>> <invalid sloc> nothrow
| |   |     | | | |-OMPTeamsDirective 0x55dc1e6def08 <col:5, col:44>
| |   |     | | | | |-OMPNum_teamsClause 0x55dc1e6a5e70 <col:30, col:43>
| |   |     | | | | | `-ImplicitCastExpr 0x55dc1e6a5e58 <col:41> 'int' <LValueToRValue>
| |   |     | | | | |   `-DeclRefExpr 0x55dc1e6a5e38 <col:41> 'int' lvalue Var 0x55dc1e6a5dd0 '.vale10' 'int' refers_to_enclosing_variable_or_capture
| |   |     | | | | |-OMPFirstprivateClause 0x55dc1e6c8798 <<invalid sloc>> <implicit>
| |   |     | | | | | `-DeclRefExpr 0x55dc1e6c84b8 <line:9:25> 'int' lvalue Var 0x55dc1e6a5c88 'b' 'int' refers_to_enclosing_variable_or_capture
| |   |     | | | | `-CapturedStmt 0x55dc1e6dee98 <line:8:5, line:10:5>
| |   |     | | | |   |-CapturedDecl 0x55dc1e6c82a0 <<invalid sloc>> <invalid sloc> nothrow
| |   |     | | | |   | |-CompoundStmt 0x55dc1e6c85a8 <line:8:5, line:10:5>
| |   |     | | | |   | | `-BinaryOperator 0x55dc1e6c8588 <line:9:7, col:29> 'double' lvalue '='
[...]

Rationale:

The OpenMP rules describing the meaning of clauses on a compound directive may depend on what variables are referenced in which clauses. Below is an example of such a rule:

If the parallel construct is among the constituent constructs and the list item is not also specified in the firstprivate clause, then the effect of the lastprivate clause is as if the shared clause with the same list item is applied to the parallel construct. [OpenMP spec v5.2 p340]
In a function template, the names of variables may not yet be resolved, which makes it impossible in general to determine whether two expressions refer to the same object or not.

The breakdown of the compound directive into constituents has the purpose of removing dependence on specific compound directive names in the clang source code. For example, it aims at eliminating situations like

case OMPD_target_teams_distribute_for_simd:
  ...

if (DKind == OMPD_target_teams_distribute_for_simd)
  ...

The goal there to make it possible to implement more combined directives with only minimal changes to the parser/sema code. The codegen should be able to generate functional code for it, although it may need to treat certain combinations as a whole to generate better performing code.

While the most fine-grained breakup of the compound directives would be to use leaf directives, this might interfere with the proper generation of code for the composite constructs, hence the minimal “unit” of directive representation is either a leaf or a composite directive.

How it happens

The bird’s eye view of the structure of the code can be summarized as:

The parser will call ActOnOpenMPRegionStart before every OpenMP construct, and ActOnOpenMPRegionEnd after every OpenMP construct.
ActOnOpenMPRegionEnd will perform the compound directive splitting (assume that it will include clause assignment as per OpenMP rules) and create AST for all necessary implicit clauses. These clauses will also be assigned to the correct unit constructs.
ActOnOpenMPExecutableDirective will create AST nodes for the unit directives.
If num_teams or thread_limit were present, create a CompoundStmt with the declarations/initializations of variables used on them, followed by the OpenMP construct. Rationale:

The functions ActOnOpenMPRegionStart and ActOnOpenMPRegionEnd manage scopes that serve as representations of capturing regions. The “Start” function opens them (i.e. pushes them on the FunctionScopes stack), and the “End” function closes them and creates the corresponding CapturedStmt.

Originally, all explicit clauses were created before capture regions were opened, and all implicit clauses were created afterwards. In other words, none of the variables in clause expressions were automatically captured. However, each clause had a capturing region associated with it (via getOpenMPCaptureRegionForClause), which is the region where the clause’s variables need to be captured. The code in ActOnOpenMPRegionEnd would go over the explicit clauses (i.e. the clauses that had been created by then) and capture the referenced variables in the scope that corresponds to the clause’s capturing region. The AST nodes for the clauses were attached to the top-level AST node that represented the (possibly compound) directive.

The new implementation moves the creation of implicit clauses from ActOnOpenMPExecutableDirective to ActOnOpenMPRegionEnd and performs the directive splitting in there as well. The role of getOpenMPCaptureRegionForClause has been replaced by the directive splitter, and the capture regions are determined by the clause’s location among the unit constructs. The reason to do all this in ActOnOpenMPRegionEnd is to be able to create proper variable captures (while the capturing scopes were still open).

The num_teams and thread_limit clauses are the only clauses whose expressions need to be evaluated in the context immediately enclosing the outermost leaf construct. To accomplish that evaluate them explicitly and assign to temporary variables.

What has changed

Every unit construct has its own capturing regions.
Clause pre-inits are essentially moot, except for num_teams and thread_limit.
DSAAttrChecker can now deal with open CapturedRegionScopeInfos. This is used to create implicit clauses prior to closing the capturing regions. Rationale:

Previously, a compound directive may have had an arbitrary number of capturing regions (each of which would eventually result in a CapturedStmt). When a directive is split, there must be a capturing region for each resulting unit construct.

In the old implementation all clauses were attached to the root AST node (which was the one representing the entire directive), and some clauses contained initialization code that would be emitted once the placement was determined in CodeGen.
In the new implementation the code can be emitted immediately when it’s encountered, since the clause is already placed in the correct location.

More on clause assignment

The actual splitting and clause assignment happens in frontend-agnostic code, which uses its own representation of directives and clauses. In the prototype code that representation is contained in the ext namespace (“ext” roughly referring to “external”).

The specific steps to split a directive in ActOnOpenMPRegionEnd are as follows:

Convert explicit clauses to “ext”.
Use DSAAttrChecker to analyze the body of the directive and create “ext” clauses for the indicated implicit clauses.
Split tie directive with the above clauses attached to it.
Iterate over unit constructs from the innermost to the outermost.
- Analyze the clauses assigned to the unit construct to see if any more implicit directives are needed. If so, consult the splitter about their placement, and add them to the required unit construct enclosing the current one. Rationale:

After creating the initial set of implicit clauses, and after splitting the original directive with the implicit clauses added, there may still be a need to create additional implicit directives. The motivating example came from “nvptx_SPMD_codegen.cpp”:

#pragma omp target teams distribute parallel for simd if(a)
  for (int i = 0; i < 10; ++i)
    ;

The initial analysis doesn’t indicate any implicit clauses to be created, however after splitting, we end up with the following assignment:

target if(a)
teams
distribute parallel for simd if(a)

Because of that, the variable a is now explicitly present inside of the target region, which necessitates (in this case) a firstprivate directive. This wouldn’t have been inferred by the splitting code, since it only implements clause assignment rules (which do not cover such cases). It may be argued that clang itself should be able to detect it (DSAAttrChecker), but at the moment it does not.

Edit: tagged more people
Edit: added links to prototype #2

Thanks for working on this and for the great write-up! I really like the expandable rationales.

When thinking about the problem, I was assuming it would be OMPCompoundBlockDirective and OMPCompoundLoopDirective subsuming all the compound and leaf directives. A bit flag member would indicate which leaf directives are part of it and decomposition would only take place at CodeGen. Looking at each bit, the already existing leaf lambdas such as emitCommonOMPParallelDirective would be added to the callback stack. Why did you decide to do the decomposition in the front-end instead? Is it because you would not need to change anything about already existing leaf constructs and therefore less risk to break things?

Decomposition in the frontend as some disadvantages. In addition to consistency of the LLVM-IR, you also need to ensure consistency of the AST after decomposition. The AST is also meant to faithfully represent the source input (in contrast to an IR) which decomposed constructs do not (e.g. combined constructs not distinguishable from written in two pragmas). Conceptually different AST in templates and non-templates, both need to be handled by e.g. code completion. More AST node types (the unit ones + the compound ones representing the same semantics). Many more captured statements potentially add overhead.

kparzysz October 9, 2024, 1:59pm 3

I wanted to make the changes as early as possible, so that the rest of the code can work on the new representation. In the beginning I looked at doing it in the parser, but that quickly turned out to the unfeasible.

The code generator doesn’t have a convenient representation to use for this. It takes AST and outputs LLVM IR. I didn’t want to create new AST nodes in codegen—if I do that, why not do it in the frontend.

With the changes in the frontend, the codegen should just work. At least that was the idea. It wouldn’t produce optimal code, but it should be functionally correct.

kparzysz October 9, 2024, 3:27pm 4

In the OpenMP call @Meinersbur talked more about the codegen approach: every (compound) OpenMP directive would be represented by the same generic AST node, and then the codegen would walk over the constituents, and generate code for each.
The downside of the AST approach are potential difficulties in unparsing the AST back into source (that ideally should match the original source code).

Michael, please correct this if I misunderstood anything.

I haven’t really explored the codegen approach in any detail, but there are some complications I see with this.

The frontend does some shortcuts at the moment regarding the creation of capture regions—some combined directives have fewer regions than the number of leaf constituents (even if the constituents would require a region if they were on their own). This relies on those directives being handled as a whole in the codegen. This means that either we need to change the codegen for these constituents to deal with a missing region, or that we create extra regions in the AST. My approach does the latter, but it seems like either one would require changes here.
Variables in OpenMP clauses need to be inserted in proper capture regions. In the AST approach, the clause assignment happens first, then the capture regions are updated before converting them into CapturedStmt nodes. This already happens now, but is based on getOpenMPCaptureRegionForClause, which is more of an approximation of which region each clause would belong to. The AST approach makes it explicit, and eliminates the need for that function. The codegen approach would keep it, and then perform the directive splitting. This would create a second source of information about which region a variable should be captured in—ideally they would match, but making changes to one could require updating the other.

This is what came to my mind right now.

jdoerfert October 10, 2024, 12:39am 5

I think splitting the pragmas early has benefits.
I believe the “unit” constructs idea is important, forcing only leaf constructs is not viable.

Arguably, we could keep the “opaque AST nodes” till the end, but given that the nodes we create already exist, I’m less inclined to that approach. To preserve the source representation one could string the unit constructs together (parent/child relationship) or simply mark everything that is implicit. The latter might be sufficient modulo reordering. I guess we can always keep the original spelling if that is required, or even use the special AST node we use in variant calls to represent the original callee and the chosen callee at the same time.

I think this has some nice benefits:

Tooling can see all the unit constructs explicitly, without having to deal with N combinations.
We can generate code on a per-unit construct basis, which should make things simpler.
The AST explicitly shows what is going on, e.g., what directives take which clauses. This is actually nice to understand what is going on.

And some drawbacks:

We need to ensure the original construct can be recreated, at least a semantically equivalent version of it. As mentioned, I believe that’s doable.
Codegen, or the optimizer, might need to deal with special sequences of constructs to ensure performance. distribute parallel for comes to mind, as that should result in the new device runtime call rather than 2 loops with an explicit parallel region. I am generally inclined to do this in the optimizer anyway.

Clause pre-inits are essentially moot, except for num_teams and thread_limit.

Can you elaborate on why these two are special? Other clauses on teams should be similar, no?

I’m not sure about this part:

which necessitates (in this case) a firstprivate directive.

Do we currently, and/or with this proposal, make all implicit captures explicit?

kparzysz October 14, 2024, 1:52pm 6

The OpenMP spec requires that all expressions on clauses are evaluated in the context of the immediately enclosing construct. That is, except two clauses: num_teams and thread_limit. Those are required to be evaluated in the context of the construct immediately enclosing the outermost leaf.

When we attach these clauses to directives, these directives may not be the outermost, so when we create type tmp = expression; for them, we still need to stash it “for later” in the pre-init statement. For the rest of the clauses we don’t need to do that, because they are already placed in the right context (i.e. we can evaluate the expressions right where the clause is located).

Correct me if I’m wrong, but the implicit/explicit captures is a concept that applies to the source code. In the AST they are all represented the same, i.e. they will show up in CapturedStmt regardless of how they ended up captured.

The example you’re referring to is

#pragma omp target teams distribute parallel for simd if(a)

When it’s broken up into unit constructs, we have

target
|-- if(a)   // this "a" is "evaluated" outside of "target"
`-- teams
    `-- distribute parallel for simd (composite)
        `-- if(a)   // this "a" is now inside of "target"

In order for the second if(a) to work, the a needs to be mapped somehow into the target region. Currently, the codegen operates on the entire compound directive, and takes care of that internally, but when we separate the constituents, the target directive needs the extra “firstprivate” clause for the codegen to handle it right.

kparzysz December 5, 2024, 9:53pm 7