Dialect for data locality/sharing specifiers/clauses in OpenMP, OpenACC, and do concurrent (original) (raw)

Background

All of OpenMP, OpenACC, and do concurrent provide utilities to control the locality/privatization/shareability of data items within the scopes of their constructs. For OpenMP, we have private, firstprivate, and lastprivate clauses to control how a data item is to privatized within the scope of a construct. For do concurrent, the user can use the local and local_init locality specifiers to achieve a similar goal to those OpenMP clauses. There some semantics differences between the OpenMP clauses and do concurrent specifiers. For example, in the do concurrent case a privatized item is created/allocated for each iteration of a concurrent loop; while in the OpenMP case, the same allocation might be used for all iterations of a chunk. However, on the syntactic level, these constructs are quite similar.

A similar observation can be made for the reduction clause in OpenMP and reduce specifier in do concurrent constructs.

This RFC will mainly focus on OpenMP and do concurrent since I am not as familiar with OpenACC. However, the discussion should be extendable to OpenACC as well.

Current status in flang (and relevant MLIR dialects)

Each of the 3 programming models (OpenMP, OpenACC, and do concurrent) implement the above utilities on its own and, in some cases, differently from the other models.

Delayed privatization vs. early privatization

Over the past year, the OpenMP dialect implemented “delayed privatization”. With delayed privatization, privatization clauses are modeled in the IR and only lowered (or inlined) as late as possible in the pipeline, in particular, when MLIR is lowered to LLVM. For example, consider the following Fortran input:

  !$omp target private(simple_var)
    simple_var = 10
  !$omp end target

When delayed privatization is enabled (which is the case by default of most OpenMP constructs currently), flang emits a separate operation to encapsulate the privatization logic:

  omp.private {type = private} @_QFtarget_simpleEsimple_var_private_i32 : i32

and links this op to the relevant OpenMP construct by referncing its symbol:

  omp.target private(@_QFtarget_simpleEsimple_var_private_i32 %2#0 -> %arg0 : !fir.ref<i32>) {
    .... use %arg0 within the construct's scope ....
  }

Note that such delayed privatizers become more complex when they model firstprivate: to model the copying logic, or model more complex data types: e.g. to clean up allocatables.

This is not a new or a unique idea since OpenACC also has a similar way of modeling privatization through its acc.private.recipe operation which has very similar syntax, semantics, and usage to omp.private.

Opposite to delayed privatization, we have early/eager privatization. In this case, instead of modelling the privatization logic in a separate op, we inline that logic early within the construct on which the privatization is specified. The obvious downside of this that the logic of privatization and the parent construct are intermengled reducing debugability within the compiler’s pipeline. At the moment, do concurrent locality specifiers are still modeled using early privatization.

Modeling reduction

For reduction, the 3 programming models have their own separate but very similar approaches as well. For example, OpenMP has the omp.declare_reduction op while OpenACC has the acc.reduction.recipe op. do concurrent does not use a separate op but models reductions using attributes that store the reduction operation, e.g. reduce(#fir.reduce_attr<add> -> %sum : ....).

Proposal

This RFC proposes starting a new separate dialect to model privatization/locality as well as reduction clauses/specifiers across OpenMP, OpenACC, and do concurrent. In particular, such dialect would contain the following ops as a start:

  1. One operation that merges both omp.private and acc.private.recipe. The same op can then be used to model delayed privatization for do concurrent’s local and local_init specifiers.
  2. One operation that merges both omp.declare_reduction and acc.reduction.recipe. The same op can then be used for do concurrent’s reduce specifier as well.

Proof of concept

To provide a more concrete idea of this looks like, a proof-of-concept was implemented here. This PoC does not create a new dialect but rather reuses OpenMP table-gen constructs for modeling do concurrent’s local and local_init specifiers. The PoC is divided into a number of commits each of which is self-contained and specific to a specific part in the pipeline, e.g. there is a commit for lowering from PFT to MLIR, a commit for parsing and printing, a commit for lowering between relevant MLIR construct, etc. Tests are also included to showcase the resulting MLIR.

Productizing the PoC

As mentioned the PoC does not actually start a new dialect but rather reuses some of the OpenMP table-gen records in the FIR dialect. The next steps towards productizing this PoC might be the following:

  1. Moving the used OpenMP records to the new dialect. In particular, moving the OpenMP_Clause, OpenMP_PrivateClauseSkip and OpenMP_PrivateClause records and generalizing their names as appropriate.
  2. Using the generalized OpenMP_PrivateClause in both the OpenMP and FIR dialects similar to whatthe PoC currently does.
  3. Using the generalized OpenMP_PrivateClause in the OpenACC dialect.
  4. Doing a similar round of changes for reductions.

Questions

Any feedback on the above is, of course, welcome. However, a few questions to start:

  1. Are there any expected blockers to having such a dialect? I might be missing intricate details specific to the relevant programming models. Therefore, I am interested to know if there are any major issues having shared constructs/table-gen records across the 3 dialects.
  2. If this new dialect is a reasonable idea, any suggestions for naming the dialect as well the private/local-related records?

Thanks @bob.belcher for writing this RFC.

Please treat the following only as a suggestion.

An alternative approach would be for the OpenMP private/reduction clauses and the do-concurrent locality clauses to require FIR (or CIR) to have a clone operation. The clone operation can be similar to the omp.privateclause operation which specifies how to allocate a clone, initialize it, copy from another variable, and finalize/destroy if applicable. The fir.clone operation can then be used by OpenMP, do concurrent and OpenACC. At the LLVM level, hopefully it can be modelled as a function operation.

tblah April 30, 2025, 10:41am 3

Thanks for the RFC.

How do you plan to handle the semantic differences between OpenMP and do concurrent? Are there others that we must be careful of? (OpenACC too but I know less about that).

If we are going to the effort of creating new operations etc, I would like to also take this opportunity to unify the private and reduction declaration operations. In OpenMP at least, reduction is no more different from privatization than private is from firstprivate. I would like to keep the design of the reduction delaration operation and add the reduction combiner region as another optional region (similar to the copy region).

Do you have any plans about how you would manage this transition? This would be a big change requiring a lot of coordination.

Kiran, thanks for your suggestion. I think making requirements like that of the lowerings using the OpenMP dialect would be a step backwards from the generality we currently support. In particular I think it would be difficult to keep the current lowerings from affine. I think it might also be challenging to express which clone operation to use inside of the reduction clause without reverting to just having a region containing arbitrary mlir (as we do already).

If we assume that we will continue to represent initialization as a region containing mlir operations, then whether there are several operations or just one is orthogonal to this RFC.

+1

This would not be very different from each lowering filling an omp.privateClause Op. Rathern than filling such an operation we will require the presence of a similar operation (which could even be omp.privateClause op) using an interface or a trait. So this will be a generalisation rather than a restriction.

Do you mean the scf.parallel to OpenMP conversion?

Thanks both for your comments.

I think a separate dialect might be better since we wouldn’t have to leak fir or cir ops/records into OpenMP and OpenACC dialects. Do you think having a separate dialect for this would be an overkill?

This is for the lowering logic to handle. On the syntactic level, private and local would be represented similarely. When the compiler then lowers the clause/specifier, it will scope the allocations appropriately. For example, lowering fir.do_concurent to fir.do_loop .... unordered, a private/local copy will be allocated for each iteration. However, lowering fir.do_concurrent to omp parallel do, for example, the local specifier will be “transferred” as a private clause to the newly created parallel do construct and the MLIR to LLVM lowering will handle this new clause just like it does now. So on a semantic level, the privatizer op will be interpreted according to the construct/parent op to which it is attached.

Sounds reasonable. Was this discussed before somewhere? This as well is a big effort and might require a discussion/RFC on its own. I am not very familiar with reductions though so I might be overcomplicating this.

Nothing concrete yet. I am hoping we can choose a direction out of this RFC and then we can discuss the implementation plan afterwards. The lazy part of my brain is tempting me to duplicate what we have in OpenMP in do concurrent first so that we can move forward with do concurrent locality specifiers as quickly as possible and then deduplicate later. The advantages would be keeping the stable parts of the compiler stable (i.e. OpenMP and OpenACC stuff) and more quickly complete the implementation of do concurrent to OpenMP mapping. But I think we have to think this through more carefully.

tblah April 30, 2025, 3:10pm 6

Thanks

Okay that makes sense. This allocation inside of loop bodies will be another case where making allocation implicit in the omp.private operation becomes useful (so we can avoid stack allocations in the loop body). I think unless we get input from an OpenACC expert, that should not be included in any unification because these semantic differences could remain quite subtle.

Not publicly. I have discussed that idea in private previously with you and others. I haven’t made an RFC for it because I have not had the time to be able to commit to doing the work myself.

OpenMP reduction and privatization share the same lowering code for generating the init and dealloc regions. One difference is that reduction has an alloc region (which in practice just contains an alloca), this shouldn’t be hard to change to make the allocation implicit as in omp.private (which I think is a better design because it better constrains the contents of the alloc region). The other difference is there is a combiner region implementing the reduction itself, this could be optional in the same way that the firstprivate copy region is optional (depending on the type of the private op).

We also support passing scalar variables by value instead of by reference, which complicates the reduction operation slightly. Personally I think this adds more complexity without any measured benefit (in theory it might improve performance but I was never able to find a case that demonstrated this), so I would be happy to see that get dropped. But if you wanted to continue with that feature I don’t think it conflicts with this design.

Unifying privatization and reduction is definately not required to accomplish your RFC and could be completed separately, I was just thinking that it wouldn’t be much work to add this on if you need to change all of the mlir->llvm codegen code anyway (in fact it might make it easier by enabling more sharing between reduction and privatization).

klausler April 30, 2025, 5:03pm 7

This would apply to implicitly localized variables in do concurrent as well, yes?

do concurrent (j=1:n)
  tmp = ...
  ...
  ... = tmp
end do

Ok, I think I have a more concrete plan to go forward without have to postpone implementing the feature we actually care about (which is a more mature implementation of do concurrent and its mapping to OpenMP). Here is a potential plan:

  1. Transitionary stage where we duplicate the private-related table gen records to fir. So OpenMP_PrivateClause will be duplicated to fir_LocalSpecifier and PrivateClauseOp will be duplicated to LocalSpecifierOp. The advantage of doing this is to move forward faster with do concurrent mapping without having to wait for the major refactoring efforts required by unifying private and reduction operations and the efforts needed to create a new dialect (or a new op somewhere shared between OpenMP, fir, and possibly OpenACC) and the refactoring required by this. This transitionary solution will also be cleaner than reusing the OpenMP stuff in fir as the PoC currently does.
  2. Start refactoring in the OpenMP dialect by combining private and reduction operations into one op. This new combined op will be the base for the shared op between fir and OpenMP (and OpenACC) later on to model private/local/reduce clauses/specifiers.
  3. Promote the new OpenMP private/reduce op to a shared location so that it can be used at least for fir. Whether this is a separate dialect, lives in fir, or some other solution is still open for discussion. If the dialect idea is an overkill, I think we can scrape it and have some shared interfaces similar to what we have under the OpenACCMPCommon directory.

Ah, thanks for bringing that up. At the moment, this does not happen on the level of the fir.do_concurrent.loop op. However, when converting do concurrent (to OpenMP) in DoConcurrentConversion.cpp, the pass collect “loop-local” values and privatizes them (see llvm-project/flang/lib/Optimizer/OpenMP/DoConcurrentConversion.cpp at main · llvm/llvm-project · GitHub). I this can be improved though by modelling implicitely localized variables on the do concurrent level.