[RFC] OpenMP Offload the device from the accelerator (i.e nested target directive) (original) (raw)

Hi, @jdoerfert
Thanks for sharing your suggestion and sorry for late response. We had internal discussion about this topic.
We’ve checked @kai_plociennik’s approach that you said. Starting from his approach could be an option.
Also, we also want to continue this discussion in here or in a meeting. It would be very helpful for us 🙂

The main purpose of our suggestion is designing general interface for nested offloading, not tightly coupled with specific architecture.
It means that our idea should support general interface that works for various types of ‘host’ (CPU, GPU, or other possible host) and ‘target’ (same as host, CPU, GPU, …).
To make this possible, we propose two discussion points, programming model and (runtime) offload interface.

For general programming model, our first option is using ‘target’ pragma from OpenMP in nested form (maybe as Kai~ said).

#pragma target parellel for{
        #pragma target pareller for {
                  for (int i =0; i < 10; i++) a[i] = 10
         }
}

The second option is using new directive or clauses in the target directive to mark nested offloading target.

#pragma target parellel for{
        #pragma ntarget pareller for { //[new directive]nested target
                  for (int i =0; i < 10; i++) a[i] = 10
         }
}

or

#pragma target parellel for{
        #pragma target pareller for nested { //[new clause]nested target
                  for (int i =0; i < 10; i++) a[i] = 10
         }
}

We think the first option provides more general programming interface to programmers for nested offloading.
In that case, we also can support various types of host and offloading targets without any special hints in the codes.
The matching between each code section and host/offload-target could be decided by compile-option or compiler’s decision.
WDYT?

Second discussion point is to design the unified API for nested offload.
As LLVM-offload project has been going on, we also want to provide common API regardless of the types of host/target when doing nested offload.

Extension from libomptarget is one option for compatibility.
However, the difference in host device functionality seems to be a problem in using the existing libomptarget.
Operations in libomptarget are executed on the CPU, but that operations may not work on the accelerators.

It seems difficult to provide all the functions of the existing libomptarget (due to hardware functional issues), so it would be good to provide a common API suitable for the accelerator.
For example, we only provide functions essential for offload such as kernel_launch/data_move as API.

We think that providing such a common API seems reasonable regardless of the type of accelerator, but as the LLVM offload project progresses, I wonder if there are any difficulties in providing a common API due to the characteristics of specific accelerators or other practical issues.

Thanks
Sincerely
Youngjoo Ko