Question about GPU Dialect Async Tokens in MLIR (original) (raw)

I have a question regarding the implementation of asynchronous tokens in the GPU dialect of MLIR. Specifically, I’m wondering about synchronization when my IR contains parts that cannot be offloaded to GPU execution.

Consider this example:

%339 = gpu.wait async
%341 = gpu.launch_func async [%339] @module_1::@kernel_1
    
%343 = gpu.launch_func async [%341] @module_1::@kernel_2
    
scf.for %arg1 = %c0 to %c512 step %c1 {
}

%356 = gpu.wait async [%343]
    
%345 = gpu.launch_func async [%356] @module_2::@kernel_1 
    
%347 = gpu.launch_func async [%345] @module_2::@kernel_2 

gpu.wait [%347]

My specific question is: Is it valid to use gpu.wait async for synchronization when there are parts of my IR that cannot be offloaded to GPU execution (like the empty scf.for loop in this example)? Or must I use the blocking gpu.wait operation to explicitly synchronize in such cases?

In the example above, I have a sequence of asynchronous GPU operations with a CPU-side scf.for loop in the middle. After the loop, I’m using %356 = gpu.wait async [%343] to create a new token that depends on the previous GPU work. Is this approach correct, or should I be using the blocking gpu.wait before continuing with subsequent GPU operations after CPU code?

Thank you for your insights!

It’s not clear to me what you’re trying to achieve with this: %356 = gpu.wait async [%343].
It is not doing anything actually I believe, the doc states:

    If the op contains the `async` keyword, it returns a new async token which
    is synchronized with the op arguments. This new token is merely a shortcut
    to the argument list, and one could replace the uses of the result with the
    arguments for the same effect. The async version of this op is primarily
    used to make each async token have a single use during lowering and
    thereby make forks in async execution explicit.

That is: “wait async” is only useful when you have multiple token to “group” together.

lilil April 14, 2025, 5:20am 3

I apologize, I may not have clearly expressed my question.

First, in the IR example above, there are four kernels and one scf.for, where the third kernel depends on both the scf.for and the first two kernels.

Since scf.for cannot be offloaded to the GPU, would this be considered heterogeneous computin,right?

If I want to use the GPU dialect’s asynchronous mode in this IR example, according to the dependencies, I should insert a synchronization point before the third kernel to ensure that previous computations have been completed.

Here’s where I’m confused - since part of the code can’t be offloaded to the GPU, I’m uncertain about synchronization. According to my understanding, if I insert “%356 = gpu.wait [%343]” after the scf.for, it should work, correct? What if I change it to “%356 = gpu.wait async [%343]” - would this meet the requirements? Would it wait for the first two kernels and the scf.for to finish execution?

Or is it unrelated to whether gpu.wait is in asynchronous mode, and I just need to organize the IR order so that scf.for appears before the third kernel in the code?

gpu.wait without async would not return a token, but I’m confused why you want to insert a wait in the first place?

It seems you didn’t really follow my previous answer quoting the documentation for this operation: it has zero effect on the program execution in this form, whether it is there or not does not change anything.

Yes, this is enough, you don’t need extra synchronization.

lilil April 15, 2025, 6:42am 5

I understand your point now. I previously misunderstood this concept. Having a gpu.wait async that takes a single token doesn’t make any sense - it’s equivalent to directly passing the async token to the subsequent required gpu operation. Its main purpose is for situations where multiple tokens need to be grouped as dependencies, right?
Thank you for your help.