Why Do Same-Named Kernels via gpu.launch_func Execute Concurrently but Different Kernels Execute Serially? (original) (raw)

March 13, 2025, 12:46pm 1

I’ve encountered an interesting behavior with gpu.launch_func operations in MLIR, and I’m trying to understand the underlying mechanism:

When using gpu.launch_func to launch the same kernel twice with async execution, the kernels execute concurrently (after modifying mgpuModuleLoad in CudaRuntimeWrappers.cpp).
When using gpu.launch_func to launch two different kernels with async execution (with no dependencies between them), they execute serially, even though:There are no API calls between them that would cause implicit synchronization.
When launching the same kernel twice, Nsight Systems shows concurrent execution:

image

When launching two different kernels, Nsight Systems shows serial execution:

image

I’ve examined the MLIR JIT compilation pipeline, including the CudaRuntimeWrappers.cpp implementation and the translation mechanism that converts gpu.launch_func operations to mgpuLaunchKernel calls.

I haven’t found anything that would explain this difference in behavior. If there are no API calls between the kernel launches, shouldn’t they execute concurrently regardless of whether they are the same kernel or different kernels?

Am I missing something in my analysis, or could this be a limitation of CUDA itself?

Any help would be greatly appreciated!