How to handle host-side global data automatically when lowering to GPU with MLIR? (original) (raw)
January 15, 2025, 7:28am 1
When using MLIR to generate GPU code, I noticed that many host-side global data are passed to gpu.launch_func as input parameters, which seems to cause some issues.(Someone kindly answered this question for me before.)
After examining examples in the test directory, I found that this issue appears to be resolved only by manually adding data transfers from host side to device side. Is there a way to avoid manually modifying auto-generated IR, perhaps through some passes or other mechanisms?
[Additional context that might help: I’m looking for a more automated approach to handle host-to-device data transfers when lowering MLIR to GPU, rather than having to modify the IR manually. Has anyone encountered similar issues or developed passes to handle this automatically?]
Let me know if you’d like me to clarify or expand on any part of this question.
lilil January 15, 2025, 8:46am 2
Additionally, it seems that besides global data, if I create a value using memref.alloc, it cannot be directly passed as an input parameter to gpu.launch_func either, otherwise it will result in errors similar to the following:
'cuStreamSynchronize(stream)' failed with 'CUDA_ERROR_ILLEGAL_ADDRESS'
'cuStreamDestroy(stream)' failed with 'CUDA_ERROR_ILLEGAL_ADDRESS'
'cuModuleUnload(module)' failed with 'CUDA_ERROR_ILLEGAL_ADDRESS'
grypp January 15, 2025, 12:39pm 3
Managing GPU-CPU data transfers automatically might require a significant amount of compiler work, and sometimes, it’s not even possible without full visibility into the entire program. It depends on what you want to support. my5cents
One way of handling copies automatically is to use the vendor’s unified memory
or managed memory
solutions. Unified memory
might require specific systems, but managed memory
is widely available.
In MLIR, you can allocate memref data using %memref = gpu.alloc host_shared () : memref<10xf32>
and enable GPU-CPU read-write access. The underlying driver and hardware will manage the virtual pages and handle data transfers automatically.