Alternative to copy host memory to device with stride and offset using gpu.memcpy (original) (raw)
November 28, 2025, 5:55pm 1
gpu.memcpy %memref_3, %arg0 : memref<1x10xf32, 1>, memref<1x10xf32, strided<[?, ?], offset: ?>>
%memref_4 = gpu.alloc () : memref<10x5xf32, 1>
gpu.memcpy %memref_4, %arg1 : memref<10x5xf32, 1>, memref<10x5xf32, strided<[?, ?], offset: ?>>
I tried to copy all the function input host memory to gpu memory. When lowered to the LLVM IR, it introduced builtin.unrealized_conversion_cast.
%memref_0 = gpu.alloc () : memref<1x10xf32, 1>
%58 = builtin.unrealized_conversion_cast %memref_0 : memref<1x10xf32, 1> to !llvm.struct<(ptr<1>, ptr<1>, i64, array<2 x i64>, array<2 x i64>)>
gpu.memcpy %memref_0, %31 : memref<1x10xf32, 1>, memref<1x10xf32, strided<[?, ?], offset: ?>>
%memref_1 = gpu.alloc () : memref<10x5xf32, 1>
%59 = builtin.unrealized_conversion_cast %memref_1 : memref<10x5xf32, 1> to !llvm.struct<(ptr<1>, ptr<1>, i64, array<2 x i64>, array<2 x i64>)>
gpu.memcpy %memref_1, %23 : memref<10x5xf32, 1>, memref<10x5xf32, strided<[?, ?], offset: ?>>
Since GPU memory generally requires contiguous data, what is the standard way to handle this in MLIR?