Alternative to copy host memory to device with stride and offset using gpu.memcpy (original) (raw)

November 28, 2025, 5:55pm 1


    gpu.memcpy  %memref_3, %arg0 : memref<1x10xf32, 1>, memref<1x10xf32, strided<[?, ?], offset: ?>>
    %memref_4 = gpu.alloc  () : memref<10x5xf32, 1>
    gpu.memcpy  %memref_4, %arg1 : memref<10x5xf32, 1>, memref<10x5xf32, strided<[?, ?], offset: ?>>

I tried to copy all the function input host memory to gpu memory. When lowered to the LLVM IR, it introduced builtin.unrealized_conversion_cast.

    %memref_0 = gpu.alloc  () : memref<1x10xf32, 1>
    %58 = builtin.unrealized_conversion_cast %memref_0 : memref<1x10xf32, 1> to !llvm.struct<(ptr<1>, ptr<1>, i64, array<2 x i64>, array<2 x i64>)>
    gpu.memcpy  %memref_0, %31 : memref<1x10xf32, 1>, memref<1x10xf32, strided<[?, ?], offset: ?>>
    %memref_1 = gpu.alloc  () : memref<10x5xf32, 1>
    %59 = builtin.unrealized_conversion_cast %memref_1 : memref<10x5xf32, 1> to !llvm.struct<(ptr<1>, ptr<1>, i64, array<2 x i64>, array<2 x i64>)>
    gpu.memcpy  %memref_1, %23 : memref<10x5xf32, 1>, memref<10x5xf32, strided<[?, ?], offset: ?>>

Since GPU memory generally requires contiguous data, what is the standard way to handle this in MLIR?