Assembling and linking AMD GPU kernel code (original) (raw)

Not sure if this belongs in LLVM discussion or gem5, but I’ll start here:

Goal

I’m trying to experiment with AMD GPU kernel GCN assembly code to run on the gem5 simulator. I’d like to be able to write and compile HIP code where the CPU host code is written in C/C++, but the kernel itself is written in GCN assembly code. Eventually, I’d like to make slight modifications to the GCN ISA in LLVM.

My Setup

I’m running the current stable build of gem5 using the gem5 docker image. I am using hipcc in the corresponding docker container to build a small benchmark to run in gem5, which in turn invokes clang under the hood. To start, I am using the following HIP code, which I save as square.cpp:

#include <stdio.h>
#include "hip/hip_runtime.h"

#define CHECK(cmd) \
{\
    hipError_t error  = cmd;\
    if (error != hipSuccess) { \
      fprintf(stderr, "error: '%s'(%d) at %s:%d\n", hipGetErrorString(error), error,__FILE__, __LINE__); \
    exit(EXIT_FAILURE);\
    }\
}

/*
 * Square each element in the array A and write to array C.
 */
__global__ void
vector_square(int *C_d, const int *A_d, size_t N)
{
    size_t offset = (hipBlockIdx_x * hipBlockDim_x + hipThreadIdx_x);
    size_t stride = hipBlockDim_x * hipGridDim_x ;

    for (size_t i=offset; i<N; i+=stride) {
        C_d[i] = A_d[i] * A_d[i];
    }
}

int main(int argc, char *argv[])
{
    int *A_d, *C_d;
    int *A_h, *C_h;
    size_t N = 100;
    size_t Nbytes = N * sizeof(int);
    static int device = 0;
    CHECK(hipSetDevice(device));
    hipDeviceProp_t props;
    CHECK(hipGetDeviceProperties(&props, device/*deviceID*/));
    printf ("info: running on device %s\n", props.name);
    #ifdef __HIP_PLATFORM_HCC__
      printf ("info: architecture on AMD GPU device is: %d\n",props.gcnArch);
    #endif
    printf ("info: allocate host mem (%6.2f MB)\n", 2*Nbytes/1024.0/1024.0);
    A_h = (int*)malloc(Nbytes);
    CHECK(A_h == 0 ? hipErrorMemoryAllocation : hipSuccess );
    C_h = (int*)malloc(Nbytes);
    CHECK(C_h == 0 ? hipErrorMemoryAllocation : hipSuccess );
    for (size_t i=0; i<N; i++)
    {
        A_h[i] = 1 + i;
    }

    printf ("info: allocate device mem (%6.2f MB)\n", 2*Nbytes/1024.0/1024.0);
    CHECK(hipMalloc(&A_d, Nbytes));
    CHECK(hipMalloc(&C_d, Nbytes));

    printf ("info: copy Host2Device\n");
    CHECK ( hipMemcpy(A_d, A_h, Nbytes, hipMemcpyHostToDevice));

    const unsigned blocks = 512;
    const unsigned threadsPerBlock = 256;

    printf ("info: launch 'vector_square' kernel\n");
    hipLaunchKernelGGL(vector_square, dim3(blocks), dim3(threadsPerBlock), 0, 0, C_d, A_d, N);

    printf ("info: copy Device2Host\n");
    CHECK ( hipMemcpy(C_h, C_d, Nbytes, hipMemcpyDeviceToHost));
}

I can use the -save-temps flag to generate assembly code, which gives me x86 host code and GCN device code.

I want to experiment with GPU kernel assembly code by adding and/or removing instructions, recompiling it, and testing the binary in the simulator.

What I Have Tried

For now, I’m having trouble with linking with the host code. I’m working with the sample HIP program above, square.cpp. I also created the file square.host.cpp that contains the host code and only a definition of the kernel using extern.

Based on documentation here, I’ve been using the following procedure:

# Compile square.cpp
/opt/rocm/bin/hipcc -save-temps -O1 --amdgpu-target=gfx801 square.cpp -o square

# Assemble generated assembly code
mv square-hip-amdgcn-amd-amdhsa-gfx801.s square.kernel.s
sed -i "s/\.amdgcn_target/#.amdgcn_target/" square.kernel.s # comment out target to prevent error
/opt/rocm/llvm/bin/clang++ -x assembler -target amdgcn--amdhsa -mcpu=gfx801 -c square.kernel.s -o square.kernel.o

# Re-compile host code with extern definition
/opt/rocm/bin/hipcc -save-temps -O1 --amdgpu-target=gfx801 -c square.host.cpp -o square.host.o

# Link host and kernel code
/opt/rocm/bin/hipcc -save-temps -O1 --amdgpu-target=gfx801 square.host.o square.kernel.o -lm -o square

At the linking step, I get the following error:

/usr/bin/ld: square.kernel.o: Relocations in generic ELF (EM: 224)
/usr/bin/ld: square.kernel.o: error adding symbols: file in wrong format
clang-12: error: linker command failed with exit code 1 (use -v to see invocation)

Is there something I’m doing wrong or should do differently with the compiler, assembler, or linker steps? My guess is that I’m doing something wrong in the assembler step even though I don’t get an error.

Is there some other way to achieve my goal of compiling a HIP program where the kernel is written in assembly?