[llvm-dev] Redux sync intrinsics issue (original) (raw)

Dustin Favorite via llvm-dev llvm-dev at lists.llvm.org
Fri Aug 6 07:50:47 PDT 2021

Previous message: [llvm-dev] Element atomic vector stores: just do it?
Next message: [llvm-dev] Need help with compiler-rt/test/fuzzer/cross_over_uniform_dist.test
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

Hi all,

redux sync intrinsics are not working as expected.

clang/test/redux-builtins.cu has usage as

out = __nvvm_redux_sync_add(val, 0xFF);

out is the write location for the warp, val is the thread's contributed value, and 0xFF is the mask for a fully active warp.

So far all usage of this builtin has resulted in an Illegal instruction. This is an nvcc application using the nvcc builtin to reduce across a warp:

#include "cuda_runtime.h" #include "device_launch_parameters.h" #include <cuda_profiler_api.h> #include <stdio.h> #include <stdlib.h> #include <cuda_runtime.h>

inline device unsigned warpReduceRedux(unsigned val) { return __reduce_add_sync(0xFF, val); }

global void reduceKernel(unsigned in, unsigned out, int N) { unsigned sum = in[threadIdx.x]; sum = warpReduceRedux(sum); if (threadIdx.x == 0) out[0] = sum; }

int main() { const int SIZE = 32; const int ARRAY_BYTES = SIZE * sizeof(unsigned);

// generate the input array on the host
unsigned h_in[SIZE];
unsigned sum = 0.0f;
for (int i = 0; i < SIZE; i++) {
    h_in[i] = i;
    sum += h_in[i];
}

// declare GPU memory pointers
unsigned * d_in, *d_out;

// allocate GPU memory
cudaMalloc((void **)&d_in, ARRAY_BYTES);
cudaMalloc((void **)&d_out, sizeof(unsigned));

// transfer the input array to the GPU
cudaMemcpy(d_in, h_in, ARRAY_BYTES, cudaMemcpyHostToDevice);

// offload to device
reduceKernel<<<1, SIZE>>>(d_in, d_out, SIZE);

// copy back the sum from GPU
unsigned h_out;
cudaMemcpy(&h_out, d_out, sizeof(unsigned), cudaMemcpyDeviceToHost);
printf("%u\n", h_out);

}

cuda-memcheck is clear and has verifiable output. The same application, substituting the nvcc builtin for the clang one then building with clang:

inline device unsigned warpReduceRedux(unsigned val) { return __nvvm_redux_sync_add(val, 0xFF); }

compiles but does not pass cuda-memcheck and does not provide the correct output:

========= CUDA-MEMCHECK ========= Illegal Instruction ========= at 0x00000cf0 in reduceKernel(unsigned int*, unsigned int*, int) ========= by thread (0,0,0) in block (0,0,0)

What is the usage for these? I've also attached the PTX emitted by these apps in case there's a backend issue to be found. -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20210806/bc8b118c/attachment.html> -------------- next part -------------- A non-text attachment was scrubbed... Name: clang.ptx Type: image/ptx Size: 7038 bytes Desc: not available URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20210806/bc8b118c/attachment.bin> -------------- next part -------------- A non-text attachment was scrubbed... Name: nvcc.ptx Type: image/ptx Size: 944 bytes Desc: not available URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20210806/bc8b118c/attachment-0001.bin>

Previous message: [llvm-dev] Element atomic vector stores: just do it?
Next message: [llvm-dev] Need help with compiler-rt/test/fuzzer/cross_over_uniform_dist.test
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

More information about the llvm-dev mailing list