ReduceSum error (original) (raw)

hi,

i get this error on nvidia v100.

2025-08-07 18:03:20.734516997 [E:onnxruntime:Default, cuda_call.cc:123 CudaCall] CUDNN failure 5000: CUDNN_STATUS_EXECUTION_FAILED ; GPU=0 ; hostname=e7ff6c9ad68c ; file=/onnxruntime_src/onnxruntime/core/providers/cuda/reduction/reduction_ops.cc ; line=778 ; expr=cudnnReduceTensor(GetCudnnHandle(ctx), reduce_desc, indices_cuda.get(), indices_bytes, workspace_cuda.get(), workspace_bytes, &one, input_tensor, temp_X.get(), &zero, output_tensor, temp_Y.get());

2025-08-07 18:03:20.734593929 [E:onnxruntime:, sequential_executor.cc:572 ExecuteKernel] Non-zero status code returned while running ReduceSum node. Name:‘/ReduceSum_1’ Status Message: CUDNN failure 5000: CUDNN_STATUS_EXECUTION_FAILED ; GPU=0 ; hostname=e7ff6c9ad68c ; file=/onnxruntime_src/onnxruntime/core/providers/cuda/reduction/reduction_ops.cc ; line=778 ; expr=cudnnReduceTensor(GetCudnnHandle(ctx), reduce_desc, indices_cuda.get(), indices_bytes, workspace_cuda.get(), workspace_bytes, &one, input_tensor, temp_X.get(), &zero, output_tensor, temp_Y.get());

how to fix it or debug it?

i use nvidia v100 with NVIDIA-SMI 535.183.01 Driver Version: 535.183.01 CUDA Version: 12.9

Thanks,

Gerald

Hi ,

Can you pls check

Version Compatibility: Ensure that your cuDNN version is compatible with your CUDA version (11.9 in your case) and the NVIDIA driver version (535.183.01). You can check the compatibility matrix on the NVIDIA website.

Installation and Setup: Verify that cuDNN is correctly installed and configured for your system. You might need to reinstall cuDNN or update your CUDA toolkit to ensure compatibility.

Driver Updates: Although your driver version seems up-to-date, it’s always a good idea to check for any newer versions that might resolve the issue.

Environment Variables: Ensure that your environment variables (like CUDAHOME, PATH, and LDLIBRARYPATH) are correctly set to point to your CUDA and cuDNN installations.

Debugging: To debug, you can try running your application with the CUDALAUNCHBLOCKING=1 environment variable set. This can help identify which specific kernel launch is causing the issue.

cuDNN Logging: Enable cuDNN logging by setting the CUDNNLOGDEST and CUDNNLOGINFO environment variables. This can provide more detailed information about what’s going wrong.