Help Needed with Optimizing CUDA Kernel Performance for Deep Learning Inference (original) (raw)

Hi everyone,

I’m currently working on optimizing a deep learning inference pipeline using CUDA, and I’ve hit a bit of a roadblock in terms of kernel performance. I’m using a custom model running on an NVIDIA RTX 4090, and while the model accuracy is fine, the inference time is significantly higher than expected — particularly during certain layers where I’ve implemented custom CUDA kernels.

I’ve profiled the code using Nsight Systems and Nsight Compute. The profiler shows some kernel launches are taking more time than they should, and I suspect it’s due to inefficient memory access patterns and lack of coalescing. I’ve tried reorganizing some of the memory layout and switching to shared memory where possible, but the gains are marginal.

A few specific questions I’d love help with:

What are some reliable best practices for optimizing memory access in custom CUDA kernels, especially when working with batched inputs?
Are there any recommended CUDA libraries or tools that help streamline inference optimization (besides cuDNN, which I’m already using for other parts)?
How do you decide when it’s worth rewriting parts of the model logic in CUDA vs relying on existing high-level sap abap course in bangalore frameworks?

I’d really appreciate any advice, examples, or resources you can share. If needed, I can share snippets of my current kernel code too.

Thanks in advance!