How to debug cuda in python(use cupy to call kernels, not lib file) (original) (raw)

September 25, 2025, 1:14pm 1

I use cupy to call cuda kernels, but I don’t know how to debug cuda code, here is my wrapper file:

wrapper.py

import math
from pathlib import Path

import cupy as cp
import numpy as np

with open(Path(__file__).parents[1] / 'cuda' / 'lcx_projector_kernels.cu', 'r', encoding='utf-8') as f:
    lines = f.read()
    compute_systemG_kernel = cp.RawKernel(lines, 'compute_systemG_kernel')


def compute_systemG_lm(xstart,
                   xend,
                   img_origin,
                   voxsize,
                   sysG,
                   nLORs,
                   img_dim,
                   tofbin_width,
                   sigma_tof,
                   tofcenter_offset,
                   nsigmas,
                   tofbin,
                   threadsperblock):

        compute_systemG_kernel(
            (math.ceil(nLORs / threadsperblock), ), (threadsperblock, ),
            (xstart.ravel(), xend.ravel(),
            cp.asarray(img_origin), cp.asarray(voxsize), sysG,
            np.int64(nLORs), cp.asarray(img_dim),
            np.float32(tofbin_width), cp.asarray(sigma_tof).ravel(),
            cp.asarray(tofcenter_offset).ravel(), np.float32(nsigmas),
            tofbin)
    )

There are lots of ways to debug cuda code with lib file on Internet, but cupy do not use lib file instead of delivering parameters to cuda kernels. Does anybody know how to debug cuda in my situation?

I have try in this way, but breakpoing only stop in python, not in cuda.

launch.json

{
  "version": "0.2.0",
  "configurations": [
    {
            "name": "train.py",
            "type": "debugpy",
            "request": "launch",
            "program": "/home/fanghaodu/code/LMPDnet/train.py",
            "console": "integratedTerminal",
            "justMyCode": false,
            "env": {
                "PYTHONPATH": "${workspaceFolder}"
            }
    },

    {
        "name": "CUDA GDB",
        "type": "cuda-gdb",
        "request": "launch",
        "program": "/home/fanghaodu/.conda/envs/cu128/bin/python", // which python
        "args": ["${file}"],
        "debuggerPath": "/opt/apps/cuda-12.8/bin/cuda-gdb", // which cuda-gdb
    },
    
  ],

    "compounds": [
    {
        "name": "Python and CUDA",
        "configurations": ["train.py", "CUDA GDB"]
    }
],
    
}

I’ve heard that cuda-gdb can debug cuda file but it needs to build lib file first and should use pybind11, so troublesome. I wish there was a more convenient way.

Please help me, thanks in advance.

mmason September 25, 2025, 5:01pm 2

Thank you for posting this question.

To clarify: are you trying to debug the (host) Python code using VSCode’s debugpy launch type, and the CUDA C/C++ code at the same time? If so, that’s not supported today. You need to pick between the standalone Python debugger for the host Python code, or the cuda-gdb debugger for CUDA C/++, you can’t use both at the same time.

What you can do is debug the CUDA C/C++ parts of your program with cuda-gdb, and use the pdb debugger module to debug the host Python code. pdb is a command-line Python debugger, compared to VSCode’s debugpy support.

If you can provide a copy of lcx_projector_kernels.cu (or at least the part that wrapper.py uses)? If so, I can put together an example debug session demonstrating how to use cuda-gdb and pdb to debug at the same time.

Thanks for your reply! Here’s the lcx_projector_kernels.cu, it seems simple:

extern "C" __device__ unsigned char ray_cube_intersection_cuda(float orig0, 
                                                               float orig1, 
                                                               float orig2,
                                                               float bounds0_min, 
                                                               float bounds1_min,
                                                               float bounds2_min,
                                                               float bounds0_max,
                                                               float bounds1_max,
                                                               float bounds2_max,
                                                               float rdir0, 
                                                               float rdir1,
                                                               float rdir2,
                                                               float* t1, 
                                                               float* t2){ 
  
  unsigned char intersec = 1;
  ...//Some calculations
  return(intersec);
}


extern "C" __global__ void compute_systemG_kernel(float *xstart, 
                                                      float *xend, 
                                                      float *img_origin, 
                                                      float *voxsize, 
                                                      float *sysG,
                                                      long long nlors, 
                                                      int *img_dim,
                                                      float tofbin_width,
                                                      float *sigma_tof,
                                                      float *tofcenter_offset,
                                                      float n_sigmas,
                                                      short *tof_bin)
{
  long long i = blockDim.x * blockIdx.x + threadIdx.x;
  ...//Some calculations on sysG
}

My train.py wil call compute_systemG_lm in wrapper.py, then call to cuda, debugging python is not my main goal, what I want to debug is cuda, but the entry to cuda is in python, and that’s my problem right now.

In addition, I still wonder know how to debug in vscode under this situation, better with gui, not in command line, cause I am not familiar with the way to break points on the command line, I’d better with using the mouse to click, and I think gui rendering will be clearer, it is more convenient to see variable values. If must in command line, could you give an example? Thanks a lot.