Kernels from Element-Wise Loops - MATLAB & Simulink (original) (raw)
The simplest case of CUDA® kernel creation is from MATLAB® functions that contain scalarized, element-wise math operations. When element-wise operations are enclosed within a for-loop body, concurrent CUDA threads can be invoked to compute each loop iteration in parallel. Because CUDA threads do not execute in particular order, and are independent of each other, it is essential that the iteration in your for
-loop does not depend on the results of other iterations.
Element-Wise Math Example
This example shows how to create CUDA kernels from functions that contain element-wise math operations. Suppose that you want to square each element of a matrix x
and scale by a factor of1/(i+j)
, where i,j
are the row and column indexes. You can implement this example as a MATLAB function.
function [y] = myFun(x)
y = zeros(size(x)); for i = 1:size(x,1) for j = 1:size(x,2) y(i,j)=(x(i,j)^2)/(i+j); end end end
Preparing myFun for Code Generation
The first statement zeros(size(A))
in the myFun
function is to initialize result vector y
to zeros. For CUDA code generation, pre-allocate memory for y
without incurring the overhead of initializing the memory to zeros. Replace this line withcoder.nullcopy(zeros(size(y)))
.
To create CUDA kernels from loops, GPU Coder™ provides another pragma coder.gpu.kernel. Specifying this kernel pragma overrides all parallel-loop analysis. If you do not specify a parameter, GPU Coder determines the kernel bounds based on the loop bounds and input size. It provides a way for you to specify kernel launch parameters such as_thread_ and block sizes. However, use it only when you know that the loop is safe to parallelize. Because the myFun
example is simple and does not require specification of the kernel launch parameters, you can utilize the coder.gpu.kernelfun
pragma to generate CUDA kernels.
With these modifications, the original myFun
function is suitable for code generation.
function [y] = myFun(x) %#codegen
y = coder.nullcopy(zeros(size(x))); coder.gpu.kernelfun(); for i = 1:size(x,1) for j = 1:size(x,2) y(i,j)=(x(i,j)^2)/(i+j); end end end
Generated CUDA Code
When you generate CUDA code by using the GPU Coder app or from the command line, GPU Coder creates a single kernel that performs squaring and scaling operation. The following is a snippet of the myFun_kernel1
kernel code.
static global launch_bounds(512, 1) void myFun_kernel1(const real_T *x, real_T *y) { ... threadId = ((((gridDim.x * gridDim.y * blockIdx.z + gridDim.x * blockIdx.y) + blockIdx.x) * (blockDim.x * blockDim.y * blockDim.z) + threadIdx.z * blockDim.x * blockDim.y) + threadIdx.y * blockDim.x) + threadIdx.x; i = (int32_T)(threadId / 512U); j = (int32_T)(threadId - (uint32_T)i * 512U); if ((!(j <= 512)) && (!(i <= 512))) { y[i + (j << 9)] = x[i + (j << 9)] * x[i + (j << 9)] / ((real_T)(i + j) + 2.0); } }
The following is a snippet of the main myFun
function. Before callingmyFun_kernel1
, there is a single cudaMemcpy
call that transfers the matrix x
from the host (x
) to the device (gpu_x
). The kernel has 512 blocks containing 512 threads per block, consistent with the size of the input vector. A second cudaMemcpy
call copies the result of the computation back to the host.
cudaMemcpy((void *)gpu_x, (void *)x, 2097152ULL, cudaMemcpyHostToDevice); myFun_kernel1<<<dim3(512U, 1U, 1U), dim3(512U, 1U, 1U)>>>(gpu_x, gpu_y); cudaMemcpy((void *)y, (void *)gpu_y, 2097152ULL, cudaMemcpyDeviceToHost);
Limitations
- If the loop bounds are of the unsigned data type, the code generator may add conditional checks to determine if the loop bounds are valid. These conditional checks may limit optimizations that are performed by the software and introduce reduction kernels that can cause performance change.
See Also
coder.gpu.kernel | coder.gpu.kernelfun | gpucoder.matrixMatrixKernel | coder.gpu.constantMemory | gpucoder.stencilKernel