coder.gpu.nokernel - Pragma to disable kernel creation for loops - MATLAB (original) (raw)
Pragma to disable kernel creation for loops
Syntax
Description
coder.gpu.nokernel()
is a loop level pragma that when placed immediately before a for loop prevents the code generator from generating CUDA® kernels for the statements within the loop. This pragma does not require any input parameters.
This function is a code generation function. It has no effect in MATLAB®.
Examples
Generate CUDA Code for a Simple Nested Loop
This example shows how to use the nokernel
pragma in a function and prevent the code generator from generating CUDA kernels for the statements within the loop
In one file, write the entry-point function nestedLoop
that accepts two vector inputs A,B
of size 32x512
. The function has two nested for
-loops of different iteration lengths, one for operating along the column and one for operating along the row. The first nested loop computes the sum of the two vector inputs while the second nested loop scales the sum by a factor of three.
function [C] = nestedLoop(A, B) G = zeros(32, 512); C = zeros(32, 512);
coder.gpu.kernelfun();
% This nested loop will be fused
for i = 1:32
for j = 1:512
G(i,j) = A(1,j) + B(1,j);
end
end
coder.gpu.nokernel();
for i = 1:32
for j = 1:512
C(i,j) = G(i,j) * 3;
end
end
end
Use the codegen function to generate CUDA MEX function.
cfg = coder.gpuConfig('mex'); cfg.GenerateReport = true; codegen -config cfg -args {ones(1,512,'double'),ones(1,512,'double')} nestedLoop
GPU Coder creates two kernels: nestedLoop_kernel1
to perform the computation G(i,j) = A(1,j) + B(1,j);
of the first nested loop andnestedLoop_kernel2
kernel to perform the computationC(i,j) = G(i,j) * 3;
of the second nested loop. The second kernel is created for the inner loop of the second nested loop. The noKernel
pragma is applicable only to the loop that immediately follows the statement. Snippets of the generated kernels are shown.
static global launch_bounds(512, 1) void nestedLoop_kernel1(const real_T B[512], const real_T A[512], real_T G[16384]) { uint32_T threadId; ... if (i < 32) { G[i + (j << 5)] = A[j] + B[j]; } } static global launch_bounds(512, 1) void nestedLoop_kernel2(real_T G [16384], int32_T i, real_T C[16384]) { uint32_T threadId; ...; if (j < 512) { C[i + (j << 5)] = G[i + (j << 5)] * 3.0; }
A snippet of the main function shows that the code generator has fused the first nested loop as indicated by the kernel launch parameters. As mentioned earlier, the outer loop of the second nested loop is the one that is not mapped to a kernel. Hence the code generator places a for-loop
statement just before the call to the second CUDA kernel nestedLoop_kernel2
.
void nestedLoop(const real_T A[512], const real_T B[512], real_T C[16384]) { int32_T i; ... // These two loops will be fused cudaMemcpy(gpu_B, (void *)&B[0], 4096UL, cudaMemcpyHostToDevice); cudaMemcpy(gpu_A, (void *)&A[0], 4096UL, cudaMemcpyHostToDevice); nestedLoop_kernel1<<<dim3(32U, 1U, 1U), dim3(512U, 1U, 1U)>>>(*gpu_B, *gpu_A, * gpu_G); for (i = 0; i < 32; i++) { nestedLoop_kernel2<<<dim3(1U, 1U, 1U), dim3(512U, 1U, 1U)>>>(*gpu_G, i, *gpu_C); C_dirtyOnGpu = true; } ... cudaFree(*gpu_C); }
Version History
Introduced in R2019a
See Also
Apps
Functions
- codegen | coder.gpu.kernel | gpucoder.stencilKernel | coder.gpu.constantMemory | gpucoder.reduce | gpucoder.sort