coder.gpu.nokernel - Pragma to disable kernel creation for loops - MATLAB (original) (raw)

Pragma to disable kernel creation for loops

Syntax

Description

coder.gpu.nokernel() is a loop level pragma that when placed immediately before a for loop prevents the code generator from generating CUDA® kernels for the statements within the loop. This pragma does not require any input parameters.

This function is a code generation function. It has no effect in MATLAB®.

example

Examples

collapse all

Generate CUDA Code for a Simple Nested Loop

This example shows how to use the nokernel pragma in a function and prevent the code generator from generating CUDA kernels for the statements within the loop

In one file, write the entry-point function nestedLoop that accepts two vector inputs A,B of size 32x512. The function has two nested for-loops of different iteration lengths, one for operating along the column and one for operating along the row. The first nested loop computes the sum of the two vector inputs while the second nested loop scales the sum by a factor of three.

function [C] = nestedLoop(A, B) G = zeros(32, 512); C = zeros(32, 512);

coder.gpu.kernelfun();
% This nested loop will be fused
for i = 1:32
   for j = 1:512
       G(i,j) = A(1,j) + B(1,j);
   end
end

coder.gpu.nokernel();  
for i = 1:32
   for j = 1:512
       C(i,j) = G(i,j) * 3;
   end
end    

end

Use the codegen function to generate CUDA MEX function.

cfg = coder.gpuConfig('mex'); cfg.GenerateReport = true; codegen -config cfg -args {ones(1,512,'double'),ones(1,512,'double')} nestedLoop

GPU Coder creates two kernels: nestedLoop_kernel1 to perform the computation G(i,j) = A(1,j) + B(1,j); of the first nested loop andnestedLoop_kernel2 kernel to perform the computationC(i,j) = G(i,j) * 3; of the second nested loop. The second kernel is created for the inner loop of the second nested loop. The noKernel pragma is applicable only to the loop that immediately follows the statement. Snippets of the generated kernels are shown.

static global launch_bounds(512, 1) void nestedLoop_kernel1(const real_T B[512], const real_T A[512], real_T G[16384]) { uint32_T threadId; ... if (i < 32) { G[i + (j << 5)] = A[j] + B[j]; } } static global launch_bounds(512, 1) void nestedLoop_kernel2(real_T G [16384], int32_T i, real_T C[16384]) { uint32_T threadId; ...; if (j < 512) { C[i + (j << 5)] = G[i + (j << 5)] * 3.0; }

A snippet of the main function shows that the code generator has fused the first nested loop as indicated by the kernel launch parameters. As mentioned earlier, the outer loop of the second nested loop is the one that is not mapped to a kernel. Hence the code generator places a for-loop statement just before the call to the second CUDA kernel nestedLoop_kernel2.

void nestedLoop(const real_T A[512], const real_T B[512], real_T C[16384]) { int32_T i; ... // These two loops will be fused cudaMemcpy(gpu_B, (void *)&B[0], 4096UL, cudaMemcpyHostToDevice); cudaMemcpy(gpu_A, (void *)&A[0], 4096UL, cudaMemcpyHostToDevice); nestedLoop_kernel1<<<dim3(32U, 1U, 1U), dim3(512U, 1U, 1U)>>>(*gpu_B, *gpu_A, * gpu_G); for (i = 0; i < 32; i++) { nestedLoop_kernel2<<<dim3(1U, 1U, 1U), dim3(512U, 1U, 1U)>>>(*gpu_G, i, *gpu_C); C_dirtyOnGpu = true; } ... cudaFree(*gpu_C); }

Version History

Introduced in R2019a

See Also

Apps

Functions

Objects

Topics