coder.gpu.kernel - Pragma that maps for-loops to GPU kernels - MATLAB (original) (raw)
Pragma that maps for
-loops to GPU kernels
Syntax
Description
coder.gpu.kernel()
is a loop-level pragma that you must place immediately before a for
-loop. This pragma generates a kernel and computes the launch parameters from the loop parameters.
The coder.gpu.kernel
pragma overrides all parallel loop analysis checks. This override allows GPU Coder™ to parallelize loops in situations where parallel loop analysis cannot prove that all iterations are independent. Consider using coder.gpu.kernelfun to parallelize loops in functions that pass the parallel loop analysis check.
Note
Using the coder.gpu.kernel
pragma before afor
-loop that contains reductions is not recommended.
coder.gpu.kernel(B,T)
generates a kernel with the dimensions specified by B
and T
.B[Bx,By,Bz]
is an array that defines the number of blocks in the grid along dimensions x
and y
(z
not used). T[Tx,Ty,Tz]
is an array that defines the number of threads in the block along dimensions x
,y
, and z
.
A value of -1 for B
and T
indicates that GPU Coder must infer the grid and block dimensions automatically. Thecoder.gpu.kernel
pragma generates errors for invalid grid and block dimensions.
coder.gpu.kernel(B,T,M,name)
specifies optional argumentsM
and name
. M
is a positive integer that specifies the minimum number of blocks per streaming multiprocessor. Increasing M
can reduce the register usage within a kernel and improve kernel occupancy. A value of -1 for M
indicates that GPU Coder must use the default value of 1. name
is a character array that allows you to customize the name of the generated kernel.
This function is a code generation function. It has no effect in MATLAB®.
Examples
This example shows how to use thecoder.gpu.kernel
pragma to generate a CUDA® kernel.
Create a function named multiplyVectors
that performs element-wise multiplication on two 1-by-1024 input vectors, a
and b
. The function contains a for
-loop that multiplies the elements of the vectors.
function out = multiplyVectors(a,b) %#codegen out = zeros(size(a));
for i=1:size(a,2) out(i) = a(i)*b(i); end end
To generate a kernel from the for
-loop, add thecoder.gpu.kernel
pragma before thefor
-loop. To compute the kernel launch parameters from the loop parameters, specify the coder.gpu.kernel
pragma without input arguments.
function out = multiplyVectors(a,b) %#codegen out = zeros(size(a));
coder.gpu.kernel(); for i=1:size(a,2) out(i) = a(i)*b(i); end end
Use the codegen command to generate code from multiplyVectors
. The generated code contains a kernel namedmultiplyVectors_kernel1
.
cfg = coder.gpuConfig("mex"); a = ones([1 1024]); b = ones([1 1024]); codegen -config cfg -args {a,b} -report multiplyVectors
This example shows how to use thecoder.gpu.kernel
pragma to generate a CUDA kernel and specify the launch parameters.
Create a function named addVectors
that accepts two 1-by-4096 inputs, x
and y
. The function has one for
-loop that adds x
and y
.
function out = addVectors(x,y) %#codegen out = zeros(size(x));
for i=1:size(x,2) out(i) = x(i)+y(i); end end
To create a kernel, place the coder.gpu.kernel
pragma immediately before the vector addition loop. To automatically determine the number of blocks, specify the number of blocks as -1
, and specify 128
threads per block.
function out = addVectors(x,y) %#codegen out = zeros(size(x));
coder.gpu.kernel(-1,128); for i=1:size(x,2) out(i) = x(i)+y(i); end end
Use the codegen command to generate CUDA code.
cfg = coder.gpuConfig("mex"); x = ones([1 4096]); y = ones([1 4096]); codegen -config cfg -args {x,y} -report addVectors
The generated code contains a kernel namedaddVectors_kernel1
. The kernel launches with 32 blocks and 128 threads per block.
addVectors_kernel1<<<dim3(32U, 1U, 1U), dim3(128U, 1U, 1U)>>>(*gpu_y, *gpu_x,
*gpu_out);
You can use variables or expressions when specifying the kernel launch parameters. For example, you can add an input argument namedT
to the addVectors
function and specify T
as the number of threads by usingcoder.gpu.kernel
.
function out = addVectors(x,y,T) %#codegen out = zeros(size(x));
coder.gpu.kernel(1,T); for i=1:size(x,2) out(i) = x(i)+y(i); end end
Use the codegen function to generate CUDA code. The generated code uses the input variableT
to determine the number of threads for each block.
cfg = coder.gpuConfig("dll"); x = ones([1 4096]); y = ones([1 4096]); T = 512; codegen -config cfg -args {x,y,T} -report addVectors
Version History
Introduced in R2017b
See Also
Apps
Functions
- codegen | coder.gpu.kernelfun | stencilfun | coder.gpu.constantMemory | gpucoder.reduce | gpucoder.sort | coder.gpu.nokernel