coder.gpu.kernel - Pragma that maps for-loops to GPU kernels - MATLAB (original) (raw)
Pragma that maps for
-loops to GPU kernels
Syntax
Description
coder.gpu.kernel()
is a loop-level pragma that you must place immediately before a for
-loop. This pragma generates a kernel with the dimensions computed from the loop parameters.
Note
The coder.gpu.kernel
pragma overrides all parallel loop analysis checks that the software performs. Use coder.gpu.kernelfun first before using thecoder.gpu.kernel
pragma.
Note
Using the coder.gpu.kernel
pragmafor
-loops that contains reductions is not recommended.
coder.gpu.kernel(B,T)
generates a kernel with the dimensions specified by B
and T
.B[Bx,By,Bz]
is an array that defines the number of blocks in the grid along dimensions x
and y
(z
not used). T[Tx,Ty,Tz]
is an array that defines the number of threads in the block along dimensions x
,y
, and z
.
A value of -1 for B
and T
indicates that GPU Coder™ must infer the grid and block dimensions automatically. Thecoder.gpu.kernel
pragma generates errors for invalid grid and block dimensions.
coder.gpu.kernel(B,T,M,name)
specifies optional argumentsM
and name
. M
is a positive integer that specifies the minimum number of blocks per streaming multiprocessor. Increasing M
can reduce the register usage within a kernel and improve kernel occupancy. A value of -1 for M
indicates that GPU Coder must use the default value of 1. name
is a character array that allows you to customize the name of the generated kernel.
Specifying the kernel pragma overrides all parallel loop analysis checks. This override allows loops to be parallelized in situations where parallel loop analysis cannot prove that all iterations are independent of each other. Ensure that the loop is safe to parallelize.
This function is a code generation function. It has no effect in MATLAB®.
Examples
Generate CUDA Code for MATLAB Function
This example shows how to use the kernel
pragma in a function and generate CUDA® code.
In one file, write the entry-point function scalars
that accepts two vector inputs,x
andy
, of size 1x4096
and one scalar input, scale
. The function has twofor
-loops of different iteration lengths, one for vector addition and one for finding the cumulative sum. Place thecoder.gpu.kernel(-1,128)
pragma immediately before the vector addition loop. This pragma creates a kernel that defaults to the number of blocks and allocates 128 threads per block. Place thecoder.gpu.kernel()
pragma immediately before the cumulative summation loop to generate a kernel with the dimensions computed from the loop parameters.
function [vout, sout1] = scalars(x,y,scale) sout1 = 0; vout = zeros(size(x));
coder.gpu.kernel(-1,128);
for i=1:1024
vout(i) = x(i) + y(i);
end
coder.gpu.kernel();
for i=1:4096
sout1 = (x(i)*scale) + sout1;
end
end
Use the codegen function to generate CUDA MEX function.
codegen -config coder.gpuConfig('mex')... -args {ones(1,4096,'double'),ones(1,4096,'double'),coder.typeof(0)}... -report scalars
GPU Coder creates two kernels: the scalars_kernel1
for the vector addition and the scalars_kernel2
kernel for the cumulative sum. No kernel is needed to initialize sout1=0
.
scalars_kernel1<<<dim3(8U, 1U, 1U), dim3(128U, 1U, 1U)>>>(*gpu_y, *gpu_x, *gpu_vout); scalars_kernel2<<<dim3(4U, 1U, 1U), dim3(1024U, 1U, 1U)>>>(scale, *gpu_x, gpu_sout1);
The scalars_kernel1
has 8 blocks with 128 threads per block for a total of 1024 threads, one for adding each element. Thescalars_kernel2
has 4 blocks with 1024 threads per block, resulting in a total of 4096 threads.
You can use variables or expressions when specifying the kernel dimensions. For example, you can rewrite the scalars
entry-point function so that the grid and block dimensions are specified at compile time.
function [vout, sout1] = scalars(x,y,scale, a) sout1 = 0; vout = zeros(size(x));
coder.gpu.kernel(1,a);
for i=1:1024
vout(i) = x(i) + y(i);
end
coder.gpu.kernelfun();
for i=1:length(x)
sout1 = (x(i)*scale) + sout1;
end
end
Use the codegen function to generate the CUDA MEX function.
codegen -config coder.gpuConfig('mex')... -args {ones(1,4096,'double'),ones(1,4096,'double'),20,1024}... -report scalars
Version History
Introduced in R2017b
See Also
Apps
Functions
- codegen | coder.gpu.kernelfun | gpucoder.stencilKernel | coder.gpu.constantMemory | gpucoder.reduce | gpucoder.sort | coder.gpu.nokernel