coder.gpu.kernelfun - Pragma that maps function to GPU kernels - MATLAB (original) (raw)
Pragma that maps function to GPU kernels
Syntax
Description
coder.gpu.kernelfun()
is a global-level pragma that attempts to map all the computation within the function it resides in on to the GPU. Loops within this function are parallelized into GPU kernels only if they pass the parallel-loop analysis check. This analysis tries to prove that every loop iteration is independent of each other. In addition, the code generator does not create GPU kernels for simple loops, such as those with few iterations or that perform minimal computation. Parallelization of such loops can decrease code performance due to the overhead of kernel creation and memory transfer between the GPU and the CPU.
The kernelfun
pragma does not require any input parameters. It generates kernels whose dimensions are computed automatically based on loop parameters.
This function is a code generation function. It has no effect in MATLAB®.
Examples
Generate CUDA Code for MATLAB Function
This example shows how to use thekernelfun
pragma in a function and generate CUDA® code.
In one file, write the entry-point function scalars
that accepts two vector inputs x,y
of size1x4096
and one scalar input scale
. The function has two for
-loops of different iteration lengths, one for vector addition and one for finding the cumulative sum. Place the coder.gpu.kernelfun()
pragma within thescalars
function.
function [vout, sout1] = scalars(x,y,scale) coder.gpu.kernelfun; sout1 = 0; vout = coder.nullcopy(zeros(1,1024));
for i=1:1024
vout(i) = x(i) + y(i);
end
for i=1:4096
sout1 = (x(i)*scale) + sout1;
end
end
Use the codegen function to generate CUDA MEX function.
codegen -config coder.gpuConfig('mex')... -args {ones(1,4096,'double'),ones(1,4096,'double'),coder.typeof(0)}... -report scalars
GPU Coder creates kernels for vector addition and cumulative summation.
scalars_kernel1<<<dim3(8U, 1U, 1U), dim3(128U, 1U, 1U)>>>(*gpu_y, *gpu_x, *gpu_vout); scalars_kernel2<<<dim3(4U, 1U, 1U), dim3(1024U, 1U, 1U)>>>(scale, *gpu_x, gpu_sout1);
The kernel for vector addition has a total of 1024 threads, one for adding each element. Similarly, the kernel for cumulative summation has a total of 4096 threads.
Version History
Introduced in R2017b
See Also
Apps
Functions
- codegen | coder.gpu.kernel | gpucoder.stencilKernel | coder.gpu.constantMemory | gpucoder.reduce | gpucoder.sort | coder.gpu.nokernel