CUDAKernel - Kernel executable on GPU - MATLAB (original) (raw)
parallel.gpu.CUDAKernel
Description
A CUDAKernel
object represents a CUDA kernel that can execute on a GPU. You create the kernel from CU and PTX files. For an example of how to create and use aCUDAKernel
object, see Run CUDA or PTX Code on GPU.
Note
You cannot save
or load
CUDAKernel
objects.
Creation
Syntax
Description
`kern` = parallel.gpu.CUDAKernel([ptxFile](#mw%5F48169692-9c25-4c02-a213-60de7319cf7a),[cuFile](#mw%5F191fb044-066e-4fb5-92f3-6e7d7be62150))
creates a CUDAKernel object using the PTX codeptxFile
and the CUDA® source file cuFile
. The PTX file must contain only a single entry point.
Use feval with kern
as an input to execute the CUDA kernel on the GPU. For information on executing your kernel object, seeRun a CUDAKernel.
`kern` = parallel.gpu.CUDAKernel([ptxFile](#mw%5F48169692-9c25-4c02-a213-60de7319cf7a),[cuFile](#mw%5F191fb044-066e-4fb5-92f3-6e7d7be62150),[func](#mw%5F834f335d-00a9-4bd5-91cf-830162b9f713))
creates a CUDAKernel
for the function entry point defined byfunc
. func
must unambiguously define the appropriate kernel entry point in the PTX file.
`kern` = parallel.gpu.CUDAKernel([ptxFile](#mw%5F48169692-9c25-4c02-a213-60de7319cf7a),[cProto](#mw%5F926f364a-5f24-47e7-b85d-f3f82071e38d))
creates a CUDAKernel
object using the PTX fileptxFile
and the C prototype cProto
.cProto
is the C function prototype for the kernel call thatkern
represents. The PTX file must contain only a single entry point.
`kern` = parallel.gpu.CUDAKernel([ptxFile](#mw%5F48169692-9c25-4c02-a213-60de7319cf7a),[cProto](#mw%5F926f364a-5f24-47e7-b85d-f3f82071e38d),[func](#mw%5F834f335d-00a9-4bd5-91cf-830162b9f713))
creates a CUDAKernel
object from a PTX file and C prototype for the function entry point defined by func
. func
must unambiguously define the appropriate kernel entry point in the PTX file.
Input Arguments
Name of a PTX file or PTX code.
You can provide the name of a PTX file, or pass its contents as a string.
Example: "simpleEx.ptx"
Data Types: char
| string
Name of a CUDA source file, specified as a character vector.
The function examines the CUDA source file to find the function prototype for the CUDA kernel that is defined in the PTX code. The CUDA source file must contain a kernel definition starting with'__global__'
.
Example: "simpleEx.cu"
Data Types: char
| string
Function entry point, specified as a character vector. func
must unambiguously define the appropriate entry point in the PTX file.
Note
The parallel.gpu.CUDAKernel
function searches for the specified entry point in the PTX file, and matches on any substring occurrences. Therefore, you should not name any of your entry points as substrings of any others.
Example: "add1"
Data Types: char
| string
C prototype for the kernel call, specified as a character vector. Specify multiple input arguments separated by commas.
Example: "float *,float,int"
Data Types: char
| string
Properties
Size of a block of threads on the kernel, specified as a vector of positive integers of length 1, 2, or 3 (since thread blocks can be up to 3-dimensional). The product of the elements of ThreadBlockSize
must not exceed theMaxThreadsPerBlock
for this kernel, and no element ofThreadBlockSize
can exceed the corresponding element of the[GPUDevice](parallel.gpu.gpudevice.html)
propertyMaxThreadBlockSize
.
Example: [8 8 8]
Data Types: single
| double
| int8
| int16
| int32
| int64
| uint8
| uint16
| uint32
| uint64
This property is read-only.
Maximum number of threads permissible in a single block for this CUDA kernel. The product of the elements of ThreadBlockSize
must not exceed this value.
Example: 1024
Data Types: single
| double
| int8
| int16
| int32
| int64
| uint8
| uint16
| uint32
| uint64
Size of grid of thread blocks, specified as an integer vector of length 3. This is effectively the number of thread blocks launched independently by the GPU. None of the elements of this vector can exceed the corresponding element in the vector of theMaxGridSize
property of the GPUDevice object.
Example: [977 1 1]
Data Types: single
| double
| int8
| int16
| int32
| int64
| uint8
| uint16
| uint32
| uint64
The amount of dynamic shared memory (in bytes) that each thread block can use. Each thread block has an available shared memory region. This memory is shared with registers on the multiprocessors. SharedMemorySize
must not exceed theMaxShmemPerBlock
property of the GPUDevice object.
As with all memory, this needs to be allocated before the kernel is launched. It is common for the size of this shared memory region to be tied to the size of the thread block. Setting this value on the kernel ensures that each thread in a block can access this available shared memory region.
Example: 16000
Data Types: single
| double
| int8
| int16
| int32
| int64
| uint8
| uint16
| uint32
| uint64
This property is read-only.
The entry point name in the PTX code called by the kernel.
Example: "_Z13returnPointerPKfPy"
Data Types: char
| string
This property is read-only.
The maximum number of left hand side arguments that the kernel supports. It cannot be greater than the number of right hand side arguments, and if any inputs are constant or scalar it will be less.
Example: 1
Data Types: single
| double
| int8
| int16
| int32
| int64
| uint8
| uint16
| uint32
| uint64
| logical
This property is read-only.
The required number of right hand side arguments needed to call this kernel. All inputs need to define either the scalar value of an input, the elements for a vector input/output, or the size of an output argument.
Example: 5
Data Types: single
| double
| int8
| int16
| int32
| int64
| uint8
| uint16
| uint32
| uint64
This property is read-only.
Cell array of character vectors of length NumRHSArguments
. Each of the character vectors indicates what the expected MATLAB® data type for that input is by specifying a numeric type such asuint8
, single
, or double
followed by the word scalar
or vector
to indicate if we are passing by reference or value. In addition, if that argument is only an input to the kernel, it is prefixed by in
; and if it is an input/output, it is prefixed by inout
. This allows you to decide how to efficiently call the kernel with both MATLAB arrays and gpuArray
objects, and to see which of the kernel inputs are being treated as outputs.
Example: {'inout double vector'} {'in double vector'} {'in double vector'} {'in uint32 scalar'} {'in uint32 scalar'}
Data Types: cell
Object Functions
Examples
This example shows how to create a CUDAKernel
object using a PTX file and a CU file, or using a PTX file and the function prototype.
The CUDA source file simpleEx.cu
contains the following code:
/*
- Add a constant to a vector. */ global void addToVector(float * pi, float c, int vecLen) { int idx = blockIdx.x * blockDim.x + threadIdx.x; if (idx < vecLen) { pi[idx] += c; } }
Compile the CU file into a PTX file using mexcuda with the -ptx
option.
Building with 'NVIDIA CUDA Compiler'. MEX completed successfully.
Create a CUDA kernel using the PTX file and the CU file.
kern = parallel.gpu.CUDAKernel("simpleEx.ptx","simpleEx.cu");
Create a CUDA kernel using the PTX file and the function prototype of theaddToVector
function.
kern = parallel.gpu.CUDAKernel("simpleEx.ptx","float *,float,int");
Both of the preceding statements return a kernel object that you can use to call theaddToVector
CUDA kernel.
This example shows how to create a CUDAKernel
object from a PTX file with more than one entry point.
Suppose your CU file, myfun.cu
, contains a functionadd1
for adding two doubles together and a functionadd2
for adding two vectors together.
global void add1( double * a, double b ) { *a += b; }
global void add2( double * v1, const double * v2 ) { int idx = threadIdx.x; v1[idx] += v2[idx]; }
Compile the CU file into a PTX file using mexcuda with the -ptx
option.
Building with 'NVIDIA CUDA Compiler'. MEX completed successfully.
The PTX file contains two entry points corresponding to the add1
and add2
functions. When your PTX code contains multiple entry points, you must specify an entry when creating your kernel.
Create a kernel for adding two doubles together and specify the entry pointadd1
.
k = parallel.gpu.CUDAKernel("myfun.ptx","myfun.cu","add1");
Version History
Introduced in R2010b