CUDAKernel - Kernel executable on GPU - MATLAB (original) (raw)

parallel.gpu.CUDAKernel

Description

A CUDAKernel object represents a CUDA kernel that can execute on a GPU. You create the kernel from CU and PTX files. For an example of how to create and use aCUDAKernel object, see Run CUDA or PTX Code on GPU.

Note

You cannot save or load CUDAKernel objects.

Creation

Syntax

Description

`kern` = parallel.gpu.CUDAKernel([ptxFile](#mw%5F48169692-9c25-4c02-a213-60de7319cf7a),[cuFile](#mw%5F191fb044-066e-4fb5-92f3-6e7d7be62150)) creates a CUDAKernel object using the PTX codeptxFile and the CUDA® source file cuFile. The PTX file must contain only a single entry point.

Use feval with kern as an input to execute the CUDA kernel on the GPU. For information on executing your kernel object, seeRun a CUDAKernel.

example

`kern` = parallel.gpu.CUDAKernel([ptxFile](#mw%5F48169692-9c25-4c02-a213-60de7319cf7a),[cuFile](#mw%5F191fb044-066e-4fb5-92f3-6e7d7be62150),[func](#mw%5F834f335d-00a9-4bd5-91cf-830162b9f713)) creates a CUDAKernel for the function entry point defined byfunc. func must unambiguously define the appropriate kernel entry point in the PTX file.

example

`kern` = parallel.gpu.CUDAKernel([ptxFile](#mw%5F48169692-9c25-4c02-a213-60de7319cf7a),[cProto](#mw%5F926f364a-5f24-47e7-b85d-f3f82071e38d)) creates a CUDAKernel object using the PTX fileptxFile and the C prototype cProto.cProto is the C function prototype for the kernel call thatkern represents. The PTX file must contain only a single entry point.

example

`kern` = parallel.gpu.CUDAKernel([ptxFile](#mw%5F48169692-9c25-4c02-a213-60de7319cf7a),[cProto](#mw%5F926f364a-5f24-47e7-b85d-f3f82071e38d),[func](#mw%5F834f335d-00a9-4bd5-91cf-830162b9f713)) creates a CUDAKernel object from a PTX file and C prototype for the function entry point defined by func. func must unambiguously define the appropriate kernel entry point in the PTX file.

Input Arguments

expand all

Name of a PTX file or PTX code.

You can provide the name of a PTX file, or pass its contents as a string.

Example: "simpleEx.ptx"

Data Types: char | string

Name of a CUDA source file, specified as a character vector.

The function examines the CUDA source file to find the function prototype for the CUDA kernel that is defined in the PTX code. The CUDA source file must contain a kernel definition starting with'__global__'.

Example: "simpleEx.cu"

Data Types: char | string

Function entry point, specified as a character vector. func must unambiguously define the appropriate entry point in the PTX file.

Note

The parallel.gpu.CUDAKernel function searches for the specified entry point in the PTX file, and matches on any substring occurrences. Therefore, you should not name any of your entry points as substrings of any others.

Example: "add1"

Data Types: char | string

C prototype for the kernel call, specified as a character vector. Specify multiple input arguments separated by commas.

Example: "float *,float,int"

Data Types: char | string

Properties

expand all

Size of a block of threads on the kernel, specified as a vector of positive integers of length 1, 2, or 3 (since thread blocks can be up to 3-dimensional). The product of the elements of ThreadBlockSize must not exceed theMaxThreadsPerBlock for this kernel, and no element ofThreadBlockSize can exceed the corresponding element of the[GPUDevice](parallel.gpu.gpudevice.html) propertyMaxThreadBlockSize.

Example: [8 8 8]

This property is read-only.

Maximum number of threads permissible in a single block for this CUDA kernel. The product of the elements of ThreadBlockSize must not exceed this value.

Example: 1024

Size of grid of thread blocks, specified as an integer vector of length 3. This is effectively the number of thread blocks launched independently by the GPU. None of the elements of this vector can exceed the corresponding element in the vector of theMaxGridSize property of the GPUDevice object.

Example: [977 1 1]

The amount of dynamic shared memory (in bytes) that each thread block can use. Each thread block has an available shared memory region. This memory is shared with registers on the multiprocessors. SharedMemorySize must not exceed theMaxShmemPerBlock property of the GPUDevice object.

As with all memory, this needs to be allocated before the kernel is launched. It is common for the size of this shared memory region to be tied to the size of the thread block. Setting this value on the kernel ensures that each thread in a block can access this available shared memory region.

Example: 16000

This property is read-only.

The entry point name in the PTX code called by the kernel.

Example: "_Z13returnPointerPKfPy"

Data Types: char | string

This property is read-only.

The maximum number of left hand side arguments that the kernel supports. It cannot be greater than the number of right hand side arguments, and if any inputs are constant or scalar it will be less.

Example: 1

This property is read-only.

The required number of right hand side arguments needed to call this kernel. All inputs need to define either the scalar value of an input, the elements for a vector input/output, or the size of an output argument.

Example: 5

This property is read-only.

Cell array of character vectors of length NumRHSArguments. Each of the character vectors indicates what the expected MATLAB® data type for that input is by specifying a numeric type such asuint8, single, or double followed by the word scalar or vector to indicate if we are passing by reference or value. In addition, if that argument is only an input to the kernel, it is prefixed by in; and if it is an input/output, it is prefixed by inout. This allows you to decide how to efficiently call the kernel with both MATLAB arrays and gpuArray objects, and to see which of the kernel inputs are being treated as outputs.

Example: {'inout double vector'} {'in double vector'} {'in double vector'} {'in uint32 scalar'} {'in uint32 scalar'}

Data Types: cell

Object Functions

Examples

collapse all

This example shows how to create a CUDAKernel object using a PTX file and a CU file, or using a PTX file and the function prototype.

The CUDA source file simpleEx.cu contains the following code:

Add a constant to a vector. */ global void addToVector(float * pi, float c, int vecLen) { int idx = blockIdx.x * blockDim.x + threadIdx.x; if (idx < vecLen) { pi[idx] += c; } }

Compile the CU file into a PTX file using mexcuda with the -ptx option.

Building with 'NVIDIA CUDA Compiler'. MEX completed successfully.

Create a CUDA kernel using the PTX file and the CU file.

kern = parallel.gpu.CUDAKernel("simpleEx.ptx","simpleEx.cu");

Create a CUDA kernel using the PTX file and the function prototype of theaddToVector function.

kern = parallel.gpu.CUDAKernel("simpleEx.ptx","float *,float,int");

Both of the preceding statements return a kernel object that you can use to call theaddToVector CUDA kernel.

This example shows how to create a CUDAKernel object from a PTX file with more than one entry point.

Suppose your CU file, myfun.cu, contains a functionadd1 for adding two doubles together and a functionadd2 for adding two vectors together.

global void add1( double * a, double b ) { *a += b; }

global void add2( double * v1, const double * v2 ) { int idx = threadIdx.x; v1[idx] += v2[idx]; }

Compile the CU file into a PTX file using mexcuda with the -ptx option.

Building with 'NVIDIA CUDA Compiler'. MEX completed successfully.

The PTX file contains two entry points corresponding to the add1 and add2 functions. When your PTX code contains multiple entry points, you must specify an entry when creating your kernel.

Create a kernel for adding two doubles together and specify the entry pointadd1.

k = parallel.gpu.CUDAKernel("myfun.ptx","myfun.cu","add1");

Version History

Introduced in R2010b