gpucoder.stridedMatrixMultiply - Optimized GPU implementation of strided and batched matrix multiply
operation - MATLAB ([original](https://in.mathworks.com/help/gpucoder/ref/gpucoder.stridedmatrixmultiply.html)) ([raw](?raw))
Optimized GPU implementation of strided and batched matrix multiply operation
Syntax
Description
[D](#mw%5Fb71748ae-7071-481e-ba4f-21be0fdd71ed) = gpucoder.stridedMatrixMultiply([A](#mw%5F0b091cbe-462c-4f9e-83d5-edefd67c03e9),[B](#mw%5F0b091cbe-462c-4f9e-83d5-edefd67c03e9))
performs strided matrix-matrix multiplication of a batch of matrices. The input matricesA
and B
for each instance of the batch are located at fixed address offsets from their addresses in the previous instance. Thegpucoder.stridedMatrixMultiply
function performs matrix-matrix multiplication of the form:
where α is a scalar multiplication factor, A
,B
, and D
are matrices with dimensionsm
-by-k
,k
-by-n
, andm
-by-n
respectively. You can optionally transpose or hermitian-conjugate A
and B
. By default, α is set to one and the matrices are not transposed. To specify a different scalar multiplication factor and perform transpose operations on the input matrices, use theName,Value
pair arguments.
All the batches passed to thegpucoder.stridedMatrixMultiply
function must be uniform. That is, all instances must have the same dimensionsm,n,k
.
___ = gpucoder.stridedMatrixMultiply(___,[Name,Value](#namevaluepairarguments))
performs strided batched matrix multiply operation by using the options specified by one or more Name,Value
pair arguments.
Examples
Perform a simple batched matrix-matrix multiplication and use thegpucoder.stridedMatrixMultiply
function to generate CUDA® code that calls correspondingcublas<t>gemmStridedBatched
APIs.
In one file, write an entry-point function myStridedMatMul
that accepts matrix inputs A
and B
. Because the input matrices are not transposed, use the 'nn'
option.
function [D] = myStridedMatMul(A,B,alpha)
[D] = gpucoder.stridedMatrixMultiply(A,B,'alpha',alpha, ... 'transpose','nn');
end
To create a type for a matrix of doubles for use in code generation, use thecoder.newtype function.
A = coder.newtype('double',[5 4 100],[0 0]); B = coder.newtype('double',[4 5 100],[0 0]); alpha = 0.3; inputs = {A,B,alpha};
To generate a CUDA library, use the codegen function.
cfg = coder.gpuConfig('lib'); cfg.GpuConfig.EnableCUBLAS = true; cfg.GpuConfig.EnableCUSOLVER = true; cfg.GenerateReport = true; codegen -config cfg-args inputs myStridedMatMul
The generated CUDA code contains kernels myStridedMatMul_kernelNN
for initializing the input and output matrices. The code also contains thecublasDgemmStridedBatched
API calls to the cuBLAS library. The following code is a snippet of the generated code.
// // File: myStridedMatMul.cu // ... void myStridedMatMul(const double A_data[], const int A_size[3], const double B_data[], const int B_size[3], double alpha, double D_data[], int D_size[3]) { double alpha1; ... beta1 = 0.0; cudaMemcpy(gpu_alpha1, &alpha1, 8ULL, cudaMemcpyHostToDevice); cudaMemcpy(gpu_A_data, (void *)A_data, A_size[0] * A_size[1] * A_size[2] * sizeof(double), cudaMemcpyHostToDevice); cudaMemcpy(gpu_B_data, (void *)B_data, B_size[0] * B_size[1] * B_size[2] * sizeof(double), cudaMemcpyHostToDevice); cudaMemcpy(gpu_beta1, &beta1, 8ULL, cudaMemcpyHostToDevice); if (D_data_dirtyOnCpu) { cudaMemcpy(gpu_D_data, &D_data[0], 25 * D_size[2] * sizeof(double), cudaMemcpyHostToDevice); }
if (batchDimsA[2] >= batchDimsB[2]) { if (batchDimsA[2] >= 1) { ntilecols = batchDimsA[2]; } else { ntilecols = 1; } } else { ntilecols = batchDimsB[2]; }
cublasDgemmStridedBatched(getCublasGlobalHandle(), CUBLAS_OP_N, CUBLAS_OP_N, 5, 5, 4, (double *)gpu_alpha1, (double *)&gpu_A_data[0], 5, strideA, (double *) &gpu_B_data[0], 4, strideB, (double *)gpu_beta1, (double *)&gpu_D_data[0], 5, 25, ntilecols); cudaMemcpy(&D_data[0], gpu_D_data, 25 * D_size[2] * sizeof(double), cudaMemcpyDeviceToHost); ... }
Input Arguments
Operands, specified as vectors or matrices.gpucoder.stridedMatrixMultiply
multiplies along the first two dimensions.
Data Types: double
| single
| int8
| int16
| int32
| int64
| uint8
| uint16
| uint32
| uint64
Complex Number Support: Yes
Name-Value Arguments
Specify optional pairs of arguments asName1=Value1,...,NameN=ValueN
, where Name
is the argument name and Value
is the corresponding value. Name-value arguments must appear after other arguments, but the order of the pairs does not matter.
Before R2021a, use commas to separate each name and value, and enclose Name
in quotes.
Example: D = gpucoder.stridedMatrixMultiply(A,B,'alpha',0.3,'transpose','CC');
Value of the scalar used for multiplication with A
. Default value is one.
Character vector or string composed of two characters, indicating the operation performed on the matrices A
and B
prior to matrix multiplication. Possible values are normal ('N'
), transposed ('T'
), or complex conjugate transpose ('C'
).
Output Arguments
Product, returned as a scalar, vector, or matrix. Array D
has the same number of rows as input A
and the same number of columns as input B
.
Version History
Introduced in R2020a
See Also
Apps
Functions
- codegen | coder.gpu.kernel | coder.gpu.kernelfun | gpucoder.stridedMatrixMultiplyAdd | gpucoder.batchedMatrixMultiply | gpucoder.batchedMatrixMultiplyAdd