gpucoder.stridedMatrixMultiplyAdd - Optimized GPU implementation of strided, batched matrix multiply with add

  operation - MATLAB ([original](https://in.mathworks.com/help/gpucoder/ref/gpucoder.stridedmatrixmultiplyadd.html)) ([raw](?raw))

Optimized GPU implementation of strided, batched matrix multiply with add operation

Syntax

Description

[D](#mw%5F30fe7fbd-7baf-4b6d-8540-dd4cc894baf7) = gpucoder.stridedMatrixMultiplyAdd([A](#mw%5F3915b476-a26f-4943-a975-67bdbea5402c),[B](#mw%5F3915b476-a26f-4943-a975-67bdbea5402c),[C](#mw%5F3915b476-a26f-4943-a975-67bdbea5402c)) performs strided matrix-matrix multiplication and add of a batch of matrices. The input matrices A, B, and C for each instance of the batch are located at fixed address offsets from their addresses in the previous instance. The gpucoder.stridedMatrixMultiplyAdd function performs matrix-matrix multiplication of the form:

where α and β are scalar multiplication factors, A,B, C, and D are matrices with dimensions m-by-k,k-by-n,m-by-n, andm-by-n respectively. A andB can optionally be transposed or hermitian-conjugated. By default, α and β are set to one and the matrices are not transposed. To specify a different scalar multiplication factor and perform transpose operations on the input matrices, use theName,Value pair arguments.

All the batches passed to thegpucoder.stridedMatrixMultiplyAdd function must be uniform. That is, all instances must have the same dimensionsm,n,k.

___ = gpucoder.stridedMatrixMultiplyAdd(___,[Name,Value](#namevaluepairarguments)) performs batched matrix multiply and add operation by using the options specified by one or more Name,Value pair arguments.

example

Examples

collapse all

Performs a simple batched matrix-matrix multiplication with add and use the gpucoder.stridedMatrixMultiplyAdd function to generate CUDA® code that calls correspondingcublas<t>gemmStridedBatched APIs.

In one file, write an entry-point function myStridedMatMulAdd that accepts matrix inputs A, B, andC. Because the input matrices are not transposed, use the'nn' option.

function [D] = myStridedMatMulAdd(A,B,C,alpha,beta)

[D] = gpucoder.stridedMatrixMultiplyAdd(A,B,C,'alpha',alpha,... 'beta',beta,'transpose','nn');

end

To create a type for a matrix of doubles for use in code generation, use thecoder.newtype function.

A = coder.newtype('double',[12,14 10],[0 0]); B = coder.newtype('double',[14,16 10],[0 0]); C = coder.newtype('double',[12,16 10],[0 0]); alpha = 0.3; beta = 0.6; inputs = {A,B,C,alpha,beta};

To generate a CUDA library, use the codegen function.

cfg = coder.gpuConfig('lib'); cfg.GpuConfig.EnableCUBLAS = true; cfg.GpuConfig.EnableCUSOLVER = true; cfg.GenerateReport = true; codegen -config cfg-args inputs myStridedMatMulAdd

The generated CUDA code contains kernels myStridedMatMulAdd_kernelNN for initializing the input and output matrices. The code also contains thecublasDgemmStridedBatched API calls to the cuBLAS library. The following code is a snippet of the generated code.

// // File: myStridedMatMulAdd.cu ...

void myStridedMatMulAdd(const double A[1680], const double B[2240], const double C[1920], double alpha, double beta, double D[1920]) { double alpha1;

..alpha1 = alpha; beta1 = beta; cudaMemcpy(gpu_C, (void *)&C[0], 15360ULL, cudaMemcpyHostToDevice); myStridedMatMulAdd_kernel1<<<dim3(4U, 1U, 1U), dim3(512U, 1U, 1U)>>>(*gpu_C, *gpu_D); cudaMemcpy(gpu_alpha1, &alpha1, 8ULL, cudaMemcpyHostToDevice); cudaMemcpy(gpu_A, (void *)&A[0], 13440ULL, cudaMemcpyHostToDevice); cudaMemcpy(gpu_B, (void *)&B[0], 17920ULL, cudaMemcpyHostToDevice); cudaMemcpy(gpu_beta1, &beta1, 8ULL, cudaMemcpyHostToDevice); cublasDgemmStridedBatched(getCublasGlobalHandle(), CUBLAS_OP_N, CUBLAS_OP_N, 12, 16, 14, (double *)gpu_alpha1, (double *)&(*gpu_A)[0], 12, 168, (double *) &(*gpu_B)[0], 14, 224, (double *)gpu_beta1, (double *)&(*gpu_D)[0], 12, 192, 10); cudaMemcpy(&D[0], gpu_D, 15360ULL, cudaMemcpyDeviceToHost); ... }

Input Arguments

collapse all

Operands, specified as vectors or matrices. The number of columns inA must be equal to the number of rows in B. The number of rows in A must be equal to the number of rows inC. The number of columns in B must be equal to the number of columns in C.

Name-Value Arguments

collapse all

Specify optional pairs of arguments asName1=Value1,...,NameN=ValueN, where Name is the argument name and Value is the corresponding value. Name-value arguments must appear after other arguments, but the order of the pairs does not matter.

Before R2021a, use commas to separate each name and value, and enclose Name in quotes.

Example: D = gpucoder.stridedMatrixMultiplyAdd(A,B,C,'alpha',0.3,'beta',0.6,'transpose','CC');

Value of the scalar used for multiplication with A. Default value is one.

Value of the scalar used for multiplication with C. Default value is one.

Character vector or string composed of two characters, indicating the operation performed on the matrices A and B prior to matrix multiplication. Possible values are normal ('N'), transposed ('T'), or complex conjugate transpose ('C').

Output Arguments

collapse all

Product, returned as a scalar, vector, or matrix. Array D has the same number of rows as input A and the same number of columns as input B.

Version History

Introduced in R2020a

gpucoder.stridedMatrixMultiplyAdd - Optimized GPU implementation of strided, batched matrix multiply with add

Syntax

Description

Examples

Input Arguments

Name-Value Arguments

Output Arguments

Version History

See Also

Apps

Functions

Objects

Topics