cuSOLVER Example - MATLAB & Simulink (original) (raw)
Main Content
This example solves the systems of linear equations Ax = B
forx
by using the cuSOLVER library. The matrices A
andB
must have the same number of rows. If A
is a scalar, then A\B
is equivalent to A.\B
. If A
is a square n-by-n matrix and B
is a matrix with n rows, then x = A\B
is a solution to the equation A*x = B
, if it exists. The MATLAB® implementation of backslash
is:
function [x] = backslash(A,b) if (isscalar(A)) x = coder.nullcopy(zeros(size(b))); else x = coder.nullcopy(zeros(size(A,2),size(b,2))); end
x = A\b;
end
Prepare backslash
for Kernel Creation
GPU Coder™ requires no special pragma to generate calls to libraries. Just as before, there are two ways to generate CUDA® kernels — coder.gpu.kernelfun and coder.gpu.kernel. In this example, we utilize thecoder.gpu.kernelfun
pragma to generate CUDA kernels. The modified backslash
function is:
function [x] = backslash(A,b) %#codegen
if (isscalar(A)) x = coder.nullcopy(zeros(size(b))); else x = coder.nullcopy(zeros(size(A,2),size(b,2))); end
coder.gpu.kernelfun() x = A\b;
end
Note
A minimum size is required on the input data for replacing math operators and functions with cuSOLVER library implementations. The minimum threshold is 128 elements.
Generated CUDA Code
When you generate CUDA code, GPU Coder creates function calls to initialize the cuSOLVER library, performmldivide
operations, and release hardware resources that the cuSOLVER library uses. A snippet of the generated CUDA code is:
cusolverEnsureInitialization();
/* Copyright 2017 The MathWorks, Inc. */
cudaMemcpy(b_gpu_A, A, 1152UL, cudaMemcpyHostToDevice);
blackslash_kernel1<<<dim3(1U, 1U, 1U), dim3(160U, 1U, 1U)>>>(b_gpu_A,gpu_A);
cudaMemcpy(b_A, gpu_A, 1152UL, cudaMemcpyDeviceToHost);
cusolverDnDgetrf_bufferSize(cusolverGlobalHandle, 12, 12, &gpu_A[0], 12,
&cusolverWorkspaceReq);
cusolverWorkspaceTypeSize = 8;
cusolverInitWorkspace();
cudaMemcpy(gpu_A, b_A, 1152UL, cudaMemcpyHostToDevice);
cusolverDnDgetrf(cusolverGlobalHandle, 12, 12, &gpu_A[0], 12, (real_T *)
cusolverWorkspaceBuff, &gpu_ipiv_t[0], gpu_info_t);
A_dirtyOnGpu = true;
cudaMemcpy(&info_t, gpu_info_t, 4UL, cudaMemcpyDeviceToHost);
To initialize the cuSOLVER library and create a handle to the cuSOLVER library context, the function cusolversEnsureInitialization()
callscusolverDnCreate()
cuSOLVER API. It allocates hardware resources on the host and device.
static void cusolverEnsureInitialization(void) { if (cusolverGlobalHandle == NULL) { cusolverDnCreate(&cuSolverGlobalHandle); } }
backslash_kernel1
zero pads the matrix A
. This kernel is launched with a single block of 512 threads.
static global launch_bounds(160, 1) void backslash_kernel1(const real_T * A, real_T b_A) { int32_T threadId; ; ; threadId = (int32_T)(((gridDim.x * gridDim.y * blockIdx.z + gridDim.x * blockIdx.y) + blockIdx.x) * (blockDim.x * blockDim.y * blockDim.z) + (int32_T)((threadIdx.z * blockDim.x * blockDim.y + threadIdx.y * blockDim.x) + threadIdx.x)); if (!(threadId >= 144)) { / Copyright 2017 The MathWorks, Inc. */ b_A[threadId] = A[threadId]; } }
Calls to cudaMemcpy
transfer the matrix A
from the host to the device. The function cusolverDnDgetrf
computes the LU factorization of an m×n matrix:
P*A = L*U
where A is an m×n matrix, P is a permutation matrix, L is a lower triangular matrix with unit diagonal, and U is an upper triangular matrix.
cuSOLVER Standalone Code
For functions like qr
that only have partial support in cuSOLVER, GPU Coder uses LAPACK library where necessary. For MEX functions, the code generator uses the LAPACK library that is included with MATLAB. For standalone code, the code generator uses the LAPACK library that you specify. To specify the LAPACK library:
- At the command line, define your own coder.LAPACKCallback class containing the LAPACK library information and assign it to the
CustomLAPACKCallback
property of the code configuration object. - In the GPU Coder app, set the Custom LAPACK library callback parameter to your LAPACK library.
For example, to generate a standalone executable, you can use the following code generation script. Here myLAPACK
is the name of the custom coder.LAPACKCallback class containing the LAPACK library information.
cfg = coder.gpuConfig('exe'); cfg.CustomLAPACKCallback = 'myLAPACK'; cfg.GenerateExampleMain = 'GenerateCodeAndCompile';
classdef myLAPACK < coder.LAPACKCallback methods (Static) function hn = getHeaderFilename() hn = 'lapacke.h'; end function updateBuildInfo(buildInfo, buildctx) [~,linkLibExt] = buildctx.getStdLibInfo(); cudaPath = getenv('CUDA_PATH'); libPath = 'lib\x64';
buildInfo.addIncludePaths(fullfile(cudaPath,'include'));
libName = 'cusolver';
libPath = fullfile(cudaPath,libPath);
buildInfo.addLinkObjects([libName linkLibExt], libPath, ...
'', true, true);
lapackLocation = 'C:\LAPACK\win64'; % specify path to LAPACK libraries
includePath = fullfile(lapackLocation,'include');
buildInfo.addIncludePaths(includePath);
libPath = fullfile(lapackLocation,'lib');
libName = 'mllapack';
buildInfo.addLinkObjects([libName linkLibExt], libPath, ...
'', true, true);
buildInfo.addDefines('HAVE_LAPACK_CONFIG_H');
buildInfo.addDefines('LAPACK_COMPLEX_STRUCTURE');
end
end
end
For more information, see Speed Up Linear Algebra in Generated Standalone Code by Using LAPACK Calls.