cuSOLVER Example - MATLAB & Simulink (original) (raw)

Main Content

This example solves the systems of linear equations Ax = B forx by using the cuSOLVER library. The matrices A andB must have the same number of rows. If A is a scalar, then A\B is equivalent to A.\B. If A is a square n-by-n matrix and B is a matrix with n rows, then x = A\B is a solution to the equation A*x = B, if it exists. The MATLAB® implementation of backslash is:

function [x] = backslash(A,b) if (isscalar(A)) x = coder.nullcopy(zeros(size(b))); else x = coder.nullcopy(zeros(size(A,2),size(b,2))); end

x = A\b;

end

Prepare backslash for Kernel Creation

GPU Coder™ requires no special pragma to generate calls to libraries. Just as before, there are two ways to generate CUDA® kernels — coder.gpu.kernelfun and coder.gpu.kernel. In this example, we utilize thecoder.gpu.kernelfun pragma to generate CUDA kernels. The modified backslash function is:

function [x] = backslash(A,b) %#codegen

if (isscalar(A)) x = coder.nullcopy(zeros(size(b))); else x = coder.nullcopy(zeros(size(A,2),size(b,2))); end

coder.gpu.kernelfun() x = A\b;

end

Note

A minimum size is required on the input data for replacing math operators and functions with cuSOLVER library implementations. The minimum threshold is 128 elements.

Generated CUDA Code

When you generate CUDA code, GPU Coder creates function calls to initialize the cuSOLVER library, performmldivide operations, and release hardware resources that the cuSOLVER library uses. A snippet of the generated CUDA code is:

cusolverEnsureInitialization();

/* Copyright 2017 The MathWorks, Inc. */ cudaMemcpy(b_gpu_A, A, 1152UL, cudaMemcpyHostToDevice);
blackslash_kernel1<<<dim3(1U, 1U, 1U), dim3(160U, 1U, 1U)>>>(b_gpu_A,gpu_A); cudaMemcpy(b_A, gpu_A, 1152UL, cudaMemcpyDeviceToHost); cusolverDnDgetrf_bufferSize(cusolverGlobalHandle, 12, 12, &gpu_A[0], 12, &cusolverWorkspaceReq); cusolverWorkspaceTypeSize = 8; cusolverInitWorkspace(); cudaMemcpy(gpu_A, b_A, 1152UL, cudaMemcpyHostToDevice); cusolverDnDgetrf(cusolverGlobalHandle, 12, 12, &gpu_A[0], 12, (real_T *) cusolverWorkspaceBuff, &gpu_ipiv_t[0], gpu_info_t); A_dirtyOnGpu = true; cudaMemcpy(&info_t, gpu_info_t, 4UL, cudaMemcpyDeviceToHost);

To initialize the cuSOLVER library and create a handle to the cuSOLVER library context, the function cusolversEnsureInitialization() callscusolverDnCreate() cuSOLVER API. It allocates hardware resources on the host and device.

static void cusolverEnsureInitialization(void) { if (cusolverGlobalHandle == NULL) { cusolverDnCreate(&cuSolverGlobalHandle); } }

backslash_kernel1 zero pads the matrix A. This kernel is launched with a single block of 512 threads.

static global launch_bounds(160, 1) void backslash_kernel1(const real_T * A, real_T b_A) { int32_T threadId; ; ; threadId = (int32_T)(((gridDim.x * gridDim.y * blockIdx.z + gridDim.x * blockIdx.y) + blockIdx.x) * (blockDim.x * blockDim.y * blockDim.z) + (int32_T)((threadIdx.z * blockDim.x * blockDim.y + threadIdx.y * blockDim.x) + threadIdx.x)); if (!(threadId >= 144)) { / Copyright 2017 The MathWorks, Inc. */ b_A[threadId] = A[threadId]; } }

Calls to cudaMemcpy transfer the matrix A from the host to the device. The function cusolverDnDgetrf computes the LU factorization of an m×n matrix:

P*A = L*U

where A is an m×n matrix, P is a permutation matrix, L is a lower triangular matrix with unit diagonal, and U is an upper triangular matrix.

cuSOLVER Standalone Code

For functions like qr that only have partial support in cuSOLVER, GPU Coder uses LAPACK library where necessary. For MEX functions, the code generator uses the LAPACK library that is included with MATLAB. For standalone code, the code generator uses the LAPACK library that you specify. To specify the LAPACK library:

For example, to generate a standalone executable, you can use the following code generation script. Here myLAPACK is the name of the custom coder.LAPACKCallback class containing the LAPACK library information.

cfg = coder.gpuConfig('exe'); cfg.CustomLAPACKCallback = 'myLAPACK'; cfg.GenerateExampleMain = 'GenerateCodeAndCompile';

classdef myLAPACK < coder.LAPACKCallback methods (Static) function hn = getHeaderFilename() hn = 'lapacke.h'; end function updateBuildInfo(buildInfo, buildctx) [~,linkLibExt] = buildctx.getStdLibInfo(); cudaPath = getenv('CUDA_PATH'); libPath = 'lib\x64';

        buildInfo.addIncludePaths(fullfile(cudaPath,'include'));
        libName = 'cusolver';
        libPath = fullfile(cudaPath,libPath);
        buildInfo.addLinkObjects([libName linkLibExt], libPath, ...
            '', true, true);
        
        lapackLocation = 'C:\LAPACK\win64'; % specify path to LAPACK libraries 
        
        includePath = fullfile(lapackLocation,'include');
        buildInfo.addIncludePaths(includePath);
        libPath = fullfile(lapackLocation,'lib');
        libName = 'mllapack';
        
        buildInfo.addLinkObjects([libName linkLibExt], libPath, ...
            '', true, true);
        buildInfo.addDefines('HAVE_LAPACK_CONFIG_H');
        buildInfo.addDefines('LAPACK_COMPLEX_STRUCTURE');
    end
end

end

For more information, see Speed Up Linear Algebra in Generated Standalone Code by Using LAPACK Calls.