Adaptive thread distributions for SpMV on a GPU (original) (raw)

Abstract

We present a simple auto-tuning method to improve the performance of sparse matrix-vector multiply (SpMV) on a GPU. The sparse matrix, stored in CSR format, is sorted in increasing order of the number of nonzero elements per row and partitioned into several ranges. The number of GPU threads per row (TPR) is then assigned for different ranges of the matrix rows to balance the workload for the GPU threads. Tests show that the method provides good performance for most of the matrices tested, compared to the NVIDIA sparse package. The auto-tuning approach is easy to implement, the tuning process is fast, and it is not necessary to convert the matrices into different formats and try them one by one to determine the best format for the matrix, as in some other approaches for this problem.

W. Gropp hasn't uploaded this paper.

Let W. know you want this paper to be uploaded.

Ask for this paper to be uploaded.