Stream-K, which is a new general way to do split-K. It can not only improve performance, but can also significantly reduce the number of tile sizes that need to be profiled to find the best one.
Fused multi-head attention Kernel. It has two variants: one uses batched GEMM for the fixed sequence length, and the other one uses group GEMM for the variable sequence length. Both versions just need one kernel.
Dual GEMM, which can fuse A x B and A x C into one kernel. Two GEMMs has no producer-consumer dependency.
BLAS3 functions with Hoppers new double precision matrix multiplication instructions.
ELL Block Sparse GEMM, which uses an ELL matrix to describe the sparsity of A matrix. B and output matrices are still dense. The block size can be arbitary.
Optimized Group Conv for SingleGroup mode, which requires that the output channel per group is a multiple of Threadblock tile N.
kOptimized - use direct conv to compute instead of implicit GEMM. * The restrictions are: 1) input ,output channel and group number should be multiple of (128 / sizeof(input element)). 2) The input filter size should be the same as the template parameter configuration.
kFixedStrideDilation - which puts stride and dilation into templates to further improve the performance. In this mode, kernel persistents some inputs into register to squeeze more performance, so large filter/stride/dilation is not recommanded. * The restrictions are: 1) input, output channel and group number should be multiple of (128 / sizeof(input element)). 2) input filter size, stride, dilation should same as the template parameter configuration.
Scripts to fuse multiple back-to-back GEMM. Its implementation was discussed in a GTC'22 Spring talk.