Release CUTLASS 2.11 · NVIDIA/cutlass (original) (raw)

2.11.0 (2022-11-19)

Stream-K, which is a new general way to do split-K. It can not only improve performance, but can also significantly reduce the number of tile sizes that need to be profiled to find the best one.
Fused multi-head attention Kernel. It has two variants: one uses batched GEMM for the fixed sequence length, and the other one uses group GEMM for the variable sequence length. Both versions just need one kernel.
Dual GEMM, which can fuse A x B and A x C into one kernel. Two GEMMs has no producer-consumer dependency.
Hopper improves double precision matrix multiplication by 2x compared to Ampere at iso-clocks. It is supported since CUDA 11.8.
BLAS3 functions with Hoppers new double precision matrix multiplication instructions.
ELL Block Sparse GEMM, which uses an ELL matrix to describe the sparsity of A matrix. B and output matrices are still dense. The block size can be arbitary.
Optimized Group Conv for SingleGroup mode, which requires that the output channel per group is a multiple of Threadblock tile N.
Optimized DepthWise Conv. Two new modes are added
- kOptimized - use direct conv to compute instead of implicit GEMM.
  * The restrictions are: 1) input ,output channel and group number should be multiple of (128 / sizeof(input element)). 2) The input filter size should be the same as the template parameter configuration.
- kFixedStrideDilation - which puts stride and dilation into templates to further improve the performance. In this mode, kernel persistents some inputs into register to squeeze more performance, so large filter/stride/dilation is not recommanded.
  * The restrictions are: 1) input, output channel and group number should be multiple of (128 / sizeof(input element)). 2) input filter size, stride, dilation should same as the template parameter configuration.
Scripts to fuse multiple back-to-back GEMM. Its implementation was discussed in a GTC'22 Spring talk.
FP8 data type definition and conversion routines.
Updates and bugfixes from the community (thanks!). Big shout out to Meta's xFormers.
Deprecation announcement: CUTLASS plans to deprecate the following:
- Maxwell and Pascal GPU architectures
- Ubuntu 16.04
- CUDA 10.2