Bibtek download is not available in the pre-proceeding
Bin Lin, Ningxin Zheng, Lei Wang, Shijie Cao, Lingxiao Ma, Quanlu Zhang, Yi Zhu, Ting Cao, Jilong Xue, Yuqing Yang, Fan Yang
N:M sparsity is becoming increasingly popular due to its promise to achieve both high model accuracy and computational efficiency for deep learning. However, the real-world benefit of N:M sparsity is limited as there is a lack of dedicated GPU kernel implementations for general N:M sparsity with various sparsity ratios. In this work, we present nmSPARSE, a library of efficient GPU kernels for two fundamental operations in neural networks with N:M sparse weights: sparse matrix-vector multiplication (SpMV) and sparse matrix-matrix multiplication (SpMM). By leveraging the intrinsic balance characteristic of N:M sparsity, nmSPARSE kernels rearrange irregular computation and scattered memory accesses in sparse matrix multiplication into hardware-aligned regular computation and conflict-free memory accesses at runtime. Evaluated on NVIDIA A100 GPU, nmSPARSE kernels achieve up to 5.2× speedup on SpMV and 6.0× speedup on SpMM over the fastest baseline. End-to-end studies on transformer models demonstrate that using nmSPARSE outperforms other baselines.