Part of Proceedings of Machine Learning and Systems 5 (MLSys 2023) mlsys2023
Saurav Muralidharan
Deep neural networks are often highly over-parameterized, and weight pruning or sparsification can be an effective method for reducing both their memory footprints and inference latencies. Among existing pruning strategies, unstructured or fine-grained pruning typically achieves the highest compression ratios and lowest task errors; unfortunately, such irregular and non-uniform sparsity leads to significant load imbalance and consequently degraded performance on parallel architectures. Recent attempts to accelerate unstructured sparsity on GPUs have focused on the 90-99% sparsity regime, where most modern DNNs have been shown to lose considerable accuracy. In this paper, we introduce the uniform sparsity pattern that ensures a constant number of non-zero values per row of the sparse matrix, and thus lends itself well to efficient, load-balanced execution on modern parallel architectures. Uniform sparsity achieves useful speedups in both the moderate (50-90%) and high (90%+) sparsity regimes and performs similarly to unstructured sparsity in terms of accuracy. We describe how uniform sparsity is induced on DNN weights and present optimized kernels that accelerate uniform sparsity on GPUs. We evaluate uniform sparsity on a range of real-world networks and synthetic data, and demonstrate mean performance improvements of up to 62% over the NVIDIA cuSparse library at iso-accuracy settings.