ULPPACK: Fast Sub-8-bit Matrix Multiply on Commodity SIMD Hardware

Part of Proceedings of Machine Learning and Systems 4 pre-proceedings (MLSys 2022)

Paper

Bibtek download is not available in the pre-proceeding


Authors

Jaeyeon Won, Jeyeon Si, Sam Son, Tae Jun Ham, Jae W. Lee

Abstract

Recent progress in quantization techniques has demonstrated the feasibility of sub-8-bit quantization with a negligible end-to-end accuracy drop. However, today’s commodity hardware such as CPUs and GPUs is still suboptimal in executing these sub-8-bit quantized networks as its SIMD instructions only support the granularity of 8 bits or wider. This paper presents ULPPACK, a software technique to accelerate those ultra low-precision networks via effective operand packing. The key idea of ULPPACK is to pack multiple low-precision (<8 bits) operands densely into a single wide (16 bits) register and perform multiple narrow multiply-accumulate (MAC) operations with a single wide multiply. We introduce two effective packing schemes with different tradeoffs as well as optimizations to amortize the overhead of shifting and masking the output partial sum. Our evaluation of ULPPACK with a 512x512x512 GEMM kernel demonstrates substantial performance gains over state-of-the-art low-precision linear algebra libraries with a speedup of 2.1x, 1.8x, and 2.7x for 3-bit weights/activations (W3A3) over Google’s GEMMLOWP, Facebook’s QNNPACK, and an optimized bit-serial implementation, respectively. For end-to-end evaluation on PyTorch with seven 3-bit quantized convolutional neural networks (CNNs), ULPPACK achieves geomean speedups of 3.9x and 1.5x over the baseline 32-bit floating-point (FP32) and QNNPACK, respectively.