Atom: Low-Bit Quantization for Efficient and Accurate LLM Serving

Part of Proceedings of Machine Learning and Systems 6 (MLSys 2024) Conference

Bibtex Paper

Authors

Yilong Zhao, Chien-Yu Lin, Kan Zhu, Zihao Ye, Lequn Chen, Size Zheng, Luis Ceze, Arvind Krishnamurthy, Tianqi Chen, Baris Kasikci

Abstract

The growing demand for Large Language Models (LLMs) in applications such as content generation, intelligentchatbots, and sentiment analysis poses considerable challenges for LLM service providers. To efficiently useGPU resources and boost throughput, batching multiple requests has emerged as a popular paradigm; to furtherspeed up batching, LLM quantization techniques reduce memory consumption and increase computing capacity.However, prevalent quantization schemes (e.g., 8-bit weight-activation quantization) cannot fully leverage thecapabilities of modern GPUs, such as 4-bit integer operators, resulting in sub-optimal performance.To maximize LLMs’ serving throughput, we introduce Atom, a low-bit quantization method that achieves highthroughput improvements with negligible accuracy loss. Atom significantly boosts serving throughput by usinglow-bit operators and considerably reduces memory consumption via low-bit quantization. It attains high accuracyby applying a novel mixed-precision and fine-grained quantization process. We evaluate Atom on 4-bit weight-activation quantization setups in the serving context. Atom improves end-to-end throughput by up to 7.73×compared to the FP16 and by 2.53× compared to INT8 quantization, while maintaining the same latency target.