SLoRA: Scalable Serving of Thousands of LoRA Adapters

Part of Proceedings of Machine Learning and Systems 6 (MLSys 2024) Conference

Bibtex Paper

Authors

Ying Sheng, Shiyi Cao, Dacheng Li, Coleman Hooper, Nicholas Lee, Shuo Yang, Christopher Chou, Banghua Zhu, Lianmin Zheng, Kurt Keutzer, Joseph Gonzalez, Ion Stoica

Abstract

The "pretrain-then-finetune" paradigm is commonly adopted in the deployment of large language models. Low-Rank Adaptation (LoRA), a parameter-efficient fine-tuning method, is often employed to adapt a base model to a multitude of tasks, resulting in a substantial collection of LoRA adapters derived from one base model. We observe that this paradigm presents significant opportunities for batched inference during serving. To capitalize on these opportunities, we present SLoRA, a system designed for the scalable serving of many LoRA adapters. SLoRA stores all adapters in the main memory and fetches the adapters used by the currently running queries to the GPU memory. To efficiently use the GPU memory and reduce fragmentation, SLoRA proposes a unified memory pool. This memory pool uses a unified paging mechanism to manage dynamic adapter weights with different ranks and KV cache tensors with varying sequence lengths.Additionally, SLoRA employs a novel tensor parallelism strategy and highly optimized custom CUDA kernels for batched LoRA computation. Collectively, these features enable SLoRA to serve thousands of LoRA adapters on a single GPU or across multiple GPUs with a small overhead. Compared to state-of-the-art libraries such as HuggingFace PEFT and vLLM (with naive support of LoRA serving), SLoRA can improve the throughput by up to 4 times and increase the number of served adapters by several orders of magnitude. As a result, SLoRA enables scalable serving of many task-specific fine-tuned models and offers the potential for large-scale customized fine-tuning services.