ALCOP: Automatic Load-Compute Pipelining in Deep Learning Compiler for AI-GPUs

Part of Proceedings of Machine Learning and Systems 5 pre-proceedings (MLSys 2023) mlsys2023


Bibtek download is not available in the pre-proceeding


Guyue Huang, Yang Bai, Liu Liu, Yuke Wang, Bei Yu, Yufei Ding, Yuan Xie


Pipelining between data loading and computation is a critical tensor program optimization for GPUs. In order to unleash the high performance of latest GPUs, we must perform a synergetic optimization of multi-stage pipelining across the multi-level buffer hierarchy of GPU. Existing frameworks rely on hand-written libraries such as cuBLAS to perform pipelining optimization, which is inextensible to new operators and un-composable with prior tensor compiler optimizations. This paper presents ALCOP, the first framework that is compiler-native and fully supports multi-stage multi-level pipelining. ALCOP overcomes three critical obstacles in generating code for pipelining: detection of pipelining-applicable buffers, program transformation for multi-level multi-stage pipelining, and efficient schedule parameter search by incorporating static analysis. Experiments show that ALCOP can generate programs with 1.23× speedup on average (up to 1.73×) over vanilla TVM. On end-to-end models, ALCOP can improve upon TVM by up to 1.18×, and XLA by up to 1.64×. Besides, our performance model significantly improves the efficiency of the schedule tuning process and can find schedules with 99% of the performance given by exhaustive search while costing 40× fewer trials.