PLink: Discovering and Exploiting Locality for Accelerated Distributed Training on the public Cloud

Part of Proceedings of Machine Learning and Systems 2 (MLSys 2020)

Bibtex Metadata Paper

Authors

Liang Luo, Peter West, Jacob Nelson, Arvind Krishnamurthy, Luis Ceze

Abstract

Training deep learning models has become an important workload on the public cloud. Scaling cloud-based distributed training faces unique challenges from the hierarchical network topology of the datacenter and the dynamic nature of the multi-tenant environment. Timely training of deep learning models requires effective use of topology-induced locality in the datacenter network. This work proposes PLink, an optimized communication library that probes the physical network and then generates and executes a fitted hierarchical aggregation plan to take advantage of such locality, and evolves the plan to adapt to changing network conditions. PLink needs no support from cloud providers and operates out-of-the-box on unmodified public clouds. PLink serves as a direct plug-in to many training frameworks, delivering up to 2.3x better end-to-end training throughput for popular DL models on Azure and EC2 compared to the state of the art.