In-network Aggregation for Shared Machine Learning Clusters

Part of Proceedings of Machine Learning and Systems 3 (MLSys 2021)

Bibtex Paper

Authors

Nadeen Gebara, Manya Ghobadi, Paolo Costa

Abstract

We present PANAMA, a network architecture for machine learning (ML) workloads on shared clusters where a variety of training jobs co-exist.PANAMA consists of two key components: (i) an efficient in-network hardware accelerator designed to accelerate large data-parallel training transfers; and (ii) a lightweight congestion control protocol to enable fair sharing of network resources across different flows. Our congestion control protocol exploits the unique communication pattern in training to ensure large in-network aggregation transfers do not negatively impact short latency-sensitive flows. To evaluate the feasibility of PANAMA, we build an FPGA-based prototype with 10 Gbps transceivers and show that our hardware datapath achieves line-rate aggregation. Our large-scale simulations demonstrate that PANAMA improves the mean and 99%-tile completion time of latency-sensitive short flows by a factor of 2–4.5 while reducing the average training time of large jobs by a factor of 1.25.