Minsik Cho, Ulrich Finkler, David Kung, Hillery Hunter
As deep neural networks get more complex and input datasets get larger, it can take days or even weeks to train a deep neural network to the desired accuracy. Therefore, enabling distributed deep learning at a massive scale is a critical, since it offers the potential to reduce the training time from weeks to hours. In this paper, we present BlueConnect, an efficient communication library for distributed deep learning that is highly optimized for popular GPU-based platforms. BlueConnect decomposes a single all-reduce operation into a large number of parallelizable reduce-scatter and all-gather operations to exploit the trade-off between latency and bandwidth, and adapt to a variety of network configurations. Therefore, each individual operation can be mapped to a different network fabric and take advantage of the best performing library for that fabric. We integrated BlueConnect into Caffe2, and demonstrated that BlueConnect significantly pushes the state-of-the-art in large-scale deep learning by reducing communication overhead by 87\% on 192 GPUs for Resnet-50 training over prior arts.