{"title": "BlueConnect: Decomposing All-Reduce for Deep Learning on Heterogeneous Network Hierarchy", "book": "Proceedings of Machine Learning and Systems", "page_first": 241, "page_last": 251, "abstract": "As deep neural networks get more complex and input datasets get larger, it can take days or\neven weeks to train a deep neural network to the desired accuracy. Therefore, enabling distributed deep learning at a\nmassive scale is a critical, since it offers the potential to reduce the training\ntime from weeks to hours. In this paper, we present BlueConnect, an\nefficient communication library for distributed deep learning that is highly optimized for popular GPU-based platforms.\nBlueConnect decomposes a single all-reduce operation into a large number of parallelizable reduce-scatter and all-gather operations\nto exploit the trade-off between\nlatency and bandwidth, and adapt to a variety of network configurations. Therefore, each individual operation can be mapped\nto a different network fabric and take advantage of the best performing library for that fabric.\nWe integrated BlueConnect into Caffe2, and demonstrated that BlueConnect significantly\npushes the state-of-the-art in large-scale deep learning\nby reducing communication overhead by 87\\% on 192 GPUs for Resnet-50 training over prior arts.", "full_text": "                  BLUECONNECT: DECOMPOSINGALL-REDUCEFORDEEPLEARNINGON\r\n                                           HETEROGENEOUSNETWORKHIERARCHY\r\n                                            MinsikCho1 UlrichFinkler2 DavidKung2 HilleryHunter2\r\n                                      ABSTRACT                                   et al., 2012). Hardware accelerators such as GPU/TPU and\r\n                    Asdeepneural networks get more complex and                   their accompanying software stacks have provided a sig-\r\n                    input datasets get larger, it can take days or even          ni\ufb01cant amount of speed up (Jouppi et al., 2017; NVidia,\r\n                    weekstotrainadeepneuralnetworktothedesired                   2017b). However, deep neural network training for speech\r\n                    accuracy. Therefore, enabling distributed deep               and vision can still take days and even weeks. Therefore,\r\n                    learning at a massive scale is critical, since it of-        parallelization by distributing the deep learning training to\r\n                    fers the potential to reduce the training time from          many (upwards of hundreds) GPUs over a cluster or on a\r\n                    weeks to hours. In this paper, we present Blue-              cloud environment is critical to cut the training time from\r\n                    Connect, an ef\ufb01cient communication library for               weeks to hours and minutes (Goyal et al., 2017; Iandola\r\n                    distributed deep learning that is highly optimized           et al., 2015; Jia et al., 2018; You et al., 2017a;b).\r\n                    for popular GPU-based platforms. BlueConnect                 Distributed deep learning is challenging because as the num-\r\n                    decomposes a single all-reduce operation into a              ber of learners (or GPUs) increases, the computation time\r\n                    large number of parallelizable reduce-scatter and            decreases while the amount of communication stays con-\r\n                    all-gather operations to exploit the trade-off be-           stant (Goyal et al., 2017; Uber, 2017; You et al., 2017a),\r\n                    tween latency and bandwidth, and adapt to a va-              resulting in unfavorable computation to communication ra-\r\n                    riety of network con\ufb01gurations. Therefore, each              tios, and thus diminished returns on more learners. One\r\n                    individual operation can be mapped to a differ-              can either increase the computational workload with a large\r\n                    ent network fabric and take advantage of the best            mini-batch size in stochastic gradient decent (SGD) (i.e.,\r\n                    performing implementation for the correspond-                weakscaling) and/or decrease the communication overhead.\r\n                    ing fabric. According to our experimental results            However, it is known that a large mini-batch beyond a cer-\r\n                    on two system con\ufb01gurations, BlueConnect can                 tain point can degrade training quality (Balles et al., 2016;\r\n                    outperform the leading industrial communication              Keskar et al., 2016; Krizhevsky, 2014), not to mention that\r\n                    library by wide margin, and the BlueConnect                  mini-batch size is limited by the GPU memory capacity in\r\n                    integrated Caffe2 can signi\ufb01cantly reduce syn-               practice. Therefore, in addition to enabling deep learning\r\n                    chronization overhead by 87% on 192 GPUs for                 with large mini-batch sizes (Goyal et al., 2017; Jia et al.,\r\n                    Resnet-50 training over prior schemes.                       2018; You et al., 2017a;b), it is crucial to develop a fully op-\r\n                                                                                 timized communication mechanism tuned for deep learning\r\n                                                                                 for massive scale-out that can a) maximize the bandwidth\r\n               1    INTRODUCTION                                                 utilization in popular deep learning environments like GPU-\r\n               Deeplearning has become the de-facto technique for an in-         based cluster/cloud, and b) minimize the linearly growing\r\n               creasing number of cognitive applications, including vision,      communication latency with the number of learners (Srid-\r\n               speech, and language translation (Amodei et al., 2015; Ioffe      haran et al., 2018).\r\n               &Szegedy,2015;Jia et al., 2014). The success is driven by         In this paper, we report the performance of an ef\ufb01cient\r\n               the availability of an enormous volume of data and advances       communication library for deep learning, BlueConnect, that\r\n               in deep neural networks, which in turn make deep learning         provides a highly ef\ufb01cient all-reduce algorithm for SGD, an\r\n               one of the most computationally demanding AI applica-             integral part in modern deep learning frameworks (Abadi\r\n               tions (Amodei et al., 2015; Chen et al., 2016; Krizhevsky         et al., 2016; Chen et al., 2015; Facebook, a;b; Goyal et al.,\r\n                                                                                 2017; Jia et al., 2014; Niitani et al., 2017; NVidia, 2017a;\r\n                  1IBM Systems, Austin, Texas, USA 2IBM T. J. Watson Re-         Seide & Agarwal, 2016). The key idea in BlueConnect is to\r\n               search Center, Yorktown Heights, New York, USA. Correspon-        decompose one all-reduce operation into series of reduce-\r\n               dence to: Minsik Cho <minsikcho@us.ibm.com>.                      scatter and all-gather patterns in a topology-aware fashion,\r\n               Proceedings of the 2nd SysML Conference, Palo Alto, CA, USA,      whichenablesalarge-scaledeeplearningwithreducedcom-\r\n               2019. Copyright 2019 by the author(s).\r\n                                 BlueConnect: Decomposing All-Reduce for Deep Learning on Heterogeneous Network Hierarchy\r\n                 munication overhead. Our technical contribution includes:                                        Table 1. Notations.\r\n                    \u2022 BlueConnectadaptstothehierarchyofcommunication                       \u03b1        non-zero latency time per transfer at each network switch\r\n                      bandwidths by leveraging topology-awareness, so that                 wi             bandwidth (unit/sec) of network switch type i\r\n                                                                                           s                   a network switch instance j with w\r\n                      it fully utilizes the heterogeneous network architec-                 i.j                                                     i\r\n                                                                                           W              a set of network bandwidths, {wi|\u2203i \u2208 Z       }\r\n                      ture in popular deep learning platforms (IBM, 2017a;                                                                           \u22650\r\n                                                                                           S           a set of switch instances, {s   |w \u2208W,\u2203j \u2208Z }\r\n                      NVidia, a).                                                                                                   i.j  i               \u22650\r\n                                                                                           P                              a set of learners\r\n                                                                                           N                         the gradient\u2019s size in unit\r\n                    \u2022 Through topology-aware decomposition, BlueConnect                    c(r,w)          a set of learners that perform reduce scatter\r\n                      also minimizes the communication latency overhead,                                and all gather over bandwidth w with a learner r\r\n                      the critical bottleneck in large-scale deep learning.\r\n                    \u2022 For each decomposed piece, BlueConnect can mix-                     like cloud, b) the traf\ufb01c generated by deep learning is highly\r\n                      and-match various reduce-scatter and all-gather imple-              bursty and extremely large (i.e., 100MB -1GB), while ex-\r\n                      mentations/algorithms over different network fabrics                isting techniques have been optimized for relatively small\r\n                      to maximize network utilization.                                    and frequent exchanges, and c) most existing algorithms are\r\n                                                                                          not tuned for new network fabrics (i.e, NVLink (NVLink,\r\n                 The rest of the paper is organized as follows. We present                2017)). As future GPUs/accelerators double their perfor-\r\n                 preliminaries in Section 2. Section 3 discusses our pro-                 manceeachgeneration,thegradientsynchronizationinSGD\r\n                 posed algorithm, BlueConnect. Experimental results are in                will become a considerable bottleneck in large-scale deep\r\n                 Section 4, followed by the conclusion in Section 5.                      learning (Keuper, 2016). Hence, it is in great demand to\r\n                                                                                          study an ef\ufb01cient communication technique for deep learn-\r\n                 2    PRELIMINARIES                                                       ing that addresses the 3 issues mentioned above.\r\n                                                                                          Other approaches to reduce communication overhead in\r\n                 2.1   Prior Arts                                                         deep learning are largely based on approximation of fully\r\n                 To enable large scale distributed deep learning with hun-                synchronous SGD (Wang & Joshi, 2018) including asyn-\r\n                 dreds of GPUs under popular data-parallelism (Amodei                     chronous SGD (ASGD)whereeachlearner can subscribe\r\n                 et al., 2015; Goyal et al., 2017; You et al., 2017a), the batch          theupdatedweightsfromaparameterserverasynchronously\r\n                 size must be in the thousands since GPU utilization and com-             (i.e., removing the synchronization barrier and suppressing\r\n                 pute to communication ratio are low for single digit batch               bursty traf\ufb01c) (Niu et al., 2011; Zhang et al., 2015a;b) and\r\n                 size per GPU for typical neural networks. Since a large                  decentralized SGD where each learner communicates only\r\n                 batch size in deep learning may cause poor convergence,                  with a subset of all learners (Lian et al., 2017). It is shown\r\n                 there have been recent efforts to mitigate such convergence              that such approximated or stochastic methods can improve\r\n                 and generalization issues (Goyal et al., 2017; Keskar et al.,            the scalability of distributed deep learning but at a cost of po-\r\n                 2016; You et al., 2017a). (Goyal et al., 2017) proposed a                tentially reduced accuracy and instable convergence (Chen\r\n                 linear learning rate scaling rule and performed learning rate            et al., 2016).\r\n                 warm-upfromasmall/safe value to the larger target value                  2.2   Notations and Basic Performance Models\r\n                 in the early training phase, and then resorting to the usual\r\n                 step-wise descent. For gradient synchronization, (Goyal                  Notations used in this paper are listed in Table 1. We ignore\r\n                 et al., 2017) leveraged a deep learning communication li-                arithmetic operation time, as it is trivially cheap in deep\r\n                 brary (Facebook, b) to demonstrate that Resnet-50 (He et al.,            learning on GPUs. Then, for a given data size n, a learner\r\n                 2015) can be trained in one hour over 32 DGX-1\u2019s (256                    countp,andabandwidthw,theperformanceofaring-based\r\n                 GPUs). Recently, (Jia et al., 2018) demonstrated that the                communication pattern can be expressed as follows (Thakur\r\n                 same Resnet50 can be trained in four minutes over 1024                   et al., 2005):\r\n                 GPUswithlow-precision (i.e., FP16).                                                                                    p\u22121n\r\n                While this is an impressive result and there exist numerous                           Tr(p,n,\u03b1,w) = (p\u22121)\u03b1+                p    w          (1)\r\n                 communication algorithms for distributed computing plat-                 Whenthe latency between any two nodes is uniform and\r\n                                \u00b4\r\n                 forms (Almasi et al., 2005; Baidu, 2017; Jia et al., 2018;               p is a power-of-two number, one can use recursive halv-\r\n                 Thakuretal.,2005),theyarenotnecessarilycustomizedand                     ing/doubling to obtain the same result with smaller latency,\r\n                 optimized for large scale distributed deep learning (Amodei              which can expressed as follows (Thakur et al., 2005):\r\n                 et al., 2015): a) most existing techniques were developed\r\n                 for a homogeneous environment, while deep learning will                               Tc(p,n,\u03b1,w) = lg(p)\u03b1+ p\u22121 n                         (2)\r\n                 be increasingly deployed on a heterogeneous environment                                                                 p    w\r\n                               BlueConnect: Decomposing All-Reduce for Deep Learning on Heterogeneous Network Hierarchy\r\n                                              S2.0\r\n                                          w2          w2\r\n                             S1.0                            S1.1\r\n                   w1             w1                w1            w1\r\n                          A0             B0               C0             D0\r\n                     w0             w0               w0             w0\r\n                                  1                               3\r\n                   0              .                 2             .                Figure 2. Two-level all-reduce (Jia et al., 2018) where only\r\n                   .              0                 .             0\r\n                   0      A1      S      B1         0     C1      S      D1\r\n                   S w0             w0              Sw0             w0             master learners are active in the 2nd step.\r\n                          A2             B2               C2             D2\r\n                     w0             w0               w0             w0\r\n                                                                                   Thakur et al., 2005) can be expressed with the following\r\n               Figure 1. 4 nodes with 12 learners on heterogeneous network ar-     performance model:\r\n               chitecture connected hierarchically.\r\n                                                                                                                              N\r\n                                                                                       T        =2(|P|\u22121){\u03b1+                 |P|         }    (5)\r\n                                                                                         one lvl                     min           {w }\r\n               Bycombining Eq. (1) and (2), we de\ufb01ne the following to                                                     0\u2264i<|W|     i\r\n               get the best of both:                                                            =2Tr(|P|,N,\u03b1,minW)                            (6)\r\n                                      \u001a                          q\r\n               Tr/c(p,n,\u03b1,w) =           Tc(p,n,\u03b1,w)       p = 2 ,q \u2208 Z (3)        where there are 2(|P| \u2212 1) iterations in Eq. (5), and each\r\n                                         Tr(p,n,\u03b1,w) otherwise                     iteration needs to transfer N data over w ,w ,...,w\r\n                                                                                                               |P|             0    1      |W|\u22121\r\n               Based on the ring and recursive communication pat-                  in the worst case (e.g., marked with dotted arrows from\r\n                                                                                   A to D in Fig. 1).          Although one-level ring-based\r\n               terns, we can compute the communication performance of                0       2\r\n               broadcastorreduceasfollows(Thakuretal.,2005):                       all-reduce has been widely used for traditional high-\r\n                                                                                   performance computing, it is not quite suitable for large-\r\n                  Tbcast(p,n,\u03b1,w) = Treduce(p,n,\u03b1,w)                               scale deep learning for two reasons:\r\n                                     =Tc(p,n,\u03b1,w)+Tr(p,n,\u03b1,w) (4)\r\n               Since our focus is on heterogeneous network architec-                  \u2022 A node with multiple GPUs (up to 16 GPUs per\r\n                                                                                        node (Amazon)) may have multiple learners inside\r\n               ture (Dichev & Lastovetsky, 2014), we extend the homoge-                 and increases |P| fast, which would rapidly increase\r\n               neousmodel(Thakuretal.,2005)byusingdifferentwi. For                      the latency of deep learning communication (i.e., a\r\n               example, we assume a typical hierarchically built cluster                large multiplier to \u03b1).\r\n               over tree-like heterogeneous network architecture (NVidia,\r\n               b) as in Fig. 1 where 12 learners (P = {A ,B ,C ,D |\u2200i \u2208\r\n                                                           i   i   i    i\r\n               {0,1,2}}) are connected through heterogeneous network                  \u2022 Since deep learning typically runs on a heterogeneous\r\n               switches in S = {s             , s      , s  }. Regarding the            network topology (e.g., Fig. 1), the performance of\r\n                                    0.{0,1,2,3} 1.{0,1}   2.0\r\n               example in Fig 1, s0.\u2217 can represent an intra-node network               one-level approach is gated by the slowest bandwidth\r\n               like NVLink around 32GB/s per lane, while s1.\u2217 and s2.0                  along the path (i.e., minW), not fully utilizing other\r\n               mayrepresent inter-node switches for 100Gbps In\ufb01niBand.                  fast networks fabrics.\r\n               In such cases, w1 and w2 would be 100Gbps and 200Gbps\r\n               respectively to ideally match the total uplink bandwidth            Toaddress this problem, a two-level approach is used in the\r\n               from all the hanging nodes (e.g., fat-tree (Al-Fares et al.,        state-of-the-art deep learning softwares (Facebook, b; Jia\r\n               2008; NVidia, b)).                                                  et al., 2018; NVidia, 2017a) shown in Fig. 2. In the \ufb01rst step,\r\n               2.3   All-Reduce for Distributed SGD                                the gradients are reduced to the master learner on each node.\r\n                                                                                   Then, a small-scale one-level ring-based all-reduce is\r\n               Thekeycommunicationpattern used in SGD synchroniza-                 appliedamongthemasterlearnersonly. Finally,thegradient\r\n               tion in deep learning is all-reduce (Amodeietal., 2015;             in the master learners is locally broadcast back to the other\r\n               Baidu, 2017) which is popularly implemented with ring-              learners within the same node, synchronizing all the learners\r\n               based reduce scatterorall gather(Thakuretal.,                       in the training task. When |P| is decomposed into two\r\n               2005). Based on Eq.(1,4), the synchronization costs of              learner counts such as p (the number of learners within\r\n                                                                                                              0\r\n               prior arts in deep learning can be computed. For exam-              each node) and p (the number of master learners) like\r\n                                                                                                      1\r\n               ple of one-level ring-based all-reduce in (Baidu, 2017;             |P| = p p , the performance of such a two-level scheme\r\n                                                                                            0 1\r\n                                                           BlueConnect: Decomposing All-Reduce for Deep Learning on Heterogeneous Network Hierarchy\r\n                             can be formally expressed as follows (Jia et al., 2018):                                                                                                                              All-Reduce\r\n                                    T                 =T                  (p ,N,\u03b1,w )*reducetomaster*\r\n                                       two lvl                reduce           0                   0                                                                                              Reduce-Scatter                  All-Gather\r\n                                                            +T                (p ,N,\u03b1,w ) *bcastfrommaster*\r\n                                                                    bcast         0                    0\r\n                                                            +2T (p ,N,\u03b1, min {w })\r\n                                                                       r/c       1                0\u2264i<|W|                i                                           Reduce-Scatter               Reduce-Scatter                     All-Gather                   All-Gather\r\n                                                                                                                                                                     Reduce-Scatter               Reduce-Scatter                     All-Gather                   All-Gather\r\n                                                      =2T (p ,N,\u03b1,w )+2T (p ,N,\u03b1,w )\r\n                                                                c      0                   0               r      0                    0                             Reduce-Scatter               Reduce-Scatter                     All-Gather                   All-Gather\r\n                                                                                P                                                                                    Reduce-Scatter               Reduce-Scatter                     All-Gather                   All-Gather\r\n                                                            +2T ( ,N,\u03b1,minW)                                                                   (7)                          .                            .                                .                            .\r\n                                                                       r/c p                                                                                                .         P/p0               .         P/pi                   .         P/pi               .         P/p0\r\n                                                                                  0\r\n                             Although we can trivially show that Eq. (7) has smaller                                                                          Figure 3. All-reduce can be decomposed into multiple stages of\r\n                             latency overhead than Eq. (6), it would still suffer from the                                                                     parallelizable reduce-scatter and all-gather operations.\r\n                             following three limitations:\r\n                                  \u2022 Latency overhead can be large when p \u226a |P|.                                                                                granularity nor \ufb02exibility suf\ufb01cient enough to utilize the\r\n                                                                                                                      0\r\n                                  \u2022 Performance is still gated by minW.                                                                                        underlying hardware and the highly optimized implemen-\r\n                                                                                                                                                               tations (i.e., ones offered by the hardware vendors) ef\ufb01-\r\n                                  \u2022 Manylearners stay idle during the 2nd step, leading to                                                                     ciently. We, however, found that the reduce-scatter\r\n                                       bandwidth under-utilization.                                                                                            and all-gather can be further decomposed into mul-\r\n                                                                                                                                                               tiple stages of parallelizable reduce-scatter and\r\n                                  \u2022 reduce/broadcastatthe\ufb01rststepandthelaststep                                                                                all-gather operations in some symmetric cases. In\r\n                                       is expensive.                                                                                                           detail, Fig. 3 shows that all-reduce can be \ufb01rst broken\r\n                                                                                                                                                               into one reduce-scatter followed by all-gather\r\n                             OurproposedBlueConnectinSection3addressestheselimi-                                                                               (the arrows indicate dependency). However, additional de-\r\n                             tations with a novel topology-awareschemeasinSection3.2                                                                           composition is possible if the following integer factorization\r\n                             based on the all-reduce decomposition in Section 3.1.                                                                             exists:\r\n                                                                                                                                                                        |P| = p p p ...p = Yp                                        (p \u2208 N,p >1)                                (8)\r\n                             3         BLUECONNECT                                                                                                                                      0 1 2              k                   i          i                i\r\n                                                                                                                                                                                                                    i<k\r\n                             In this section, we introduce a communication library for                                                                         Then, the reduce-scattercanbefurtherdecomposed\r\n                             deep learning, BlueConnect, with detailed examples. The                                                                           into k \u2212 1 stages of bundled reduce-scatter opera-\r\n                             maingoalofBlueConnectistogreatly reduce the commu-                                                                                tions where the i-th stage has P concurrently launchable\r\n                             nication/synchronization overhead for massive scale-out of                                                                                                                                        p\r\n                                                                                                                                                                                                                                 i\r\n                             deep learning based on topology-aware all-reduce. In                                                                              reduce-scatter operations over different subsets of\r\n                             contrast to the prior arts in Section 2, BlueConnect relies                                                                       learners. The all-gather can also be further decom-\r\n                             on a series of multiple and concurrent reduce-scatter                                                                             posed in the same way, but they have a backward depen-\r\n                             and all-gatheroperationsandgeneratestraf\ufb01c patterns                                                                               dency. If all-reduce is performed based on the pro-\r\n                             optimized for heterogeneous network topology, leveraging                                                                          posed decomposition, every learner participates in one of\r\n                             full network capacity. We assume tree-topology for network                                                                        the reduce-scatter or all-gather operations at\r\n                             architecture for illustration purpose, but the BlueConnect is                                                                     any moment or stage (unlike the two-step approach). The\r\n                             \ufb02exible enough to be mapped to other architectures includ-                                                                        strength of the proposed decomposition are two-fold:\r\n                                                                                           \u00b4\r\n                             ing mesh/torus network (Almasi et al., 2005) as well (see\r\n                             Section 3.3). Section 3.1 focuses on all-reduce decom-                                                                                 \u2022 Decomposition can offer enough granularity and \ufb02ex-\r\n                             position, and Section 3.2 formally presents BlueConnect                                                                                     ibility to map operations to underlying network ele-\r\n                             with the performance model given in Section 3.3.                                                                                            ments and implementations.\r\n                             3.1         All-Reduce Decomposition                                                                                                   \u2022 Higher parallelism at each stage can increase the band-\r\n                             BlueConnect decomposes all-reduce to \ufb01t into het-                                                                                           widthutilization(Sivakumaretal.,2000;Yildirimetal.,\r\n                             erogeneous network hierarchy and increase the hard-                                                                                         2016).\r\n                             ware utilization. One well-known way of decomposing\r\n                             all-reduceistousereduce-scatterfollowedby                                                                                         BlueConnect is essentially based on the proposed\r\n                             all-gatherwhicharepopularlyimplementedbasedon                                                                                     all-reducedecomposition in order to exploit network\r\n                             the ring scheme. Such crude decomposition has neither                                                                             topology and corresponding software stacks better.\r\n                                       A0             A1              A2                    C0             C1              C2\r\n                                      B0             B1              B2                    D0             D1              D2\r\n                                                                                                                             Reduce-Scatter                      All-Gather\r\n                                                                                                                             Reduce-Scatter                      All-Gather\r\n                                                                                                                             Reduce-Scatter                      All-Gather\r\n                                                                                                                             Reduce-Scatter                      All-Gather\r\n                                                                                                                                    .                                 .\r\n                                                                                                                                               6\r\n                                        BlueConnect: Decomposing All-Reduce for Deep Learning on Heterogeneous Network Hierarchy\r\n                                                                                                                                    .                                 .         6\r\n                                       Reduce-Scatter          Reduce-Scatter          Reduce-Scatter            All-Gather             All-Gather              All-Gather\r\n                                        A0 -A1 -A2                A0 -B0                  A0 -C0                  A0 -C0                  A0 -B0               A0 -A1 -A2   \r\n                                        B0 -B1 -B2                A1 \u2013B1                  A1 \u2013C1                  A1 \u2013C1                 A1 \u2013B1                B0 -B1 -B2   \r\n                                        C0 -C1 -C2                A2 \u2013B2                  A2 \u2013C2                  A2 \u2013C2                 A2 \u2013B2                C0 -C1 -C2   \r\n                                        D0 -D1 -D2                C0 -D0                  B0 -D0                  B0 -D0                 C0 -D0                D0 -D1 -D2   \r\n                                                                  C1 \u2013D1                  B1 \u2013D1                  B1 \u2013D1                 C1 \u2013D1\r\n                                                                  C2 \u2013D2                  B2 \u2013D2                  B2 \u2013D2                 C2 \u2013D2\r\n                                  (a) all-reduceisdecomposedintothemultiple stages of reduce-scatter /all-gatheroperations.\r\n                                        w0                                        w0\r\n                          G00.00       G01.00      G02.00            G06.00      G07.00      G08.00                 G[00-02].00    *           *              G[06-08].00    *           *\r\n                          G00.01       G01.01      G02.01            G06.01      G07.01      G08.01                 G[00-02].01    *           *              G[06-08].01    *           *\r\n                          G00.02       G01.02      G02.02            G06.02      G07.02      G08.02                 G[00-02].02    *           *              G[06-08].02    *           *\r\n                          G00.03       G01.03      G02.03            G06.03      G07.03      G08.03                 G[00-02].03    *           *              G[06-08].03    *           *\r\n                          G00.04       G01.04      G02.04            G06.04      G07.04      G08.04                    *        G[00-02].04    *                 *        G[06-08].04    *\r\n                          G00.05       G01.05      G02.05            G06.05      G07.05      G08.05                    *        G[00-02].05    *                 *        G[06-08].05    *\r\n                          G00.06       G01.06      G02.06            G06.06      G07.06      G08.06                    *        G[00-02].06    *                 *        G[06-08].06    *\r\n                          G00.07       G01.07      G02.07            G06.07      G07.07      G08.07                    *        G[00-02].07    *                 *        G[06-08].07    *\r\n                          G00.08       G01.08      G02.08            G06.08      G07.08      G08.08                    *           *        G[00-02].08          *           *        G[06-08].08\r\n                          G00.09       G01.09      G02.09            G06.09      G07.09      G08.09                    *           *        G[00-02].09          *           *        G[06-08].09\r\n                          G00.10       G01.10      G02.10            G06.10      G07.10      G08.10                    *           *        G[00-02].10          *           *        G[06-08].10\r\n                          G00.11       G01.11      G02.11            G06.11      G07.11      G08.11                    *           *        G[00-02].11          *           *        G[06-08].11\r\n                           A0          A1           A2                C0          C1          C2\r\n                                        w0                                        w0                                w1                                        w1\r\n                          G03.00       G04.00      G05.00            G09.00      G10.00      G11.00                 G[03-05].00    *           *              G[09-11].00    *           *\r\n                          G03.01       G04.01      G05.01            G09.01      G10.01      G11.01                 G[03-05].01    *           *              G[09-11].01    *           *\r\n                          G03.02       G04.02      G05.02            G09.02      G10.02      G11.02                 G[03-05].02    *           *              G[09-11].02    *           *\r\n                          G03.03       G04.03      G05.03            G09.03      G10.03      G11.03                 G[03-05].03    *           *              G[09-11].03    *           *\r\n                          G03.04       G04.04      G05.04            G09.04      G10.04      G11.04                    *        G[03-05].04    *                 *        G[09-11].04    *\r\n                          G03.05       G04.05      G05.05            G09.05      G10.05      G11.05                    *        G[03-05].05    *                 *        G[09-11].05    *\r\n                          G03.06       G04.06      G05.06            G09.06      G10.06      G11.06                    *        G[03-05].06    *                 *        G[09-11].06    *\r\n                          G03.07       G04.07      G05.07            G09.07      G10.07      G11.07                    *        G[03-05].07    *                 *        G[09-11].07    *\r\n                          G03.08       G04.08      G05.08            G09.08      G10.08      G11.08                    *           *        G[03-05].08          *           *        G[09-11].08\r\n                          G03.09       G04.09      G05.09            G09.09      G10.09      G11.09                    *           *        G[03-05].09          *           *        G[09-11].09\r\n                          G03.10       G04.10      G05.10            G09.10      G10.10      G11.10                    *           *        G[03-05].10          *           *        G[09-11].10\r\n                          G03.11       G04.11      G05.11            G09.11      G10.11      G11.11                    *           *        G[03-05].11          *           *        G[09-11].11\r\n                           B0          B1           B2                D0          D1          D2\r\n                      (b) step 1: 4 parallel reduce-scatter operations with w                                    (c) step 2: 6 parallel reduce-scatter operations with w\r\n                                                                                                   0                                                                                         1\r\n                         G[00-05].00    *           *              G[06-11].00    *            *                    G[00-11].00    *           *                 *           *           *\r\n                         G[00-05].01    *           *              G[06-11].01    *            *                       *           *           *              G[00-11].01    *           *\r\n                            *           *           *                 *           *            *                       *           *           *                 *           *           *\r\n                            *           *           *                 *           *            *                       *           *           *                 *           *           *\r\n                            *        G[00-05].04    *                 *        G[06-11].04     *                       *        G[00-11].04    *                 *           *           *\r\n                            *        G[00-05].05    *                 *        G[06-11].05     *                       *           *           *                 *        G[00-11].05    *\r\n                            *           *           *                 *           *            *                       *           *           *                 *           *           *\r\n                            *           *           *                 *           *            *                       *           *           *                 *           *           *\r\n                            *           *        G[00-05].08          *           *         G[06-11].08                *           *        G[00-11].08          *           *           *\r\n                            *           *        G[00-05].09          *           *         G[06-11].09                *           *           *                 *           *        G[00-11].09\r\n                            *           *           *                 *           *            *                       *           *           *                 *           *           *\r\n                            *           *           *                 *           *            *                       *           *           *                 *           *           *\r\n                                                                                                                      A0          A1           A2                C0         C1           C2\r\n                                                            w2\r\n                            *           *           *                 *           *            *                       *           *           *                 *           *           *\r\n                            *           *           *                 *           *            *                       *           *           *                 *           *           *\r\n                         G[00-05].02    *           *              G[06-11].02    *            *                    G[00-11].02    *           *                 *           *           *\r\n                         G[00-05].03    *           *              G[06-11].03    *            *                       *           *           *              G[00-11].03    *           *\r\n                            *           *           *                 *           *            *                       *           *           *                 *           *           *\r\n                            *           *           *                 *           *            *                       *           *           *                 *           *           *\r\n                            *        G[00-05].06    *                 *        G[06-11].06     *                       *        G[00-11].06    *                 *           *           *\r\n                            *        G[00-05].07    *                 *        G[06-11].07     *                       *           *           *                 *        G[00-11].07    *\r\n                            *           *           *                 *           *            *                       *           *           *                 *           *           *\r\n                            *           *           *                 *           *            *                       *           *           *                 *           *           *\r\n                            *           *         G[00-05].10         *           *         G[06-11].10                *           *        G[00-11].10          *           *           *\r\n                            *           *         G[00-05].11         *           *         G[06-11].11                *           *           *                 *           *        G[00-11].11\r\n                                                                                                                      B0          B1           B2                D0         D1           D2\r\n                      (d) step 3: 6 parallel reduce-scatter operations with w2                                                 (e) the \ufb01nal reduce-scatter result\r\n                    Figure 4. BlueConnect reduce-scatter example for 12 GPUs with |P| = p p p where p = 3,p = 2, and p = 2. The reverse\r\n                                                                                                                          0 1 2              0         1                2\r\n                    steps with all-gather shall be taken to complete all-reduce.\r\n                    3.2     Algorithm                                                                        step which shall be executed online following the decompo-\r\n                    In this section, we describe BlueConnect algorithm. The key                              sition as in Section 3.2.2. The pseudo code of BlueConnect\r\n                    idea in BlueConnect is to decompose the synchronization                                  onalearner is presented and explained in Section 3.2.3.\r\n                    or all-reduceofgradientsacross all learners into mul-                                    3.2.1      Decomposition\r\n                    tiple/concurrent reduce-scatter and all-gather\r\n                    operations based on Section 3.1, then map them to the                                    While all-reduce can be decomposed into various\r\n                    under-lying network fabrics. Therefore, BlueConnect has a                                ways, BlueConnect does so to optimize against the network\r\n                    decomposition step which can be done of\ufb02ine based on the                                 topology. First, BlueConnect decomposes all-reduce\r\n                    network topology as in Section 3.2.1 and all-reduce                                      into the same number of reduce-scatter and\r\n                              BlueConnect: Decomposing All-Reduce for Deep Learning on Heterogeneous Network Hierarchy\r\n               all-gatherstagesasthenumberofnetworkhierarchy                           as possible, in order to reduce the hop count and maxi-\r\n               levels (i.e., k in Eq. (8)). Then, the amount of parallelism            mize the bandwidth utilization on each switch.\r\n               in each stage is determined by the number of elements in              \u2022 BlueConnect reduces the latency overhead through de-\r\n               each network hierarchy level. Fig. 4 (a) shows an example               composition. Such decomposition in BlueConnect also\r\n               whereBlueConnectdecomposes|P| = p p p = 3\u00d72\u00d72\r\n                                                          0 1 2                        enables to use recursive halving/double approaches\r\n               mapping to w0,w1, and w2 respectively for Fig. 1, because               for non-power-of-two |P|. For example, if |P| = 96,\r\n               there are 3 GPUs within a node, forming a binary tree. This             the two-level approaches cannot use recursive halv-\r\n               way, BlueConnect can avoid the bandwidth bottleneck in                  ing/double (without expensive preprocessing), but\r\n               the ring-based scheme (see Eq. (5)). Note that arrays to                BlueConnectcandecomposeinto|P| = 16\u00d76anduse\r\n               all-reduce are partitioned and notated as G[a \u2212 b].c                    recursive methods for the \ufb01rst reduce-scatter\r\n               which represents a partially reduced result over the learners           and the last all-gather stages to further reduce\r\n               from a to b (inclusive) at the partition index c in Fig. 1.             latencies.\r\n               For example, G[00\u221205].05 represents the reduced results\r\n               across the learners {0,1,2,3,4,5} with respect to the parti-          \u2022 BlueConnect runs multiple ring communication pat-\r\n               tion index 5.                                                           terns over a single link, maximizing bandwidth uti-\r\n               3.2.2   All-Reduce                                                      lization (Sivakumar et al., 2000; Yildirim et al., 2016).\r\n                                                                                       A         \u2192B          runoverw1concurrentlyinFig.4\r\n                                                                                         {0,1,2}      {0,1,2}\r\n               Oncedecomposition is completed, BlueConnect executes                    (a). Such multiple parallel rings easily sustain full link\r\n               reduce-scatterandall-gatheroperationsonvar-                             utilization, leaving no idle time. We found BlueCon-\r\n               ious partitions of the input data, in a MPI-compliant man-              nect hit the near-theoretical bandwidth limit in most\r\n               ner, which will be explained in Section 3.2.3. Considering              cases, while a single ring does not.\r\n               all-reduce for Fig. 1, BlueConnect performs the fol-                  \u2022 Themultiple ring patterns in BlueConnect obviously\r\n               lowing steps:                                                           require learners to share switches. In Fig. 4 (a), six dis-\r\n               Fig. 4 (b): Four reduce-scatter operations are per-                     joint sets of ring communication patterns, A{0,1,2} \u2192\r\n                                                                                       C         and B         \u2192D           share s   , leaving\r\n                    formed concurrently with w and within a node. Note                   {0,1,2}       {0,1,2}       {0,1,2}        2.0\r\n                                                  0                                    w\r\n                                                                                         2 to each stream. Such reduced bandwidth per stream\r\n                    that data size for each instance is N.                              6\r\n                                                                                       is compensated by the reduced amount of data to trans-\r\n                                                                                       fer (i.e., N ). Since BlueConnect exercises all learners\r\n               Fig. 4 (c): Six short reduce-scatter operations are                               6\r\n                    performed concurrently with w . A            \u2192B                    at any moment and each learner sends data to a single\r\n                                                     1    {0,1,2}     {0,1,2}          leaner, we can easily compute the bandwidth fraction\r\n                    run over s    , while C        \u2192D           run over s   ,\r\n                               1.0         {0,1,2}      {0,1,2}           1.1          for each ring by dividing the bandwidth by the number\r\n                    all concurrently. Note that data size for each instance            of learners under the corresponding network hierarchy\r\n                    is N.\r\n                        3                                                                    w        w\r\n                                                                                       (e.g.,  1 and   2 ).\r\n               Fig. 4 (d): Six independent reduce-scatter opera-                              3       6\r\n                    tions, A         \u2192C           and B          \u2192D\r\n                             {0,1,2}       {0,1,2}       {0,1,2}      {0,1,2}     3.2.3   Pseudo Code\r\n                    are performed concurrently over s2.0 with w2, yet the\r\n                    data size for each instance is only N.                        In this section, we describe BlueConnect in Algorithm 1\r\n                                                         6\r\n               Fig. 4 (e): All the reduce-scatter stages are com-                 and its implementation details in the MPI context. Algo-\r\n                    pleted, and the reduced gradients are evenly distributed.     rithm 1 assumes that topology-aware decomposition can be\r\n                    all-gatherwill begin in the exactly same but re-              described by a utility like the rank \ufb01le (OpenMPI) so that\r\n                    verse order to complete all-reduce.                           |P| has been decomposed according to W. Then for a given\r\n                                                                                  gradient G[N] and a global rank r (MPI-Tutorial), Blue-\r\n                                                                                  Connect performs the one-time preparation step in lines 2-5.\r\n               AsinFig.4,BlueConnectfully leverages the heterogeneous             For a network switch type i, line 3 obtains a set of learn-\r\n               network bandwidths with inexpensive multiple/concurrent            ers which will work with a local learner r over different\r\n               reduce-scatterandall-gatheroperations. Blue-                       bandwidths. For instance, c(A ,w ) = {A ,A ,A } yet\r\n               Connect distributes data over all available nodes, which                                            1   0         0   1   2\r\n                                                                                  c(A ,w ) = {A ,B }inFig.4(a). Then, line 4 computes\r\n               provides the following key differences from the two-level              1   1        1    1\r\n                                                                                  the local rank of the current learner r among c[i]. As de-\r\n               scheme:                                                            scribed in Fig. 4, BlueConnect performs a series of concur-\r\n                                                                                  rent reduce-scatter followed by all-gather col-\r\n                  \u2022 BlueConnect decomposes P according to the network             lectives which is in lines 7-16. Both reduce-scatter\r\n                    topology and hierarchy. The goal of such decomposi-           and all-gatheroperateonG[g : g + N)onacommu-\r\n                                                                                                                              n\r\n                    tion is to keep traf\ufb01c within each switch level as much       nicator c[i]. While moving up the network topology, the\r\n                             BlueConnect: Decomposing All-Reduce for Deep Learning on Heterogeneous Network Hierarchy\r\n                                                                N                w                    N                w     w\r\n                       T   =2T (p ,N,\u03b1,w )+2T (p ,                 , \u03b1,min{w ,     1})+2T (p ,           , \u03b1,min{w ,     1,    2 })\r\n                        blc      r/c  0         0       r/c   1 p             0 p           r/c   2 p p              0 p    p p\r\n                                                                  0               0                   0 1                0   0 1\r\n                               +...+2T      (p       ,     N      , \u03b1,  min {      wj    })\r\n                                         r/c   |W|\u22121 Q                          Q\r\n                                                         |W|\u22122p       0\u2264j<|W|     j\u22121 p\r\n                                                         j=0    j                 k=0 k\r\n                                |W|\u22121\r\n                           =2 X Tr/c(pi,Q N              , \u03b1, min {Q wj      })                                                         (9)\r\n                                                  i\u22121 p     0\u2264j<i     j\u22121 p\r\n                                 i=0              j=0 j               k=0 k\r\n               size of the reduce-scatter problem decreases with a             can obtain the following BlueConnect performance model\r\n               growing n, and the gradient offset g is adjusted accord-        ontorus.\r\n               ingly. Then, in the reverse order as in line 12, the size of\r\n                                                                                           |W|\u22121\r\n               the all-gatherproblemgrowswithadecreasingn,and                               X                N                     w\r\n                                                                                  T    =2        T    (p , Q       , \u03b1, min {w0,     j })\r\n               the gradient offset g is adjusted accordingly as well. Since         blc            r/c  i    i\u22121       1\u2264j<i       p\r\n                                                                                                                 p                   0\r\n               Algorithm 1 is for one learner and all other learners perform                i=0              j=0 j\r\n               the same procedure with a different global rank r, BlueCon-                                                             (10)\r\n               nect keeps all learners busy and leaves no idle hosts unlike    Note that Eq. (6, 7) are still valid on torus, as both run a\r\n               the two-level scheme.                                           single communication stream which will be bottlenecked by\r\n                                                                               the most narrow bandwidth.\r\n               3.3  PerformanceModel\r\n                                                               Q\r\n               Assume topology-aware decomposition P =            |W|\u22121p       3.4   Limitations\r\n                                                                  j=0     j\r\n               for a fat-tree like topology as in Fig. 1. The performance      BlueConnect highly relies on all-reduce decomposi-\r\n               modelofBlueConnectcanbestatedasinEq.(9). Wecan                  tion, thus if there is no feasible case for Eq (8), BlueCon-\r\n               easily prove that BlueConnect offers smaller latency than       nect gets degenerated into a simple one-level ring scheme.\r\n               the two-level scheme in Eq. (7).                                However, considering the reality that all the hosts have the\r\n               WealsoshowaBlueConnectperformancemodelontorus                   samenumberofGPUsoverasymmetricnetworktopology\r\n               topology which is another popular network topology and          in most cases, BlueConnect can deliver high-performance\r\n               could be cheaper than fat-tree scheme (Solnushkin, 2013).       all-reduceinpractice.\r\n               ThekeyadvantageofusingBlueConnect on torus is that it\r\n               would reduce the bandwidth sharing on inter-node commu-         4    EXPERIMENTAL RESULTS\r\n               nication, as torus has dedicated connections between hosts.     We implemented BlueConnect (BLC) for GPU in C++\r\n               Assuming p is the number of learners within a node, we\r\n                           0                                                   based on CUDA-aware MPI(IBM,2017b)andNCCLver.\r\n                                                                               2(NVidia, 2017a) (without using all-reduce APIs ) to\r\n                                                         Q|W|\u22121                exchange gradients ef\ufb01ciently. BLC picks the best perform-\r\n               Algorithm 1 BlueConnect(G[N],P =                   p )\r\n                                                           j=0     j           ing reduce-scatter and all-gather implementa-\r\n                1: r = global-rank()                                           tion directly from MPI and NCCL, or from custom imple-\r\n                2: for i \u2208 0 : |W| \u2212 1 do                                      mentations. We performed two sets of experiments to study\r\n                3:    c [i] = c(r,wi)                                          the ef\ufb01ciency of BLC.\r\n                4:    l [i] = local-rank(r,c[i])\r\n                5: end for                                                     4.1   BLCcomparedwithNCCL2\r\n                6: n = 1,g = 0                                                 In this section, we report the pure all-reduce per-\r\n                7: for i \u2208 0 : |W| \u2212 1 do      N                               formance comparison between BLC and NCCL (i.e., nc-\r\n                8:    reduce-scatter(g, n,c[i])                                clAllReduce) on two different setups. In one setup, we used\r\n                9:    n=n\u00d7pi                                                   two Intel Xeon(R) CPU E5-2680 systems with 4 Nvidia\r\n               10:    g = g + Nl[i]\r\n               11: end for     n                                               Telsa P100-PCIE-16GB GPUs each, connected through\r\n               12: for i \u2208 |W| \u2212 1 : 0 do                                      10Gbps Ethernet. Within the Intel systems, the GPUs\r\n               13:    n=n\u00f7p                                                    are connected through PCIe gen3. In the other setup, we\r\n                                 i                                             used two IBM S822LC systems with 4 NVidia Tesla P100-\r\n               14:    g = g \u2212 Nl[i]\r\n                               n                                               SXM2GPUseach,connectedthrough100GbpsIn\ufb01niBand.\r\n               15:    all-gather(g, N,c[i])\r\n               16: end for               n                                     Within the IBM systems, the GPUs are connected through\r\n                                                                               NVLink(IBM,2017a; NVLink, 2017). Fig. 5 shows that\r\n                            BlueConnect: Decomposing All-Reduce for Deep Learning on Heterogeneous Network Hierarchy\r\n                                                                            MPI Allreduce all-reduce function in MPI (Thakur\r\n                1,000                                                       100  et al., 2005) which is optimized for generic communi-\r\n                           NCCL2                                                     NCCL\r\n                                                                                     2\r\n                 800                                                         80  cation of various sizes/topologies.\r\n               )           BLC                                             )         BLC\r\n               c                                                           c\r\n               e                                                           e\r\n               s 600                                                       s 60\r\n               m                                                           mRing Single-level ring-based all-reduce algorithm as\r\n                (                                                           (\r\n               e                                                           e\r\n               m                                                           m     in (Baidu, 2017), designed for deep learning.\r\n               i 400                                                       i 40\r\n               unt                                                         unt\r\n               R                                                           R\r\n                                                                             20\r\n                 200                                                        GLOO Two-levelall-reducealgorithmin(Facebook,\r\n                                                                                 b; Jia et al., 2018), designed for deep learning based\r\n                  0                                                          0   onNCCL(NVidia,2017a)andib verb.\r\n                  0.00E+00    5.00E+07    1.00E+08    1.50E+08    2.00E+08  0.00E+00     5.00E+07    1.00E+08     1.50E+08    2.00E+08\r\n                                          #FP32s                                                     #FP32s\r\n                  (a) Intel systems with PCIe gen3 and 10Gbps Ethernet      WeusedResnet-50(Goyalet al., 2017; He et al., 2015) and\r\n                                                                            ImageNet-1Ktomeasurescalingef\ufb01ciencyandcommunica-\r\n1,000           100                                                         tion overheads for 4 GPUs, 8 GPUs, up to 192 GPUs, while\r\nNCCL2                     NCCL2                                             maintaining a \ufb01xed batch size of 32 per GPU (e.g., the effec-\r\n)800BLC        ) 80       BLC                                               tive batch size is 6144 at 192 GPUs). We found that Resnet-\r\nc              c\r\ne              e\r\ns600           s 60                                                         50 has about 100MB of gradients in FP32. Since (Goyal\r\nm              m\r\n (              (\r\ne              e                                                            et al., 2017; You et al., 2017a) has demonstrated successful\r\nm              m\r\ni400           i 40                                                         convergence to best accuracy for the batch size of 8192, the\r\nunt            unt\r\nR              R                                                            scaling ef\ufb01ciency number is meaningful. We do not focus\r\n200              20                                                         on the convergence/accuracy in this paper, as all three tech-\r\n0                0                                                          niques compute all reduce results synchronously and\r\n0.00E+005.00E+071.00E+081.50E+082.00E+080.00E+005.00E+071.00E+081.50E+082.00E+08accurately. Nevertheless, we con\ufb01rmed that our BLC inte-\r\n#FP32s                                    #FP32s                            gration into Caffe2 does not alter the convergence behavior\r\n                 (b) IBM systems with NVLink and 100Gbps In\ufb01niBand          through several tests.\r\n                   Figure 5. All-reduce performance on two systems.         We present our results in Fig. 6 without MPI Allreduce\r\n                                                                            results (due to its poor performance beyond 32 GPUs).\r\n                                                                            To accurately measure the communication overhead (ac-\r\n              BLCoutperforms NCCL by exploiting the network hier-           tual all reducetime, interface-overhead to Caffe2, jitter\r\n              archy within systems as well as between systems on both       from network/OS/GPU-scheduling, required memory copy,\r\n              setups, over a wide range of FP32 \ufb02oating-point number        and so on), we \ufb01rst measure the single-GPU performance\r\n              counts. Thanks to the faster network in the IBM platform,     which is 163.0 msec per iteration or 196.3 images/sec. Our\r\n              both BLC and NCCL perform about 10x faster than the           experimental results in Fig. 6 are summarized as follows:\r\n              Intel platform, but BLC is about 1.6x faster than NCCL on        \u2022 (a) plots the overall communication overhead per it-\r\n              both cases.                                                        eration over various GPU counts. We subtracted the\r\n              4.2   BLCintegratedinCaffe2                                        baseline number (163.0 msec as mentioned above) to\r\n                                                                                 capture the total communication overhead reliably and\r\n              OurBLCimplementationispackagedasanewcommuni-                       comprehensively. With 4 GPUs (which are all in a sin-\r\n              cation operator for Caffe2 (Facebook, a), following exist-         gle node), BLC and GLOOshowsimilarperformance\r\n              ing communication operator implementation. To evaluate             because both simply use NCCL (while Ring does not).\r\n              the performance of BLC, we used a cluster of 48 IBM                However, BLC incurs less communication overhead\r\n              S822LCsystemsonRedHatEnterpriseLinuxwithcuDNN,                     with more GPUs.\r\n              each equipped with 4 NVidia Tesla P100-SXM2 GPUs con-\r\n              nected through NVLink (IBM, 2017a; NVLink, 2017). The            \u2022 The communication overhead in (a) for BLC on 192\r\n              systems were organized into 3 racks with 16 nodes each,            GPUsisabout31.0msecwherethejitter accounts for\r\n              connected via a single-port 100Gbs In\ufb01niBand network.              5-10 msec. GLOOscales much better than Ring, but\r\n              Each rack was equipped with a rack switch that was con-            BLCoffers the best scaling overall, with about 87%\r\n              nected to a director switch. ImageNet-1K benchmark has             reduction in communication overhead over GLOO on\r\n              been preloaded onto RAM Disk on each system to prevent             192 GPUs(58.0 vs 31.0 msec). If we assume the jitter\r\n              performance degradation due to disk I/O. We compared               is 5 msec, then the actual communication overhead\r\n              BLCwiththefollowingdeeplearning communication tech-                improvement of BLC over GLOOisabout2\u00d7on192\r\n              niques/libraries under the identical environment:                  GPUs.\r\n                                     BlueConnect: Decomposing All-Reduce for Deep Learning on Heterogeneous Network Hierarchy\r\n                      140                                                                            200\r\n                     )                Ring                                                                             Ideal\r\n                     c120             GLOO                                                                             Ring\r\n                     e\r\n                     s                                                                               160               GLOO\r\n                     m                BLC\r\n                     (\r\n                      100                                                                                              BLC\r\n                     d\r\n                     a\r\n                     e\r\n                     h\r\n                     r                                                                               120\r\n                     e 80                                                                           p\r\n                     v                                                                              u\r\n                     O                                                                              d\r\n                                                                                                    e\r\n                     n                                                                              e\r\n                     o 60                                                                           p\r\n                     i\r\n                     t                                                                              S 80\r\n                     a\r\n                     c\r\n                     i\r\n                     n\r\n                     u 40\r\n                     m\r\n                     m                                                                                40\r\n                     o\r\n                     C 20\r\n                        0                                                                              0\r\n                              4       8      16     32      48      64     96     128     192            0          32         64         96        128        160        192\r\n                                                          #GPUs                                                                          #GPUs\r\n                                        (a) Communication Overhead                                                          (b) Scaling Ef\ufb01ciency\r\n                      1.1                                                                            1.1\r\n                     )         Ring  GLOO BLC                                                       )         Ring  GLOO BLC\r\n                     c  1                                                                           c  1\r\n                     e                                                                              e\r\n                     s                                                                              s\r\n                     /                                                                              /\r\n                     s                                                                              s\r\n                     e                                                                              e\r\n                     g                                                                              g\r\n                     a0.9                                                                           a0.9\r\n                     m                                                                              m\r\n                     I                                                                              I\r\n                     (                                                                              (\r\n                                                                                                     \r\n                     t                                                                              t\r\n                     u                                                                              u\r\n                     p0.8                                                                           p0.8\r\n                     h                                                                              h\r\n                     g                                                                              g\r\n                     u                                                                              u\r\n                     o                                                                              o\r\n                     r                                                                              r\r\n                     h0.7                                                                           h0.7\r\n                     T                                                                              T\r\n                                                                                                     \r\n                     d                                                                              d\r\n                     e                                                                              e\r\n                     z                                                                              z\r\n                     i                                                                              i\r\n                     l                                                                              l\r\n                     a0.6                                                                           a0.6\r\n                     m                                                                              m\r\n                     r                                                                              r\r\n                     o                                                                              o\r\n                     N                                                                              N\r\n                      0.5                                                                            0.5\r\n                           4      8       16     32      48      64     96     128     192                4       8      16      32     48      64      96     128    192\r\n                                                         #GPUs                                                                           #GPUs\r\n                                           (c) Training Throughput                                           (d) Training Throughput (projected with FP16)\r\n                                                              Figure 6. Training performance comparison over 192 GPUs\r\n                      \u2022 (b) highlights how communication overhead impacts                                 the throughput gap between BLC and others would get\r\n                         the scaling ef\ufb01ciency, one of the key metrics in large-                          wider (e.g., from 11% to 22% on 192 GPUs compared\r\n                         scale deep learning. Note that our scaling ef\ufb01ciency is                          with GLOO),supporting our claim that a faster com-\r\n                         comparedwithrespect to a single-GPU performance,                                 munication algorithm is crucial for deep learning on\r\n                         instead of a single-node performance (Goyal et al.,                              morepowerful computing resources.\r\n                         2017). It shows that BLC scales best due to ef\ufb01cient\r\n                         synchronization in SGD. Ring scales worst, keeping                         5     CONCLUSIONANDFUTUREWORK\r\n                         GPUs idle for an extended period; it wastes 48% of\r\n                         GPUcomputingpoweron192GPUs(equivalentto92                                 Wehave proposed BlueConnect, an ef\ufb01cient communica-\r\n                         GPUs).                                                                    tion library for training complex deep neural networks with\r\n                                                                                                   a large number of GPUs, thus offering a viable strategy\r\n                      \u2022 (c) shows that BLC delivers the best images/sec                            to reduce training time from weeks to hours. Such rapid\r\n                         throughput over other communication techniques. On                        turn around can accelerate the improvement of existing\r\n                         192GPUs,BLCdelivers11%higherthroughputthan                                neural networks and design of new neural networks, and\r\n                         GLOO,and45%higherthroughputthanRing.                                      exploration of new application domains. To proliferate this\r\n                                                                                                   technology to the masses, more research needs to be done,\r\n                      \u2022 (d) presents the projected throughput with FP16 on                         because massive GPU scaling relies on successful training\r\n                         the future generation GPUs (i.e., Ampere GPUs) by                         to good accuracy for large batch size. Prior techniques such\r\n                         scaling down the single-GPU performance by 4.8 (2x                        as (Goyal et al., 2017; You et al., 2017a) have been demon-\r\n                         faster than Volta GPUs (NVidia, 2017b)) and cutting                       strated on some neural network types, but we need to extend\r\n                         the communication overhead by half. It indicates that\r\n                               BlueConnect: Decomposing All-Reduce for Deep Learning on Heterogeneous Network Hierarchy\r\n               it to other popular neural network types, in particular, recur-        Zheng. Mxnet: A \ufb02exible and ef\ufb01cient machine learning library\r\n               rent neural networks. The whole training has to be made                for heterogeneous distributed systems. CoRR, abs/1512.01274,\r\n               resilient and elastic since it is very likely that some devices        2015.\r\n               will malfunction when the number of devices increases. Au-          Dichev, K. and Lastovetsky, A. Optimization of collective commu-\r\n               tomation and usability issues have to be addressed to enable           nication for heterogeneous HPC platforms, pp. 95\u2013114. Wiley\r\n               moreturnkey operation, especially in a cloud environment.              Series on Parallel and Distributed Computing. 2014.\r\n                                                                                   Facebook. https://caffe2.ai. a.\r\n               6    ACKNOWLEDGMENT                                                 Facebook. https://github.com/facebookincubator/gloo. b.\r\n               WethankIBMPowerAIteamforassistancewithBlueCon-                                        \u00b4\r\n                                                                                   Goyal, Priya, Dollar, Piotr, Girshick, Ross B., Noordhuis, Pieter,\r\n               nect implementation and testing, and Brad Neimanich, Alex             Wesolowski, Lukasz, Kyrola, Aapo, Tulloch, Andrew, Jia,\r\n               Habeger, Bryant Nelson, Nicolas Castet, Bill Armstrong                Yangqing, and He, Kaiming. Accurate, large minibatch SGD:\r\n               for BlueConnect integration and productization into IBM                training imagenet in 1 hour. CoRR, abs/1706.02677, 2017.\r\n               PowerAIDDL.                                                         He, Kaiming, Zhang, Xiangyu, Ren, Shaoqing, and Sun,\r\n                                                                                      Jian. Deep residual learning for image recognition. CoRR,\r\n               REFERENCES                                                             abs/1512.03385, 2015.\r\n               Abadi, Martin, Barham, Paul, Chen, Jianmin, Chen, Zhifeng,          Iandola, Forrest N., Ashraf, Khalid, Moskewicz, Matthew W., and\r\n                  Davis, Andy, Dean, Jeffrey, Devin, Matthieu, Ghemawat, San-         Keutzer, Kurt. Firecaffe: near-linear acceleration of deep neural\r\n                  jay, Irving, Geoffrey, Isard, Michael, Kudlur, Manjunath, Lev-      network training on compute clusters. CoRR, abs/1511.00175,\r\n                  enberg, Josh, Monga, Rajat, Moore, Sherry, Murray, Derek G.,        2015.\r\n                  Steiner, Benoit, Tucker, Paul, Vasudevan, Vijay, Warden, Pete,   IBM. https://www.ibm.com/us-en/marketplace/high-performance-\r\n                  Wicke, Martin, Yu, Yuan, and Zheng, Xiaoqiang. Tensor\ufb02ow:           computing. 2017a.\r\n                  Asystem for large-scale machine learning. In 12th USENIX\r\n                  SymposiumonOperatingSystemsDesignandImplementation               IBM.    https://www.ibm.com/us-en/marketplace/spectrum-mpi.\r\n                  (OSDI16), pp. 265\u2013283, 2016.                                        2017b.\r\n               Al-Fares, Mohammad, Loukissas, Alexander, and Vahdat, Amin.         Ioffe, Sergey and Szegedy, Christian. Batch normalization: Ac-\r\n                  Ascalable, commodity data center network architecture. SIG-         celerating deep network training by reducing internal covariate\r\n                  COMMComput.Commun.Rev.,38(4):63\u201374,August2008.                      shift. In Proceedings of the 32Nd International Conference on\r\n                    \u00b4                                                                 International Conference on Machine Learning - Volume 37,\r\n               Almasi,George,Heidelberger,Philip,Archer,CharlesJ.,Martorell,          ICML\u201915, pp. 448\u2013456, 2015.\r\n                                                      \u00b4\r\n                  Xavier, Erway, C. Chris, Moreira, Jose E., Steinmacher-Burow,\r\n                  B., and Zheng, Yili. Optimization of mpi collective commu-       Jia, Xianyan, Song, Shutao, He, Wei, Wang, Yangzihao, Rong,\r\n                  nication on bluegene/l systems. In Proceedings of the 19th          Haidong, Zhou, Feihu, Xie, Liqiang, Guo, Zhenyu, Yang,\r\n                  Annual International Conference on Supercomputing, ICS \u201905,        Yuanzhou, Yu, Liwei, Chen, Tiegang, Hu, Guangxiao, Shi,\r\n                  pp. 253\u2013262, 2005. ISBN 1-59593-167-8.                              Shaohuai, and Chu, Xiaowen. Highly scalable deep learning\r\n               Amazon. https://aws.amazon.com/ec2/instance-types/p2.                  training system with mixed-precision: Training imagenet in\r\n                                                                                      four minutes. 2018.\r\n               Amodei, Dario, Anubhai, Rishita, Battenberg, Eric, Case, Carl,      Jia, Yangqing, Shelhamer, Evan, Donahue, Jeff, Karayev, Sergey,\r\n                  Casper, Jared, Catanzaro, Bryan, Chen, Jingdong, Chrzanowski,       Long, Jonathan, Girshick, Ross, Guadarrama, Sergio, and Dar-\r\n                  Mike, Coates, Adam, Diamos, Greg, Elsen, Erich, Engel,              rell, Trevor. Caffe: Convolutional architecture for fast feature\r\n                  Jesse, Fan, Linxi, Fougner, Christopher, Han, Tony, Han-            embedding. arXiv preprint arXiv:1408.5093, 2014.\r\n                  nun, Awni Y., Jun, Billy, LeGresley, Patrick, Lin, Libby,\r\n                  Narang, Sharan, Ng, Andrew Y., Ozair, Sherjil, Prenger, Ryan,    Jouppi, Norman P., Young, Cliff, Patil, Nishant, Patterson, David,\r\n                  Raiman, Jonathan, Satheesh, Sanjeev, Seetapun, David, Sen-         Agrawal, Gaurav, Bajwa, Raminder, Bates, Sarah, Bhatia,\r\n                  gupta, Shubho, Wang, Yi, Wang, Zhiqian, Wang, Chong, Xiao,          Suresh, Boden, Nan, Borchers, Al, Boyle, Rick, Cantin, Pierre-\r\n                  Bo, Yogatama, Dani, Zhan, Jun, and Zhu, Zhenyao. Deep               luc, Chao, Clifford, Clark, Chris, Coriell, Jeremy, Daley,\r\n                  speech 2: End-to-end speech recognition in english and man-         Mike, Dau, Matt, Dean, Jeffrey, Gelb, Ben, Ghaemmaghami,\r\n                  darin. CoRR, abs/1512.02595, 2015.                                 Tara Vazir, Gottipati, Rajendra, Gulland, William, Hagmann,\r\n               Baidu. https://github.com/baidu-research/baidu-allreduce. 2017.        Robert, Ho, Richard C., Hogberg, Doug, Hu, John, Hundt,\r\n                                                                                      Robert, Hurt, Dan, Ibarz, Julian, Jaffey, Aaron, Jaworski,\r\n               Balles, Lukas, Romero, Javier, and Hennig, Philipp. Coupling          Alek, Kaplan, Alexander, Khaitan, Harshit, Koch, Andy, Ku-\r\n                  adaptive batch sizes with learning rates. CoRR, abs/1612.05086,     mar, Naveen, Lacy, Steve, Laudon, James, Law, James, Le,\r\n                  2016.                                                               Diemthu, Leary, Chris, Liu, Zhuyuan, Lucke, Kyle, Lundin,\r\n                                                                                     Alan, MacKean, Gordon, Maggiore, Adriana, Mahony, Maire,\r\n                                                                    \u00b4                 Miller, Kieran, Nagarajan, Rahul, Narayanaswami, Ravi, Ni,\r\n               Chen, Jianmin, Monga, Rajat, Bengio, Samy, and Jozefowicz,             Ray, Nix, Kathy, Norrie, Thomas, Omernick, Mark, Penukonda,\r\n                  Rafal.   Revisiting distributed synchronous SGD.      CoRR,         Narayana, Phelps, Andy, Ross, Jonathan, Salek, Amir, Samadi-\r\n                  abs/1604.00981, 2016.                                               ani, Emad, Severn, Chris, Sizikov, Gregory, Snelham, Matthew,\r\n               Chen, Tianqi, Li, Mu, Li, Yutian, Lin, Min, Wang, Naiyan, Wang,        Souter, Jed, Steinberg, Dan, Swing, Andy, Tan, Mercedes, Thor-\r\n                  Minjie, Xiao, Tianjun, Xu, Bing, Zhang, Chiyuan, and Zhang,         son, Gregory, Tian, Bo, Toma, Horia, Tuttle, Erick, Vasudevan,\r\n                               BlueConnect: Decomposing All-Reduce for Deep Learning on Heterogeneous Network Hierarchy\r\n                  Vijay, Walter, Richard, Wang, Walter, Wilcox, Eric, and Yoon,     Sridharan, Srinivas, Vaidyanathan, Karthikeyan, Kalamkar, Dhiraj,\r\n                  DoeHyun. In-datacenter performance analysis of a tensor pro-         Das,Dipankar,Smorkalov,MikhailE.,Shiryaev,Mikhail,Mudi-\r\n                  cessing unit. CoRR, abs/1704.04760, 2017.                            gere, Dheevatsa, Mellempudi, Naveen, Avancha, Sasikanth,\r\n                                                                                       Kaul, Bharat, and Dubey, Pradeep. On scale-out deep learning\r\n                Keskar, Nitish Shirish, Mudigere, Dheevatsa, Nocedal, Jorge,           training for cloud and hpc. 2018.\r\n                  Smelyanskiy, Mikhail, and Tang, Ping Tak Peter. On large-         Thakur, Rajeev, Rabenseifner, Rolf, and Gropp, William. Opti-\r\n                  batch training for deep learning: Generalization gap and sharp       mization of collective communication operations in mpich. Int.\r\n                  minima. CoRR, abs/1609.04836, 2016.                                 J. High Perform. Comput. Appl., 19(1):49\u201366, February 2005.\r\n                Keuper, Janis. Distributed training of deep neuronal networks:      Uber. https://eng.uber.com/horovod. 2017.\r\n                  Theoretical and practical limits of parallel scalability. CoRR,\r\n                  abs/1609.06870, 2016.                                             Wang,Jianyu and Joshi, Gauri. Cooperative sgd: A uni\ufb01ed frame-\r\n                Krizhevsky, Alex. One weird trick for parallelizing convolutional      work for the design and analysis of communication-ef\ufb01cient\r\n                  neural networks. CoRR, abs/1404.5997, 2014.                          sgd algorithms. 2018.\r\n                Krizhevsky, Alex, Sutskever, Ilya, and Hinton, Geoffrey E. Im-      Yildirim, E., Arslan, E., Kim, J., and Kosar, T. Application-level\r\n                  agenet classi\ufb01cation with deep convolutional neural networks.        optimization of big data transfers through pipelining, paral-\r\n                  In Advances in Neural Information Processing Systems 25, pp.         lelism and concurrency. IEEE Transactions on Cloud Comput-\r\n                  1097\u20131105. 2012.                                                     ing, 4(1):63\u201375, 2016.\r\n                                                                                    You, Yang, Gitman, Igor, and Ginsburg, Boris. Scaling SGD\r\n                Lian, Xiangru, Zhang, Ce, Zhang, Huan, Hsieh, Cho-Jui, Zhang,          batch size to 32k for imagenet training. CoRR, abs/1708.03888,\r\n                  Wei, and Liu, Ji. Can decentralized algorithms outperform            2017a.\r\n                  centralized algorithms? a case study for decentralized parallel   You, Yang, Zhang, Zhao, Hsieh, Cho-Jui, Demmel, James, and\r\n                  stochastic gradient descent. 2017.                                   Keutzer, Kurt. Imagenet training in minutes. 2017b.\r\n                MPI-Tutorial.   http://mpitutorial.com/tutorials/introduction-to-   Zhang, Sixin, Choromanska, Anna, and LeCun, Yann. Deep\r\n                  groups-and-communicators.                                            learning with elastic averaging sgd. In Proceedings of the 28th\r\n                Niitani, Yusuke, Ogawa, Toru, Saito, Shunta, and Saito, Masaki.        International Conference on Neural Information Processing\r\n                  Chainercv: alibraryfordeeplearningincomputervision. CoRR,            Systems - Volume 1, NIPS\u201915, pp. 685\u2013693, 2015a.\r\n                  abs/1708.08169, 2017.                                             Zhang, Wei, Gupta, Suyog, Lian, Xiangru, and Liu, Ji. Staleness-\r\n                Niu, Feng, Recht, Benjamin, Re, Christopher, and Wright,               aware async-sgd for distributed deep learning.        CoRR,\r\n                  Stephen J. Hogwild!: A lock-free approach to parallelizing           abs/1511.05950, 2015b.\r\n                  stochastic gradient descent. In Proceedings of the 24th Interna-\r\n                  tional Conference on Neural Information Processing Systems,\r\n                  NIPS\u201911, pp. 693\u2013701, 2011.\r\n                NVidia. https://devblogs.nvidia.com/parallelforall/dgx-1-fastest-\r\n                  deep-learning-system. a.\r\n                NVidia. https://www.nvidia.com/en-us/data-center/dgx-saturnv. b.\r\n                NVidia. https://developer.nvidia.com/nccl. 2017a.\r\n                NVidia.   https://devblogs.nvidia.com/parallelforall/inside-volta.\r\n                  2017b.\r\n                NVLink. https://en.wikipedia.org/wiki/NVLink. 2017.\r\n                OpenMPI. https://www.open-mpi.org/projects/hwloc.\r\n                Seide, Frank and Agarwal, Amit.       Cntk: Microsoft\u2019s open-\r\n                  source deep-learning toolkit. In Proceedings of the 22Nd ACM\r\n                  SIGKDDInternational Conference on Knowledge Discovery\r\n                  and Data Mining, KDD \u201916, pp. 2135\u20132135, 2016.\r\n                Sivakumar, H., Bailey, S., and Grossman, R. L. Psockets: The case\r\n                  for application-level network striping for data intensive applica-\r\n                  tions using high speed wide area networks. In Proceedings of\r\n                  the 2000 ACM/IEEE Conference on Supercomputing, 2000.\r\n                Solnushkin, Konstantin S. Automated design of torus networks.\r\n                  CoRR,abs/1301.6180, 2013.\r\n", "award": [], "sourceid": 130, "authors": [{"given_name": "Minsik", "family_name": "Cho", "institution": "IBM"}, {"given_name": "Ulrich", "family_name": "Finkler", "institution": "IBM Research"}, {"given_name": "David", "family_name": "Kung", "institution": "IBM Research"}, {"given_name": "Hillery", "family_name": "Hunter", "institution": "IBM Research"}]}