{"title": "TicTac: Accelerating Distributed Deep Learning with Communication Scheduling", "book": "Proceedings of Machine Learning and Systems", "page_first": 418, "page_last": 430, "abstract": "State-of-the-art deep learning systems rely on iterative distributed training to tackle the increasing complexity of models and input data. In this work, we identify an opportunity for accelerating distributed DNN training in systems that rely on graph representation for computation, such as TensorFlow and PyTorch, through communication scheduling. We develop a system, TicTac, that reduces the iteration time by identifying and enforcing parameter transfers in the order in which the parameters are consumed by the underlying computational model, thereby guaranteeing near-optimal overlap of communication and computation. Our system is implemented over TensorFlow and enforces the optimal ordering by prioritization of parameter transfers at the Parameter Server in data parallel training. TicTac requires no changes to the model or developer inputs and improves the throughput by up to $37.7\\%$ in inference and $19.2\\%$ in training, while also reducing straggler effect by up to $2.3\\times$. Our code is publicly available.", "full_text": "                           TICTAC: ACCELERATING DISTRIBUTED DEEP LEARNING WITH\r\n                                                      COMMUNICATIONSCHEDULING\r\n                                         SayedHadiHashemi*1 SangeethaAbduJyothi*1 RoyHCampbell1\r\n                                                                           ABSTRACT\r\n                     State-of-the-art deep learning systems rely on iterative distributed training to tackle the increasing complexity\r\n                     of models and input data. In this work, we identify an opportunity for accelerating distributed DNN training in\r\n                     systems that rely on graph representation for computation, such as TensorFlow and PyTorch, through commu-\r\n                     nication scheduling. We develop a system, TicTac, that reduces the iteration time by identifying and enforcing\r\n                     parameter transfers in the order in which the parameters are consumed by the underlying computational model,\r\n                     thereby guaranteeing near-optimal overlap of communication and computation. Our system is implemented over\r\n                     TensorFlow and enforces the optimal ordering by prioritization of parameter transfers at the Parameter Server in\r\n                     data parallel training. TicTac requires no changes to the model or developer inputs and improves the throughput\r\n                     by up to 37.7% in inference and 19.2% in training, while also reducing straggler effect by up to 2.3\u00d7. Our code\r\n                     is publicly available.\r\n                1    INTRODUCTION                                                    improve the learning time by hours in these long-running\r\n                Deep learning has grown signi\ufb01cantly in the past decade,             learning jobs.\r\n                fuelled by the \ufb02exibility of development offered by ma-              Theiteration time in deep learning systems depends on the\r\n                chine learning frameworks, availability of rich data, and            time taken by (i) computation, (ii) communication and (iii)\r\n                readily accessible distributed high-performance comput-              the overlap between the two. When workers receive the pa-\r\n                ing. The computational cost of training sophisticated deep           rametersfromtheparameterserveratthebeginningofeach\r\n                learning modelshaslongoutgrownthecapabilitiesofasin-                 iteration, all parameters are not used simultaneously; they\r\n                gle high-end machine, leading to distributed training being          are consumed based on the dependencies in the underlying\r\n                the norm in a typical AI pipeline. Training a deep learning          DAG.Whileoneparticularscheduleofparametertransfers\r\n                model is an iterative job which may take days to weeks in            (over the complete set of parameters in a given model in a\r\n                high-end clusters today.                                             single iteration) may facilitate faster computation, another\r\n                Computational graphs are used to represent the training              may cause blockage. Hence, identifying the best schedule\r\n                jobs in state-of-the-art systems (Abadi et al., 2016; Chen           of parameter transfers is critical for reducing the blocking\r\n                et al., 2015; Paszke et al., 2017). In the commonly-used             on computation (determined by DAG dependencies), and\r\n                Model Replica or data parallel mode of training, the input           in turn improving the overlap and the iteration time.\r\n                data is partitioned and processed at participating workers           We observe that the schedule of data transfers in current\r\n                using identical computational graphs. Each iteration typi-           systems(Abadietal.,2016;Chenetal.,2015;Paszkeetal.,\r\n                cally lasts milliseconds to seconds. At the end of each it-          2017) is determined arbitrarily during execution without\r\n                eration, servers exchange a relatively large amount of data          considering the impact on overlap. We quantify the ob-\r\n                associated with parameter updates to aggregate the results           served combinations in TensorFlow and \ufb01nd that in a trial\r\n                of the iteration. This communication overhead has a sub-             with 1000 iterations on ResNet-V2-50, every iteration had\r\n                stantial impact on throughput of the system and also limits          a unique order of received parameters which has not been\r\n                its scalability (Sridharan et al., 2018; Alistarh et al., 2017).     observedpreviously. Thisrandomorderofparametertrans-\r\n                Evenasmallimprovementincommunicationoverheadcan                      fers at workers has two performance implications. First,\r\n                  *                    1                                             the iteration time, and in turn throughput (number of sam-\r\n                   Equal contribution   Department of Computer Science, Uni-         ples processed per second), suffers signi\ufb01cantly due to sub-\r\n                versity of Illinois at Urbana-Champaign, Urbana, IL. Correspon-      optimal overlap. Second, even in the same iteration, multi-\r\n                dence to: Sayed Hadi Hashemi <hashemi3@illinois.edu>.                ple workers might follow different schedules of data trans-\r\n                Proceedings of the 2nd SysML Conference, Palo Alto, CA, USA,         fers, leading to stragglers during synchronized training.\r\n                2019. Copyright 2019 by the author(s).\r\n                                      TicTac: Accelerating Distributed Deep Learning with Communication Scheduling\r\n               Past work has attempted to address this issue by enforc-          are reading parameters from the PS or decentralized work-\r\n               ing the same order of parameter transfers at all workers.         ers as shown in Figure 3. While decentralized aggrega-\r\n               However, these solutions are restricted to earlier systems        tion techniques (such as all-reduce and Horovod (Sergeev\r\n               with layer-by-layer model representation (Arnold, 2016;           &Balso, 2018)) are gaining traction in high performance\r\n               Cuietal., 2016; Zhang et al., 2017) where \ufb01nding the opti-        networking, TicTac does not address such systems and is\r\n               mal order of execution is trivial (Cui et al., 2014). In mod-     focused on PS.\r\n               ern systems with DAG representation (Abadi et al., 2016;          In this section, we give a brief overview of deep learning\r\n               Paszke et al., 2017), this is a non-trivial challenge.            systems, prior techniques proposed in these systems to mit-\r\n               Inthiswork,wedeviseasystematicmethodologyforderiv-                igate network overhead, and opportunities for further opti-\r\n               ing near-optimal schedules of parameter transfers through         mization.\r\n               critical path analysis on the underlying computational\r\n               graph. This allows maximal overlap of computation and             2.1   NetworkOptimizationinDNNtraining\r\n               communication and prevents stragglers arising from ran-           In deep learning systems, high GPU utilization can be\r\n               dom order of parameter transfers at workers.        We also       achievedintwoways: (i)whentotalcommunicationtimeis\r\n               develop a lightweight resource-level enforcement mecha-           less than or equal to the computation time and (ii) with ef\ufb01-\r\n               nism over TensorFlow (Abadi et al., 2016). These tech-            cient overlap of communication and computation. Several\r\n               niques form the core of our system, TicTac, which achieves        techniqueshavebeenproposedtoimproveGPUutilization.\r\n               substantial performance improvement while requiring no\r\n               changes in the model or developer inputs.                         Increasing computation time:          The fraction of com-\r\n               In summary, we make the following contributions:                  putation time relative to communication time can be in-\r\n               \u2022 We identify an opportunity for improving performance            creased by increasing the batch size (Iandola et al., 2016).\r\n                  in state-of-the-art deep learning systems with Parameter       However, this approach suffers from decreased accu-\r\n                  Server-based aggregation through prioritized parameter         racy (Keskar et al., 2016) and may not be generally appli-\r\n                  transfers (\u00a72).                                                cable under resource constraints. (Goyal et al., 2017; Cho\r\n               \u2022 We de\ufb01ne a metric to quantify the ef\ufb01ciency of a given          et al., 2017; You et al., 2017; Akiba et al., 2017).\r\n                  execution: the overlap coef\ufb01cient (\u00a73).                        Decreasing communicationtime: Solutions for reducing\r\n               \u2022 We propose two heuristics, TIC and TAC, for near-               networkcommunicationhavetakenmultipleapproaches\u2014\r\n                  optimal scheduling of computation and communication            modifying the machine learning algorithm to reduce com-\r\n                  in Model Replica with Parameter Server.                        munication cost (Alistarh et al., 2017; Wen et al., 2017;\r\n               \u2022 We implement our system over TensorFlow (\u00a7 5). The              Zhang et al., 2017), reducing the precision of parameter\r\n                  code is publicly available. 1                                  representation (Vanhoucke et al., 2011; Courbariaux et al.,\r\n                                                                                 2015; Gupta et al., 2015), changing the network primitives\r\n               \u2022 We extensively evaluate the performance of our system           to collective (e.g. all reduce) (Goyal et al., 2017; Cho et al.,\r\n                  in GPU and high-end CPU environments under training            2017; Amodei et al., 2015; You et al., 2017; Akiba et al.,\r\n                  and inference of DNN models and show that throughput           2017) or broadcast (Zhang et al., 2017).\r\n                  can be improved by up to 37% (\u00a76).                             Smarter interleaving of computation and communica-\r\n               2    BACKGROUNDANDMOTIVATION                                      tion:  Several layer-by-layer systems (Arnold, 2016; Cui\r\n                                                                                 et al., 2016; Zhang et al., 2017), where the models are se-\r\n               Oursystemfocusesonnetworkoptimizationindeeplearn-                 quential and obtaining the order is trivial (Cui et al., 2014),\r\n               ing frameworks with DAG representation of computational           adopt this approach. These solutions are not applicable\r\n               graphs (Abadi et al., 2016; Paszke et al., 2017), Model           to current DAG-based systems such as TensorFlow (Abadi\r\n               Replica (MR) mode of distribution and Parameter Servers.          et al., 2016) and PyTorch (Paszke et al., 2017). The inter-\r\n               The performance improvement provided by TicTac is ben-            resource dependency considered in (Cui et al., 2016) (with\r\n               e\ufb01cial in two key environments. First, it improves through-       GPUmemory) and in (Zhang et al., 2017) (with network)\r\n               put and iteration time in clud environment with commodity         is constrained to layer-by-layer models.\r\n               hardware or on-demand clusters where high resiliency is           In this work, we focus on improving the iteration time\r\n               critical (workers may be preempted). Second, in online re-        through better and predictable overlap of communication\r\n               inforcementlearningwithworkersfortrainingandseparate              and computation. Techniques for optimizing communica-\r\n               active agents for inference, enforced ordering can improve        tion and communication time are orthogonal to our system\r\n               the inference time. In this environment, the active agents        and may be used in parallel with TicTac.\r\n                  1https://github.com/xldrx/tictac\r\n                                                                 TicTac: Accelerating Distributed Deep Learning with Communication Scheduling\r\n                              Partition                                                                                                                        Partition PS: 0                                  Partition Worker: 0\r\n                                                      op2                                                                                                                     Send 0                           Recv 0\r\n                                                                                                     Time                                         Read                      Send 1                                                         Send 0\r\n                                    recv2                                NIC       recv1       recv2\r\n                                                                                                                                                 Recv 1                 Aggregate              Update\r\n                                                                                                                                                   Recv 0\r\n                                                      op1           Processor                  op1         op2\r\n                                    recv1                                   (b) Good Execution Order                                         Figure 2: Distributed execution of Model-Replica with Parame-\r\n                                                                                                                                             ter Server\r\n                                                                         NIC       recv2       recv1\r\n                             (a) Toy Computational                                                                                                                                    Parameter Servers\r\n                             Graph                                  Processor                              op1         op2                                                                Parameters\r\n                                                                              (c) Bad Execution Order                                                                            Read                        Update\r\n                          Figure 1: Impact of multi-resource operation ordering on perfor-                                                                                        Agents               Workers\r\n                          mance\r\n                                                                                                                                                                                    Observations      Input Data\r\n                          2.2       Opportunity for Optimization                                                                                                                 Inference              Training\r\n                          We demonstrate the opportunity for accelerating DNN                                                                            Figure 3: A general reinforcement learning setup\r\n                          training through a better understanding of the internal\r\n                          computational model in TensorFlow which is a Directed                                                              ResNet-v2-50,Inception-v3andVGG-16networksandob-\r\n                          AcyclicGraph(DAG).Theparametertransfersaredenoted                                                                  serve the order of network transfers at a single worker. The\r\n                          by send and recv operations in the DAG. In MR, each                                                                observed order of parameter transfer is unique in ResNet-\r\n                          worker has an identical copy of the computational DAG.                                                             v2-50 and Inception-v3 networks across the 1000 runs. In\r\n                          In the worker DAG, all recv ops are roots and send ops                                                             VGG-16,weobserve493uniquecombinationsacross1000\r\n                          are leaves. Thus recv ops can block the initialization of                                                          runs.\r\n                          a computation branch in the DAG. Since the activation of\r\n                          various branches of computation in the DAG is dependent                                                            2.3       ComparisonwithOtherDistributedSystems\r\n                          on the recv at the root of the branch, the ordering in MR                                                          It is worth noting that deep learning systems with computa-\r\n                          can be reduced to the ordering of recv ops in workers.                                                             tional graphs are fundamentally different from graph pro-\r\n                          DAGatthe PS is different from that at workers. PS DAG                                                              cessing systems (Malewicz et al., 2010; Hoque & Gupta,\r\n                          has \ufb01ve ops per parameter: aggregation, send, recv, read,                                                          2013;Xinetal.,2013). Indeeplearning, thegraphisarep-\r\n                          and update. Since send and recv at the PS are not blocked                                                          resentation of the computation to be done on the input data.\r\n                          bycomputation, our focus is on the worker DAG.                                                                     In graph processing systems, the graph itself is the input to\r\n                          In the simple DAG shown in Figure 1a, a sample worker                                                              the computation. As a result, graphs in DNN frameworks\r\n                          DAG,therearetwopossibleschedulesforparametertrans-                                                                 are a few orders of magnitude smaller than a typical large-\r\n                          fers. If recv (parameter 1 transfer from PS to the worker)                                                         scale graph processing system. Iterations in DNN frame-\r\n                                                1\r\n                          happens before recv (parameter 2 transfer), it reduces the                                                         works are identical, and network communication pattern is\r\n                                                               2\r\n                          blocking on computation time and improves the overlap.                                                             \ufb01xed. This may not be true for graph processing systems.\r\n                          The reverse order results in increased iteration time due to                                                       Instreamprocessingsystems,therelationshipbetweenpro-\r\n                          blocking on computation. Thus, in a distributed environ-                                                           cessing elements are represented using graphs. These sys-\r\n                          ment, network can block computation based on dependen-                                                             temsallowpipelining,withdifferentpartitionsofinputdata\r\n                          cies in the DAG. This can lead to under-utilization of com-                                                        being processed in different elements along the pipeline at\r\n                          putational capacity, in turn resulting in sub-optimal perfor-                                                      the same time. In contrast, DNN frameworks process the\r\n                          mance. In addition, variation in iteration time caused by                                                          entire batch of input at a processing element at a worker.\r\n                          random order of parameter transfers across multiple work-                                                          Pipelining is not employed in this environment. Hence, op-\r\n                          ers can lead to straggling effect.                                                                                 timizations proposed for stream processing cannot be bor-\r\n                          Theimpactofpooroverlapcanbesigni\ufb01cantinDNNtrain-                                                                   rowed here.\r\n                          ing due to complexity of state-of-the-art models. For in-\r\n                          stance, ResNet-v2-152 (He et al., 2016) has 363 param-                                                             3       QUANTIFYING PERFORMANCE\r\n                          eters with an aggregate size of 229.5MB. The computa-\r\n                          tional graph associated with this neural network has 4655                                                          In this section, we explore methods for quantitatively com-\r\n                          operations in the TensorFlow framework. Finding the op-                                                            paring the ef\ufb01ciency of multiple schedules. Towards this\r\n                          timal schedule in this complex DAG involves evaluating                                                             goal, we formally de\ufb01ne the scheduling problem and inves-\r\n                          363! combinations. Werun1000iterationsoflearningover                                                               tigate the feasibility of \ufb01nding an optimal solution. Finally,\r\n                                     TicTac: Accelerating Distributed Deep Learning with Communication Scheduling\r\n               wede\ufb01neametricthat is used to quantify the ef\ufb01ciency of         in the DAG. The Cmax represents the goal of scheduling is\r\n               a schedule.                                                     to minimize the last node completion time.\r\n               3.1  Scheduling Problem                                         This problem is still open (Brucker & Knust, 2007) and\r\n                                                                               simpler cases are proven to be NP-Hard. While there exist\r\n               The objective is to \ufb01nd the optimal schedule of network         approximations for relaxed versions of this problem, to the\r\n               transfers that minimizes the iteration time by improving the    best of our knowledge, there is no solution or approxima-\r\n               communication/computation overlap. The network trans-           tion with guaranteed bounds for our original problem.\r\n               fers of parameters (recv ops) are roots in the computational\r\n               graphattheworker. Thebranchofcomputationopsdepen-               3.2   De\ufb01ningOverlapCoef\ufb01cient\r\n               dent on a recv op can be executed only after the network        ThetwomajorcontributorstototalDNNiterationtime(T)\r\n               transfer is completed. Thus, the order of network transfers     are network transfer time or the communication time (N)\r\n               can determine the order of computation as well as the ex-       and the computation time (C). Since the computation and\r\n               tent of overlap. We focus on improving the overlap, and in      communication may overlap, the total time T \u2264 N + C.\r\n               turn the iteration time, by choosing a near-optimal schedule    GivenaGPU/CPU/TPUenvironment,weassumethecom-\r\n               of parameter transfers.                                         putation time, C, to be constant. We ignore the case of\r\n               Theinputstothisoptimizationproblemare: (a)theworker             computation stragglers and focus on communication.\r\n               DAG, and (b) a time oracle. The time oracle (Time(op))          Wede\ufb01netwometrics that de\ufb01ne the DNN iteration time:\r\n               predicts the execution time of a given op. For computa-         (a) the communication/computation ratio, \u03c1 and (b) the\r\n               tion ops, this indicates the elapsed time on a computation      overlap coef\ufb01cient, \u03b1. The ratio of communication to com-\r\n               resource. For communication ops, this represents the trans-     putation, denoted by \u03c1, determines the extent of bene\ufb01ts\r\n               fer time on the communication medium. We compute the            achievable. When \u03c1 < 1, communication time is smaller\r\n               timeassumingthattheresourceisdedicatedtotheopunder              than the total computation time, providing ample opportu-\r\n               consideration.                                                  nity for running GPUs at high utilization.\r\n               Theoutputoftheschedulingalgorithmisafeasiblesched-              Thesecondfactoraffecting the GPU utilization is the over-\r\n               ule of ops in the DAG tagger by priorities. Ops in a com-       lap coef\ufb01cient, \u03b1 =     N+C\u2212T . N + C is the iteration\r\n               putational DAG may have multiple feasible topological or-                               min(N,C)\r\n               ders. However, some of them may result in a bad iteration       time when there is no overlap, and T is the actual itera-\r\n               time (as explained in Figure 1). We want to limit the exe-      tion time. The difference between these quantities is the\r\n               cution path to take the one that improves the training per-     extent of overlap. The maximum overlap possible is given\r\n               formance. We achieve this with priority numbers. Priority       by min(N,C), which is achieved when the smaller quan-\r\n               number is a positive integer assigned to an op in the DAG.      tity completely overlaps with the large quantity. The dif-\r\n               Ahigher priority op is given a lower priority number. An        ference is normalized by this factor to obtain the overlap\r\n               op may not be assigned a priority if it need not be ordered.    coef\ufb01cient, \u03b1 \u2208 [0,1].\r\n               Multiple ops may be assigned the same priority if their rel-    The GPU utilization (U = C) can be represented in terms\r\n               ative order is insigni\ufb01cant.                                    of these coef\ufb01cients:        T\r\n               The order is enforced in the following manner. When we                            C                             1\r\n               need to select a new op from the ready-to-execute queue,        U = N +C\u2212\u03b1\u2217min(N,C) = 1+\u03c1\u2212\u03b1\u2217min(\u03c1,1)\r\n               we randomly choose from among the set of ops that con-\r\n               tain the lowest priority number and those without any pri-      The goal of our scheduling algorithms is to achieve high\r\n               ority number. It is worth noting that priority only speci\ufb01es    GPUef\ufb01ciencybymaximizing\u03b1,i.e., increasing the over-\r\n               relative order among candidate ops in the ready-to-execute      lap of communication and computation. The impact of our\r\n               queue at a given resource, and the resulting order will still   scheduling algorithm on \u03b1, and in turn the GPU utilization\r\n               respect the topological order speci\ufb01ed by the DAG.              is plotted in Figure 4 using Inception v3 with 2 workers and\r\n               The problem of \ufb01nding the optimal schedule is NP-hard.          1 PS as an example.\r\n               Asimpler version of the optimal execution problem with\r\n               homogeneous hardware can be formally de\ufb01ned as follow           4    SCHEDULING ALGORITHMS\r\n               (Using notion in (Pinedo, 2008)): P |M ,prec|C\r\n                                                   m i           max           In this section, we present two heuristics to derive the opti-\r\n               In this formulation, Pm represents multiple parallel re-        mal schedule of recv ops using a given worker DAG (\u00a73).\r\n               sources with identical performance. M assigns the opera-\r\n                                                       i                       The intuition behind our heuristics is to prioritize trans-\r\n               tions to speci\ufb01c resources, i.e., computation ops vs. com-      fers that speed up the critical path in the DAG by reducing\r\n               munication. prec describes the dependency relation of ops       blocking on computation caused by parameter transfers.\r\n                                              TicTac: Accelerating Distributed Deep Learning with Communication Scheduling\r\n                        ) 1.0    TensorFlow + TicTac          99%                                 Algorithm 1: Property Update Algorithm\r\n                                                                         90%              65%        55%    50%\r\n                        (                                                          70%\r\n                          0.8                                    85%        75%                   // Update properties for the given the set of\r\n                        t\r\n                        n                                                                             outstanding read ops R\r\n                        e\r\n                        i                         TensorFlow\r\n                        c\r\n                        i 0.6                                                                   1 Function UpdateProperties(G, Time, R):\r\n                        f\r\n                        f\r\n                        e                                                                       2      foreach op \u2208 G do\r\n                        o                                                                       3           op.M \u2190P                    Time(r);\r\n                        C 0.4\r\n                                                                         60%                  45%                        \u2200r\u2208op.dep\u2229R\r\n                        p\r\n                        a                                                                       4      end\r\n                        l\r\n                        r\r\n                        e 0.2                                                                   5      foreach op \u2208 R do\r\n                        v\r\n                        O         95%     80%                                              40%  6           op.P \u2190 0;\r\n                          0.0                                                                   7           op.M+ \u2190+\u221e;\r\n                              0.0     0.2     0.4     0.6     0.8    1.0     1.2     1.4        8      end\r\n                                      C om mu nication to Computation Ratio ()                  9      foreach op \u2208 G \u2212 R do\r\n                                                                                               10           D\u2190op.dep\u2229R;\r\n                        Figure 4: Improvement in GPU utilization with TicTac                   11           if |D| = 1 then\r\n                                                                                               12                \u2200r \u2208 D : r.P \u2190 r.P + Time(op);\r\n                                                                                               13           end\r\n                                                                                               14           if |D| > 1 then     +               +\r\n                  Timing-Independent               Communication              scheduling 15                      \u2200r \u2208 D : r.M     \u2190min{r.M ,op.M};\r\n                                                                                               16           end\r\n                  (TIC): In TIC, we assign priorities based only on vertex 17                          end\r\n                  dependencies in the DAG (ignoring the execution time of 18 end\r\n                  each op). Higher priorities are given to transfers which\r\n                  are least blocking on computation. In this algorithm, we\r\n                  ignore the time oracle, Time, and assume all ops have\r\n                  equal cost.                                                                     Directly-Dependent Compute Load (recvOp.P): This\r\n                  Timing-Aware Communication scheduling (TAC): In                                 property represents the computational bene\ufb01t of complet-\r\n                  this algorithm, we prioritize transfers that maximize \u03b1 by                      ing a recv op. More speci\ufb01cally, it is the total Time(op)\r\n                  using information on (a) vertex dependencies among ops                          for all ops that can be activated only by completing this\r\n                  speci\ufb01ed by the computational DAG, and (b) execution                            recvOp, but not without it. These ops are those whose\r\n                  time of each op estimated with time oracle.                                     communicationdependenciescontainonlythisoutstanding\r\n                                                                                                  recvOp(itisadmissibletohavecommunicationdependen-\r\n                  4.1    Opproperties                                                             cies on other completed recv operations). For example, in\r\n                                                                                                  Figure 1a, recv .P = Time(op ) and recv .P = 0 since\r\n                                                                                                                      1                     1              2\r\n                  Before delving into the algorithms, we de\ufb01ne properties                         noopcanexecutewithcompletion of only recv .\r\n                                                                                                                                                               2\r\n                  associated with ops that are used in the scheduling algo-\r\n                                                                                                                                                                 +\r\n                  rithms.     The inputs are the worker data\ufb02ow DAG (G),                          ImpendingCommunicationLoad(recvOp.M ): This\r\n                  a time oracle (Time), available communication channels                          property helps us to identify candidate recv ops to be acti-\r\n                  on a device (C) and a set of outstanding (to-be-activated)                      vated, given the current recv is completed. In more detail,\r\n                  recvs ops (R). We assume that recv ops not in R have                            it is the minimum communication cost to activate a compu-\r\n                  their corresponding transfers completed. These properties                       tation op which has multiple recv dependencies including\r\n                  are updated using the algorithm 1.                                              the one under consideration. For example, in Figure 1a,\r\n                                                                                                  read .M+ = read .M+ = Cost(read )+Cost(read ).\r\n                                                                                                         1                 2                         1                    2\r\n                  Communication Dependency (op.dep): This is the set                              Please note that recvOp.M+ includes the communication\r\n                  of recv ops that an op is directly or transitively dependent                    time of that recvOp.\r\n                  on. For example, in \ufb01gure 1a, op .dep = {recv ,recv }.\r\n                                                             2                  1        2\r\n                  Weextractthecommunicationdependenciesusingadepth-                               4.2     Timing-Independent Communication Scheduling\r\n                  \ufb01rst post-\ufb01x graph traversal on the DAG.                                                (TIC)\r\n                  CommunicationTime(Op.M): Communicationtimeof                                    The goal of this algorithm is to prioritize those transfers\r\n                  anopisthetotalnetworktransfertimerequiredtocomplete                             whichreducesblockingonnetworktransfers. Ourintuition\r\n                  that op. For a recv op, this is the time required to complete                   is that information on DAG structure alone can provide sig-\r\n                  its corresponding transfer, given by Time(recvOp). For                          ni\ufb01cant improvement.\r\n                  other ops, this is the total time to complete all outstanding                   To achieve this goal, we de\ufb01ne a generic time function\r\n                  dependent transfers, given by                                                   which only uses the number of communication ops instead\r\n                  P                 Time(r).         For example, in Figure 1a,                   of time taken by an op. We use this simple cost function to\r\n                      r\u2208op.dep\u2229R\r\n                  op .M = Time(recv ) and op .M = Time(recv ) +\r\n                     1                          1           2                          1          generate the schedule in Timing-Independent Communica-\r\n                  Time(recv ).\r\n                                 2                                                                tion scheduling (TIC).\r\n                  For recv ops, we de\ufb01ne two additional properties.                               General Time Oracle: We de\ufb01ne a simple universal time\r\n                                              TicTac: Accelerating Distributed Deep Learning with Communication Scheduling\r\n                                                         recv D      op3                          Comparator: We combine results from the two cases to\r\n                                                                                                  make a comparator that extends to multiple read ops. This\r\n                          recv A      op1                recv C      op2                          is an approximate induction, which may not be correct in\r\n                                                op3                                               general. The result is the Comparator function in algo-\r\n                          recv B      op2                recv A      op1      recv B              rithm 3. It is easy to prove that this function is transitive\r\n                                                                                                  and can be used for partial ordering.\r\n                                 (a) Case 1                      (b) Case 2                       Theorderingalgorithmtakesapartition graph on a worker,\r\n                                         Figure 5: Sample DAG                                     calculates the communication dependencies, then while\r\n                  oracle as follows:                                                              there is an outstanding recv op, it updates properties, \ufb01nds\r\n                                                    (                                             the smallest recv op with respect to the comparator and\r\n                               Time        (op) =     0      if op is not recv           (1)      then removes the recv from the outstanding set and assign\r\n                                    General           1      if op is recv                        it a higher priority relative to others.\r\n                  Thecomplete solution is given in Algorithm 2.                                   Algorithm 3: Timing-AwareCommunicationScheduling(TAC)\r\n                                                                                                  // Compare two given recv ops\r\n                                                                                               1 Function Comparator(Op ,Op ): Bool\r\n                                                                                                                          A     B\r\n                                                                                               2       A\u2190min(P ,M );\r\n                  Algorithm 2: Timing-IndependentCommunicationScheduling(TIC)                                      A     B\r\n                                                                                               3       B\u2190min(PB,MA);\r\n                1 Function TIC(G)                                                              4       if A 6= B then\r\n                2      FindDependencies(G) ;                                                   5            return A < B\r\n                3      UpdateProperties(G,R,Time={Computation: 0, Communication: 1});          6       else\r\n                                                                   +                           7            return M+ < M+\r\n                4      \u2200opinG,ifopisrecv : op.priority \u2190 op.M ;                                8       end           A       B\r\n                5 end                                                                          9 end\r\n                                                                                              10 Function TAC(G,Time)\r\n                                                                                              11       FindDependencies(G) ;\r\n                  4.3    Timing-AwareCommunicationScheduling(TAC) 12                                   R\u2190{op|\u2200opinG,opisrecv};\r\n                                                                                              13       count \u2190 0;\r\n                                                                                              14       while R is not empty do\r\n                  The goal of this algorithm is to prioritize those transfers 15                            UpdateProperties(G,R,Time);\r\n                  which reduces the blocking of computation, i.e., speeding 16                              Find the minimum op from R wrt Comparator;\r\n                                                                                              17            RemoveopfromR;\r\n                  up transfers on the critical path. To achieve this goal, the 18                           op.priority \u2190 count;\r\n                  algorithm focuses on two cases. First, it considers the op- 19                            count \u2190 count+1;\r\n                                                                                              20       end\r\n                  portunityforoverlappingcommunicationandcomputation. 21 end\r\n                  Second, in the case of equal overlap or absence of it, it\r\n                  looks at the impending transfers to choose one which elim-\r\n                  inates the computation block sooner.                                            5     SYSTEMDESIGN\r\n                  To better describe the logic, we begin with an example for                      In this section, we provide a brief overview of the system\r\n                  each case.                                                                      design and implementation.\r\n                  Case 1:      In Figure 5a, when deciding between two read                       The system has four main components: the tracing mod-\r\n                  ops, A and B, A should precede B iff:                                           ule, the time oracle estimator, the ordering wizard, and the\r\n                    A\u227aB \u21d0\u21d2 T(A\u2192B)<T(B\u2192A)                                                          enforcement module (shown in Figure 6).\r\n                     \u21d0\u21d2 M +max{P ,M }+P <M +max{P ,M }+P\r\n                              A           A    B       B       B            B    A       A\r\n                     \u21d0\u21d2 M +P +M \u2212min{P ,M }+P <\r\n                              A     A      B           A     B       B                            TracingModule: Thismodulecollectsruntimestatsfrom\r\n                             M +P +M \u2212min{P ,M }+P\r\n                               B      B      A           B    A       A                           anexecution, whichislaterfedtothetimeoracleestimator.\r\n                     \u21d0\u21d2 min{P ,M }<min{P ,M }\r\n                                  B     A            A     B\r\n                   Therefore:                                                                     Time Oracle: The time oracle is responsible for estimat-\r\n                                                                                                  ing the runtime of each op in the system based on the exe-\r\n                           A\u227aB\u2192min{P ,M }<min{P ,M }                                     (2)      cutiontimingstats. Notethattheruntimemayvarydepend-\r\n                                                  B     A                A      B                 ing on the platform, device characteristics, input data and\r\n                                                                                                  even across iterations on the same hardware/software. We\r\n                  Case 2:         In Figure 5b, when all recv ops are out-                        execute each operation 5 times and measure the time taken\r\n                  standing, their P is 0, making them equivalent under the                        in each run. Our Time Oracle implementation chooses the\r\n                  comparison in Equation 2. Obviously, recv                   and recv            minimum of all measured runs for a given op as the time\r\n                                                                           A              B\r\n                  should precede other recvs. Hence, we use M+ to break                           for that op.\r\n                  the ties: recv .M+ = recv .M+ = Time(recv ) +\r\n                                    A                   B                            A\r\n                  Time(recv ) < recv .M+ < recv .M+.                                              OrderingWizard: Thismoduleisresponsibleforassign-\r\n                                 B            C                  D\r\n                                                         TicTac: Accelerating Distributed Deep Learning with Communication Scheduling\r\n                                              Base Model                                                                                    Receiver        RPC                     RPC         Sender\r\n                                                                   Ordering            Time                                                   Application Framework   Network    Framework Application\r\n                                                                    Wizard             Oracle\r\n                                                                                                                                                Recv                                           Send     A\r\n                                                                                                                                           B    Create \r\n                                                Model                                                                                           Request\r\n                                                                                                                                                            Send \r\n                                                                    Priority        Time Oracle                                                            Request\r\n                                                                     List            Estimator                                                                         Transfer\r\n                                                                                                                                                                       Request\r\n                                                                                                                                                                                   Receive \r\n                                                                                                                                                                                   Request\r\n                                                         TensorFlow                                                                                                                           Receive \r\n                                                                                                                                                                                              Request\r\n                                              Execution          Enforcement                                                                                                                  Prepare \r\n                                                Engine             Module                                                                                                                    Response   C\r\n                                                                                                                                                                                    Send \r\n                                                                                                                                                                                  Response\r\n                                                                                                                                                                       Transfer\r\n                                                                                                                                                                      Response\r\n                                                Tracing                               Timing                                                               Receive \r\n                                                Module                                 Stats                                                              Response\r\n                                                                                                                                                Process \r\n                                                                                                                                               Response\r\n                          Figure 6: System Design. Components of our system are in                                                          Figure 7: Life time of a network transfer.\r\n                                                 blue sharp-edged rectangles.\r\n                       ing priorities to recv ops on a single worker. The sched-                                            the sender. If the send op is also active at the sender, the\r\n                       ule may be computed based on TIC or TAC. In TAC, the                                                 transfer may be initiated by gRPC. In this data\ufb02ow, there\r\n                       ordering module relies on the time estimated by the time                                             are three possible candidate locations for enforcing order-\r\n                       oracle. In TIC, the order is determined based on the DAG                                             ing \u2014 at the receiver before the request is initiated, at the\r\n                       alone. The estimated priorities are sent to the enforcement                                          sender before the send op is activated or at the sender be-\r\n                       module. The priority list is calculated of\ufb02ine before the                                            fore the transfer is sent to gRPC. Alternatively, this may\r\n                       execution; all iterations follow the same order.                                                     also be enforced as a direct dependency in the DAG.\r\n                       Enforcement Module:                       This module takes as input the                             We implement the enforcement module at the sender, i.e.\r\n                       priority list computed by the ordering module and enforces                                           the PS, before the transfer is sent to gRPC. This choice\r\n                       this order on the network transfers per worker.                                                      is guided by several practical concerns.                               Enforcing di-\r\n                                                                                                                            rectly on the DAG is conservative since each transfer has\r\n                       5.1      Implementation                                                                              to wait for the completion of the previous transfer. This\r\n                       We implement our system over TensorFlow 1.8. We de-                                                  prevents pipelining and drastically reduces the communi-\r\n                       scribe our implementation in detail.                                                                 cation throughput. Ordering the activation of recv or send\r\n                                                                                                                            ops is not suf\ufb01cient since it could change throughout the\r\n                       Time Oracle:              We use the TensorFlow internal tracer to                                   data \ufb02ow. For example, a larger transfer request may take\r\n                       measure the time of computation ops. We extend the capa-                                             longer to reach the response state on the sender side. Dur-\r\n                       bility (115 LOC C++) of this tracer to collect information                                           ing this interval, a smaller transfer with lower priority may\r\n                       on network transfer at all workers. Our code is publicly                                             catch up.\r\n                       available (https://github.com/xldrx/tictac).                                                         For the purpose of enforcement, the priorities are sequen-\r\n                                                                                                                            tially assigned to an integer in the range of [0,n). Thus,\r\n                       OrderingWizard: WeimplementTICandTACasof\ufb02ine                                                         the priority number of a transfer represents the number\r\n                       analyzers (250 LOC in Python). The implementation takes                                              of transfers that have to complete before it. The sender\r\n                       time oracle and base model in the TensorFlow DAG format                                              (PS server) maintains a counter for each worker per itera-\r\n                       and generates the priority of recv ops.                                                              tion which is incremented when a corresponding transfer\r\n                                                                                                                            is handed to the gRPC. Before a transfer is handed to the\r\n                       Enforcing: Theenforcementmoduleisimplementedover                                                     gRPC,itisblockeduntilthecorrespondingcounterreaches\r\n                       the gRPC submodule of TensorFlow (40LOC in C++).                                                     the normalized priority number.\r\n                       gRPC provides one channel per worker-PS pair with all                                                During experiments, we notice that gRPC may not always\r\n                       transfers between the pair sent to the same queue. Only                                              process transfers in the order they are queued. This affects\r\n                       onetransfer can be active at a given moment for each chan-                                           the performance of our ordering in some cases. However,\r\n                       nel. A network transfer over gRPC in TensorFlow involves                                             the number of such occurrences at the gRPC level are very\r\n                       multiple stages as shown in Figure 7. When a recv op is                                              few. In Inception model (one of the tested models), this\r\n                       activated at the receiver, it sends a request for transfer to                                        error was 0.5% in TIC and 0.4% in TAC.\r\n                                       TicTac: Accelerating Distributed Deep Learning with Communication Scheduling\r\n                6    RESULTS                                                        3%difference in iteration time on a single machine. The\r\n                Inthissection,weevaluateTicTacunderawiderangeofin-                  data is read in the TFRecord format from a shared NFS-\r\n                puts/system parameters to answer the following questions:           connected Azure storage, samples are resized, augmented,\r\n                                                                                    and prefetched during training. TicTac does not alter the\r\n                                                                                    computational \ufb02ow of the model; it only chooses one of\r\n                \u2022 HowdoesTicTacperformwithscaleoutofworkers?                        the feasible orders of network transfers. Hence, it does not\r\n                \u2022 How is TicTac affected by the number of parameter                 affect the accuracy of training (shown in Figure 8).\r\n                  servers?\r\n                \u2022 Howdoesthe bene\ufb01ts accrued with TicTac change with                         3 \u00d7 101                         Method\r\n                  the communication and computation cost?                                                                    No Ordering\r\n                \u2022 Howwelldotheproposedheuristicsperformintermsof                             2 \u00d7 101                         TIC\r\n                  consistency and straggler mitigation?\r\n                                                                                            Loss (Log)\r\n                                                                                                1 01\r\n                Setup:    We use in-graph replication for Distributed Ten-\r\n                sorFlow(Google)withsynchronizedtrainingandsynthetic                                  0             200           400\r\n                input data.                                                                                         Iteration\r\n                Wetest TicTac under two environments. (a) Cloud GPU                 Figure 8: Loss value throughout the \ufb01rst 500 iterations of train-\r\n                environment(envG): We use Standard NC6 virtual ma-                  ing InceptionV3 on ImageNet.\r\n                chines (6 cores, 56 GB memory, 1 X Nvidia K80 GPUwith\r\n                12GBmemory)onAzurecloudenvironment. Forparame-                      Next, we compare the performance metrics across various\r\n                ter servers we used Standard F64s v2 (CPU Only, 64 cores,           heuristics.  Speci\ufb01cally, we evaluate throughput, overlap\r\n                128GBmemory). (b)High-endCPUcluster(envC): We                       coef\ufb01cient, and prevalence of stragglers (slow workers that\r\n                use a commodity cluster (32 core, 64GB memory, 1GbE                 force others to wait, thereby increasing the iteration time).\r\n                network). In both environments, we test 2 to 16 workers             Performance of TIC is only marginally worse compared to\r\n                and 1 to 4 PS. For understanding the impact of batch size,          TAC(showninFigure15inAppendix). Thisindicatesthat,\r\n                wetestthenetworkswiththestandardbatchsizemultiplied                 for current models, DAG-level information is suf\ufb01cient for\r\n                by factors [0.5, 1, 2]. We tested our method on 10 well-            obtaining a near-optimal scheduling. However, we expect\r\n                knownmodels(details of models in Table 1 in Appendix).              the gap between TIC and TAC to increase as complexity of\r\n                Weevaluate the performance under two workloads: train-              models increases.\r\n                ing and inference. In training, we use Stochastic Gradi-            We attempted to compare TicTac with Poseidon (Zhang\r\n                ent Descent (SGD) as optimizer. The training workload is            et al., 2017). However, only the binaries of Poseidon are\r\n                identical to the training jobs used in practice. We emulate         publiclyavailable. Inourexperiments,Poseidonperformed\r\n                the inference workload of agents in reinforcement learning          extremely poorly compared to TicTac, and even vanilla\r\n                withonlinetraining. Inthisenvironment,parameterservers              TensorFlow 1.8. Since Poseidon is based on older version\r\n                store the parameters which are updated by a set of training         of TensorFlow(TensorFlow1.0)andCUDA(8.0),wewere\r\n                worker nodes (which we do not consider in the inference             unable to account the poor performance to their methodol-\r\n                workload). The inference agents are responsible for read-           ogy. Hence, we exclude the results since the comparison is\r\n                ing the parameters from the PS and running the inference            inconclusive. Additionally, since order extraction is not ex-\r\n                (this is the phase we evaluate in this workload).                   plained in their paper, we were unable to reimplement their\r\n                In each test, we discard the \ufb01rst 2 iterations to limit the         strategy.\r\n                warm-up effect (initialization of GPUs, cache etc). This is         6.1   Throughput\r\n                necessary since the \ufb01rst iteration takes much longer com-\r\n                pared to the rest. We record the next 10 iterations. For            Scaling the numberofworkers: InFigure9,weevaluate\r\n                throughput, we report the mean across 10 iterations; for            the impact of scaling the number of workers with the num-\r\n                straggler effect and overlap coef\ufb01cient we report the max-          ber of PS to workers \ufb01xed to the ratio 1:4. We obtain up\r\n                imum. Computing the TIC and TAC heuristics takes ap-                to 37.7% of speed up in throughput across networks. The\r\n                proximately10seconds. Notethattheseheuristicsarecom-                gains are measured relative to the baseline \u2014 no schedul-\r\n                puted before the training/inference begins. Hence, this will        ing. Larger networks have higher performance gains. The\r\n                not add overhead during the execution.                              speed up depends on two factors \u2014 communication load\r\n                We use Imagenet Dataset for our experiments. We eval-               andextent of overlap. As the number of workers increases,\r\n                uated both synthetic and real data and observed less than           the communication load increases in PS. When the com-\r\n                                       TicTac: Accelerating Distributed Deep Learning with Communication Scheduling\r\n                                 number_of_workers = 1 number_of_workers = 2                           task = inference           task = train\r\n                        Inception v1                                                    Inception v1                                           1\r\n                           VGG-19                                                          VGG-16                                              2\r\n                        Inception v2                                                       VGG-19                                              4\r\n                        AlexNet v2                                                      Inception v3\r\n                     Model VGG-16                                                        AlexNet v2\r\n                      ResNet-50 v1                                                   ModelInception v2\r\n                      ResNet-50 v2                                                     ResNet-50 v2\r\n                        Inception v3                                                   ResNet-50 v1\r\n                     ResNet-101 v1                                                    ResNet-101 v1\r\n                                   0       20           0        20                                 0       20        40    0        20        40\r\n                                 number_of_workers = 4 number_of_workers = 8                       Throughput (Sample/Second)Throughput (Sample/Second)\r\n                        Inception v1                                                                    Speed Up (%)             Speed Up (%)\r\n                           VGG-19                                                  Figure 10: Impact of scaling the number of Parameter Servers\r\n                        Inception v2                                               onenvG cloudGPUenvironmentwith8workers.\r\n                        AlexNet v2                                                                                task = inference\r\n                     Model VGG-16                                                                    Inception v2\r\n                      ResNet-50 v1                                                                  ResNet-50 v1\r\n                      ResNet-50 v2                                                                    AlexNet v2\r\n                        Inception v3                                                                                          \u00d7 1\r\n                     ResNet-101 v1                                                                      VGG-19                  2\r\n                                   0       20           0        20                                     VGG-16                \u00d7 1\r\n                                                             speedup                              ModelInception v1           \u00d7 2\r\n                                 number_of_workers = 16                                             ResNet-50 v2\r\n                        Inception v1        inference\r\n                           VGG-19                                                                    Inception v3\r\n                        Inception v2        train                                                  ResNet-101 v1\r\n                        AlexNet v2                                                                                0       20\r\n                     Model VGG-16                                                                           Throughput (Sample/Second)\r\n                      ResNet-50 v1                                                                                 Speed Up (%)\r\n                      ResNet-50 v2                                                 Figure 11: Impact of scaling the computational load on envG\r\n                        Inception v3                                               cloud GPU environment with 4 workers.\r\n                     ResNet-101 v1\r\n                                   0       20\r\n                                        speedup                                    multiple parameter servers, enforcing ordering with TicTac\r\n               Figure 9: Impact of scaling the number of workers on through-       provides signi\ufb01cant performance improvement.\r\n               put.  The gains are measure with respect to the baseline (no        Scaling the computational load: In Figure 11, we show\r\n               scheduling). Measured on envG with PS:Workers in the ratio 1:4.     the impact of varying computational load by testing each\r\n                                                                                   model with the prescribed batch size multiplied by three\r\n               munication load increases, scheduling can provide bene\ufb01ts           factors \u2014 0.5, 1, 2. There are two factors affecting the\r\n               through better overlap until a threshold. When the commu-           scaling of computation load \u2014 computation time and op-\r\n               nication load is much higher than the computation load, the         portunity for overlap. The relative ratio of communication\r\n               impact of overlap diminishes. Hence, beyond this thresh-            and computation determines the opportunity for overlap.\r\n               old, the bene\ufb01ts accrued with scheduling reduces. This              Asthebatchsizeincreases,thecomputationtimeincreases.\r\n               threshold varies across models. Also, the gains are mea-            If the communication time is higher (compared to the com-\r\n               sured with respect to the baseline which chooses a random           putation time), increase in computation time increases the\r\n               schedule, leading to variations in performance.         Hence,      opportunity for overlap. If communication time is smaller\r\n               we observe varying trends across networks based on the              than computation time, scaling will reduce throughput as\r\n               network-speci\ufb01c characteristics. In small networks, with            the opportunity for overlap reduces.\r\n               small number of workers and parameter servers, the over-            Scalability with network size::       We show the improve-\r\n               head associated with scheduling may overshadow the ben-             ment in throughput (samples/second) achieved with TIC\r\n               e\ufb01ts of better overlap. In such rare cases, we observe a            compared to the baseline with no scheduling in Figure 12.\r\n               slow down of up to 4.2%. This shows that scheduling net-            Weobserve that larger networks obtain higher gains. This\r\n               work transfers may be disabled in small networks at small           can be attributed to larger variance in parameter transfer\r\n               training and inference sizes.                                       orders in larger DAGs in the absence of scheduling.\r\n               ScalingthenumberofParameterServers: InFigure10,                     6.2   Overlap Coef\ufb01cient\r\n               weevaluate the impact of scaling the number of parameter\r\n               servers with 8 workers in envG (Cloud with GPU) across              To validate the overlap coef\ufb01cient metric, we run train-\r\n               various networks. In general, we obtain higher gains in             ing of Inception v2 1000 times each with and without the\r\n               the inference phase than training. Even in the presence of          scheduling algorithm, TAC in env . The overlap coef\ufb01-\r\n                                                                                                                         C\r\n                                               TicTac: Accelerating Distributed Deep Learning with Communication Scheduling\r\n                                         60           inference                                                                                      baseline\r\n                                                                                                                            50                       tic\r\n                                                      train\r\n                                         40                                                                                 40\r\n                                                                                                                            30\r\n                                         20                                                                                 20\r\n                                        Speed Up (%)                                                                       Straggler Time(%)\r\n                                                                                                                            10\r\n                                     Throughput (Sample/Second)0                                                             00           1000          2000\r\n                                            0            1000           2000                                                           Number of Ops\r\n                                                     Number of Ops                                    Figure 14: Effect of stragglers with TIC in the GPU enviorn-\r\n                   Figure 12: Throughput speedup with training and inference as a                     ment, envG.\r\n                   function of DAG size represented in number of ops\r\n                                      NoOrdering  TAC       LR(R2 = 0.98)                             In Figure 14, we show the impact of stragglers. Straggler\r\n                                                                                                      effect is caused by two factors: system-level performance\r\n                     ime  1                                        1                                  variations and ef\ufb01ciency of scheduling on individual work-\r\n                     T  0.9                                      0.8\r\n                     Step0.8                                     0.6                                  ers. In the baseline, workers follow arbitrary scheduling.\r\n                        0.7                                   CDF0.4                                  Hence, a worker with a bad order forces other workers into\r\n                        0.6                                      0.2\r\n                        0.5                                                                           a long wait, more than 50% of the total iteration time in\r\n                     Normalized                                    0                                  some cases. On average, scheduling limits straggler ef-\r\n                              0 0.20.40.60.8 1                      0.40.50.60.70.80.9 1\r\n                              Overlap Coef\ufb01cient (\u03b1)                   Normalized Step Time           fect with larger bene\ufb01ts in bigger DNNs (higher number\r\n                   Figure 13: In envC, on Inception v2, (a) Regression test of                        of ops). Enforcing any order reduces straggler effect re-\r\n                   Scheduling Ef\ufb01ciency and Normalized Step Time, (b) Step Time                       gardless of the quality of the chosen order.\r\n                   Comparison across Scheduling Mechanisms.\r\n                   cient can predict step time accurately, with a high R2 score                       7     CONCLUSION\r\n                   of 0.98, as seen in Figure 13 (a). This proves that most of                        In this work, we elucidate the importance of communica-\r\n                   the variation in iteration time arises from random schedules                       tion scheduling in distributed deep learning systems. We\r\n                   in parameter transfers. We also observe that in the absence                        devised a metric for quantifying the ef\ufb01ciency of a given\r\n                   of enforced scheduling, the step time and overlap coef\ufb01-                           schedule of data transfers and developed two heuristics for\r\n                   cient have a large variance. With scheduling, the step time                        ef\ufb01cient scheduling. Through extensive testing of these\r\n                   is reduced and the variance is minimal. Moreover, most                             heuristics across a variety of workloads, we demonstrated\r\n                   runs have an overlap coef\ufb01cient approaching 1, indicating                          that signi\ufb01cant gains are achievable through communica-\r\n                   near-optimal scheduling in TAC.                                                    tion scheduling. For a typical DNN training which runs for\r\n                                                                                                      daystoweeks,20%improvementiniterationtimecansave\r\n                   6.3    PerformanceConsistency                                                      signi\ufb01cant compute power.\r\n                   In Figure 13 (b), we compare the consistency in perfor-                            Ourstudy encourages further research in network schedul-\r\n                   mance obtained with and without scheduling (TAC) in in-                            ing for parameter server as well as other unexplored ag-\r\n                   ference on InceptionV2 with 1000 runs in env . We see                              gregation techniques such as all reduce. In future, we can\r\n                                                                                  C                   also take into account additional metrics such as congestion\r\n                   that TAC has consistent performance, denoted by a sharp                            from the network fabric for better network performance.\r\n                   curve in the CDF. The baseline (no scheduling), on the\r\n                                                                                       th             These results also provide motivation for extending the\r\n                   other hand, has a large variance. For comparison, 95                    per-       scheduling to additional resources types such as memory\r\n                   centile of normalized step time in the baseline and TAC are                        and storage.\r\n                   respectively 0.63403 and 0.99825.\r\n                   Straggler Effect: : Performance inconsistency creates                              8     ACKNOWLEDGEMENT\r\n                   straggling worker effect when multiple workers have dif-                           WethankPaulBarhamandBrightenGodfreyfortheirfeed-\r\n                   ferent makespan. As a result, all workers have to wait for                         back. This material is based on work supported by the\r\n                   the slowest one. We quantify the straggler time as the max-                        National Science Foundation under Grant No. 1725729.\r\n                   imumtimespentbyanyworkerinwaitingtothetotal iter-                                  Azure Cloud resources used in this paper is provided\r\n                   ation time (represented in percentage).                                            through Microsoft Azure Sponsorship.\r\n                                         TicTac: Accelerating Distributed Deep Learning with Communication Scheduling\r\n                 REFERENCES                                                              Gupta, S., Agrawal, A., Gopalakrishnan, K., and Narayanan, P.\r\n                Abadi, M., Barham, P., Chen, J., Chen, Z., Davis, A., Dean, J.,             Deep learning with limited numerical precision. In Interna-\r\n                   Devin, M., Ghemawat, S., Irving, G., Isard, M., et al. Tensor-           tional ConferenceonMachineLearning,pp.1737\u20131746,2015.\r\n                   Flow: A System for Large-Scale Machine Learning. In OSDI,             He, K., Zhang, X., Ren, S., and Sun, J. Deep residual learning\r\n                   volume 16, pp. 265\u2013283, 2016.                                            for image recognition. CoRR, abs/1512.03385, 2015. URL\r\n                Akiba,T., Suzuki, S., and Fukuda, K. Extremely Large Minibatch              http://arxiv.org/abs/1512.03385.\r\n                   SGD:Training ResNet-50 on ImageNet in 15 Minutes. CoRR,               He, K., Zhang, X., Ren, S., and Sun, J. Identity mappings in\r\n                   abs/1711.04325, 2017. URL http://arxiv.org/abs/                          deep residual networks. CoRR, abs/1603.05027, 2016. URL\r\n                   1711.04325.                                                              http://arxiv.org/abs/1603.05027.\r\n                Alistarh, D., Grubic, D., Li, J., Tomioka, R., and Vojnovic, M.          Hoque, I. and Gupta, I. LFGraph: Simple and fast distributed\r\n                   QSGD:Communication-Ef\ufb01cientSGDviaGradientQuantiza-                       graphanalytics. InProceedingsoftheFirstACMSIGOPSCon-\r\n                   tion and Encoding. In Advances in Neural Information Pro-                ference on Timely Results in Operating Systems, pp. 9. ACM,\r\n                   cessing Systems, pp. 1707\u20131718, 2017.                                    2013.\r\n                Amodei, D., Anubhai, R., Battenberg, E., Case, C., Casper, J.,           Iandola, F. N., Moskewicz, M. W., Ashraf, K., and Keutzer,\r\n                   Catanzaro, B., Chen, J., Chrzanowski, M., Coates, A., Diamos,            K. Firecaffe: near-linear acceleration of deep neural network\r\n                   G., Elsen, E., Engel, J., Fan, L., Fougner, C., Han, T., Han-            training on compute clusters.     In Proceedings of the IEEE\r\n                   nun, A. Y., Jun, B., LeGresley, P., Lin, L., Narang, S., Ng,             Conference on Computer Vision and Pattern Recognition, pp.\r\n                   A. Y., Ozair, S., Prenger, R., Raiman, J., Satheesh, S., Seeta-          2592\u20132600, 2016.\r\n                   pun, D., Sengupta, S., Wang, Y., Wang, Z., Wang, C., Xiao,            Ioffe, S. and Szegedy, C. Batch normalization: Accelerating deep\r\n                   B., Yogatama, D., Zhan, J., and Zhu, Z. Deep Speech 2: End-              network training by reducing internal covariate shift. CoRR,\r\n                   to-End Speech Recognition in English and Mandarin. CoRR,                 abs/1502.03167, 2015. URL http://arxiv.org/abs/\r\n                   abs/1512.02595, 2015. URL http://arxiv.org/abs/                          1502.03167.\r\n                   1512.02595.\r\n                Arnold, S.      An Introduction to Distributed Deep Learning.            Keskar, N. S., Mudigere, D., Nocedal, J., Smelyanskiy, M.,\r\n                   https://seba1511.com/dist_blog/,2016.                                    and Tang, P. T. P.    On large-batch training for deep learn-\r\n                                                                                            ing: Generalization gap and sharp minima.        arXiv preprint\r\n                Brucker, P. and Knust, S. Complexity results for scheduling prob-           arXiv:1609.04836, 2016.\r\n                   lems, 2007.                                                           Krizhevsky, A. One weird trick for parallelizing convolutional\r\n                Chen, T., Li, M., Li, Y., Lin, M., Wang, N., Wang, M., Xiao,                neural networks. arXiv preprint arXiv:1404.5997, 2014.\r\n                   T., Xu, B., Zhang, C., and Zhang, Z. Mxnet: A \ufb02exible and             Malewicz, G., Austern, M. H., Bik, A. J., Dehnert, J. C., Horn,\r\n                   ef\ufb01cientmachinelearninglibraryforheterogeneousdistributed                I., Leiser, N., and Czajkowski, G. Pregel: a system for large-\r\n                   systems. arXiv preprint arXiv:1512.01274, 2015.                          scale graph processing. In Proceedings of the 2010 ACM SIG-\r\n                Cho,M.,Finkler,U.,Kumar,S.,Kung,D.,Saxena,V.,andSreed-                      MODInternational Conference on Management of data, pp.\r\n                   har, D.   PowerAI DDL. arXiv preprint arXiv:1708.02188,                  135\u2013146. ACM, 2010.\r\n                   2017.                                                                 Paszke, A., Gross, S., Chintala, S., and Chanan, G. PyTorch:\r\n                Courbariaux, M., Bengio, Y., and David, J.-P. Binaryconnect:                Tensors and dynamic neural networks in Python with strong\r\n                   Training deep neural networks with binary weights during                 GPUacceleration, 2017.\r\n                   propagations. In Advances in neural information processing            Pinedo, M. L. Scheduling: Theory, Algorithms, and Systems.\r\n                   systems, pp. 3123\u20133131, 2015.                                            SpringerPublishingCompany,Incorporated,3rdedition,2008.\r\n                Cui, H., Tumanov, A., Wei, J., Xu, L., Dai, W., Haber-Kucharsky,            ISBN0387789340,9780387789347.\r\n                   J., Ho, Q., Ganger, G. R., Gibbons, P. B., Gibson, G. A., et al.      Sergeev, A. and Balso, M. D. Horovod: fast and easy distributed\r\n                   Exploiting iterative-ness for parallel ML computations. In Pro-          deep learning in tensor\ufb02ow.     CoRR, abs/1802.05799, 2018.\r\n                   ceedings of the ACM Symposium on Cloud Computing, pp. 1\u2013                 URLhttp://arxiv.org/abs/1802.05799.\r\n                   14. ACM, 2014.\r\n                                                                                         Simonyan, K. and Zisserman, A.           Very deep convolutional\r\n                Cui, H., Zhang, H., Ganger, G. R., Gibbons, P. B., and Xing,                networks for large-scale image recognition.      arXiv preprint\r\n                   E. P. GeePS: Scalable deep learning on distributed GPUs with             arXiv:1409.1556, 2014.\r\n                   a GPU-specialized parameter server. In Proceedings of the\r\n                   Eleventh European Conference on Computer Systems, pp. 4.              Sridharan, S., Vaidyanathan, K., Kalamkar, D., Das, D.,\r\n                   ACM,2016.                                                                Smorkalov, M. E., Shiryaev, M., Mudigere, D., Mellem-\r\n                                                                                            pudi, N., Avancha, S., Kaul, B., et al.           On Scale-out\r\n                Google.         Distributed   tensor\ufb02ow.         https://www.               Deep Learning Training for Cloud and HPC. arXiv preprint\r\n                   tensorflow.org/deploy/distributed.                                       arXiv:1801.08030, 2018.\r\n                                \u00b4\r\n                Goyal, P., Dollar, P., Girshick, R., Noordhuis, P., Wesolowski, L.,      Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S. E., Anguelov,\r\n                   Kyrola, A., Tulloch, A., Jia, Y., and He, K. Accurate, Large             D., Erhan, D., Vanhoucke, V., and Rabinovich, A.          Going\r\n                   Minibatch SGD: Training ImageNet in 1 Hour. arXiv preprint               deeper with convolutions. CoRR, abs/1409.4842, 2014. URL\r\n                   arXiv:1706.02677, 2017.                                                  http://arxiv.org/abs/1409.4842.\r\n                                   TicTac: Accelerating Distributed Deep Learning with Communication Scheduling\r\n              Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J., and Wojna,\r\n                Z. Rethinking the inception architecture for computer vision.\r\n                CoRR, abs/1512.00567, 2015. URL http://arxiv.org/\r\n                abs/1512.00567.\r\n              Vanhoucke, V., Senior, A., and Mao, M. Z. Improving the speed\r\n                of neural networks on CPUs. In Proc. Deep Learning and Un-\r\n                supervisedFeatureLearningNIPSWorkshop,volume1,pp. 4.\r\n                Citeseer, 2011.\r\n              Wen, W., Xu, C., Yan, F., Wu, C., Wang, Y., Chen, Y., and Li,\r\n                H. Terngrad: Ternary gradients to reduce communication in\r\n                distributed deep learning. In Advances in Neural Information\r\n                Processing Systems, pp. 1508\u20131518, 2017.\r\n              Xin, R. S., Gonzalez, J. E., Franklin, M. J., and Stoica, I. Graphx:\r\n                Aresilient distributed graph system on spark. In First Interna-\r\n                tional WorkshoponGraphDataManagementExperiencesand\r\n                Systems, pp. 2. ACM, 2013.\r\n              You, Y., Zhang, Z., Hsieh, C., and Demmel, J.   100-epoch\r\n                ImageNet Training with AlexNet in 24 Minutes.    CoRR,\r\n                abs/1709.05011, 2017. URL http://arxiv.org/abs/\r\n                1709.05011.\r\n              Zhang, H., Zheng, Z., Xu, S., Dai, W., Ho, Q., Liang, X., Hu,\r\n                Z., Wei, J., Xie, P., and Xing, E. P. Poseidon: An Ef\ufb01cient\r\n                Communication Architecture for Distributed Deep Learning\r\n                on GPU Clusters. In 2017 USENIX Annual Technical Con-\r\n                ference (USENIX ATC 17), pp. 181\u2013193, Santa Clara, CA,\r\n                2017. USENIX Association. ISBN 978-1-931971-38-6. URL\r\n                https://www.usenix.org/conference/atc17/\r\n                technical-sessions/presentation/zhang.\r\n                                                      TicTac: Accelerating Distributed Deep Learning with Communication Scheduling\r\n                      APPENDIX                                                                                       schemes (TIC and TAC). We observe that both TIC and\r\n                      A DNNMODELS                                                                                    TACoffersigni\ufb01cantspeedupcomparedtothebaseline(no\r\n                                                                                                                     scheduling). Performance of TIC is comparable to that of\r\n                      In Table 1, we present the model characteristics of 10 deep                                    TACindicating that we can achieve improved performance\r\n                      learning models used in our evaluation. The number of pa-                                      without relying on runtime statistics in current models.\r\n                      rameters, total size of all parameters, number of computa-                                     Due to the simplicity of TIC algorithm, we use it as the\r\n                      tional operations in inference mode and training mode, and                                     representative algorithm for scheduling in the cloud GPU\r\n                      the standard batch size are given below.                                                       environment (envG).\r\n                      Neural Network                             #Par     Total Par       #Ops            Batch\r\n                      Model                                               Size (MiB)      Inference/      Size\r\n                                                                                          Training\r\n                      AlexNet v2 (Krizhevsky, 2014)              16       191.89          235/483         512\r\n                      Inception v1 (Szegedy et al., 2014)        116      25.24           1114/2246       128\r\n                      Inception v2 (Ioffe & Szegedy, 2015)       141      42.64           1369/2706       128\r\n                      Inception v3 (Szegedy et al., 2015)        196      103.54          1904/3672       32                                   task = inference                      task = train\r\n                      ResNet-50 v1 (Heetal., 2015)               108      97.39           1114/2096       32             Inception v2\r\n                      ResNet-101 v1 (Heetal., 2015)              210      169.74          2083/3898       64\r\n                      ResNet-50 v2 (Heetal., 2016)               125      97.45           1423/2813       64\r\n                      ResNet-101 v2 (Heetal., 2016)              244      169.86          2749/5380       32            ModelVGG-16\r\n                      VGG-16(Simonyan&Zisserman,2014)            32       527.79          388/758         32                                                                                           TIC\r\n                      VGG-19(Simonyan&Zisserman,2014)            38       548.05          442/857         32              AlexNet v2                                                                   TAC\r\n                                       Table 1: DNN model characteristics                                                             0        25       50       75       0        25       50       75\r\n                                                                                                                                        Throughput (Sample/Second)          Throughput (Sample/Second)\r\n                                                                                                                                                Speed Up (%)                        Speed Up (%)\r\n                      B TICVS. TAC\r\n                      In Figure 15, we plot the increase in throughput achieved                                      Figure 15: Increase in throughput with the scheduling schemes\r\n                                                                                                                     (TIC and TAC) compared to the baseline (no scheduling). Mea-\r\n                      with scheduling in env                 with and without the scheduling                         sured on env         (CPU-Only).\r\n                                                         C                                                                             C\r\n", "award": [], "sourceid": 199, "authors": [{"given_name": "Sayed Hadi", "family_name": "Hashemi", "institution": "University of Illinois at Urbana-Champaign"}, {"given_name": "Sangeetha", "family_name": "Abdu Jyothi", "institution": "University of Illinois at Urbana-Champaign"}, {"given_name": "Roy", "family_name": "Campbell", "institution": "University of Illinois at Urbana-Champaign"}]}