{"title": "Beyond Data and Model Parallelism for Deep Neural Networks.", "book": "Proceedings of Machine Learning and Systems", "page_first": 1, "page_last": 13, "abstract": "Existing deep learning systems commonly parallelize deep neural network (DNN) training using data or model parallelism, but these strategies often result in suboptimal parallelization performance. We introduce SOAP, a more comprehensive search space of parallelization strategies for DNNs that includes strategies to parallelize a DNN in the Sample, Operator, Attribute, and Parameter dimensions. We present FlexFlow, a deep learning engine that uses guided randomized search of the SOAP space to find a fast parallelization strategy for a specific parallel machine. To accelerate this search, FlexFlow introduces a novel execution simulator that can accurately predict a parallelization strategy\u2019s performance and is three orders of magnitude faster than prior approaches that execute each strategy. We evaluate FlexFlow with six real-world DNN benchmarks on two GPU clusters and show that FlexFlow increases training throughput by up to 3.3\u00d7 over state-of-the-art approaches, even when including its search time, and also improves scalability.", "full_text": "BEYONDDATAANDMODELPARALLELISMFORDEEPNEURALNETWORKS\r\n                                                      ZhihaoJia1 MateiZaharia1 AlexAiken1\r\n                                                                       ABSTRACT\r\n                    Existing deep learning systems commonly parallelize deep neural network (DNN) training using data or model\r\n                    parallelism, but these strategies often result in suboptimal parallelization performance. We introduce SOAP, a\r\n                    morecomprehensive search space of parallelization strategies for DNNs that includes strategies to parallelize a\r\n                    DNNintheSample,Operator, Attribute, and Parameter dimensions. We present FlexFlow, a deep learning engine\r\n                    that uses guided randomized search of the SOAP space to \ufb01nd a fast parallelization strategy for a speci\ufb01c parallel\r\n                    machine. To accelerate this search, FlexFlow introduces a novel execution simulator that can accurately predict a\r\n                    parallelization strategy\u2019s performance and is three orders of magnitude faster than prior approaches that execute\r\n                    each strategy. We evaluate FlexFlow with six real-world DNN benchmarks on two GPU clusters and show that\r\n                    FlexFlow increases training throughput by up to 3.3\u00d7 over state-of-the-art approaches, even when including its\r\n                    search time, and also improves scalability.\r\n               1    Introduction                                                 explores parallelism in the operator dimension.\r\n               Asdeeplearning methods have evolved, DNN models have              We recently proposed OptCNN (Jia et al., 2018), which\r\n               gotten progressively larger and more computationally ex-          uses layer-wise parallelism for parallelizing CNNs with\r\n               pensive to train. As a result, it is now standard practice to     linear computation graphs. OptCNN uses dynamic program-\r\n               parallelize DNN training across distributed heterogeneous         ming to jointly optimize how to parallelize each operator\r\n               clusters (Dean et al., 2012; Abadi et al., 2016). Although        but does not consider parallelism across different operators.\r\n               DNNmodelsandtheclustersusedtoparallelize them are                 Moreover, OptCNN does not apply to many DNNs used for\r\n               increasingly complex, the strategies used by today\u2019s deep         language modeling, machine translation, and recommenda-\r\n               learning frameworks(e.g., TensorFlow, Caffe2, and MXNet)          tions, which tend to be RNNs or other non-linear networks.\r\n               to parallelize training remain simple, and often suboptimal.      In this paper, we introduce a comprehensive SOAP (Sample-\r\n               The most common parallelization strategy is data paral-           Operator-Attribute-Parameter) search space of paralleliza-\r\n               lelism (Krizhevsky et al., 2012), which places a replica of       tion strategies for DNNs that generalizes and goes beyond\r\n               the entire neural network on each device, so that each device     previous approaches. The operator dimension describes\r\n               processes a subset of the training data, and synchronizes         howdifferent operators in a DNN are parallelized. For a sin-\r\n               network parameters across replicas at the end of an iteration.    gle operator, the sample and parameter dimensions indicate\r\n               Data parallelism is ef\ufb01cient for compute-intensive opera-         howtraining samples and model parameters are distributed\r\n               tors with a few trainable parameters (e.g., convolution) but      across devices. Finally, the attribute dimension de\ufb01nes how\r\n               achieves suboptimal parallelization performance for opera-        different attributes within a sample are partitioned (e.g., the\r\n               tors with a large number of parameters (e.g., embedding).         height and width dimensions of an image).\r\n               Another common parallelization strategy is model paral-           WeuseSOAPinFlexFlow,adeeplearningenginethatau-\r\n               lelism (Dean et al., 2012), which assigns disjoint subsets of     tomatically \ufb01nds fast parallelization strategies in the SOAP\r\n               a neural network each to a dedicated device. This approach        search space for arbitrary DNNs. Existing approaches only\r\n               eliminates parameter synchronization between devices but          consider one or a subset of SOAP dimensions. For example,\r\n               requires data transfers between operators. ColocRL (Mirho-        data parallelism only explores the sample dimension, while\r\n               seini et al., 2017) uses reinforcement learning to learn ef\ufb01-     OptCNNparallelizes linear CNNs in the sample, attribute\r\n               cient operator assignments for model parallelism but only         andparameterdimensions. FlexFlowconsidersparallelizing\r\n                                                                                 any DNN(linear or non-linear) in all SOAP dimensions and\r\n                   1Stanford University. Correspondence to: Zhihao Jia <zhi-     explores a more comprehensive search space that includes\r\n               hao@cs.stanford.edu>.                                             existing approaches as special cases. As a result, FlexFlow\r\n               Proceedings of the 2nd SysML Conference, Palo Alto, CA, USA,      is able to \ufb01nd parallelization strategies that signi\ufb01cantly\r\n               2019. Copyright 2019 by the author(s).                            outperform existing approaches.\r\n                                                     BeyondDataandModelParallelismforDeepNeuralNetworks\r\n                 The key challenge FlexFlow must address is how to ef-                     Table 1. The parallelism dimensions used by different approaches.\r\n                 \ufb01ciently explore the SOAP search space, which is much                     S, O, A, and P indicate parallelism in the Sample, Operator, At-\r\n                 larger than those considered in previous systems and in-                  tribute, and Parameter dimensions. Hybrid parallelism indicates\r\n                 cludes more sophisticated parallelization strategies. To this             an approach supports parallelizing an operator in a combination of\r\n                 end, FlexFlow uses two main components: a fast, incre-                    the sample, attribute, and parameter dimensions (see Figure 2).\r\n                 mental execution simulator to evaluate different paralleliza-               Parallelization         Parallelism      Hybrid       Supported\r\n                                                                                             Approach                Dimensions     Parallelism    DNNs\r\n                 tion strategies, and a Markov Chain Monte Carlo (MCMC)                                                                                  **\r\n                                                                                             Data Parallelism        S                             partial\r\n                 search algorithm that takes advantage of the incremental                    ModelParallelism        O, P                          all\r\n                                                                                                                                                         **\r\n                 simulator to rapidly explore the large search space.                         Krizhevsky (2014)      S, P                          CNNs\r\n                                                                                                                                                         #\r\n                                                                                              Wuetal.(2016)          S, O                          RNNs\r\n                                                                                                                                                         #\r\n                 FlexFlow\u2019s execution simulator can accurately predict the                   ColocRL                 O                             partial\r\n                                                                                                                                                         %\r\n                 performanceofaparallelizationstrategyintheSOAPsearch                        OptCNN                  S, A, P             X         linear\r\n                 space for arbitrary DNNs and is three orders of magnitude                   FlexFlow (this paper)   S, O, A, P          X         all\r\n                                                                                             ** Does not work for DNNs whose entire model cannot \ufb01t on a single device.\r\n                 faster than pro\ufb01ling real executions. We borrow the idea                    # DoesnotworkforDNNswithlargeoperatorsthatcannot \ufb01t on a single device.\r\n                 from OptCNNofmeasuringtheperformance of an operator                         % Only works for DNNs with linear computation graphs.\r\n                 once for each con\ufb01guration and feed these measurements                    outperforms OptCNN, even on linear DNNs, by supporting\r\n                 into a task graph that models both the architecture of a DNN              a larger search space.\r\n                 model and the network topology of a cluster. The execu-\r\n                 tion simulator estimates the performance of a parallelization             2    Related Work\r\n                 strategy by simulating the execution on the task graph. In\r\n                 addition, we introduce a delta simulation algorithm that sim-\r\n                 ulates a new strategy using incremental updates to previous               Dataandmodelparallelismhavebeenwidelyusedbyex-\r\n                 simulations and further improves performance over naive                   isting deep learning systems to distribute training across\r\n                 simulations by up to 6.9\u00d7.                                                devices. Data parallelism (Krizhevsky et al., 2012) is in-\r\n                 Theexecution simulator achieves high accuracy for predict-                ef\ufb01cient for operators with a large number of parameters\r\n                 ing parallelization performance. We evaluate the simulator                (e.g., densely-connected layers) and becomes a scalability\r\n                 withsixreal-worldDNNsontwodifferentGPUclustersand                         bottleneck in large scale distributed training. Model paral-\r\n                 showthat,forallthemeasuredexecutions,therelativediffer-                   lelism (Dean et al., 2012) splits a DNN into disjoint subsets\r\n                 ence between the real and simulated execution time is less                and trains each subset on a dedicated device, which reduces\r\n                 than 30%. Most importantly for the search, we test different              communication costs for synchronizing network parameters\r\n                 strategies for a given DNN and show that their simulated                  but exposes limited parallelism.\r\n                 execution time preserves real execution time ordering.                    Expert-designed parallelization strategies manually opti-\r\n                 Using the execution simulator as an oracle, the FlexFlow ex-              mize parallelization for speci\ufb01c DNNs by using experts\u2019 do-\r\n                 ecution optimizer uses a MCMCsearchalgorithmtoexplore                     main knowledge and experience. For example, Krizhevsky\r\n                 the SOAP search space and iteratively propose candidate                   (2014) introduces \u201cone weird trick\u201d that uses data paral-\r\n                 strategies based on the simulated performance of previous                 lelism for convolutional and pooling layers and switches to\r\n                 candidates. The execution simulator can also work with                    modelparallelism for densely-connected layers to acceler-\r\n                 other search strategies, such as learning-based search algo-              ate CNNs. To parallelize RNNs, Wu et al. (2016) uses data\r\n                 rithms. Whenthesearchprocedureis\ufb01nished,theexecution                      parallelism that replicates the entire RNN on each node and\r\n                 optimizer returns the best strategy it has discovered.                    switches to model parallelism for intra-node parallelization.\r\n                                                                                           Although these expert-designed strategies improve perfor-\r\n                 Weevaluate FlexFlow on a variety of real-world DNNs in-                   manceoverdataandmodelparallelism,theyaresuboptimal.\r\n                 cluding AlexNet (Krizhevsky et al., 2012), ResNet-101 (He                 Weusetheseexpert-designed strategies as baselines in our\r\n                 et al., 2016), Inception-v3 (Szegedy et al., 2016), RNN                   experiments and show that FlexFlow can further improve\r\n                 Text Classi\ufb01cation (Kim, 2014), RNN Language Model-                       training throughput by up to 2.3\u00d7.\r\n                 ing (Zaremba et al., 2014) and Neural Machine Transla-                    Automated frameworks have been proposed for \ufb01nding\r\n                 tion (Wu et al., 2016). Compared to data/model paral-                     ef\ufb01cient parallelization strategies in a limited search space.\r\n                 lelism and strategies manually designed by domain ex-                     ColocRL(Mirhoseini et al., 2017) uses reinforcement learn-\r\n                 perts (Krizhevsky, 2014; Wu et al., 2016), FlexFlow in-                   ing to \ufb01nd ef\ufb01cient device placement for model parallelism.\r\n                 creases training throughput by up to 3.3\u00d7, reduces com-                   OptCNN(Jiaetal.,2018)usesdynamicprogrammingtopar-\r\n                 munication costs by up to 5\u00d7, and achieves signi\ufb01cantly                   allelize linear CNNs. OptCNN\u2019s approach does not explore\r\n                 better scaling. In addition, FlexFlow outperforms the strate-             parallelism across operators and is not applicable to DNNs\r\n                 gies found by ColocRL by 3.4-3.8\u00d7 on the same hardware                    withnon-linearcomputationgraphs. Gaoetal.(2017;2019)\r\n                 con\ufb01guration evaluated in ColocRL. Finally, FlexFlow also                 exploited hybrid parallelization on tiled domain-speci\ufb01c\r\n                                                                                                                                                                                          BeyondDataandModelParallelismforDeepNeuralNetworks\r\n                                                                                                     Operator Graph                                                                                Device Topology                                                                                                               Table 2. Parallelizable dimensions for different operators. The\r\n                                                                                                                           MatMul                                                                                           Network                                                                                               sample and channel dimension index different samples and neu-\r\n                                                                                                                                                                                                                                                                                                                                  rons, respectively. For images, the length and the combination of\r\n                                                                                                                           Concat                                                                         CPU                                       CPU                                                                           height and width dimensions specify a position in an image.\r\n                                                                                                         Conv                                     Conv                                          GPU                GPU                    GPU                 GPU                                                                        Operator                                                                                                      Parallelizable Dimensions\r\n                                                                                                                                                                                                                                                                                                                                                                                                                           (S)ample                                  (A)ttribute                                                   (P)arameter\r\n                                                                                                                                                                                                                                                                                                                                         1Dpooling                                                                         sample                                    length, channel\r\n                                                                                                                                                   Execution Optimizer                                                                                                                                                                   1Dconvolution                                                                     sample                                    length                                                        channel\r\n                                                                                                                                                                        Simulated                                                                                                                                                        2Dconvolution                                                                     sample                                    height, width                                                 channel\r\n                                                                                                                        MCMC                                        Performance                                       Execution                                                                                                          Matrix multiplication                                                             sample                                                                                                  channel\r\n                                                                                                                 Search Alg.                                                 Candidate                                 Simulator\r\n                                                                                                                                                                                Strategy                                                                                                                                                \t                                                           \t                                                           \t                                                          \t\r\n                                                                                                                                                                                                                                                                                                                                        l                                                           l                                                           l                                                          l\r\n                                                                                                                                                                                                                                                                                                                                        e                                                           e                                                           e                                                          e\r\n                                                                                                                                                                                                   Best Found Strategy                                                                                                                  n                                                           n                                                           n                                                          n\r\n                                                                                                                                                                                                                                                                                                                                        an                                                          an                                                          an                                                         an\r\n                                                                                                                                                                                                                                                                                                                                        h                                                           h                                                           h                                                    \t     h                                                    \t\r\n                                                                                                                                                                                                                                                                                                                                        C                                                     \t     C                                                     \t     C                                                th        C                                                 th\r\n                                                                                                                                                                                                                                                                                                                                                                                          th                                                          th                                                       g                                                           g\r\n                                                                                                                                                                                                                                                                                                                                                                                      ng                                                         ng                                                         n                                                           n\r\n                                                                                                                                             Distributed Runtime                                                                                                                                                                                 Sample\t                          Le                         Sample\t                         Le                          Sample\t                        Le                           Sample\t                        Le\r\n                                                                                                                                                                                                                                                                                                                                      Data\tParallelism\t                                          Model\tParallelism\t                                           Hybrid\tParallelism\t                                           Hybrid\tParallelism\t\r\n                                                                                                                             Figure 1. FlexFlow overview.                                                                                                                                                                                                 (S)\t                                                         (P)\t                                                      (S,\tP)\t                                                   (S,\tA,\tP)\t\r\n                                                                                                                                                                                                                                                                                                                                 Figure 2. Example parallelization con\ufb01gurations for 1D convolu-\r\n                                                          hardware and proposed various data\ufb02ow optimizations for                                                                                                                                                                                                                 tion. Dashed lines show partitioning the tensor.\r\n                                                          both intra-layer and inter-layer data communication.\r\n                                                                                                                                                                                                                                                                                                                                  lution, etc.), and each edge (o ,o ) \u2208 G is a tensor (i.e., a\r\n                                                          Table 1 summarizes the parallelism dimensions explored                                                                                                                                                                                                                                                                                                                                                i            j\r\n                                                                                                                                                                                                                                                                                                                                  n-dimensional array) that is an output of o and an input of\r\n                                                          by existing approaches. Data parallelism uses the sample                                                                                                                                                                                                                                                                                                                                                                                               i\r\n                                                                                                                                                                                                                                                                                                                                  o . In addition, FlexFlow also takes a device topology graph\r\n                                                          dimension to parallelize training, while model parallelism                                                                                                                                                                                                                   j\r\n                                                                                                                                                                                                                                                                                                                                  D=(D ,D )describingallavailable hardware devices\r\n                                                          exploits the parameter and operator dimensions. Expert-                                                                                                                                                                                                                                                    N                   E\r\n                                                          designed strategies exploit parallelism in the sample or pa-                                                                                                                                                                                                            and their interconnections, as shown in Figure 1. Each node\r\n                                                                                                                                                                                                                                                                                                                                  d \u2208 D represents a device (e.g., a CPU or a GPU), and\r\n                                                          rameter dimension to parallelize an operator but do not                                                                                                                                                                                                                      i                        N\r\n                                                                                                                                                                                                                                                                                                                                  each edge (d ,d ) \u2208 D                                                                                     is a hardware connection (e.g., a\r\n                                                          support hybrid parallelism that uses a combination of the                                                                                                                                                                                                                                                                     i            j                             E\r\n                                                                                                                                                                                                                                                                                                                                  NVLink,aPCI-e,oranetworklink) between device d and\r\n                                                          sample, attribute, and parameter dimensions to parallelize                                                                                                                                                                                                                                                                                                                                                                                                                                             i\r\n                                                                                                                                                                                                                                                                                                                                  d . The edges are labeled with the bandwidth and latency\r\n                                                          an operator (see Figure 2). Compared to these manually                                                                                                                                                                                                                       j\r\n                                                          designed strategies, FlexFlow considers more sophisticated,                                                                                                                                                                                                             of the connection.\r\n                                                          and often more ef\ufb01cient, strategies to parallelize a single op-                                                                                                                                                                                                         FlexFlow takes an operator graph and a device topology\r\n                                                          erator. In addition, compared to existing automated frame-                                                                                                                                                                                                              as inputs and automatically \ufb01nds an ef\ufb01cient strategy in\r\n                                                          works (e.g., ColocRL and OptCNN), FlexFlow supports                                                                                                                                                                                                                     the SOAP search space. All strategies in the search space\r\n                                                          moregeneric DNNs and \ufb01nds strategies that are up to 3.8\u00d7                                                                                                                                                                                                                perform the same computation de\ufb01ned by the DNN and\r\n                                                          faster by exploring a signi\ufb01cantly larger search space.                                                                                                                                                                                                                 therefore maintains the same model accuracy by design.\r\n                                                          Graph-based cluster schedulers. Previous work has pro-                                                                                                                                                                                                                 ThemaincomponentsofFlexFlowareshowninFigure1.\r\n                                                          posed cluster schedulers that schedule cluster-wide tasks                                                                                                                                                                                                              Theexecution optimizer uses a MCMC search algorithm to\r\n                                                          byusing graph-based algorithms. For example, Quincy (Is-                                                                                                                                                                                                                explore the space of possible parallelization strategies and\r\n                                                          ard et al., 2009) maps task scheduling to a \ufb02ow network                                                                                                                                                                                                                 iteratively proposes candidate strategies that are evaluated\r\n                                                          and uses a min-cost max-\ufb02ow (MCMF) algorithm to \ufb01nd                                                                                                                                                                                                                     by an execution simulator. The execution simulator uses\r\n                                                          ef\ufb01cient task placement. Firmament (Gog et al., 2016) gen-                                                                                                                                                                                                              a delta simulation algorithm that simulates a new strategy\r\n                                                          eralizes Quincy by employing multiple MCMF optimiza-                                                                                                                                                                                                                    using incremental updates to previous simulations. The\r\n                                                          tion algorithms to reduce task placement latencies. Existing                                                                                                                                                                                                            simulated execution time guides the search in generating\r\n                                                          graph-based schedulers optimize task placement by assum-                                                                                                                                                                                                                future candidates. Whenthesearchtimebudgetisexhausted,\r\n                                                          ing a \ufb01xed task graph. However, FlexFlow solves a different                                                                                                                                                                                                             the execution optimizer sends the best discovered strategy to\r\n                                                          problem that requires jointly optimizing how to partition an                                                                                                                                                                                                            a distributed runtime for parallelizing the actual executions.\r\n                                                          operator into tasks by exploiting parallelism in the SOAP\r\n                                                          dimensions and how to assign tasks to devices.                                                                                                                                                                                                                          4                 TheSOAPSearchSpace\r\n                                                          3                  Overview                                                                                                                                                                                                                                            This section introduces the SOAP search space of paral-\r\n                                                                                                                                                                                                                                                                                                                                  lelization strategies for DNNs. To parallelize a DNN oper-\r\n                                                          Similar to existing deep learning frameworks (e.g., Tensor-                                                                                                                                                                                                             ator across devices, we require each device to compute a\r\n                                                          Flow and PyTorch), FlexFlow uses an operator graph G                                                                                                                                                                                                                    disjoint subset of the operator\u2019s output tensors. Therefore,\r\n                                                          to describe all operators and state in a DNN. Each node                                                                                                                                                                                                                wemodeltheparallelization of an operator o by de\ufb01ning\r\n                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                i\r\n                                                          o \u2208 G is an operator (e.g., matrix multiplication, convo-                                                                                                                                                                                                               howtheoutput tensor of o is partitioned.\r\n                                                               i                                                                                                                                                                                                                                                                                                                                                                               i\r\n                                                                BeyondDataandModelParallelismforDeepNeuralNetworks\r\n                                      U (output)           V (input)          W (input)                        5     Execution Simulator\r\n                                    S)Channelout (P)     S)Channelin (P)    ) Channelout (P)\r\n                                    (                    (                  (P\r\n                                                                                                               In this section, we describe the execution simulator, which\r\n                                    e                    e                   in\r\n                                    l                    l                  l\r\n                                    p               = p              x      e\r\n                                    m                    m                  n                                  takes an operator graph G, a device topology D, and a par-\r\n                                                                            n\r\n                                    Sa                   Sa                 a\r\n                                                                            Ch                                 allelization strategy S as inputs and predicts the execution\r\n                                              Degree(Sample) = 2, Degree(Channel ) = 2\r\n                            Configuration                                       out                            time to run G on D using strategy S.\r\n                                              Devices = {GPU1, GPU2, GPU3, GPU4}\r\n                                                                                                               Thesimulator depends on the following assumptions:\r\n                      GPU1              =         x              GPU2              =        x                  A1. Theexecutiontimeofeachtaskispredictable with low\r\n                                                                                                                      variance and is independent of the contents of input\r\n                                                                                                                      tensors.\r\n                      GPU3               =        x              GPU4              =        x                  A2. For each connection (d ,d ) between device d and\r\n                                                                                                                                                         i   j                             i\r\n                                                                                                                      d with bandwidth b, transferring a tensor of size s\r\n                                                                                                                        j\r\n                                                                                                                      from d to d takes s/b time (i.e., the communication\r\n                    Figure 3. An example parallelization con\ufb01guration for a matrix                                              i       j\r\n                    multiplication operator.                                                                          bandwidth can be fully utilized).\r\n                                                                                                               A3. Each device processes the assigned tasks with a FIFO\r\n                                                                                                                      (\ufb01rst-in-\ufb01rst-out) scheduling policy. This is the policy\r\n                    For an operator o , we de\ufb01ne its parallelizable dimensions                                        used by modern devices such as GPUs.\r\n                                             i                                                                 A4. The runtime has negligible overhead. A device be-\r\n                    P astheset of all divisible dimensions in its output tensor.\r\n                       i                                                                                              gins processing a task as soon as its input tensors are\r\n                    P alwaysincludesasampledimension. Forallotherdimen-\r\n                       i                                                                                              available and the device has \ufb01nished previous tasks.\r\n                    sions in P , we call it a parameter dimension if partitioning\r\n                                   i\r\n                    over that dimension requires splitting the model parame-                                   To simulate an execution, we borrow the idea from\r\n                    ters and call it an attribute dimension otherwise. Table 2                                 OptCNN (Jia et al., 2018) to measure the execution time\r\n                    shows the parallelizable dimensions of some example oper-                                  of each distinct operator once for each con\ufb01guration and\r\n                    ators. Finally, we also consider parallelism across different                              include these measurements in a task graph, which includes\r\n                    operators in the operator dimension.                                                       all tasks derived from operators and dependencies between\r\n                    Aparallelization con\ufb01guration c of an operator o de\ufb01nes                                    tasks. The simulator can generate an execution timeline by\r\n                                                                   i                        i                  running a simulation algorithm on the task graph.\r\n                    howtheoperatorisparallelizedacrossmultipledevices. Fig-\r\n                    ure 2 shows some example con\ufb01gurations for parallelizing                                   5.1     TaskGraph\r\n                    a 1D convolution operator in a single dimension as well as\r\n                    combinations of multiple dimensions.                                                       Ataskgraphmodelsdependenciesbetweenindividualtasks\r\n                    For each parallelizable dimension in P , c includes a posi-                                derived from operators. To unify the abstraction, we model\r\n                                                                            i    i\r\n                    tive integer that is the degree of parallelism in that dimen-                              each hardware connection between devices as a communica-\r\n                    sion. |ci| is the product of the parallelism degrees for all                               tion device that can only perform communication tasks (i.e.,\r\n                    parallelizable dimensions of ci. We use equal size parti-                                  data transfers). Note that devices and hardware connections\r\n                    tions in each dimension to guarantee well-balanced work-                                   are modeled as separate devices, which allows computation\r\n                    load distributions. A parallelization con\ufb01guration ci par-                                 (i.e., normal tasks) and communication (i.e., communication\r\n                    titions the operator o into |c | independent tasks, denoted                                tasks) to be overlapped if possible.\r\n                                                   i          i\r\n                    as ti:1,...,ti:|c |, meanwhile ci also includes the device as-\r\n                                         i                                                                     Given an operator graph G, a device topology D, and a\r\n                    signment for each task t                  (1 \u2264 k \u2264 |c |). Given the\r\n                                                         i:k                       i                           parallelization strategy S, we use the following steps to\r\n                    output tensor of a task and its operator type, we can infer                                construct a task graph T = (T ,T ), where each node\r\n                    the necessary input tensors to execute each task.                                                                                         N E\r\n                                                                                                               t \u2208 TN is a task (i.e., a normal task or a communication\r\n                    Figure 3 shows an example parallelization con\ufb01guration                                     task) and each edge (t ,t ) \u2208 T                 is a dependency that task\r\n                                                                                                                                               i   j        E\r\n                    for a matrix multiplication operator (i.e., U = V W). The                                  tj cannot start until task ti is completed. Note that the edges\r\n                    operator is partitioned into four independent tasks assigned                               in the task graph are simply ordering constraints\u2014the edges\r\n                    to different GPU devices. The input and output tensors of                                  donotindicate data \ufb02ow, as all data \ufb02ow is included in the\r\n                    the tasks are shown in the \ufb01gure.                                                          task graph as communication tasks.\r\n                    Aparallelization strategy S describes one possible paral-                                     1. For each operator o \u2208 G with parallelization con\ufb01gu-\r\n                                                                                                                                                  i\r\n                    lelization of an application. S includes a parallelization                                        ration c , we add tasks t             , ..., t      to T .\r\n                                                                                                                                 i                      i:1        i:|c |       N\r\n                                                                                                                                                                       i\r\n                    con\ufb01guration c for each operator o , and each o \u2019s con\ufb01gu-                                    2. For each tensor (o ,o ) \u2208 G, which is an output of op-\r\n                                          i                            i                 i                                                      i    j\r\n                    ration can be chosen independently from among all possible                                        erator o and an input of o , we compute the output sub-\r\n                                                                                                                                 i                        j\r\n                    con\ufb01gurations for o .                                                                             tensors written by tasks t                 (1 \u2264 k \u2264 |c |) and the\r\n                                                 i                                                                                                         i:ki            i        i\r\n                                                                     BeyondDataandModelParallelismforDeepNeuralNetworks\r\n                              Embedding Recurrent              Linear              GPU1 Xfer       GPU2 Xfer       GPU3       GPU1 Xfer        GPU2 Xfer       GPU3       GPU1 Xfer       GPU2 Xfer       GPU3\r\n                                  Layer         Layer          Layer               exe: 2 exe: 1 exe: 1 exe: 1 exe: 3         exe: 2 exe: 1 exe: 1 exe: 1 exe: 3          exe: 2 exe: 1 exe: 2 exe: 2 exe: 3\r\n                                    o1            o3             o5                          c               c                  r: 0    r: 2    r: 3    r: 4    r: 7        r: 0    r: 2    r: 5    r: 7    r: 9\r\n                                                                                    t1:1     t      t3:1     t      t5:1        s: 0    s: 2    s: 3    s: 4    s: 7        s: 0    s: 2    s: 5    s: 7    s: 9\r\n                                                                                             c               c                  r: 0    r: 4    r: 5    r: 6                r: 0    r: 4\r\n                                    o2            o4             o6                 t1:2     t      t3:2     t                  s: 2    s: 4    s: 5    s: 6                s: 2    s: 4\r\n                         Config c1, c2:    Config c3, c4:    Config c5, c6:                  c               c                  r: 0    r: 6    r: 7    r: 8   r: 11        r: 0    r: 6    r: 7    r: 8   r: 11\r\n                         # batch = 2       # batch = 2       # batch = 1            t2:1     t      t4:1     t      t6:1        s: 4    s: 6    s: 7    s: 8   s: 11        s: 4    s: 6    s: 7    s: 9   s: 12\r\n                         # channel = 1     # channel = 1     # channel = 1                                                      r: 0    r: 8    r: 9   r: 10                r: 0    r: 8    r: 9   r: 10\r\n                         ti:k = GPU1       t   = GPU2        t   = GPU3                      c               c\r\n                                            i:k               i:k                   t2:2     t      t4:2     t                  s: 6    s: 8    s: 9   s: 10                s: 6    s: 8    s: 9   s: 10\r\n                       (a) An example parallelization strategy. (b) The corresponding                                      (c) The task graph after the                (d) The task graph after the\r\n                                                                               task graph.                                 full simulation algorithm.                  delta simulation algorithm.\r\n                      Figure 4. Simulating an example parallelization strategy. The tasks\u2019 exeTime and device are shown on the top of each column. In\r\n                      Figure 4c and 4d, the word \u201cr\u201d and \u201cs\u201d indicate the readyTime and startTime of each task, respectively, and the dashed edges\r\n                      indicate the nextTask.\r\n                                Table 3. Properties for each task in the task graph.                                   cation task, its exeTime is the time to transfer a tensor (of\r\n                                 Property           Description                                                        size s) between devices with bandwidth b and is estimated\r\n                                              Properties set in graph construction\r\n                                 exeTime            Theelapsed time to execute the task.                               as s/b (assumption A2).\r\n                                 device             Theassigned device of the task.\r\n                                 I(t)               {t |(t ,t) \u2208 T }\r\n                                                      in    in         E                                               In addition to exeTime, FlexFlow also sets device, I(t),\r\n                                 O(t)               {t    |(t, t   ) \u2208 T }\r\n                                                      out      out       E                                             and O(t) (de\ufb01ned in Table 3) during graph construction.\r\n                                                   Properties set in simulation                                        Other properties in Table 3 remain unset and must be \ufb01lled\r\n                                 readyTime          Thetimewhenthetaskisreadytorun.\r\n                                 startTime          Thetimewhenthetaskstarts to run.                                   in by the simulation.\r\n                                 endTime            Thetimewhenthetaskiscompleted.\r\n                                 preTask            Theprevious task performed on device.                              5.2      Full Simulation Algorithm\r\n                                 nextTask           Thenexttask performed on device.\r\n                                  Internal properties used by the full simulation algorithm\r\n                                 state              Current state of the task, which is one of                         Wenowdescribe a full simulation algorithm that we use\r\n                                                    NOTREADY,READY,andCOMPLETE.                                        as a baseline for comparisons with our delta simulation al-\r\n                                                                                                                       gorithm. The full simulation algorithm \ufb01rst builds a task\r\n                              input sub-tensors read by tasks t                        (1 \u2264 k \u2264 |c |).                 graph using the method described in Section 5.1 and then\r\n                                                                                 j:k              j         j\r\n                                                                                     j\r\n                              For every task pair ti:k and tj:k with shared tensors,                                   sets the properties for each task using a variant of Dijkstra\u2019s\r\n                                                                 i             j\r\n                              if two tasks are assigned to the same device, we add                                     shortest-path algorithm (Cormen et al., 2009). Tasks are\r\n                              an edge (t          , t      ) into T , indicating a dependency                          enqueued into a global priority queue when ready (i.e., all\r\n                                             i:k     j:k              E\r\n                                                 i       j\r\n                              between the two tasks, and no communication task is                                      predecessor tasks are completed) and are dequeued in in-\r\n                              needed. If t             and t          with shared tensors are as-                      creasing order by their readyTime. Therefore, when a\r\n                                                 i:k           j:k\r\n                                                    i              j\r\n                              signed to different devices, we add a communication                                      task t is dequeued, all tasks with an earlier readyTime\r\n                              task tc to TN and two edges (ti:k ,tc) and (tc,tj:k ) to                                 have been scheduled, and we can set the properties for task\r\n                                                                               i                         j\r\n                              TE. The new task tc is assigned to the communication                                     t while maintaining the FIFO scheduling order (assumption\r\n                              device between the devices that perform ti:k and tj:k .                                  A3). Figure 4c shows the execution timeline of the example\r\n                                                                                                i             j\r\n                                                                                                                       parallelization strategy.\r\n                      Figure 4a shows an example parallelization strategy for a                                        5.3      Delta Simulation Algorithm\r\n                      standard 3-layer RNN consisting of an embedding layer, a\r\n                      recurrent layer, and a linear layer. It represents commonly                                      FlexFlow uses a MCMC search algorithm that proposes a\r\n                      used model parallelism that assigns operators in each layer                                      newparallelization strategy by changing the parallelization\r\n                      to a dedicated GPU. Figure 4b shows the corresponding task                                       con\ufb01guration of a single operator in the previous strategy\r\n                      graph. Each square and hexagon indicate a normal and a                                           (see Section 6.2). As a result, in the common case, most of\r\n                      communication task, respectively, and each directed edge                                         the execution timeline does not change from one simulated\r\n                      represents a dependency between tasks.                                                           strategy to the next. Based on this observation, we introduce\r\n                      Table 3 lists the properties for each task in the task graph.                                    a delta simulation algorithm that starts from a previous task\r\n                      For a normal task derived from an operator, its exeTime                                          graph and only re-simulates tasks involved in the portion of\r\n                      is the time to execute the task on the given device and is                                       the execution timeline that changes, an optimization that dra-\r\n                      estimated by running the task multiple times on the device                                       matically speeds up the simulator, especially for strategies\r\n                      and measuring the average execution time (assumption A1).                                        for large distributed machines.\r\n                      Atask\u2019s exeTimeiscached,andallfuture tasks with the                                              Tosimulate a new strategy, the delta simulation algorithm\r\n                      sameoperator type and input/output tensor shapes will use                                        \ufb01rst updates tasks and dependencies from an existing task\r\n                      the cached value without rerunning the task. For a communi-                                      graph and enqueues all modi\ufb01ed tasks into a global prior-\r\n                                                BeyondDataandModelParallelismforDeepNeuralNetworks\r\n               ity queue. Similar to the Bellman-Ford shortest-path algo-         the search algorithm iteratively proposes new candidates\r\n               rithm (Cormen et al., 2009), the delta simulation algorithm        until one of the following two criteria is satis\ufb01ed: (1) the\r\n               iteratively dequeues updated tasks and propagates the up-          search time budget for current initial strategy is exhausted;\r\n               dates to subsequent tasks.                                         or (2) the search procedure cannot further improve the best\r\n               For the example in Figure 4, consider a new parallelization        discovered strategy for half of the search time.\r\n               strategy derived from the original strategy (Figure 4a) by         7    FlexFlow Runtime\r\n               only reducing the parallelism of operator o to 1 (i.e., |c | =\r\n                                                           3             3\r\n               1). Figure 4d shows the task graph for the new paralleliza-        We found that existing deep learning systems (e.g., Ten-\r\n               tion strategy, which can be generated from the original task       sorFlow, PyTorch, Caffe2, and MXNet) only support paral-\r\n               graph (in Figure 4c) by updating the simulation properties         lelizing an operator in the sample dimension through data\r\n               of tasks in the grey area.                                         parallelism, and it is non-trivial to parallelize an operator in\r\n               6    Execution Optimizer                                           other dimensions or combinations of several SOAP dimen-\r\n                                                                                  sions in these systems.\r\n               Theexecution optimizer takes an operator graph and a de-           To support parallelizing DNN models using any strategy\r\n               vice topology as inputs and automatically \ufb01nds an ef\ufb01cient         in the SOAP search space, we implemented the FlexFlow\r\n               parallelization strategy. Using the simulator as an oracle,        distributed runtime in Legion (Bauer et al., 2012), a high-\r\n               FlexFlow transforms the parallelization optimization prob-         performance parallel runtime for distributed heterogeneous\r\n               leminto a cost minimization problem, namely minimizing             architectures, and use cuDNN (Chetlur et al., 2014) and\r\n               the predicted execution time.                                      cuBLAS(cuBLAS)astheunderlyinglibraries for process-\r\n               Finding the optimal parallelization strategy is NP-hard, by        ing DNNoperators. We use the Legion high-dimensional\r\n               an easy reduction from minimum makespan (Lam & Sethi,              partitioning interface (Treichler et al., 2016) to support paral-\r\n               1977). In addition, the number of possible strategies is           lelizing an operator in any combination of the parallelizable\r\n               exponential in the number of operators of an operator graph        dimensions.\r\n               (see Section 4), which makes it intractable to exhaustively\r\n               enumerate the search space. To \ufb01nd a low-cost strategy,            8    Evaluation\r\n               FlexFlow uses a cost minimization search to heuristically\r\n               explore the space and returns the best strategy discovered.        This section evaluates the performance of FlexFlow on six\r\n                                                                                  real-world DNN benchmarks and two GPU clusters.\r\n               6.1   MCMCSampling                                                 Table 4 summarizes the DNNs used in our experiments.\r\n               This section brie\ufb02y introduces the Metropolis-Hastings al-         AlexNet, Inception-v3, and ResNet-101 are three CNNs\r\n               gorithm (Hastings, 1970) we use for MCMC sampling in               that achieved the best accuracy in the ILSVRC competi-\r\n               the execution optimizer. The algorithm maintains a current         tions (Russakovsky et al., 2015). For AlexNet, the per-\r\n               strategy S and randomly proposes a new strategy S\u2217. S\u2217             iteration training time is smaller than the time to load train-\r\n               is accepted and becomes the new current strategy with the          ingdatafromdisk. WefollowthesuggestionsinTensorFlow\r\n               following probability:                                             Benchmarks \u2217 and use synthetic data to benchmark the per-\r\n                                   \u0010                                \u0001\u0011           formanceofAlexNet. Forallotherexperiments,thetraining\r\n                 \u03b1(S\u2217|S) = min 1,exp \u03b2 \u00b7(cost(S)\u2212cost(S\u2217)                 (1)     data is loaded from disk in the training procedure.\r\n                                                                                  RNNTC, RNNLM and NMT are sequence-to-sequence\r\n               MCMCtendstobehaveasagreedysearchalgorithm,pre-                     RNNmodelsfortextclassi\ufb01cation, language modeling, and\r\n               ferring to move towards lower cost whenever that is readily        neural machine translation, respectively. RNNTC uses four\r\n               available, but can also escape local minima.                       LSTMlayerswithahiddensizeof1024. RNNLMusestwo\r\n               6.2   Search Algorithm                                             LSTMlayerswithahiddensizeof2048. BothRNNmodels\r\n                                                                                  include a softmax linear after the last LSTM layer. NMT\r\n               Ourmethodforgenerating proposals is simple: an operator            includes an encoder and a decoder, both of which consist\r\n               in the current parallelization strategy is selected at random,     of 2 LSTMlayers with a hidden size of 1024. To improve\r\n               and its parallelization con\ufb01guration is replaced by a random       model accuracy, we also use an attention layer on top of the\r\n               con\ufb01guration. We use the predicted execution time from the         last decoder LSTM layer (Bahdanau et al., 2014). Figure 13\r\n               simulator as the cost function in Equation 1 and use existing      illustrates the structure of the NMT model. For all three\r\n               strategies (e.g., data parallelism, expert-designed strategies)    RNNmodels,wesetthenumberofunrollingstepsfor each\r\n               as well as randomly generated strategies as the initial can-       recurrent layer to 40.\r\n               didates for the search algorithm. For each initial strategy,          \u2217https://www. tensor\ufb02ow.org/performance/benchmarks\r\n                                                       BeyondDataandModelParallelismforDeepNeuralNetworks\r\n                                                          Table 4. Details of the DNNs and datasets used in evaluation.\r\n                    DNN            Description                                                         Dataset                           Reported Acc.      OurAcc.\r\n                                                                        Convolutional Neural Networks (CNNs)\r\n                    AlexNet        A12-layer CNN                                                       Synthetic data                    -                  -\r\n                    Inception-v3   A102-layer CNNwithInception modules (Szegedy et al., 2014)          ImageNet                          78.0%a             78.0%a\r\n                    ResNet-101     A101-layer residual CNN with shortcut connections                   ImageNet                          76.4%a             76.5%a\r\n                                                                           Recurrent Neural Networks (RNNs)\r\n                    RNNTC          4recurrent layers followed by a softmax layer                       MovieReviews(Movies)              79.8%              80.3%\r\n                                                                                                                                              b                 b\r\n                    RNNLM          2recurrent layers followed by a softmax layer                       PennTreebank (Marcus et al.)      78.4               76.1\r\n                                                                                                                                               c                 c\r\n                    NMT            4recurrent layers followed by an attention and a softmax layer      WMTEnglish-German(WMT)            19.67              19.85\r\n                    a top-1 accuracy for single crop on the validation dataset (higher is better).\r\n                    b word-level test perplexities on the Peen Treebank dataset (lower is better).\r\n                    c BLEUscores(Papineni et al., 2002) on the test dataset (higher is better).\r\n                                                                                                     Data Parallelism (P100)              Data Parallelism (K80)\r\n                                Network                             Network                          Expert-designed Strategy (P100)      Expert-designed Strategy (K80)\r\n                     100 Gb/s                            56 Gb/s                                     FlexFlow (P100)                      FlexFlow (K80)\r\n                         CPUs              CPUs              CPUs              CPUs                 2500AlexNet (batch size = 256)   200Inception_v3 (batch size = 64)\r\n                      P100   P100      P100   P100        K80     K80       K80    K80              2000                             150\r\n                                                                                                    1500\r\n                      P100   P100      P100   P100        K80     K80       K80    K80                                               100\r\n                                                                                                    1000\r\n                  (a) The P100 Cluster (4 nodes). (b) The K80 Cluster (16 nodes).                    500                              50\r\n                                                                                                       1(1)02(1)4(1)8(2)16(4)32(8)64(16)1(1)02(1)4(1)8(2)16(4)32(8)64(16)\r\n                 Figure 5. Architectures of the GPU clusters used in the experi-                       ResNet-101 (batch size = 64)       RNNTC (batch size = 64)\r\n                  ments. An arrow line indicates a NVLink connection. A solid line                   200                             600\r\n                                                                                                                                     500\r\n                  is a PCI-e connection. Dashed lines are In\ufb01niband connections                      150\r\n                                                                                                                                     400\r\n                  across different nodes.                                                            100                             300\r\n                                                                                                                                     200\r\n                                                                                                      50\r\n                                                                                                                                     100\r\n                                                                                                  Num. Samples/second/GPU1(1)02(1)4(1)8(2)16(4)32(8)64(16)1(1)02(1)4(1)8(2)16(4)32(8)64(16)\r\n                                                                                                     400 RNNLM (batch size = 64)     400   NMT (batch size = 64)\r\n                 We follow prior work (Krizhevsky et al., 2012; Szegedy                              350                             350\r\n                  et al., 2016; He et al., 2016; Kim, 2014; Zaremba et al.,                          300                             300\r\n                                                                                                     250                             250\r\n                  2014; Wu et al., 2016) to construct operator graphs and set                        200                             200\r\n                  hyperparameters (e.g., learning rates, weight decays). We                          150                             150\r\n                                                                                                     100                             100\r\n                  use synchronous training and a per-GPU batch size of 64                             50                              50\r\n                  for all DNN benchmarks, except for AlexNet, which has a                              1(1)02(1)4(1)8(2)16(4)32(8)64(16)1(1)02(1)4(1)8(2)16(4)32(8)64(16)\r\n                  muchsmaller model and uses a per-GPU batch size of 256.                                                   Num. Devices\r\n                 To evaluate the performance of FlexFlow with different                        Figure 6. Per-iteration training performance on six DNNs. Num-\r\n                  device topologies, we performed the experiments on two                       bers in parenthesis are the number of compute nodes used in the\r\n                  GPUclusters, as shown in Figure 5. The \ufb01rst cluster con-                     experiments. The dash lines show the ideal training throughput.\r\n                  tains 4 compute nodes, each of which is equipped with two\r\n                  Intel 10-core E5-2600 CPUs, 256GB main memory, and                           8.1    Parallelization Performance\r\n                  four NVIDIA Tesla P100 GPUs. GPUs on the same node\r\n                  are connected by NVLink, and nodes are connected over                        8.1.1    Per-iteration Performance\r\n                 100GB/s EDRIn\ufb01niband. The second cluster consists of 16\r\n                  nodes, each of which is equipped with two Intel 10-core E5-                  We compare the per-iteration training performance of\r\n                  2680 GPUs, 256GB main memory, and four NVIDIA Tesla                          FlexFlow with the following baselines. Data parallelism\r\n                  K80 GPUs. Adjacent GPUs are connected by a separate                          is commonly used in existing deep learning systems. To\r\n                  PCI-e switch, and all GPUs are connected to CPUs through                     control for implementation differences, we ran data paral-\r\n                  a shared PCI-e switch. Compute nodes in the cluster are                      lelism experiments in TensorFlow r1.7, PyTorch v0.3, and\r\n                  connected over 56 GB/s EDR In\ufb01niband.                                        ourimplementationandcomparedtheperformancenumbers.\r\n                  Unless otherwise stated, we set 30 minutes as the time                       ComparedtoTensorFlowandPyTorch,FlexFlowachieves\r\n                  budget for the execution optimizer and use data parallelism                  the same or better performance numbers on all six DNN\r\n                  and a randomly generated strategy as the initial candidates                  benchmarks, and therefore we report the data parallelism\r\n                  for the search. As shown in Table 5, the search terminates                   performance achieved by FlexFlow in the experiments.\r\n                  in a few minutes in most cases.                                              Expert-designed strategies optimize parallelization based\r\n                                                                                                BeyondDataandModelParallelismforDeepNeuralNetworks\r\n                                                                                                                                                                                                      10\r\n                                       3.0            2.6                         70 65.8                                   40                                                                                                                 TensorFlow\r\n                                                                                  60                                           35.7\r\n                                       2.5                                                                                  35\r\n                                           1.9                                    50                                        30            28.2      28.7                                                8                                      FlexFlow\r\n                                       2.0\r\n                                                                                 40                                         25\r\n                                       1.5                                                                                  20\r\n                                                                1.1               30           24.2\r\n                                       1.0                                                                                  15\r\n                                                                                  20                                                                                                                    6\r\n                                                                                                          12.1              10\r\n                                       0.5                                        10\r\n                                      Time (seconds)                                                                         5\r\n                                       0.0Data     Expert   FlexFlow          Total Data TransfersPer Iteration (GB)0DataExpertFlexFlow0DataExpertFlexFlow\r\n                                   Per-iteration ExecutionParallelDesigned         Parallel Designed                    Total Task ComputationTime Per Iteration (s)ParallelDesigned                    4\r\n                              (a) Per-iteration                         (b) Overall data trans-(c) Overall task run                                                                                     2\r\n                              execution time.                            fers per iteration.                       time per iteration.                                                               Average Training Loss\r\n                                                                                                                                                                                                        00             5            10            15            20\r\n                              Figure 7. Parallelization performance for NMT on 64 K80 GPUs                                                                                                                         Training Time (hours)\r\n                              (16 nodes). FlexFlow reduces per-iteration execution time by 1.7-                                                                      Figure 8. Training curves of Inception-v3 in different systems. The\r\n                              2.4\u00d7anddatatransfers by 2-5.5\u00d7 compared to other approaches.                                                                           modelis trained on 16 P100 GPUs (4 nodes).\r\n                              FlexFlow achieves similar overall task computation time as expert-                                                                               400                                377                         8000\r\n                              designed strategy, which is 20% fewer than data parallelism.                                                                                     350         ColocRL                                            7000                                    OptCNN\r\n                                                                                                                                                                                           FlexFlow                                                                5846               FlexFlow\r\n                                                                                                                                                                               300                                                            6000\r\n                              on domain experts\u2019 knowledge and experience.                                                                          For                        250                                                            5000\r\n                                                                                                                                                                               200                                                            4000              3791          3749       3923\r\n                                                                                                                                                                                               152\r\n                              CNNs, (Krizhevsky, 2014) uses data parallelism for par-                                                                                          150                                                            3000                         2408       2659\r\n                                                                                                                                                                                                              107                                    19302208\r\n                              allelizing convolutional and pooling layers and switches                                                                                         100                                                            2000\r\n                              to model parallelism for densely-connected layers. For                                                                                       Training Throughput(samples per second)5045                   Training Throughput(samples per second)1000\r\n                                                                                                                                                                                 0    Inception_v3             NMT                              0   Inception   RNNTC      RNNLM        NMT\r\n                              RNNs,(Wuetal.,2016)usesdataparallelismthatreplicates\r\n                              the entire operator graph on each compute node and uses                                                                                                   (a) ColocRL                                                      (b) OptCNN\r\n                              model parallelism that assign operators with the same depth                                                                            Figure 9. Comparison among the parallelization strategies found\r\n                              to the same GPU on each node. These expert-designed                                                                                    bydifferent automated frameworks.\r\n                              strategies are used as a baseline in our experiments. Model\r\n                              parallelism only exposes limited parallelism by itself, and                                                                            ferent task computation time. For the matrix multiplication\r\n                              we compare against model parallelism as a part of these                                                                                operator in the NMT model, parallelizing it in the chan-\r\n                              expert-designed strategies.                                                                                                            nel dimension reduces the operator\u2019s overall computation\r\n                              Figure 6 shows the per-iteration training performance on                                                                               time by 38% compared to parallelizing the operator in the\r\n                              all six DNN benchmarks. For ResNet-101, FlexFlow \ufb01nds                                                                                  batch dimension. Figure 7c shows that FlexFlow reduces\r\n                              strategies similar to data parallelism (except using model                                                                             the overall task computation time by 20% compared to data\r\n                              parallelism on a single node for the last fully-connected                                                                              parallelism for the NMT model. The expert-designed strat-\r\n                              layer) and therefore achieves similar parallelization perfor-                                                                          egy achieves slightly better total task computation time than\r\n                              mance. For other DNN benchmarks, FlexFlow \ufb01nds more                                                                                    FlexFlow. However, this is achieved by using model paral-\r\n                              ef\ufb01cient strategies than the baselines and achieves 1.3-3.3\u00d7                                                                           lelism on each node, which disables any parallelism within\r\n                              speedup. Note that FlexFlow performs the same operators                                                                                each operator and results in imbalanced workloads. As a\r\n                              as data parallelism and expert-designed strategies, and the                                                                            result, the expert-designed strategy achieves even worse ex-\r\n                              performance improvement is achieved by using faster par-                                                                               ecution performance than data parallelism (see Figure 7a).\r\n                              allelization strategies. We found that the parallelization                                                                             FlexFlow reduces the task computation time while enabling\r\n                              strategies discovered by FlexFlow have two advantages over                                                                             parallelism within an operator and maintaining load balance.\r\n                              data parallelism and expert-designed strategies.                                                                                       8.1.2           End-to-end Performance\r\n                              Reducing overall communication costs. Similar to exist-\r\n                              ing deep learning systems, the FlexFlow distributed runtime                                                                            FlexFlow performs the same computation as other deep\r\n                              supportsoverlappingdatatransferswithcomputationtohide                                                                                  learning systems for a DNN model and therefore achieves\r\n                              communication overheads. However, as we scale the num-                                                                                 the same model accuracy. Table 4 veri\ufb01es that FlexFlow\r\n                              ber of devices, the communication overheads increase, but                                                                              achieves the state-of-the-art accuracies on the DNN bench-\r\n                              the computation time used to hide communication remains                                                                                marks used in the experiments.\r\n                              constant. Therefore, reducing overall communication costs                                                                              In this experiment, we compare the end-to-end training per-\r\n                              is bene\ufb01cial for large-scale distributed training. Figure 7b                                                                           formance between FlexFlow and TensorFlow on Inception-\r\n                              shows that, to parallelize the NMT model on 64 K80 GPUs                                                                                v3. We train Inception-v3 on the ImageNet dataset until the\r\n                              (16 nodes), FlexFlow reduces the per-iteration data transfers                                                                          modelreaches the single-crop top-1 accuracy of 72% on the\r\n                              by2-5.5\u00d7comparedtootherparallelization approaches.                                                                                     validation set. The training processes in both frameworks\r\n                              Reducing overall task computation time. Data paral-                                                                                    use stochastic gradient decent (SGD) with a learning rate of\r\n                              lelism always parallelizes an operator in the batch dimen-                                                                             0.045 and a weight decay of 0.0001. Figure 8 illustrates the\r\n                              sion. However, as reported in (Jia et al., 2018), parallelizing                                                                        training curves of the two systems and show that FlexFlow\r\n                              an operator through different dimensions can result in dif-                                                                            reduces the training time by 38% compared to TensorFlow.\r\n                                                            BeyondDataandModelParallelismforDeepNeuralNetworks\r\n                                                  Table 5. The end-to-end search time with different simulation algorithms (seconds).\r\n                     Num.            AlexNet                  ResNet                  Inception                 RNNTC                     RNNLM                      NMT\r\n                     GPUs Full      Delta   Speedup    Full  Delta   Speedup   Full    Delta  Speedup    Full   Delta   Speedup    Full   Delta  Speedup    Full   Delta   Speedup\r\n                          4   0.11  0.04      2.9\u00d7     1.4   0.4      3.2\u00d7     14      4.1      3.4\u00d7     16     7.5      2.2\u00d7      21     9.2      2.3\u00d7     40     16        2.5\u00d7\r\n                          8   0.40  0.13      3.0\u00d7     4.5   1.4      3.2\u00d7     66      17       3.9\u00d7     91     39       2.3\u00d7      76     31       2.5\u00d7     178    65        2.7\u00d7\r\n                        16    1.4   0.48      2.9\u00d7     22    7.3      3.1\u00d7     388     77       5.0\u00d7     404    170      2.4\u00d7      327    121      2.7\u00d7     998    328       3.0\u00d7\r\n                        32    5.3   1.8       3.0\u00d7     107   33       3.2\u00d7     1746    298      5.9\u00d7     1358   516      2.6\u00d7      1102   342      3.2\u00d7     2698   701       3.8\u00d7\r\n                        64    18    5.9       3.0\u00d7     515   158      3.3\u00d7     8817    1278     6.9\u00d7     4404   1489     3.0\u00d7      3406   969      3.6\u00d7     8982   2190      4.1\u00d7\r\n                                                                                                              8                                      4\r\n                   8.1.3     AutomatedFrameworks                                                                                                     2\r\n                                                                                                              4\r\n                                                                                                                                                      1\r\n                   Wecompare against two automated frameworks that \ufb01nd                                        2\r\n                                                                                                                                                    0.5\r\n                   parallelization strategies in a limited search space.                                      1               4 x P100 (1 node)                        4 x P100 (1 node)\r\n                                                                                                                              16 x P100 (4 nodes)                      16 x P100 (4 nodes)\r\n                                                                                                                              4 x K80 (1 node)      0.2                4 x K80 (1 node)\r\n                   ColocRLusesreinforcement learning to learn device place-                                Real Execution Time (s)16 x K80 (4 nodes)Real Execution Time (s)16 x K80 (4 nodes)\r\n                   mentfor model parallelism. We are not aware of any pub-                                  0.40.4     1      2      4      8       0.10.1 0.2    0.5    1     2    4\r\n                                                                                                               Simulated Execution Time (s)            Simulated Execution Time (s)\r\n                   licly available implementation of ColocRL, so we compare                                      (a) Inception-v3                            (b) NMT\r\n                   against the learned device placement for Inception-v3 and\r\n                   NMT,asreported in the paper, and performed the experi-                              Figure 10. Comparisonbetweenthesimulatedandactualexecution\r\n                   ments on the same machine.                                                          time for different DNNs and device topologies.\r\n                   Figure 9a compares the training throughput of the strate-                                                   300\r\n                                                                                                                                                Full Simulation\r\n                   gies found by FlexFlow and ColocRL for four K80 GPUs                                                        250              Delta Simulation\r\n                   on a single node. The parallelization strategies found by\r\n                   FlexFlowachieve3.4-3.8\u00d7speedupcomparedtoColocRL.                                                            200\r\n                   We attribute the performance improvement to the larger                                                      150\r\n                   search space explored by FlexFlow.                                                                       Expected Run Time of100\r\n                                                                                                                               Best Found Strategy (ms)0510    15\r\n                                                                                                                                     Elapsed Time (minutes)\r\n                   Besides improving training performance, FlexFlow has two                            Figure 11. Search performance with the full and delta simulation\r\n                   additional advantages over ColocRL. First, ColocRL re-                              algorithms for the NMT model on 16 P100 GPUs (4 nodes).\r\n                   quires executing each strategy in the hardware environment\r\n                   to get reward signals and takes 12-27 hours to \ufb01nd the best                         real execution time measured by actual executions. Fig-\r\n                   placement, while FlexFlow \ufb01nds ef\ufb01cient parallelization                             ure 10 shows the results for different DNNs and different\r\n                   strategies for these executions in 14-40 seconds. Second,                           available devices. The dashed lines indicate a relative dif-\r\n                   ColocRLuses up to 160 compute nodes (with 4 GPUs on                                 ference of 0% and 30%, respectively, which encompasses\r\n                   each node) to \ufb01nd the placement in time, while FlexFlow                             the variance between actual and predicted execution time.\r\n                   uses a single compute node to run the execution optimizer.                          In addition, for different parallelization strategies with the\r\n                   OptCNN(Jiaet al., 2018) uses dynamic programming to                                 sameoperator graph and device topology (i.e., points of the\r\n                   parallelize linear DNNs. To evaluate OptCNN\u2019s perfor-                               same shape in the \ufb01gure), their simulated execution time\r\n                   manceonnon-linear RNNs, we explicitly fuse all recurrent                            preserves actual execution time ordering, which shows that\r\n                   nodes sharing the same parameters to a single operator.                             simulatedexecutiontimeisanappropriatemetrictoevaluate\r\n                   WecomparetheperformanceofFlexFlowandOptCNNfor                                       the performance of different strategies.\r\n                   different DNNs on 16 P100 GPUs. FlexFlow and OptCNN                                 Simulator execution time. Figure 11 shows the search\r\n                   found the same parallelization strategies for AlexNet and                           performancewithdifferentsimulationalgorithmsfor\ufb01nding\r\n                   ResNet with linear operator graphs and found different                              a strategy for the NMT model on 16 P100 GPUs on 4 nodes.\r\n                   strategies for the other DNNs as shown in Figure 9b. For                            The full and delta simulation algorithms terminate in 16\r\n                   these DNNs with non-linear operator graphs, FlexFlow                                and 6 minutes, respectively. If the allowed time budget is\r\n                   achieves 1.2-1.6\u00d7 speedup compared to OptCNN by us-                                 less than 8 minutes, the full simulation algorithm will \ufb01nd a\r\n                   ing parallelization strategies that exploit parallelism across                      worse strategy than the delta simulation algorithm.\r\n                   different operators. We show two examples in Section 8.4.                           Wecompare the end-to-end search time of the execution\r\n                   8.2    Execution Simulator                                                          optimizer with different simulation algorithms. For a given\r\n                                                                                                       DNNmodelanddevicetopology,wemeasuretheaverage\r\n                   We evaluate the performance of the simulator using two                              execution time of the optimizer using 10 random initial\r\n                   metrics: simulator accuracy and simulator execution time.                           strategies. The results are shown in Table 5. The delta simu-\r\n                                                                                                       lation algorithm is 2.2-6.9\u00d7 faster than the full simulation\r\n                   Simulator accuracy. We \ufb01rst compare the estimated exe-                              algorithm. Moreover, the speedup over the full simulation\r\n                   cution time predicted by the execution simulator with the                           algorithm increases as we scale the number of devices.\r\n                                                       BeyondDataandModelParallelismforDeepNeuralNetworks\r\n                 Figure 12. The best discovered strategy for parallelizing Inception-v3 on 4 P100 GPUs. For each operator, the vertical and horizontal\r\n                  dimensions indicate parallelism in the sample and parameter dimension, respectively. Each GPU is denoted by a color.\r\n                                                                  . . .       Softmax          (e.g., embed layers), it performs the computation on a single\r\n                                                                  . . .       Attention        GPUtoeliminateparameter synchronization. Second, for\r\n                  Encoder LSTM2           . . .                   . . .      Decoder LSTM2     a layer with a large number of parameters and heavy com-\r\n                  Encoder LSTM1           . . .                   . . .      Decoder LSTM1     putation (e.g., softmax layers), FlexFlow uses parallelism\r\n                  Encoder Embed           . . .                   . . .      Decoder Embed     in the parameter dimension and assigns the computation for\r\n                 Figure 13. The best discovered strategy for parallelizing NMT on              a subset of parameters to each task. This reduces parame-\r\n                  4P100GPUs. Foreachoperator,theverticalandhorizontaldimen-                    ter synchronization costs while maintaining load balance.\r\n                  sions indicate parallelism in the sample and parameter dimension,            Third, for multiple recurrent layers (e.g., LSTM and atten-\r\n                  respectively. Each grey box denotes a layer, whose operators share           tion layers), FlexFlow uses concurrency among different\r\n                  the same network parameters. Each GPU is denoted by a color.                 layers as well as parallelism within each operator to reduce\r\n                  8.3    Search Algorithm                                                      parameter synchronization costs while balancing load.\r\n                 Wecomparethebestdiscoveredstrategies with the global                          9     Conclusion\r\n                  optimal strategies for small executions. To obtain a search                  This paper presents FlexFlow, a deep learning system that\r\n                  spaceofreasonablesize, welimitthenumberofdevicesto4                          automatically \ufb01nds ef\ufb01cient parallelization strategies in the\r\n                  andconsiderthefollowingtwoDNNs. LeNet(LeCun,2015)                            SOAP search space for DNN training. FlexFlow uses a\r\n                  is a 6-layer CNN. The second DNN is a variant of RNNLM                       guided randomized search procedure to explore the space\r\n                  wherethenumberofunrollingsteps for each recurrent layer                      and includes an execution simulator that is an ef\ufb01cient\r\n                  is restricted to 2. We use depth-\ufb01rst search to explore the                  and accurate predictor of DNN performance. We evalu-\r\n                  space and use A\u2217 (Cormen et al., 2009) to prune the search.                  ate FlexFlow with six real-world DNN benchmarks on two\r\n                  Finding the optimal strategies for LeNet and RNNLM took                      GPUclusters and show FlexFlow signi\ufb01cantly outperforms\r\n                  0.8 and 18 hours, respectively. For both DNNs, FlexFlow                      state-of-the-art parallelization approaches.\r\n                  \ufb01nds the same global optimal strategy in less than 1 second.\r\n                  8.4    CaseStudies                                                           Acknowledgements\r\n                  Inception-v3. Figure 12 shows the best discovered strategy                   Wethank the anonymous reviewers for their feedback on\r\n                  for parallelizing Inception-v3 on four P100 GPUs, which                      this work. This work was supported by NSF grant CCF-\r\n                  exploits intra-operator parallelism for operators on the criti-              1409813,theExascaleComputingProject(17-SC-20-SC),a\r\n                  cal path and uses a combination of intra- and inter-operator                 collaborative effort of the U.S. Department of Energy Of\ufb01ce\r\n                  parallelism for operators on different branches. This results                of Science and the National Nuclear Security Administra-\r\n                  in a well-balanced workload and reduces data transfers for                   tion, and is based on research sponsored by DARPA un-\r\n                  parameter synchronization. Compared to data parallelism,                     der agreement number FA84750-14-2-0006. This research\r\n                  this strategy reduces the parameter synchronization costs by                 was also supported in part by af\ufb01liate members and other\r\n                  75%andtheper-iteration execution time by 12%.                                supporters of the Stanford DAWN project\u2014Ant Financial,\r\n                  For parallelizing the same Inception-v3 model on four K80                    Facebook, Google, Infosys, Intel, Microsoft, NEC, Teradata,\r\n                  GPUs with asymmetric connections between GPUs (see                           SAPandVMware\u2014aswellasDARPAgrantFA8750-17-2-\r\n                  Figure 5b), we observe that the best discovered strategy                     0095(D3M)andNSFgrantCNS-1651570. TheU.S.Gov-\r\n                  tends to parallelize operators on adjacent GPUs with a direct                ernmentisauthorizedtoreproduceanddistributereprintsfor\r\n                  connection to reduce the communication costs.                                Governmental purposes notwithstanding any copyright no-\r\n                                                                                               tation thereon. The views and conclusions herein are those\r\n                  NMT.Figure13showsthebestdiscoveredstrategy for par-                          of the authors and should not be interpreted as necessar-\r\n                  allelizing NMT on four P100 GPUs. First, for a layer with a                  ily representing the of\ufb01cial policies or endorsements either\r\n                  large number of network parameters and little computation                    expressed or implied of DARPA or the U.S. Government.\r\n                                             BeyondDataandModelParallelismforDeepNeuralNetworks\r\n              References                                                      Hastings, W. K. Monte carlo sampling methods using\r\n                                                                                markov chains and their applications. Biometrika, 57\r\n              Abadi, M., Barham, P., Chen, J., Chen, Z., Davis, A., Dean,       (1):97\u2013109, 1970.\r\n                 J., Devin, M., Ghemawat, S., Irving, G., Isard, M., Kudlur,\r\n                 M., Levenberg, J., Monga, R., Moore, S., Murray, D. G.,      He, K., Zhang, X., Ren, S., and Sun, J. Deep residual learn-\r\n                 Steiner, B., Tucker, P., Vasudevan, V., Warden, P., Wicke,     ing for image recognition. In Proceedings of the IEEE\r\n                 M., Yu, Y., and Zheng, X. Tensor\ufb02ow: A system for              Conference on Computer Vision and Pattern Recognition,\r\n                 large-scale machine learning. In Proceedings of the 12th       CVPR,2016.\r\n                 USENIXConferenceonOperatingSystemsDesignand                  Isard, M., Prabhakaran, V., Currey, J., Wieder, U., Talwar,\r\n                 Implementation, OSDI, 2016.                                    K., and Goldberg, A. Quincy: Fair scheduling for dis-\r\n              Bahdanau, D., Cho, K., and Bengio, Y. Neural machine              tributed computing clusters. In Proceedings of the ACM\r\n                 translation by jointly learning to align and translate.        SIGOPS22ndSymposiumonOperatingSystemsPrinci-\r\n                 CoRR,abs/1409.0473, 2014.                                      ples, SOSP \u201909, pp. 261\u2013276. ACM, 2009.\r\n              Bauer, M., Treichler, S., Slaughter, E., and Aiken, A. Le-      Jia, Z., Lin, S., Qi, C. R., and Aiken, A. Exploring hidden\r\n                 gion: Expressing locality and independence with logical        dimensionsinaccelerating convolutional neural networks.\r\n                 regions. In Proceedings of the International Conference        In Proceedings of the 35th International Conference on\r\n                 onHighPerformanceComputing,Networking, Storage                 MachineLearning,volume80ofProceedingsofMachine\r\n                 andAnalysis, 2012.                                             Learning Research. PMLR, 2018.\r\n              Chetlur, S., Woolley, C., Vandermersch, P., Cohen, J., Tran,    Kim, Y. Convolutional neural networks for sentence clas-\r\n                 J., Catanzaro, B., and Shelhamer, E. cudnn: Ef\ufb01cient           si\ufb01cation. CoRR, abs/1408.5882, 2014. URL http:\r\n                 primitives for deep learning. CoRR, abs/1410.0759, 2014.       //arxiv.org/abs/1408.5882.\r\n                 URLhttp://arxiv.org/abs/1410.0759.                           Krizhevsky, A. One weird trick for parallelizing convo-\r\n              Cormen, T. H., Leiserson, C. E., Rivest, R. L., and Stein,        lutional neural networks. CoRR, abs/1404.5997, 2014.\r\n                 C. Introduction to Algorithms, Third Edition. The MIT          URLhttp://arxiv.org/abs/1404.5997.\r\n                 Press, 3rd edition, 2009.                                    Krizhevsky, A., Sutskever, I., and Hinton, G. E. ImageNet\r\n              cuBLAS. Dense Linear Algebra on GPUs. https://                    classi\ufb01cation with deep convolutional neural networks.\r\n                 developer.nvidia.com/cublas,2016.                              In Proceedings of the 25th International Conference on\r\n                                                                                Neural Information Processing Systems, NIPS, 2012.\r\n              Dean, J., Corrado, G. S., Monga, R., Chen, K., Devin, M.,       Lam,S.andSethi,R. Worstcaseanalysisoftwoscheduling\r\n                 Le, Q. V., Mao, M. Z., Ranzato, M., Senior, A., Tucker,        algorithms. SIAM Journal on Computing, 6, 1977.\r\n                 P., Yang, K., and Ng, A. Y. Large scale distributed deep\r\n                 networks. In NIPS, 2012.                                     LeCun, Y. LeNet-5, convolutional neural networks. URL:\r\n              Gao, M., Pu, J., Yang, X., Horowitz, M., and Kozyrakis, C.        http://yann. lecun. com/exdb/lenet, 2015.\r\n                 Tetris: Scalable and ef\ufb01cient neural network acceleration    Marcus, M. P., Marcinkiewicz, M. A., and Santorini, B.\r\n                 with 3d memory. In Proceedings of the Twenty-Second            Building a large annotated corpus of english: The penn\r\n                 International Conference on Architectural Support for          treebank. Comput. Linguist., 19.\r\n                 ProgrammingLanguagesandOperatingSystems,ASP-\r\n                 LOS\u201917,2017.                                                 Mirhoseini, A., Pham, H., Le, Q. V., Steiner, B., Larsen, R.,\r\n              Gao, M., Yang, X., Pu, J., Horowitz, M., and Kozyrakis, C.        Zhou, Y., Kumar, N., Norouzi, M., Bengio, S., and Dean,\r\n                 Tangram: Optimized coarse-grained data\ufb02ow for scalable         J.  Device placement optimization with reinforcement\r\n                 nn accelerators. In Proceedings of the 24th International      learning. 2017.\r\n                 Conference on Architectural Support for Programming          Movies.         Movie    review    data.       https://\r\n                 Languages and Operating Systems, ASPLOS \u201919, 2019.             www.cs.cornell.edu/people/pabo/\r\n                                                                                movie-review-data/,2005.\r\n              Gog,I.,Schwarzkopf,M.,Gleave,A.,Watson,R.N.M.,and\r\n                 Hand, S. Firmament: Fast, centralized cluster scheduling     Papineni, K., Roukos, S., Ward, T., and Zhu, W.-J. Bleu: A\r\n                 at scale. In 12th USENIX Symposium on Operating Sys-           methodfor automatic evaluation of machine translation.\r\n                 tems Design and Implementation (OSDI 16), pp. 99\u2013115,          In Proceedings of the 40th Annual Meeting on Associa-\r\n                 Savannah, GA, 2016. USENIX Association.                        tion for Computational Linguistics, ACL \u201902, 2002.\r\n                                              BeyondDataandModelParallelismforDeepNeuralNetworks\r\n               Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S.,    Algorithm 1 Full Simulation Algorithm.\r\n                 Ma,S., Huang, Z., Karpathy, A., Khosla, A., Bernstein,         1: Input: An operator graph G, a device topology D, and a\r\n                 M., Berg, A. C., and Fei-Fei, L. ImageNet Large Scale             parallelization strategy S.\r\n                 Visual Recognition Challenge. International Journal of         2: T = BUILDTASKGRAPH(G,D,S)\r\n                 Computer Vision (IJCV), 115(3):211\u2013252, 2015. doi:             3: readyQueue = {} // a priority queue sorted by readyTime\r\n                 10.1007/s11263-015-0816-y.                                     4: for t \u2208 TN do\r\n                                                                                5:     t.state = NOTREADY\r\n               Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S. E.,        6:     if I(t) = {} then\r\n                                                                                7:        t.state = READY\r\n                 Anguelov, D., Erhan, D., Vanhoucke, V., and Rabi-              8:        readyQueue.enqueue(t)\r\n                 novich, A. Going deeper with convolutions. CoRR,               9: while readyQueue 6= {} do\r\n                 abs/1409.4842, 2014.      URL http://arxiv.org/               10:     Taskt=readyQueue.dequeue()\r\n                 abs/1409.4842.                                                11:     Device d = t.device\r\n                                                                               12:     t.state = COMPLETE\r\n               Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J., and Wojna,   13:     t.startTime = max{t.readyTime, d.last.endTime}\r\n                                                                               14:     t.endTime = t.startTime + t.exeTime\r\n                 Z. Rethinking the inception architecture for computer         15:     d.last = t\r\n                 vision. In Proceedings of the IEEE Conference on Com-         16:     for n \u2208 O(t) do\r\n                 puter Vision and Pattern Recognition, 2016.                   17:        n.readyTime = max{n.readyTime, t.endTime}\r\n                                                                               18:        if all tasks in I(n) are COMPLETE then\r\n               Treichler, S., Bauer, M., Sharma, R., Slaughter, E., and        19:            n.state = READY\r\n                 Aiken, A. Dependent partitioning. In Proceedings of           20:            readyQueue.enqueue(n)\r\n                 the 2016 ACM SIGPLAN International Conference on              21: return max{t.endTime | t \u2208 TN}\r\n                 Object-Oriented Programming, Systems, Languages, and\r\n                 Applications, OOPSLA\u2019 16. ACM, 2016.                          B DeltaSimulationAlgorithm\r\n               WMT. Conference on machine translation. http://www.             Algorithm 2 shows the pseudocode of the full simulation\r\n                 statmt.org/wmt16,2016.                                        algorithm. It \ufb01rst updates tasks and dependencies from an\r\n                                                                               existing task graph and enqueues all modi\ufb01ed tasks into a\r\n               Wu, Y., Schuster, M., Chen, Z., Le, Q. V., Norouzi, M.,         global priority queue (line 4-5). Similar to the Bellman-\r\n                 Macherey, W., Krikun, M., Cao, Y., Gao, Q., Macherey,         Ford shortest-path algorithm (Section 24.1 of Cormen et al.\r\n                 K., Klingner, J., Shah, A., Johnson, M., Liu, X., Kaiser,     (2009)), the delta simulation algorithm iteratively dequeues\r\n                 L., Gouws, S., Kato, Y., Kudo, T., Kazawa, H., Stevens,       updated tasks and propagates the updates to subsequent\r\n                 K., Kurian, G., Patil, N., Wang, W., Young, C., Smith, J.,    tasks (line 6-14). The full and delta simulation algorithms\r\n                 Riesa, J., Rudnick, A., Vinyals, O., Corrado, G., Hughes,     always produce the same timeline for a given task graph.\r\n                 M., and Dean, J. Google\u2019s neural machine translation\r\n                 system: Bridging the gap between human and machine            C ArtifactAppendix\r\n                 translation. CoRR, abs/1609.08144, 2016.\r\n               Zaremba, W., Sutskever, I., and Vinyals, O. Recurrent neu-      C.1    Abstract\r\n                 ral network regularization. CoRR, abs/1409.2329, 2014.        This artifact appendix helps readers to reproduce the main\r\n                 URLhttp://arxiv.org/abs/1409.2329.                            experimental results in this paper. In the artifact evaluation,\r\n                                                                               we compare the average training throughput of different\r\n               A FullSimulationAlgorithm                                       parallelization strategies in FlexFlow.\r\n                                                                               C.2    Artifact check-list (meta-information)\r\n               Algorithm 1 shows the pseudocode of the full simulation\r\n               algorithm. It \ufb01rst builds a task graph using the method            \u2022 Compilation:    GCC 4.8 or above, CUDA 8.0 or above,\r\n               described in Section 5.1 and then sets the properties for each       cuDNN6.0orabove\r\n               task using a variant of Dijkstra\u2019s shortest-path algorithm         \u2022 Run-timeenvironment: Linux Ubuntu 16.04 or above\r\n               (Section 24.3 of Cormen et al. (2009)). Tasks are enqueued         \u2022 Hardware: A compute node with multiple GPUs, such as\r\n               into a global priority queue when ready (i.e., all predecessor       AmazonEC2p2.x8largeorp3.x8largeinstances. Note that\r\n               tasks are completed) and are dequeued in increasing order            a single GPU is able to verify the functionality of FlexFlow\r\n               bytheirreadyTime. Therefore,whenatasktisdequeued,                    but cannot show FlexFlow\u2019s performance improvement over\r\n               all tasks with an earlier readyTime have been scheduled,             the baselines (e.g., data parallelism).\r\n               and we can set the properties for task t while maintaining         \u2022 Metrics: The primary metric of comparison is the average\r\n               the FIFO scheduling order (assumption A3).                           training throughput.\r\n                                                 BeyondDataandModelParallelismforDeepNeuralNetworks\r\n                Algorithm 2 Delta Simulation Algorithm.                                 \u2022 NVIDIAcuDNNandcuBLASlibrariesareusedtoperform\r\n                 1: Input: An operator graph G, a device topology D, an original          DNNoperations.\r\n                                                           \u2032\r\n                    task graph T , and a new con\ufb01guration c for operator oi.\r\n                                                           i                            \u2022 Legion (Bauer et al., 2012) is the underlying runtime\r\n                 2: updateQueue = {} // a priority queue sorted by readyTime              FlexFlow built on.\r\n                 3: /*UPDATETASKGRAPH returns the updated task graph and a\r\n                    list of tasks with new readyTime*/                                  \u2022 (Optional) GASNet \u2020 is used for distributed executions.\r\n                 4: T ,L = UPDATETASKGRAPH(T,G,D,c ,c\u2032)\r\n                                                             i  i\r\n                 5: updateQueue.enqueue(L)                                           The following software versions were used in our experiments:\r\n                 6: while updateQueue 6= {} do                                       cuDNN7.3,CUDA9.0,Legion18.02.0,andGASNet1.28.0.\r\n                 7:     Taskt=updateQueue.dequeue()\r\n                 8:     t.startTime = max{t.readyTime, t.preTask.endTime}            C.4    Installation\r\n                 9:     t.endTime = t.startTime + t.exeTime\r\n                10:     for n \u2208 O(t) do                                              The README.md \ufb01le includes detailed instructions on how to\r\n                11:        if UPDATETASK(n) then                                     install the FlexFlow runtime. The Legion and GASNetsubmodules\r\n                12:            updateQueue.push(n)\r\n                13:     if UPDATETASK(t.nextTask) then                               can be initialized by the following command lines:\r\n                14:        updateQueue.push(t.nextTask)                                                 git submodule init\r\n                15: return max{t.endTime | t \u2208 TN}\r\n                16:                                                                                   git submodule update\r\n                17: function UPDATETASK(t)\r\n                18:     t.readyTime = max{p.endTime | p \u2208 I(t)}                      Theffcompile.shscriptcompilesaDNNmodelinFlexFlow:\r\n                19:     /*Swap t with other tasks on the device to maintain\r\n                    FIFO.*/                                                                           ./ffcompile.sh dnn.cc\r\n                20:     t.startTime = max{t.readyTime, t.preTask.endTime}\r\n                21:     if t\u2019s readyTime or startTime is changed then                where dnn.ccde\ufb01nestheoperators in the DNN model.\r\n                22:        return True\r\n                23:     else                                                         C.5    Experimentwork\ufb02ow\r\n                24:        return False\r\n                                                                                     The run experiments.sh script automatically builds and\r\n                                                                                     evalautes two example DNN models (i.e., AlexNet (Krizhevsky\r\n                   \u2022 Howmuchdiskspacerequired(approximately)?: About                 et al., 2012) and ResNet (He et al., 2016)) in FlexFlow. All experi-\r\n                     2 GBofdiskstorage should be suf\ufb01cient for all experiments.      ments were run with synthetic data in GPU memory to remove the\r\n                                                                                     side effects of data transfers between CPU and GPU.\r\n                   \u2022 Howmuchtimeisneededtopreparework\ufb02ow(approxi-                    ForeachDNNmodel,wecomparethetrainingthroughputsofdata\r\n                     mately)?: About one hour to install all dependencies and        parallelism and FlexFlow\u2019s optimized parallelization strategies on\r\n                     compile the FlexFlow runtime.                                   1, 2, and 4 GPUs on a compute node.\r\n                   \u2022 How much time is needed to complete experiments (ap-            C.6    Evaluation and expected result\r\n                     proximately)?: About 20 minutes for all experiments.\r\n                   \u2022 Publicly available?: Yes                                        The run experiments.shscript prints the training through-\r\n                                                                                     puts of different parallelization strategies. By running the script on\r\n                   \u2022 Code licenses (if publicly available)?:   Apache License,       a multi-GPU node, you should observe that FlexFlow\u2019s optimized\r\n                     Version 2.0.                                                    strategies consistently outperform the data parallelism baseline.\r\n                   \u2022 Archived (provide DOI)?:                                        C.7    Experimentcustomization\r\n                     https://doi.org/10.5281/zenodo.2564262                          FlexFlowcanbeusedtooptimizeparallelizationforarbitraryDNN\r\n                                                                                     models. We refer users to the README.md \ufb01le in this artifact\r\n                C.3   Description                                                    evaluation for detailed instructions on how to use FlexFlow for\r\n                                                                                     other DNN models.\r\n                C.3.1    Hardwaredependencies\r\n                TheexperimentsinthepaperwereperformedontwoGPUclusters,\r\n                as described in Figure 5. To reproduce the experiments, we suggest\r\n                to run this artifact evaluation on a compute node with multiple\r\n                GPUs,suchasAmazonEC2p2.x8largeorp3.x8large instances.\r\n                This will be suf\ufb01cient to demonstrate FlexFlow\u2019s performance\r\n                improvement over the widely used data parallelism baseline.\r\n                C.3.2    Software dependencies\r\n                FlexFlow depends on the following software libraries:                    \u2020http://gasnet.lbl.gov/", "award": [], "sourceid": 16, "authors": [{"given_name": "Zhihao", "family_name": "Jia", "institution": "Stanford University"}, {"given_name": "Matei", "family_name": "Zaharia", "institution": "Stanford and Databricks"}, {"given_name": "Alex", "family_name": "Aiken", "institution": "Stanford University"}]}