{"title": "Optimizing DNN Computation with Relaxed Graph Substitutions", "book": "Proceedings of Machine Learning and Systems", "page_first": 27, "page_last": 39, "abstract": "Existing deep learning frameworks optimize the computation graph of a DNN model by performing greedy rule-based graph transformations, which generally only consider transformations that strictly improve runtime performance. We propose relaxed graph substitutions that enable the exploration of complex graph optimizations by relaxing the strict performance improvement constraint, which greatly increases the space of semantically equivalent computation graphs that can be discovered by repeated application of a suitable set of graph transformations. We introduce a backtracking search algorithm over a set of relaxed graph substitutions to find optimized networks and use a flow-based graph split algorithm to recursively split a computation graph into smaller subgraphs to allow efficient search. We implement relaxed graph substitutions in a system called MetaFlow and show that MetaFlow improves the inference and training performance by 1.1-1.6\u00d7 and 1.1-1.2\u00d7 respectively over existing deep learning frameworks.", "full_text": " OPTIMIZING DNN COMPUTATION WITH RELAXED GRAPH SUBSTITUTIONS\r\n ZhihaoJia1 JamesThomas1 ToddWarszawski1 MingyuGao12 MateiZaharia1 AlexAiken1\r\n ABSTRACT\r\n Existing deep learning frameworks optimize the computation graph of a DNN model by performing greedy\r\n rule-based graph transformations, which generally only consider transformations that strictly improve runtime\r\n performance. We propose relaxed graph substitutions that enable the exploration of complex graph optimizations\r\n byrelaxingthestrict performance improvementconstraint, which greatly increases the space of semantically equiv-\r\n alent computation graphs that can be discovered by repeated application of a suitable set of graph transformations.\r\n Weintroduce a backtracking search algorithm over a set of relaxed graph substitutions to \ufb01nd optimized networks\r\n and use a \ufb02ow-based graph split algorithm to recursively split a computation graph into smaller subgraphs to\r\n allow ef\ufb01cient search. We implement relaxed graph substitutions in a system called MetaFlow and show that\r\n MetaFlowimproves the inference and training performance by 1.1-1.6\u00d7 and 1.1-1.2\u00d7 respectively over existing\r\n deep learning frameworks.\r\n 1 INTRODUCTION where not all intermediate states are strict improvements\r\n Deep neural networks (DNNs) have driven advances are not considered. As a result, current optimizers miss\r\n in many practical problems, such as image classi\ufb01ca- manymorecomplexoptimizationopportunities: we show\r\n tion (Krizhevsky et al., 2012; He et al., 2016), machine that exploring a larger space of substitutions can improve\r\n translation (Wu et al., 2016; Bahdanau et al., 2014), and the performance of widely used DNNs by up to 1.6\u00d7 over\r\n gameplaying (Silver et al., 2016). Over time, state-of-the- existing rule-based optimizers.\r\n art DNNs become larger and deeper, resulting in increased In this paper, we propose relaxed graph substitutions. We\r\n computational requirements. increase the space of optimizations considered by relaxing\r\n To mitigate the increasing computational requirements it the strict performance constraint, allowing any substitutions\r\n is standard to optimize computation in a DNN, which is that preserve semantics whether or not they improve perfor-\r\n de\ufb01ned by a computation graph of mathematical operators mance. These \u201cdowngrading\u201d graph substitutions are useful\r\n (e.g., matrix multiplication, convolution, etc.). Existing as intermediate steps in transforming graph architectures\r\n deep learning systems such as TensorFlow, PyTorch, and and eventually discovering new graphs with signi\ufb01cantly\r\n TVMoptimizeaninputcomputationgraphbyperforming better runtimeperformance. Toef\ufb01cientlyexplorethislarger\r\n greedy rule-based substitutions on the graph (Abadi et al. space of computation graphs, we use backtracking search\r\n (2016); PyTorch; Chen et al. (2018)). Each substitution over a set of relaxed graph substitutions to \ufb01nd improved\r\n replaces a subgraph matching a speci\ufb01c pattern with a new networks after multiple substitution steps.\r\n subgraph that computes the same result. For example, oper- Asamotivatingexample,weshowhowwecanoptimizethe\r\n ator fusion combines several operators into one, which can widely used ResNet architecture (He et al., 2016) using our\r\n eliminate intermediate results and increases the granularity approach, as shown in Figure 1. The left-most graph shows\r\n of the operators, thereby reducing system overheads such as an optimized graph after greedy operator fusions, which\r\n memoryaccesses and kernel launches. combine a convolution and a following activation (i.e., relu)\r\n Existing deep learning optimizers consider performance- into a \u201cconvolution with activation\u201d. However, by adaptively\r\n improving substitutions, which they greedily and repeatedly applying relaxed graph substitutions (shown as the arrows\r\n apply to a computation graph until no further substitutions in the \ufb01gure), it is possible to generate a \ufb01nal graph (right-\r\n can be made. More involved sequences of transformations most)thatis1.3xfasterthantheoriginalgraph(left-most)on\r\n a NVIDIAV100GPU.Notethatthe\ufb01rstgraphsubstitution\r\n 1Stanford University 2Tsinghua University. Correspondence to: increases a convolution\u2019s kernel size from 1x1 to 3x3 by\r\n Zhihao Jia . padding the kernel with extra 0\u2019s. This downgrades runtime\r\n performance (since a convolution with a larger kernel runs\r\n Proceedings of the 2nd SysML Conference, Palo Alto, CA, USA, slower) but enables additional subsequent kernel fusions,\r\n 2019. Copyright 2019 by the author(s).\r\n Optimizing DNNComputationwithRelaxedGraphSubstitutions\r\n input enlarge input input input input\r\n conv kernel\r\n conv3x3x256 conv1x1x256 conv3x3x256 conv3x3x256 conv3x3x512 conv3x3x512 conv3x3x512\r\n relu relu relu relu fuse conv relu relu relu\r\n ops\r\n conv3x3x256 conv3x3x256 split fuse conv conv3x3x256 conv3x3x256\r\n and add fuse conv relu\r\n add add conv3x3x256 relu and relu\r\n relu relu add\r\n relu\r\n Figure 1. A sequence of relaxed graph substitutions on a ResNet module (He et al., 2016). Each arrow is a graph substitution, and the\r\n dotted subgraphs in the same color indicate the source and target graph of a substitution. \u201cconv axbxc\u201d indicates a convolution with kernel\r\n size a \u00d7 b and c output channels. The \ufb01nal graph (right-most) is 1.3x faster than the original graph (left-most) on a NVIDIA V100 GPU.\r\n resulting in an overall improvement. Section 3 describes the MetaFlow, which can be used to optimize DNN computa-\r\n other graph substitutions in more detail. tion graphs for any existing deep learning framework. In\r\n Adding relaxed graph substitutions to existing DNN opti- particular, we show that TensorFlow, TensorFlow XLA, and\r\n mizers and applying them greedily could easily result in TensorRT can directly use MetaFlow\u2019s optimized graphs to\r\n degraded performance. For example, the enlarge operator improve both inference and training performance.\r\n substitution in Figure 1 will likely degrade performance Weevaluate MetaFlow on \ufb01ve real-world DNNs, including\r\n if the resulting convolution cannot be fused with another Inception-v3 (Szegedy et al., 2016), SqueezeNet (Iandola\r\n operator. While one could attempt to address this by adding et al., 2016), ResNet-50 (He et al., 2016), RNN Text Classi-\r\n special case rules and heuristics to an existing system, we \ufb01cation (Kim, 2014), and Neural Machine Translation (Wu\r\n believe such an approach would be error-prone and brittle et al., 2016). MetaFlow\u2019s search algorithm is able to op-\r\n in the face of new architectures and new substitution rules. timize each of these DNNs in under 5 minutes. We show\r\n Instead we use cost-based backtracking search to effectively that MetaFlow outperforms existing deep learning optimiz-\r\n explore the large space of computation graphs generated ers with speedups ranging from 1.1-1.6\u00d7 for inference and\r\n byapplying relaxed graph substitutions, without requiring 1.1-1.2\u00d7 for training. The performance improvement is\r\n optimizer developers to implement numerous new rules. achieved by discovering ef\ufb01cient computation graphs that\r\n First we introduce a cost model that incorporates multiple decrease the overall memory usage by up to 1.5\u00d7 and the\r\n cost dimensions (e.g., FLOPs, execution time, memory us- total number of kernel launches by up to 3.3\u00d7. Finally,\r\n age, etc.) and can accurately estimate the performance of weshowthatMetaFlow\u2019soptimizedgraphs can be directly\r\n different computation graphs. The cost model allows us to fed into existing frameworks and improve their inference\r\n quickly compare different graphs. performance by up to 1.3\u00d7.\r\n Second, we propose a backtracking search algorithm that Tosummarize, our contributions are:\r\n quickly \ufb01nds ef\ufb01cient solutions for small graphs. However,\r\n the computation graphs of state-of-the-art DNNs are too \u2022 We introduce relaxed graph substitutions, which en-\r\n large to directly explore the search space of all equivalent able the exploration of complex graph optimizations\r\n computation graphs. Therefore, we use a graph split algo- inaccessible to existing deep learning frameworks.\r\n rithm that recursively splits an original computation graph \u2022 We propose a cost-based search algorithm that can\r\n into individual subgraphs with smaller sizes. The graph is automatically \ufb01ndoptimizedcomputationgraphsinthe\r\n split in a way that minimizes the number of graph substi- search space generated by relaxed graph substitutions.\r\n tutions spanning different subgraphs and is computed by\r\n solving a max-\ufb02ow problem (Cormen et al., 2009). These \u2022 WeimplementMetaFlow,the\ufb01rstrelaxedgraphsubsti-\r\n subgraphsareoptimizedbythebacktrackingsearchandthen tution optimizer for DNNs. On a collection of standard\r\n stitched back together to form the \ufb01nal optimized graph. Fig- DNNs,weshowthatcomparedtoexistingframeworks\r\n ure 3 depicts an overview of our graph optimization process. MetaFlowimprovesruntime performance by 1.1-1.6\u00d7,\r\n Weimplementrelaxedgraphsubstitutionsinasystemcalled while maintaining the same network accuracy.\r\n Optimizing DNNComputationwithRelaxedGraphSubstitutions\r\n Input Comp. Independent Optimized Optimized \r\n Inference Data Training Samples Graph Subgraphs Subgraphs Comp. Graph\r\n (input) (input)\r\n Convolution Convolution Convolution Derivatives\r\n (Backward) (output) Flow-based Search-based\r\n Final Graph\r\n BatchNorm Derivatives Graph Split Graph Subst. Generation\r\n BatchNorm BatchNorm (Backward) (output)\r\n FullyConnected FullyConnected FullyConnected Derivatives\r\n (Backward) (output)\r\n Softmax Figure 3. MetaFlow Overview.\r\n Softmax Softmax (Backward)\r\n Prediction Training Labels\r\n (output) (input) subgraph. Finally, MetaFlow generates an optimized com-\r\n (a) Inference (b) Training putation graph of the input graph by using the optimized\r\n subgraphs as basic building blocks.\r\n Figure 2. The inference and training graphs of a 4-layer example MetaFlowis a framework-agnostic computation graph opti-\r\n CNNmodel. Dotted edges are the inputs and outputs of each mizer: an optimized computation graph by MetaFlow can\r\n computation graph.\r\n be executed on various deep learning runtimes, such as Ten-\r\n sorRT (TensorRT), TensorFlow (Abadi et al., 2016), and\r\n 1\r\n 2 OVERVIEW TensorFlow XLA.\r\n Similar to existing DNN optimizers (Abadi et al., 2016; 3 RELAXEDGRAPHSUBSTITUTIONS\r\n Chenetal., 2018; PyTorch), MetaFlow uses a computation\r\n graph G to de\ufb01ne computation and state in a DNN model. This section introduces relaxed graph substitutions, each of\r\n Each node is a mathematical operator (e.g., matrix multi- which consists of a source graph that can map to particular\r\n plication, convolution, etc.), and each edge is a tensor (i.e., subgraphs in the computation graph of a DNN and a target\r\n n-dimensional array). For a computation graph G taking graph that de\ufb01nes how to create a new subgraph to replace\r\n input tensors I and producing output tensors O, we de\ufb01ne a mapped subgraph.\r\n its computation as O = G(I). Sourcegraph. Asourcegraphde\ufb01nesthestructureofvalid\r\n Wede\ufb01ne two computation graphs G and G\u2032 to be equiv- subgraphs for a substitution. Each node in a source graph\r\n alent if G and G\u2032 compute mathematically equivalent out- is associated with a type and can only be mapped to an\r\n puts for arbitrary inputs (i.e., \u2200I : G(I) = G\u2032(I)). For a operator of the same type. A source graph can also include\r\n given computation graph G, MetaFlow automatically \ufb01nds wildcard nodes, each of which can be mapped to any sin-\r\n an equivalent computation graph G\u2032 with optimized run- gle operator. The wildcard nodes are useful when the type\r\n time performance by using compositions of provided graph of an operator does not affect the substitution procedure\r\n substitutions. and allow a source graph to describe multiple substitution\r\n For a DNN model, the inference and training procedures scenarios that are similar. In addition to type constraints,\r\n are de\ufb01ned by different computation graphs, as shown in a source graph can also incorporate additional constraints\r\n Figure 2. An inference graph includes a single input and on one or multiple operators to further restrict mapping.\r\n one or more outputs, while a training graph generally has Figure 4a demonstrates a substitution for fusing two convo-\r\n two inputs (i.e., training samples and labels) and multiple lutions, which de\ufb01nes constraints on conv1 and conv2 to\r\n outputs (i.e., derivatives for trainable parameters in each guarantee they can only be mapped to convolutions with the\r\n operator). MetaFlow merely treats inference and training as samekernel size, stride, and padding.\r\n different graphs to optimize and applies the same techniques Edgesinasourcegraphdescribedatadependenciesbetween\r\n onbothgraphs. operators. A graph substitution requires the mapped sub-\r\n Figure 3 shows the main components of MetaFlow. First, graph to have the same data dependencies as the source\r\n for any input computation graph, MetaFlow uses a \ufb02ow- graph. Each operator can optionally have an external edge\r\n based graph split algorithm to recursively divide the input (shown as dotted edges in Figure 4) that can map to zero,\r\n graph into subgraphs that are amenable to direct search. one, or multiple edges connecting to external operators in\r\n Second, MetaFlow optimizes each individual subgraph with the computation graph. An external edge indicates that the\r\n a backtracking search on the search space de\ufb01ned by re- operator\u2019s output can be accessed by external operators and\r\n peated application of relaxed graph substitutions to each 1https://www.tensor\ufb02ow.org/xla\r\n Optimizing DNNComputationwithRelaxedGraphSubstitutions\r\n its source and target graphs compute mathematically equiva-\r\n op1 op1.out op2 lent outputs for all external edges. This de\ufb01nition is similar\r\n to our de\ufb01nition of equivalent computation graphs if each\r\n conv3 external edge is considered as an output of the graph. Any\r\n conv1 conv2 couv2.out composition of valid graph substitutions preserves equiva-\r\n split lence among generated computation graphs.\r\n couv1.out\r\n Source Graph Target Graph Composition. Many complex graph optimizations can be\r\n decomposed into a sequence of simple relaxed graph sub-\r\n # Constraints on the source graph: stitutions. Recall that Figure 1 demonstrates a potential\r\n conv1.kernel == conv2.kernel\r\n conv1.stride == conv2.stride optimization on ResNet that fuses two convolutions with\r\n conv1.padding == conv2.padding different kernel sizes by enlarging the kernel of one convo-\r\n # Construct the target graph: lution. As another example, the following equations show\r\n op2._ = op1._ howtosimplifythecomputationinaSimpleRecurrentUnit\r\n conv3._ = conv1._ (Equations 2 and 4 in Lei et al. (2017)) by using a sequence\r\n conv3.outChannels =conv1.outChannels + conv2.outChannels\r\n conv3.weights = concat(conv1.weights, conv2.weights) of graph substitutions that distribute multiplications, reorder\r\n split.sizes = [conv1.outChannels, conv2.outChannels] commutative operators, and factor out common terms, re-\r\n (a) Fuse two convolutions. spectively.\r\n ~\r\n ~x \u2297 ~y + (1 \u2212 ~x) \u2297 ~z (4 operators)\r\n split ~\r\n \u21d2 ~x\u2297~y+1\u2297~z\u2212~x\u2297~z (5 operators)\r\n conv1 conv2 \u21d2 ~x\u2297~y\u2212~x\u2297~z+~z (4 operators)\r\n add \u21d2 ~x\u2297(~y\u2212~z)+~z (3 operators)\r\n add.out Note that both optimizations involve complex sequences\r\n Source Graph Target Graph of graph substitutions that require temporarily decreasing\r\n runtime performance in intermediate states.\r\n # Constraints on the source graph:\r\n conv1.stride == (1, 1)\r\n # Construct the target graph: 4 THEMETAFLOWSEARCHALGORITHM\r\n conv2.inChannels = conv1.inChannels + conv1.outChannels Relaxed graph substitutions provide a search space of po-\r\n conv2.outChannels = conv1.outChannels tential computation graphs that are equivalent to an initial\r\n # I is an identity matrix\r\n conv2.weights = concat(conv1.weights, I) computation graph but have different runtime performance.\r\n (b) Fuse a convolution and an add. Finding optimal graphs in the search space is challenging,\r\n since the search space can be in\ufb01nite depending on which\r\n Figure 4. Example relaxed graph substitutions. The substitution substitution rules are used. It is certainly infeasible to ex-\r\n in (a) was used in the second (green) step of Figure 1, and the haustively enumerate the search space for today\u2019s DNN\r\n substitution in (b) was used in the third (yellow) step. models.\r\n This section describes the key techniques used in MetaFlow\r\n to ef\ufb01ciently prune the search space and quickly \ufb01nd opti-\r\n must be preserved in the substitution. mized (but not necessarily optimal) graphs. In particular,\r\n Target graph. A target graph describes how to construct a Section 4.1 introduces a cost model that incorporates mul-\r\n newsubgraph to substitute for the mapped subgraph. For tiple cost dimensions (e.g., FLOPs, execution time, mem-\r\n each newly created operator, the target graph de\ufb01nes how ory usage, etc) and can accurately predict the execution\r\n to set parameters and compute weights by using parameters performance of various computation graphs. Section 4.2\r\n and weights in the source graph. For each external edge in introduces a backtracking search algorithm that effectively\r\n the source graph, there is a corresponding external edge in \ufb01ndsanoptimizedcandidategraphinthesearchspaceunder\r\n the target graph (also shown as dotted edges). Any exter- the cost model. Because the computation graphs of state-\r\n nal operator originally connecting to a mapped operator in of-the-art DNNs are too large to directly optimize, we use\r\n the source graph should now connect to the corresponding a \ufb02ow-based graph split algorithm (Section 4.3) to recur-\r\n operator in the target graph. sively divide a computation graph into smaller individual\r\n subgraphs while maximizing graph substitution opportuni-\r\n Correctness. We de\ufb01ne a graph substitution to be valid if ties.\r\n Optimizing DNNComputationwithRelaxedGraphSubstitutions\r\n 4.1 Cost Model Algorithm 1 A Backtracking Search Algorithm\r\n 1: Input: An initial computation graph G , a cost model\r\n Weintroduceacostmodelthatincorporates multiple dimen- 0\r\n sions to evaluate the runtime performance of a computation Cost(\u00b7), a list of valid graph substitutions {S1,...,Sm},\r\n graph. The cost model computes metrics for each opera- and a hyper parameter \u03b1\r\n tor in a graph and combines them appropriately to obtain 2: Output: An optimized computation graph.\r\n a total cost. This includes both metrics that can be com- 3:\r\n puted statically (e.g., FLOPs, memory usage, and number 4: // Q is a priority queue of graphs sorted by Cost(\u00b7).\r\n 5: Q = {G }\r\n of kernel launches) as well as dynamic metrics that usually 0\r\n require measurements on speci\ufb01c hardware (e.g., execution 6: while Q =6 {}do\r\n time on a particular GPU or CPU). For dynamic metrics, 7: G =Q.dequeue()\r\n previous work (Jia et al., 2018) shows that it is possible 8: for i = 1 to m do\r\n to accurately predict the execution time of a computation 9: G\u2032 = Si(G)\r\n 10: if Cost(G\u2032) < Cost(G ) then\r\n graph by only measuring a few representative operators on opt\r\n hardware. Since most DNN operators involve dense linear 11: Gopt = G\u2032\r\n algebra with no branches, their performance on hardware is 12: endif\r\n 13: if Cost(G\u2032) < \u03b1 \u00d7 Cost(G ) then\r\n highly consistent and predictable given the same parameters. opt\r\n For example, once we have measured and stored the execu- 14: Q.enqueue(G\u2032)\r\n tion time of a convolution with particular parameters (i.e., 15: endif\r\n kernel size, stride, padding, etc.), we can use that execution 16: endfor\r\n time for other convolutions with the same parameters. 17: end while\r\n Ourcost model can optimize a single cost dimension (e.g., 18: return Gopt\r\n minimizing overall FLOPs) as well as incorporate multiple\r\n cost dimensions, such as minimizing execution time while\r\n maintaining a memory usage limit (by returning an in\ufb01nite graph into smaller disjoint subgraphs that are amenable to\r\n cost if the memory usage limit is exceeded). We observe backtracking search. This is motivated by our observation\r\n that many graph substitutions result in a tradeoff among sev- that graph substitutions are performed on a few locally con-\r\n eral cost dimensions instead of improving all of them. For nected operators, and splitting a computation graph into\r\n example, the graph substitution in Figure 4b reduces mem- smaller individual subgraphs can still preserve most graph\r\n ory accesses and kernel launches at the cost of increasing substitutions.\r\n FLOPs. To split a graph into two disjoint subgraphs, we aim at\r\n minimizing the number of graph substitutions spanning the\r\n 4.2 Backtracking Search two subgraphs, since these graph substitutions cannot be\r\n We now describe a backtracking search algorithm to au- performed on either subgraph. For each operator oi \u2208 G,\r\n wede\ufb01ne its capacity Cap(o ) to be the number of graph\r\n tomatically \ufb01nd optimized computation graphs under the i\r\n cost model. Algorithm 1 shows the pseudocode. All can- substitutions that map to at least one in-edge and one out-\r\n edge of operator o . These graph substitutions are disabled\r\n didate graphs are enqueued into a global priority queue i\r\n if operator o is used to split the graph. By using Cap(o ) as\r\n and are dequeued in increasing order by their costs. For i i\r\n each dequeued graph G, the search algorithm generates and theweightforeachoperator,wemapthegraphsplitproblem\r\n enqueues new graphs by applying potential graph substitu- to a minimum vertex cut problem (Cormen et al., 2009) and\r\n tions on G. The search algorithm uses a parameter \u03b1 (line can use any max-\ufb02ow algorithm to \ufb01nd a minimum cut.\r\n 13 in the algorithm) to tradeoff between the search time Amax-\ufb02owalgorithmsplits an arbitrary graph into two dis-\r\n and the best-discovered solution. By setting \u03b1 = 1, the joint subgraphs by minimizing spanning graph substitutions.\r\n search algorithm becomes a simple greedy algorithm and Using the max-\ufb02ow algorithm as a subroutine, Algorithm 2\r\n only considers graph substitutions that strictly reduce cost. shows a graph split algorithm that recursively divides an\r\n As\u03b1increases, the search algorithm explores a larger part entire computation graph into individual subgraphs smaller\r\n of the search space. than a threshold.\r\n 4.3 Flow-Based Recursive Graph Split After running the backtracking search algorithm to optimize\r\n individual subgraphs, MetaFlow stitches the optimized sub-\r\n Manystate-of-the-art DNN models are too large to optimize graphs back together to constitute an entire computation\r\n directly with the backtracking search. We use a \ufb02ow-based graph. Finally, a local backtracking search around each\r\n graph split algorithm to recursively divide a computation splitting point is performed for substitutions spanning the\r\n splitting point.\r\n Optimizing DNNComputationwithRelaxedGraphSubstitutions\r\n Algorithm 2 A Flow-based Graph Split Algorithm. Table 1. DNNs used in our experiments.\r\n 1: Input: An initial computation graph G DNN Description\r\n 2: Convolutional Neural Networks (CNNs)\r\n 3: function GRAPHSPLIT(G) Inception-v3 A102-layer CNNwithInception modules\r\n 4: if |G| \u2264 threshold then SqueezeNet A42-layer CNNwith\ufb01remodules\r\n 5: return G ResNet50 A50-layer CNNwithresidual modules\r\n 6: else Recurrent Neural Networks (RNNs)\r\n 7: // MIN-CUT(\u00b7) returns a minimum vertex cut. RNNTC A3-layer RNNfortext classi\ufb01cation\r\n 8: C = MIN-CUT(G) NMT A4-layer RNNforneural machine translation\r\n 9: G ={o \u2208G|o isreachablefromC}\r\n 1 i i\r\n 10: G =G\u2212G\r\n 2 1 6 EVALUATION\r\n 11: return{GRAPHSPLIT(G ), GRAPHSPLIT(G )}\r\n 1 2\r\n 12: endif This section evaluates both inference and training perfor-\r\n 13: end function manceofMetaFlowbyansweringthefollowingquestions:\r\n \u2022 HowdoesMetaFlowcomparetoexistingdeeplearning\r\n Wewouldliketopointoutthat while the \ufb02ow-based graph frameworks that rely on rule-based graph transforma-\r\n split algorithm is suf\ufb01cient and achieves good performance tions?\r\n for all DNNs used in the experiments, we do not claim that \u2022 Can MetaFlow\u2019s graph optimization be used to im-\r\n it is an optimal graph split algorithm. We have examined an- prove the runtime performance of these deep learning\r\n other graph split algorithm, balanced partitioning (Andreev frameworks?\r\n &Racke,2006),toseeif the results differ. Both algorithms\r\n achieve the same performance due to the existence of natu- \u2022 CanMetaFlowimproveboththeinferenceandtraining\r\n ral splitting points in the graphs we examined. For example, performance of different real-world DNNs?\r\n none of our substitutions cross the boundary between \ufb01re\r\n modules in SqueezeNet (Iandola et al., 2016), yielding an 6.1 Experimental Setup\r\n easy way to split the graph. However, if either the set of\r\n substitution rules or the computation graph were different, Table 1 summarizes the DNNs used in our experiments.\r\n another graph split algorithm may prove more effective. Weusethreerepresentative CNNs for image classi\ufb01cation:\r\n Inception-v3 (Szegedy et al., 2016), SqueezeNet (Iandola\r\n 5 IMPLEMENTATION et al., 2016), and ResNet50 (He et al., 2016). They use\r\n different DNN modules to improve model accuracy and\r\n MetaFlowisaframework-agnostic DNNoptimizer for ar- exhibit different graph architectures. RNNTC and NMT are\r\n bitrary computation graphs. The MetaFlow cost model two sequence-to-sequence RNN models from (Lei et al.,\r\n and runtime use existing deep learning libraries (e.g., 2017) for text classi\ufb01cation and neural machine translation,\r\n cuDNN(Chetluret al., 2014) and cuBLAS (cuBLAS) for respectively. RNNTC uses an embedding layer, a recurrent\r\n GPUs,andMKL2 forCPUs)toestimatetheexecution time layer with a hidden size of 1024, and a softmax layer. NMT\r\n of a computation graph and perform real executions on dif- includes an encoder and a decoder, both of which consist\r\n ferent devices. MetaFlow accepts a user-de\ufb01ned cost func- of an embedding layer and two recurrent layers each with\r\n tion that incorporates one or multiple cost dimensions and a hidden size of 1024. We follow previous work and use\r\n \ufb01nds a computation graph optimizing the cost function. An SRU(Leietal.,2017)astherecurrentunitsforRNNTCand\r\n optimized graph by MetaFlow can be automatically trans- NMT.AllexperimentswereperformedonaGPUnodewith\r\n formed to the formats accepted by existing deep learning a 10-core Intel E5-2600 CPU and 4 NVIDIA Tesla V100\r\n frameworks, including TensorRT, TensorFlow, and Tensor- GPUs.\r\n FlowXLA(TensorRT;Abadietal.,2016). This allows ex- In all experiments, MetaFlow considers all applicable graph\r\n isting deep learning frameworks to directly use MetaFlow\u2019s substitutions in TensorFlow XLA as well as all substitutions\r\n optimized graphs as inputs to improve runtime performance. described in Section 3 and Figure 4. Overall, a total of 14\r\n In particular, we show that MetaFlow can further improve graph substitutions are used in all experiments. The cost\r\n the runtime performance of existing deep learning frame- modelusedintheexperiments was to minimize execution\r\n works by up to 1.3\u00d7, even though these systems internally time. Unless otherwise stated, we use \u03b1 = 1.05 as the\r\n perform rule-based graph transformations before executing pruning parameter for our backtracking search algorithm\r\n an input computation graph. (see Algorithm 1). The graph split algorithm recursively\r\n 2https://01.org/mkl-dnn divides subgraphs with more than 30 operators. This allows\r\n Optimizing DNNComputationwithRelaxedGraphSubstitutions\r\n TensorFlow TensorFlow XLA TensorRT MetaFlow\r\n TensorFlow w/ TensorFlow XLA w/ TensorRT w/\r\n MetaFlow graph opt. MetaFlow graph opt. MetaFlow graph opt.\r\n 10 3.0 4.5 1.8 8\r\n 4.0 1.6 7\r\n 2.5\r\n 8 3.5 1.4\r\n 6\r\n 2.0 3.0 1.2\r\n 6 5\r\n 1.1x 2.5 1.0\r\n 1.5 4\r\n 2.0 1.1x 0.8\r\n 4 1.4x 3\r\n 1.0 1.5 0.6\r\n Execution Time (ms) 1.3x\r\n 1.0 0.4 2 1.6x\r\n 2\r\n 0.5\r\n 0.5 0.2 1\r\n 0 Inception-v3 0.0 SqueezeNet 0.0 ResNet-50 0.0 RNNTC 0 NMT\r\n Figure 5. End-to-end inference performance comparison among MetaFlow, TensorFlow, TensorFlow XLA, and TensorRT. For TensorFlow,\r\n TensorFlow XLAandTensorRT,wealsomeasuretheperformancewithMetaFlow\u2019soptimized graphs. The experiments were performed\r\n using a single inference sample on a NVIDIA V100 GPU. The right-most orange bars indicate the inference time of MetaFlow\u2019s optimized\r\n graphs on the MetaFlow engine, which achieves similar performance as TensorRT on CNNs and is faster on RNNs. This difference is due\r\n to a more ef\ufb01cient implementation of the concat and split operators that are introduced in MetaFlow\u2019s graph optimizations. For each DNN\r\n model, the blue and red lines indicate the performance achieved by the best existing system and MetaFlow, respectively. The number\r\n above each red line indicates the relative speedup over the best baseline.\r\n 2.0 evaluate the performance of the baseline frameworks with\r\n TensorRT MetaFlow\u2019s optimized computation graphs.\r\n MetaFlow\r\n 1.5 Figure 5 shows the comparison results. The blue lines\r\n 1.5 1.4\r\n 1.3 1.3 1.2 show the best performance achieved among the three\r\n 1.1 1.2 1.1 1.1 1.1\r\n 1.1 baseline frameworks without using MetaFlow\u2019s optimized\r\n 1.0 graphs, and the red lines show the MetaFlow performance.\r\n MetaFlowoutperforms existing deep learning inference en-\r\n 0.5 gines with speedups ranging from 1.1\u00d7 to 1.6\u00d7. In addition,\r\n when running MetaFlow\u2019s optimized graphs on baseline\r\n frameworks, MetaFlow also improves the inference perfor-\r\n Relative Speedups over TensorRT manceofTensorFlow, TensorFlow XLA and TensorRT by\r\n 0.0 A B C D E F G H I J K up to 1.3\u00d7. Note that all existing systems internally per-\r\n form rule-based graph transformations before executing a\r\n Figure 6. Performance comparison between MetaFlow and Ten- computation graph, therefore the performance improvement\r\n sorRT on individual subgraphs in Inception-v3 (Szegedy et al., comes from other graph optimizations beyond rule-based\r\n 2016). The experiments were performed on a NVIDIA V100 graph transformations.\r\n GPU. We further study the performance difference between\r\n MetaFlowandexistingrule-baseddeeplearningframeworks\r\n MetaFlow\u2019ssearchprocedureto\ufb01nishinlessthan5minutes on multiple cost dimensions, including the overall mem-\r\n for all the experiments. ory accesses, the number of FLOPs, the number of kernel\r\n launches and the device utilization. For this experiment, we\r\n 6.2 Inference Performance use TensorRT as the baseline as it has the best performance\r\n amongexisting deep learning frameworks. For TensorRT,\r\n 6.2.1 End-to-end performance the cost metrics are collected through its IProfiler in-\r\n We\ufb01rstcomparetheend-to-end inference performance be- terface.\r\n tween MetaFlow and existing deep learning frameworks, Tables 2 compares different cost metrics between TensorRT\r\n including TensorFlow, TensorFlow XLA, and TensorRT, on and MetaFlow. Compared to TensorRT, MetaFlow reduces\r\n a NVIDIAV100GPU.MetaFlowcanautomaticallytrans- the overall memory accesses by up to 1.6\u00d7 and the number\r\n form optimized computation graphs to standard formats of kernel launches by up to 3.7\u00d7. For the CNNs in our\r\n accepted by the baseline frameworks, therefore we also\r\n Optimizing DNNComputationwithRelaxedGraphSubstitutions\r\n Table 2. Performance comparison between MetaFlow and TensorRT on multiple cost dimensions. The experiments were performed on a\r\n NVIDIAV100GPU.ForTensorRT,thecostmetricsarecollectedthroughitsProfilerinterface. Thedeviceutilization is computed by\r\n normalizing the FLOPs by the execution time (TFLOPs per second). For each cost dimension, a number in bold shows the one with better\r\n performance.\r\n DNN Execution Time (ms) MemoryAccesses(GB) LaunchedKernels FLOPs(GFLOPs) Device Utilization\r\n TensorRT MetaFlow TensorRT MetaFlow TensorRT MetaFlow TensorRT MetaFlow TensorRT MetaFlow\r\n Inception-v3 5.51 5.00 95.4 62.2 138 115 5.68 5.69 1.03 1.14\r\n SqueezeNet 0.94 0.75 62.1 46.1 50 40 0.64 1.00 0.68 1.35\r\n ResNet50 1.97 1.86 37.2 35.8 70 67 0.52 0.54 0.26 0.29\r\n RNNTC 0.91 0.60 1.33 1.17 220 83 0.22 0.20 0.24 0.33\r\n NMT 2.45 1.56 5.32 4.68 440 135 0.84 0.78 0.34 0.50\r\n input input input\r\n conv1x1x384 conv1x1x320 conv1x1x448 conv1x1x1152 conv1x1x1152\r\n relu relu relu relu relu\r\n split split\r\n conv1x3x384 conv3x1x384 conv3x3x384 conv3x3x768 conv3x3x384 conv1x3x384 conv3x1x384 conv3x3x384\r\n relu relu relu relu relu relu relu relu\r\n concat conv1x3x384 conv3x1x384 conv3x3x768 concat conv1x3x384 conv3x1x384\r\n relu relu relu relu relu\r\n concat concat concat\r\n (a) Original graph. (b) Optimized graph for V100. (c) Optimized graph for K80.\r\n Figure 7. The original and MetaFlow\u2019s optimized computation graphs of an Inception module on different GPUs. Dotted boxes in the\r\n same color indicate mapped operators in different computation graphs, and shadow boxes highlight MetaFlow\u2019s graph optimizations.\r\n Note that on K80, MetaFlow does not expand conv1x3 and conv3x1 to conv3x3 due to less available hardware parallelism.\r\n experiments, MetaFlow achieves performance improvement than TensorRT, which leads to an end-to-end performance\r\n at the cost of increasing FLOPs in a computation graph. improvement of 1.25\u00d7.\r\n This allows MetaFlow to opportunistically fuse multiple\r\n operators to reduce memory accesses and kernel launches. 6.2.3 Comparison among different devices\r\n For example, in an Inception module, MetaFlow enlarges For a given input graph MetaFlow may discover different\r\n a conv1x3 and a conv3x1 operator both to conv3x3 optimized graphs on different devices. For example, Fig-\r\n operators to fuse them to a single conv3x3 operator (see ure 7 shows the original and MetaFlow\u2019s optimized compu-\r\n Figure 7). This reduces both memory accesses and kernel tation graphs of an Inception module on a V100 and a K80\r\n launches. GPU,respectively. The graph substitutions performed on\r\n For the RNNs, MetaFlow can also decrease the FLOPs com- each GPUare highlighted in shadow boxes. Note that the\r\n pared TensorRT. Section 3 shows how MetaFlow transforms substitution that fuses a conv1x3 and a conv3x1 into a\r\n the computation in a recurrent unit from 4 element-wise conv3x3 improves the runtime performance on a V100\r\n operators to 3 by composing a sequence of simple graph but decreases the performance on a K80.\r\n substitutions. This is a potential but currently missing opti- Wehavealsoobservedothergraphsubstitutionswhosevalue\r\n mization in TensorRT (v4.0.1, the latest version as of Sep depends on the speci\ufb01c hardware. This situation makes ex-\r\n 2018). isting greedy rule-based graph transformations less reliable\r\n 6.2.2 Subgraph performance for optimizing computation graphs on different devices,\r\n since substitutions that increase the runtime performance on\r\n We evaluate whether MetaFlow can improve the perfor- somedevices may decrease performance on other devices.\r\n mance of individual subgraphs in a DNN. Figure 6 com- Onthe other hand, MetaFlow\u2019s search-based approach is\r\n pares the performance of TensorRT and MetaFlow on in- better positioned for generating hardware-speci\ufb01c computa-\r\n dividual subgraphs in Inception-v3. The \ufb01gure shows that tion graphs by leveraging the actual performance of different\r\n MetaFlowcanconsistently \ufb01nd faster computation graphs graph substitutions on the hardware.\r\n Optimizing DNNComputationwithRelaxedGraphSubstitutions\r\n 6 Table 3. Performance comparison between MetaFlow\u2019s backtrack-\r\n TVM\r\n 5 MetaFlow ing search (with \u03b1 = 1.05) and a baseline exhaustive search on\r\n AlexNet, VGG16, ResNet18, and an Inception module shown in\r\n 4 Figure 7a. A check mark indicates the backtracking search found\r\n the same optimal graph as the exhaustive search under the cost\r\n 3 model.\r\n Graph Exhaustive Backtracking Same\r\n Search Search Result?\r\n 2 AlexNet 5.0 seconds 0.1 seconds X\r\n VGG16 2.3 minutes 0.2 seconds X\r\n 1 InceptionE 12.8 minutes 0.29 seconds X\r\n End-to-end Inference Time (ms) ResNet18 3.1 hours 0.99 seconds X\r\n 0 Inception-v3SqueezeNetResNet-50 RNNTC NMT\r\n Figure 8. End-to-end inference performance comparison between 6.3 Training Performance\r\n MetaFlowandTVMonaNVIDIAV100GPU.\r\n Graph substitution optimizations are applicable to arbitrary\r\n computation graphs including both inference and training.\r\n To evaluate how MetaFlow improves the training perfor-\r\n 2000 mance on different DNNs, we run both the original com-\r\n TensorFlow putation graphs and MetaFlow\u2019s optimized graphs on Ten-\r\n TensorFlow w/ MetaFlow graph opt. sorFlow. We follow the suggestions in TensorFlow Bench-\r\n 1500 marks3 and use synthetic data to benchmark the training\r\n 1.2x performance. The experiments were performed on four\r\n NVIDIAV100GPUsonasinglecomputenode,withdata\r\n 1000 parallelism and a global batch size of 64.\r\n 1.1x\r\n 1.2x Figure 9 shows the training throughput comparison. We\r\n 500 observe that a training graph generally involves more data\r\n dependencies than its corresponding inference graph, as\r\n showninFigure 2. As a result, MetaFlow\u2019s graph optimiza-\r\n Training Throughput (images per second)0Inception_v3ResNet-50SqueezeNettions generally achieve smaller performance improvement\r\n for training than inference. However, MetaFlow can still dis-\r\n Figure 9. Training performance comparison between TensorFlow cover computation graphs that increase training throughput\r\n and TensorFlow w/ MetaFlow\u2019s graph optimizations. The ex- byupto1.2\u00d7.\r\n periments were performed on 4 NVIDIA V100 GPUs with data\r\n parallelism and a global batch size of 64. 6.4 Search Algorithm Performance\r\n We now compare the backtracking search algorithm de-\r\n scribed in Section 4.2 with a baseline exhaustive search\r\n algorithm that enumerates all computation graphs in the\r\n 6.2.4 Comparison with code generation techniques search space. To allow the exhaustive search to complete\r\n Figure 8 compares the end-to-end inference performance in reasonable time, we use small DNN models including\r\n between MetaFlow and TVM (Chen et al., 2018). Our cur- AlexNet (Krizhevsky et al., 2012), VGG16 (Simonyan &\r\n rent implementation of MetaFlow directly uses the cuDNN Zisserman, 2014), ResNet18, and an Inception module\r\n and cuBLAS libraries to run individual operators, while showninFigure 7a.\r\n TVMuses auto-generated high-performance kernels, es- Table 3 compares the search time of the two algorithms.\r\n pecially for convolutions, making it competitive on some Compared to the baseline exhaustive search, MetaFlow\u2019s\r\n benchmarks despite its lack of the higher-level graph opti- backtracking search \ufb01nds the same optimal graph for the\r\n mizations MetaFlow provides. The optimizations in TVM four DNNs and reduces the search time by orders of magni-\r\n operate at a lower level than the optimizations in MetaFlow, tude over the baseline.\r\n so they could easily be composed. In the future, we plan Second, we evaluate the performance of our backtracking\r\n to integrate TVM as a backend for MetaFlow so that we search algorithm with different pruning parameters \u03b1. Fig-\r\n can improve performance via both graph optimization and\r\n individual kernel code generation. 3https://www.tensor\ufb02ow.org/guide/performance/benchmarks\r\n Optimizing DNNComputationwithRelaxedGraphSubstitutions\r\n 2017) uses reinforcement learning to \ufb01nd ef\ufb01cient device\r\n 10.0 30 assignment for model parallelism across multiple GPUs.\r\n Best discovered graph\r\n 9.5 End-to-end search time FlexFlow (Jia et al., 2019) introduces a comprehensive\r\n 25\r\n 9.0 search space of parallelization strategies for DNNs and\r\n 8.5 20 uses randomized search to \ufb01nd ef\ufb01cient strategies in the\r\n 8.0 search space. These frameworks optimize distributed DNN\r\n 15 training by assuming a \ufb01xed computation graph, and it still\r\n 7.5 remains an open problem to combine MetaFlow\u2019s graph\r\n 7.0 10 optimizations with these frameworks to further improve the\r\n (Inference Time in ms)6.5 runtime performance of distributed DNN training.\r\n 5\r\n 6.0 End-to-end Search Time (minutes)\r\n Performance of Best Discovered Graph5.51.01.021.041.061.081.11.1208 CONCLUSION\r\n \u03b1\r\n Existing deep learning optimizers use greedy methods to\r\n Figure 10. The performance of the best discovered graphs (shown optimize computation graphs by applying graph substitu-\r\n as the red line) and the end-to-end search time for running tions that are strictly performance increasing. This approach\r\n Inception-v3 on a V100 GPU with different \u03b1. misses potential performance gains from more complex\r\n transformations where some intermediate states are not im-\r\n provements. We identify the potential of performing such\r\n ure 10 shows the performance of the best discovered graphs transformations, and propose relaxed graph substitutions to\r\n and the end-to-end search time for running Inception-v3 achieve them. We provide a system, MetaFlow, for optimiz-\r\n on a V100 GPU with different \u03b1. The \ufb01gure shows that a ing DNNcomputation graphs using relaxed graph substitu-\r\n relatively small \u03b1 (e.g., 1.05 in this case) allows us to \ufb01nd a tions, and show that MetaFlow can achieve up to 1.6\u00d7 per-\r\n highly optimized computation graph while maintaining low formance improvements on a variety of widely used DNNs.\r\n search cost. Finally, we demonstrate that relaxed graph substitutions are\r\n widely applicable as we show that adding them to existing\r\n 7 RELATEDWORK frameworks such as TensorFlow XLA and TensorRT results\r\n in further performance improvements.\r\n Greedy rule-based graph transformation has been\r\n widely used by existing deep learning frameworks (Abadi ACKNOWLEDGEMENTS\r\n et al., 2016; TensorRT; PyTorch) to improve the runtime per- WethankOdedPadon,SenWu,KarthikSrinivasaMurthy,\r\n formance of a computation graph. Existing systems require and the anonymous reviewers for their feedback on this\r\n each rule to improve the runtime performance, preventing work. ThisworkwassupportedbyNSFgrantCCF-1409813,\r\n a large number of potential graph substitutions from being the Exascale Computing Project (17-SC-20-SC), a collab-\r\n considered. The key difference between existing deep learn- orative effort of the U.S. Department of Energy Of\ufb01ce of\r\n ing frameworks and MetaFlow is that MetaFlow considers Science and the National Nuclear Security Administration,\r\n relaxed graph substitutions that may temporarily decrease and is based on research sponsored by DARPA under agree-\r\n runtimeperformanceandusesasearchalgorithmtodiscover mentnumberFA84750-14-2-0006. This research was also\r\n optimized computation graphs in the search space. supported in part by af\ufb01liate members and other support-\r\n Automatic kernel generation. Recent work has pro- ers of the Stanford DAWN project\u2014Ant Financial, Face-\r\n posed different approaches to automatically generate high- book, Google, Infosys, Intel, Microsoft, NEC, Teradata,\r\n performance kernels for speci\ufb01c hardware (Vasilache et al., SAPandVMware\u2014aswellasDARPAgrantFA8750-17-2-\r\n 2018; Chen et al., 2018; Ragan-Kelley et al., 2013). These 0095(D3M)andNSFgrantCNS-1651570. TheU.S.Gov-\r\n kernelgenerationtechniquessolveanorthogonalproblemof ernmentisauthorizedtoreproduceanddistributereprintsfor\r\n howtoimproveperformance of individual operators, while Governmental purposes notwithstanding any copyright no-\r\n MetaFlow aims at optimizing computation graphs using tation thereon. The views and conclusions herein are those\r\n relaxed graph substitutions. We believe it is possible to of the authors and should not be interpreted as necessar-\r\n combine relaxed graph substitutions with automatic code ily representing the of\ufb01cial policies or endorsements either\r\n generation and leave this as future work. expressed or implied of DARPA or the U.S. Government.\r\n Optimizing distributed DNN training. Recent work has\r\n also proposed deep learning frameworks that automati-\r\n cally \ufb01nd ef\ufb01cient parallelization strategies for distributed\r\n DNNtraining. For example, ColocRL (Mirhoseini et al.,\r\n Optimizing DNNComputationwithRelaxedGraphSubstitutions\r\n REFERENCES Krizhevsky, A., Sutskever, I., and Hinton, G. E. ImageNet\r\n Abadi, M., Barham, P., Chen, J., Chen, Z., Davis, A., Dean, classi\ufb01cation with deep convolutional neural networks.\r\n J., Devin, M., Ghemawat, S., Irving, G., Isard, M., Kudlur, In Proceedings of the 25th International Conference on\r\n M., Levenberg, J., Monga, R., Moore, S., Murray, D. G., Neural Information Processing Systems, NIPS, 2012.\r\n Steiner, B., Tucker, P., Vasudevan, V., Warden, P., Wicke, Lei, T., Zhang, Y., and Artzi, Y. Training rnns as fast\r\n M., Yu, Y., and Zheng, X. Tensor\ufb02ow: A system for as cnns. CoRR, abs/1709.02755, 2017. URL http:\r\n large-scale machine learning. In Proceedings of the 12th //arxiv.org/abs/1709.02755.\r\n USENIXConferenceonOperatingSystemsDesignand\r\n Implementation, OSDI, 2016. Mirhoseini, A., Pham, H., Le, Q. V., Steiner, B., Larsen, R.,\r\n Zhou, Y., Kumar, N., Norouzi, M., Bengio, S., and Dean,\r\n Andreev, K. and Racke, H. Balanced graph partitioning. J. Device placement optimization with reinforcement\r\n Theory of Computing Systems, 39(6):929\u2013939, 2006. learning. 2017.\r\n Bahdanau, D., Cho, K., and Bengio, Y. Neural machine PyTorch. Tensors and Dynamic neural networks in Python\r\n translation by jointly learning to align and translate. with strong GPU acceleration. https://pytorch.\r\n CoRR,abs/1409.0473, 2014. org,2017.\r\n Chen, T., Moreau, T., Jiang, Z., Shen, H., Yan, E. Q., Wang, Ragan-Kelley, J., Barnes, C., Adams, A., Paris, S., Durand,\r\n L., Hu, Y., Ceze, L., Guestrin, C., and Krishnamurthy, A. F., and Amarasinghe, S. Halide: A language and compiler\r\n TVM:end-to-end optimization stack for deep learning. for optimizing parallelism, locality, and recomputation in\r\n CoRR, abs/1802.04799, 2018. URL http://arxiv. image processing pipelines. In Proceedings of the 34th\r\n org/abs/1802.04799. ACMSIGPLANConferenceonProgrammingLanguage\r\n Chetlur, S., Woolley, C., Vandermersch, P., Cohen, J., Tran, Design and Implementation, PLDI \u201913, 2013.\r\n J., Catanzaro, B., and Shelhamer, E. cudnn: Ef\ufb01cient Silver, D., Huang, A., Maddison, C. J., Guez, A., Sifre, L.,\r\n primitives for deep learning. CoRR, abs/1410.0759, 2014. VanDenDriessche, G., Schrittwieser, J., Antonoglou, I.,\r\n URLhttp://arxiv.org/abs/1410.0759. Panneershelvam, V., Lanctot, M., et al. Mastering the\r\n Cormen, T. H., Leiserson, C. E., Rivest, R. L., and Stein, game of go with deep neural networks and tree search.\r\n C. Introduction to Algorithms, Third Edition. The MIT Nature, 529:484\u2013489, 2016.\r\n Press, 3rd edition, 2009. Simonyan, K. and Zisserman, A. Very deep convolu-\r\n cuBLAS. Dense Linear Algebra on GPUs. https:// tional networks for large-scale image recognition. CoRR,\r\n developer.nvidia.com/cublas,2016. abs/1409.1556, 2014. URL http://arxiv.org/\r\n abs/1409.1556.\r\n He, K., Zhang, X., Ren, S., and Sun, J. Deep residual learn- Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J., and Wojna,\r\n ing for image recognition. In Proceedings of the IEEE Z. Rethinking the inception architecture for computer\r\n Conference on Computer Vision and Pattern Recognition, vision. In Proceedings of the IEEE Conference on Com-\r\n CVPR,2016. puter Vision and Pattern Recognition, 2016.\r\n Iandola, F. N., Moskewicz, M. W., Ashraf, K., Han, S., TensorRT. NVIDIA TensorRT: Programmable inference\r\n Dally, W. J., and Keutzer, K. Squeezenet: Alexnet-level accelerator. https://developer.nvidia.com/\r\n accuracy with 50x fewer parameters and <1mb model tensorrt,2017.\r\n size. CoRR, abs/1602.07360, 2016.\r\n Jia, Z., Lin, S., Qi, C. R., and Aiken, A. Exploring Vasilache, N., Zinenko, O., Theodoridis, T., Goyal, P., De-\r\n hidden dimensions in parallelizing convolutional neu- Vito, Z., Moses, W. S., Verdoolaege, S., Adams, A., and\r\n ral networks. CoRR, abs/1802.04924, 2018. URL Cohen, A. Tensor comprehensions: Framework-agnostic\r\n http://arxiv.org/abs/1802.04924. high-performance machine learning abstractions. CoRR,\r\n abs/1802.04730, 2018.\r\n Jia, Z., Zaharia, M., and Aiken, A. Beyond data and model Wu, Y., Schuster, M., Chen, Z., Le, Q. V., Norouzi, M.,\r\n parallelism for deep neural networks. In Proceedings of Macherey, W., Krikun, M., Cao, Y., Gao, Q., Macherey,\r\n the 2nd Conference on Systems and Machine Learning, K., Klingner, J., Shah, A., Johnson, M., Liu, X., Kaiser,\r\n SysML\u201919, 2019. L., Gouws, S., Kato, Y., Kudo, T., Kazawa, H., Stevens,\r\n Kim, Y. Convolutional neural networks for sentence clas- K., Kurian, G., Patil, N., Wang, W., Young, C., Smith, J.,\r\n si\ufb01cation. CoRR, abs/1408.5882, 2014. URL http: Riesa, J., Rudnick, A., Vinyals, O., Corrado, G., Hughes,\r\n //arxiv.org/abs/1408.5882. M., and Dean, J. Google\u2019s neural machine translation\r\n Optimizing DNNComputationwithRelaxedGraphSubstitutions\r\n system: Bridging the gap between human and machine \u2022 (Optional) TensorFlow, TensorFlow XLA, and TensorRT are\r\n translation. CoRR, abs/1609.08144, 2016. optionally required to run MetaFlow\u2019s optimized computa-\r\n tion graphs on these systems.\r\n A ARTIFACTAPPENDIX The following software versions were used in our experiments:\r\n cuDNN7.3,CUDA9.0,TensorFlowr1.12,andTensorRT5.0.2.6.\r\n A.1 Abstract A.4 Installation\r\n This artifact appendix helps readers to reproduce the main A.4.1 MetaFlowruntime\r\n experimental results in this paper. In this artifact evaluation,\r\n weshow(1)howMetaFlowcanautomaticallysearchforop- The MetaFlow runtime can be installed by downloading source\r\n timized computation graphs for different DNN models, and code from an archived DOI website 4 or from a public git reposi-\r\n (2) how MetaFlow\u2019s optimized graphs can be directly used tory 5. The install.sh script automatically builds all binaries\r\n as inputs to improve the runtime performance of existing used in this artifact evaluation.\r\n deep learning systems, including TensorFlow, TensorFlow A.4.2 TensorRT runtime\r\n XLA,andTensorRT.\r\n TheTensorRTruntimecanbeinstalledfollowingtheinstructionsat\r\n A.2 Artifact check-list (meta-information) https://developer.nvidia.com/tensorrt. The ex-\r\n periments in the paper were performed with TensorRT 5.0.2.6. We\r\n \u2022 Run-timeenvironment: LinuxUbuntu16.04+ have also veri\ufb01ed MetaFlow\u2019s usability on several older versions\r\n \u2022 Hardware: NVIDIATeslaP100orV100GPUs of TensorRT (e.g., 4.0.1.6).\r\n \u2022 Metrics: The primary metric of comparison is the end-to- A.4.3 TensorFlow runtime\r\n end inference latency. The TensorFlow runtime can be installed following the instruc-\r\n \u2022 How much disk space required (approximately)?: A tions at https://www.tensorflow.org/install/. The\r\n hundred MB of disk storage should be suf\ufb01cient for all ex- experiments in this paper were done with TensorFlow version 1.12.\r\n periments. Note that XLA support is not linked by default in older versions\r\n of TensorFlow. If you would like to use an older version with\r\n \u2022 Howmuchtimeisneededtopreparework\ufb02ow(approx- XLA,youmustcompilefromsource. Instructions can be found at\r\n imately)?: About one hour to install all dependencies and https://www.tensorflow.org/install/source.\r\n compile the MetaFlow runtime.\r\n \u2022 How much time is needed to complete experiments (ap- A.5 Experimentwork\ufb02ow\r\n proximately)?: About 20 minutes for all experiments. Thefollowing experiments are included in this artifact evaluation.\r\n \u2022 Publicly available?: Yes All experiments were run with synthetic input data in GPU device\r\n memorytoremovethesideeffects of data transfers between CPU\r\n \u2022 Code licenses (if publicly available)?: Apache License, and GPU.\r\n Version 2.0.\r\n \u2022 Work\ufb02ow framework used?: TensorFlow r1.12 and Ten- A.5.1 MetaFlowexperiments\r\n sorRT 5.0.2.6. The following command line automatically \ufb01nds an optimized\r\n computation graph for a DNN model and measures the inference\r\n \u2022 Archived (provide DOI)?: latency of the optimized graph in the MetaFlow runtime.\r\n https://doi.org/10.5281/zenodo.2549853\r\n ./mf --dnn model\r\n A.3 Description The example DNN models included in this artifact evaluation\r\n A.3.1 Hardwaredependencies are Inception-v3 (Szegedy et al., 2016), SqueezeNet (Iandola\r\n et al., 2016), ResNet-50 (He et al., 2016), and RNNTC (Kim,\r\n This artifact evaluation depends on a NVIDIA GPU. All experi- 2014). You can run the example models by replacing model with\r\n ments in this paper were performed on a NVIDIA V100 GPU. We inception,squeezenet,resnet50,orrnntc.\r\n have also run experiments on a NVIDIA P100 GPU and observed\r\n similar performance improvements. A.5.2 TensorRT experiments\r\n A.3.2 Software dependencies Thefollowing command line measures the inference latency of a\r\n MetaFlow\u2019s optimized computation graph in TensorRT.\r\n MetaFlowdependsonthefollowing software libraries: ./mf-trt --dnn model\r\n \u2022 The MetaFlow runtime were implemented on top of 4https://doi.org/10.5281/zenodo.2549853\r\n cuDNN (Chetlur et al., 2014) and cuBLAS (cuBLAS) li- 5https://github.com/jiazhihao/metaflow_\r\n braries. sysml19\r\n Optimizing DNNComputationwithRelaxedGraphSubstitutions\r\n DNN: SqueezeNet with Complex Byass.\r\n Baseline Graph:\r\n End-to-end runtime = 1.4037ms\r\n Estimated runtime = 1.4171 ms\r\n Floating point operations = 0.6364 Gflop\r\n Memory accesses = 62.0473 MB\r\n GPU kernel launches = 50\r\n Optimized Graph:\r\n End-to-end runtime = 1.1923ms\r\n Estimated runtime = 1.1820 ms\r\n Floating point operations = 0.8180 Gflop\r\n Memory accesses = 46.6183 MB\r\n GPU kernel launches = 42\r\n Optimized Graph on TensorRT:\r\n Average over 10 runs is 1.15658 ms.\r\n Figure 11. An example output of this artifact evaluation.\r\n where model can be one of inception, squeezenet,\r\n resnet50,andrnntc.\r\n A.5.3 TensorFlow and TensorFlow XLA experiments\r\n First, run MetaFlow using the --export file name \ufb02ag to\r\n output the computation graph to a \ufb01le. You can optionally in-\r\n clude the --noopt \ufb02ag to output an unoptimized graph. See the\r\n script code/export graphs.sh for some examples of how\r\n to export graphs.\r\n Next, run the script tensorflow py/tf executor.py on a\r\n graph \ufb01le generated as described above.\r\n python tf executor.py --graph file\r\n path to graph file [--xla]\r\n The--xla\ufb02agcontrolswhetherTensorFlowwillrunwithXLA\r\n turned on. You can run python tf executor --helpfora\r\n full list of options.\r\n A.6 Evaluation and expected result\r\n Eachexecutionoutputstheend-to-endinferencetimeofanoriginal\r\n computation graph as well as the MetaFlow\u2019s optimized computa-\r\n tion graph. When running on a NVIDIA V100 GPU, this artifact\r\n evaluation should reproduce all experimental results in Figure 5.\r\n Figure 11 shows an example output by running mf-trt on\r\n squeezenet.\r\n A.7 Experimentcustomization\r\n MetaFlow can be used to optimize arbitrary DNN computation\r\n graphs on any GPU device. We refer users to the four running ex-\r\n amples in this artifact evaluation for more details on the MetaFlow\r\n usage.\r\n", "award": [], "sourceid": 22, "authors": [{"given_name": "Zhihao", "family_name": "Jia", "institution": "Stanford University"}, {"given_name": "James", "family_name": "Thomas", "institution": "Stanford"}, {"given_name": "Todd", "family_name": "Warszawski", "institution": "Stanford University"}, {"given_name": "Mingyu", "family_name": "Gao", "institution": "Tsinghua University"}, {"given_name": "Matei", "family_name": "Zaharia", "institution": "Stanford and Databricks"}, {"given_name": "Alex", "family_name": "Aiken", "institution": "Stanford University"}]}