{"title": "Checkmate: Breaking the Memory Wall with Optimal Tensor Rematerialization", "book": "Proceedings of Machine Learning and Systems", "page_first": 497, "page_last": 511, "abstract": "Modern neural networks are increasingly bottlenecked by the limited capacity of on-device GPU memory. Prior work explores dropping activations as a strategy to scale to larger neural networks with fixed memory. However, these heuristics assume uniform cost per layer and only consider simple linear chain architectures, limiting their usability. In this paper, we formalize the problem of trading-off computation time and memory requirements for DNN training as the tensor rematerialization optimization problem. We develop a new system to optimally solve the problem in reasonable times (under an hour) using off-the-shelf MILP solvers. These schedules subsequently accelerate millions of training iterations. Our optimization pass in TensorFlow 2.0 automatically yields real training speedups of up to 4.8x over prior work, and can enable up to 5x increase in input size for real-world large networks.", "full_text": "CHECKMATE: BREAKING THE MEMORYWALL\r\n WITHOPTIMALTENSORREMATERIALIZATION\r\n Paras Jain*1 Ajay Jain*1 Aniruddha Nrusimha1\r\n AmirGholami1 PieterAbbeel1 KurtKeutzer1 IonStoica1 JosephE.Gonzalez1\r\n ABSTRACT\r\n Weformalize the problem of trading-off DNN training time and memory requirements as the tensor remateri-\r\n alization optimization problem, a generalization of prior checkpointing strategies. We introduce Checkmate, a\r\n system that solves for optimal rematerialization schedules in reasonable times (under an hour) using off-the-shelf\r\n MILPsolvers or near-optimal schedules with an approximation algorithm, then uses these schedules to accelerate\r\n millions of training iterations. Our method scales to complex, realistic architectures and is hardware-aware\r\n through the use of accelerator-speci\ufb01c, pro\ufb01le-based cost models. In addition to reducing training cost, Checkmate\r\n enables real-world networks to be trained with up to 5.1\u00d7 larger input sizes. Checkmate is an open-source project,\r\n available at https://github.com/parasj/checkmate.\r\n 1 INTRODUCTION 30\r\n Deep learning training workloads demand large amounts 20\r\n of high bandwidth memory. Researchers are pushing the\r\n memory capacity limits of hardware accelerators such as\r\n GPUsbytraining neural networks on high-resolution im- 10\r\n ages (Dong et al., 2016; Kim et al., 2016; Tai et al., 2017), RAM used (GB)\r\n 3Dpoint-clouds (Chen et al., 2017; Yang et al., 2018), and 0\r\n long natural language sequences (Vaswani et al., 2017; De- Time\r\n vlin et al., 2018; Child et al., 2019). In these applications, Retain all Rematerialize\r\n training memory usage is dominated by the intermediate activations activations\r\n activation tensors needed for backpropagation (Figure 3).\r\n Thelimited availability of high bandwidth on-device mem- Figure 1. This 32-layer deep neural network requires 30GB of\r\n ory creates a memory wall that sti\ufb02es exploration of novel memoryduringtraining in order to cache forward pass activations\r\n architectures. Across applications, authors of state-of-the- for the backward pass. Freeing certain activations early and rema-\r\n art models cite memory as a limiting factor in deep neural terializing them later reduces memory requirements by 21GB at\r\n network (DNN) design (Krizhevsky et al., 2012; He et al., the cost of a modest runtime increase. Rematerialized layers are\r\n denoted as shaded blue regions. We present Checkmate, a system\r\n 2016; Chen et al., 2016a; Gomez et al., 2017; Pohlen et al., to rematerialize large neural networks optimally. Checkmate is\r\n 2017; Child et al., 2019; Liu et al., 2019; Dai et al., 2019). hardware-aware, memory-aware and supports arbitrary DAGs.\r\n Asthere is insuf\ufb01cient RAM to cache all activation tensors Griewank&Walther(2000)andChenetal.(2016b)present\r\n for backpropagation, some select tensors can be discarded heuristics for rematerialization when the forward pass forms\r\n during forward evaluation. When a discarded tensor is nec- a linear graph, or path graph. They refer to the problem as\r\n essary as a dependency for gradient calculation, the tensor checkpointing. However, their approaches cannot be applied\r\n can be rematerialized. As illustrated in Figure 1, rematerial- generally to nonlinear DNN structures such as residual con-\r\n izing values allows a large DNN to \ufb01t within memory at the nections, and rely on the strong assumption that all nodes in\r\n expense of additional computation. the graph have the same cost. Prior work also assumes that\r\n *Equal contribution 1Department of EECS, UC Berkeley. gradients may never be rematerialized. These assumptions\r\n Correspondence to: Paras Jain . limit the ef\ufb01ciency and generality of prior approaches.\r\n Proceedings of the 3rd MLSys Conference, Austin, TX, USA, Our work formalizes tensor rematerialization as a con-\r\n 2020. Copyright 2020 by the author(s). strained optimization problem. Using off-the-shelf numeri-\r\n cal solvers, we are able to discover optimal rematerializa-\r\n Breaking the Memory Wall with Optimal Tensor Rematerialization\r\n 220 tion within the stage. That is, FREE =1ifandonlyif Algorithm 1 Generate execution plan\r\n t,i,k\r\n 221 vi can be deallocated in stage t after evaluating vk. Pred- Input: graph G =(V,E), feasible (R,S,FREE)\r\n 222 icating on Rt,k in (5) ensures values are onlyfreed once. Output: execution plan s ,...,s\r\n 223 Toexpress FREEin our ILP, (5) must be de\ufb01ned arithmeti- 1 k\r\n 224 cally with linear constraints. Applying De Morgan\u2019s law Initialize REGS[1...|V |]=\"1, r =0.\r\n 225 for union and intersection interchange, for t =1to |V| do\r\n 226 0 1 for k =1to |V| do\r\n if Rt,k then\r\n 227 B _ C // Materialize vk\r\n 228 B C\r\n FREE =\u00ac \u00acR _S R emit %r = allocate v\r\n 229 t,i,k @ t,k t+1,i t,jA k\r\n j2USERS[i] emit compute vk,%r\r\n 230 0 j>k 1 REGS[k]=r\r\n 231 X r = r +1\r\n 232 @ A\r\n 233 = 1\"Rt,k+St+1,i+ Rt,j =0 endif\r\n j2USERS[i],j>k // Free vk and dependencies\r\n Checkmate: Breaking the Memory Wall with Optimal Tensor Rematerialization\r\n 234 ,(num_hazards(t,i,k) = 0) (6) for i 2 DEPS[k] [ {k} do\r\n 235 if FREE then\r\n 236 t,i,k\r\n Static reverse where num_hazards(t,i,k) is introduced simply for nota- emit deallocate %REGS[i]\r\n User specified 237 tional convenience. Relation (6) is implemented with linear endif\r\n mode auto- Hardware\r\n 238 cast-to-boolean constraints, where ! is the maximum value Features Workspace memory\r\n architecture differentiation cost model endfor\r\n 239 num_hazards(t,i,k) can assume, endfor\r\n 240 endfor Parameter gradients Parameters\r\n FREE 2{0,1} d (7a)\r\n 241 t,i,k\r\n 242 e\r\n 1\"FREE $num_hazards(t,i,k) (7b)\r\n t,i,k\r\n 243 m 15GB\r\n !(1\"FREE ) % num_hazards(t,i,k) u (7c)\r\n t,i,k graph with bounded memory usage. This execution plan,\r\n 244 LP construction s or schedule, is constructed via a row major scan of the so-\r\n 245 and optimization n GPUmemorylimit\r\n ThecompletememoryconstrainedILPfollowsin(8),with\r\n 246 (minutes) o lution matrices, detailed in Algorithm 1.\r\n 247 O(|V||E|) variables and constraints. c\r\n Rebuild A concrete execution plan is a program consist-\r\n Training loop n t y\r\n 248 XX r 10GB\r\n ing of k statements P =(s,...,s), where\r\n static graph with 1 k\r\n (days) 249 arg min CiRt,i (1a) o s 2 {allocate,compute,deallocate}. State-\r\n rematerialization R,S,U,FREE i\r\n 250 t=1 i=1 m ment %r = allocate v de\ufb01nes a virtual register for\r\n 251 subject to (1b),(1c),(1d),(1e), e (8) the result of the operation corresponding to v, used to\r\n 252 (2),(3),(7a),(7b),(7c), m track memory usage during execution. Such a register\r\n 253 l must be allocated for v before an instance of statement\r\n U $M a\r\n 254 t,k budget t 5GB\r\n compute v, %r in the plan, which invokes the opera-\r\n 255 o tion and generates an output value which is tracked by the\r\n 256 4.5 Constraints implied by optimality T register %r. Finally, statement deallocate %r deletes\r\n Figure 2. Overview of the Checkmate system.\r\n 257 Problem 8 can be simpli\ufb01ed by removing constraints im- the virtual register, marks the output value for garbage col-\r\n 258 plied by optimality of a solution. In (2), all values with lection, and updates the tracked memory usage.\r\n 259 St,i =1are allocated space, even if they are unused. If\r\n 0GB\r\n TheexecutionplangeneratedbyAlgorithm1isfurtherop-\r\n 260 such a value is unused, the checkpoint is spurious and the A V I R D R F T R B\r\n n\r\n C r\r\n timizedbymovingdeallocationsearlierintheplanifpossi-\r\n l G e e e o i\r\n 261 solver can set St,i =0to reduce memory usage if needed. e c a g\r\n tion strategies for arbitrary deep neural networks in Ten- e s n s N B\r\n ble. For example,G spurious checkpoints that are unused in a\r\n 262 x n G\r\n p N s N\r\n N 8 s E\r\n 1 A\r\n Further, FREE =1only if operation k is spuriously stage can be deallocated at the start of the stage rather than\r\n 263 t,k,k t e\r\n e e s f R\r\n sorFlow with non-uniform computation and memory costs. e 9 i o\r\n o N N\r\n t X\r\n evaluated with no uses of the result. Hence, the solver can during the stage. Note that this code motion is unnecessary\r\n 264 t ,\r\n , - r T\r\n , n e ,\r\n 1 t 2 m\r\n 2 a\r\n set R =0toreduce cost. When solving the MILP, we as the solver guarantees that the unoptimized schedule will\r\n 265 t,k 2 -\r\n 5 t 0 2\r\n Wedemonstrate that optimal rematerialization allows larger 0 v - 1 ,\r\n 2 e\r\n 0 0\r\n eliminate |V | variables FREE , assumed to be 0, by not exceed the desired memory2budget. 2 1\r\n t,k,k 3\r\n 266 1 0 2\r\n 1 r\r\n , , 0 7 , 1\r\n only summing over i 2 DEPS[k] in (4). Note that the elim- 4 1 0\r\n 267 2 8\r\n batch sizes and substantially reduced memory usage with 2 1\r\n inatedvariablescanbecomputedinexpensivelyfromRand 2 , 2 1\r\n 0 ,\r\n 4.7 Generating static computation graph 0\r\n 268 0 2 8\r\n S after solving. 1 1 2 0 1\r\n minimal computational overhead across a range of image\r\n 269\r\n 5 5 0 1 7\r\n For implementation, the concrete execution plan can either\r\n 270 1 6\r\n be interpreted, or encoded as a static computation graph.\r\n classi\ufb01cation and semantic segmentation architectures. As a 6\r\n 271 4.6 Generating an execution plan 0 0 0\r\n 272 In this work, we generate a static graph G =(V ,E)\r\n Given a feasible solution to (8), (R,S,FREE), we generate from the plan, which is executed by a numerical machine\r\n consequence, our approach allows researchers to easily ex-\r\n 273 a concrete execution plan that evaluates the computation learning framework. See Section 6.2 for implementation\r\n 274\r\n plore larger models, at larger batch sizes, on more complex Figure 3. Memory consumed by activations far outweigh parame-\r\n signals with minimal computation overhead. ters for popular model architectures. Moreover, advances in GPU\r\n In particular, the contributions of this work include: DRAMcapacity are quickly utilized by researchers; the dashed\r\n line notes the memory limit of the GPU used to train each model.\r\n \u2022 a formalization of the rematerialization problem as a\r\n mixedinteger linear program with a substantially more Mostprior work assumes networks have linear graphs. For\r\n \ufb02exible search space than prior work, in Section 4.7. example, Chen et al. (2016b) divides the computation into\r\n \u221a \u221a\r\n \u2022 a fast approximation algorithm based on two-phase nsegments, each with nnodes. Each segment endpoint\r\n deterministic LP rounding, in Section 5. is stored during the forward pass. During the backward pass,\r\n \u2022 Checkmate, a system implemented in TensorFlow that segments are recomputed in reverse order at O(n) cost.\r\n enables training models with up to 5.1\u00d7 larger input Linear graph assumptions limit applicability of prior work.\r\n sizes than prior art at minimal overhead. For example, the popular ResNet50 (He et al., 2016) re-\r\n quires each residual block to be treated as a single node,\r\n 2 MOTIVATION leading to inef\ufb01cient solutions. For other networks with\r\n larger skip connection (e.g., U-Net (Ronneberger et al.,\r\n Memoryconsumptionduringtraining consists of (a) inter- 2015)), the vast majority of the graph is incompatible.\r\n mediatefeatures,oractivations,whosesizedependsoninput Prior work also assumes all layers are equally expensive to\r\n dimensions and (b) parameters and their gradients whose recompute.In the VGG19 (Simonyan & Zisserman, 2014)\r\n size depends on weight dimensions. Given that inputs are architecture, the largest layer is seven orders of magnitude\r\n often several order of magnitude larger than kernels, most moreexpensive than the smallest layer.\r\n memoryisusedbyfeatures, as demonstrated in Figure 3.\r\n Frameworks such as TensorFlow (Abadi et al., 2016) and Ourworkmakesfewassumptionsonneuralnetworkgraphs.\r\n PyTorch (Paszke et al., 2017; 2019) store all activations We explore a solution space that allows for (a) arbitrary\r\n during the forward pass. Gradients are backpropagated from graphs with several inputs and outputs for each node, (b)\r\n the loss node, and each activation is freed after its gradient variable memory costs across layers and (c) variable com-\r\n has been calculated. In Figure 1, we compare this memory putation costs for each layer (such as FLOPs or pro\ufb01led\r\n intensive policy and a rematerialization strategy for a real runtimes). We constrain solutions to simply be correct (a\r\n neural network. Memory usage is signi\ufb01cantly reduced node\u2019s dependencies must be materialized before it can be\r\n by deallocating some activations in the forward pass and evaluated) and within the RAM budget (at any point during\r\n recomputing them in the backward pass. Our goal is \ufb01t an execution, resident tensors must \ufb01t into RAM).\r\n arbitrary network within our memorybudgetwhileincurring To\ufb01ndsolutions to this generalized problem, we \ufb01nd solu-\r\n the minimal additional runtime penalty from recomputation. tions that minimize the amount of time it takes to perform\r\n Checkmate: Breaking the Memory Wall with Optimal Tensor Rematerialization\r\n a single training iteration, subject to the correctness and register rematerialization is limited to exceptional values\r\n memoryconstraints outlined above. We project schedules that can be recomputed in a single instruction with depen-\r\n into space and time, allowing us to cast the objective as a dencies already in registers. For example, memory offset\r\n linear expression. This problem can then be solved using computations can be cheaply recomputed, and loads of con-\r\n off-the-shelf mixed integer linear program solvers such as stants can be statically resolved. In contrast, Checkmate can\r\n GLPKorCOIN-ORBranch-and-Cut(Forrestetal.,2019). recompute entire subgraphs of the program\u2019s data-\ufb02ow.\r\n Anoptimal solution to the MILP will minimize the amount During the evaluation of a single kernel, GPUs spill per-\r\n of additional compute cost within the memory budget. thread registers to a thread-local region of global memory\r\n 3 RELATEDWORK (i.e. local memory) (Micikevicius, 2011; NVIDIA, 2017).\r\n NNtraining executes DAGs of kernels and stores intermedi-\r\n Wecategorize related work as checkpointing, reversible net- ate values in shared global memory. This produces a high\r\n works, distributed computation, and activation compression. range of value sizes, from 4 byte \ufb02oats to gigabyte tensors,\r\n whereas CPUandGPUregistersrangefrom1to64bytes.\r\n Checkpointingandrematerialization Chenetal.(2016b) Ourproblemofinterkernel memory scheduling thus differs\r\n propose a heuristic for checkpointing idealized unit-cost lin- in scale from the classical problem of register allocation\r\n \u221a within a kernel or program. Rematerialization is more ap-\r\n ear n-layer graphs with O( n) memory usage. Griewank\r\n &Walther(2000) checkpoint similar linear unit-cost graphs propriate than copying values out of core as the cost of\r\n with O(logn) memory usage and prove optimality for lin- spilling values from global GPU memory to main memory\r\n ear chain graphs with unit per-node cost and memory. In (RAM)issubstantial (Micikevicius, 2011; Jain et al., 2018),\r\n practice, DNN layers vary signi\ufb01cantly in memory usage though possible (Meng et al., 2017).\r\n andcomputational cost (Sze et al., 2017), so these heuristics Reversible Networks Gomez et al. (2017) propose a re-\r\n are not optimal in practice. Chen et al. (2016b) also develop versible (approximately invertible) residual DNN architec-\r\n a greedy algorithm that checkpoints layers of a network in ture, where intermediate temporary values can be recom-\r\n roughly memory equal segments, with a hyperparameter b putedfromvaluesderivedlater inthestandardforwardcom-\r\n for the size of such segments. Still, neither procedure is cost- putation. Reversibility allows forward pass activations to\r\n aware nor deallocates checkpoints when possible. Gruslys be recomputed during the backward pass rather than stored,\r\n et al. (2016) develop a dynamic programming algorithm for similar to gradient checkpointing. Bulo et al. (2018) replace\r\n checkpoint selection in unrolled recurrent neural network only ReLUandbatchnormalization layers with invertible\r\n training, exploiting their linear forward graphs. Feng & variants, reconstructing their inputs during the backward\r\n Huang(2018)provide a dynamic program to select check- pass, reducing memory usage up to 50%. However, this ap-\r\n points that partition branching networks but ignore layer proach has a limit to memory savings, and does not support\r\n costs and memory usage. Siskind & Pearlmutter (2018a) de- a range of budgets. Reversibility is not yet widely used to\r\n velop a divide-and-conquer strategy in programs. Beaumont save memory, but is a promising complementary approach.\r\n et al. (2019) use dynamic programming for checkpoint se-\r\n lection in a speci\ufb01c architecture with joining sub-networks. Distributed computation An orthogonal approach to ad-\r\n Intermediate value recomputation is also common in reg- dress the limited memory problem is distributed-memory\r\n ister allocation. Compiler backends lower an intermediate computations and gradient accumulation. However, model\r\n representation of code to an architecture-speci\ufb01c executable parallelism requires access to additional expensive com-\r\n binary. During lowering, an abstract static single assign- pute accelerators, fast networks, and non-trivial partition-\r\n ment (SSA) graph of values and operations (Rosen et al., ing of model state to balance communication and compu-\r\n 1988; Cytron et al., 1991) is concretized by mapping values tation (Gholami et al., 2018; Jia et al., 2018b; McCandlish\r\n to a \ufb01nite number of registers. If insuf\ufb01cient registers are et al.). Gradient accumulation enables larger batch sizes\r\n available for an SSA form computation graph, values are by computing the gradients in sub-batches across a mini-\r\n spilled to main memory by storing and later loading the batch. However, gradient accumulation often degrades per-\r\n value. Register allocation has been formulated as graph formance as batch normalization performs poorly on small\r\n coloring problem (Chaitin et al., 1981), integer program minibatch sizes (Wu & He, 2018; Ioffe & Szegedy, 2015).\r\n (Goodwin & Wilken, 1996; Lozano et al., 2018), and net- Activation compression In some DNN applications, it is\r\n work\ufb02ow(Koes&Goldstein,2006). possible to process compressed representations with mini-\r\n Register allocators may recompute constants and values malaccuracy loss. Gueguen et al. (2018) classify discrete\r\n with register-resident dependencies if the cost of doing so cosine transforms of JPEG images rather than raw images.\r\n is less than the cost of a spill (Chaitin et al., 1981; Briggs Jain et al. (2018) quantizes activations, cutting memory\r\n et al., 1992; Punjani, 2004). While similar to our setup, usage in half. Compression reduces memory usage by a\r\n constant factor, but reduces accuracy. Our approach is math-\r\n Checkmate: Breaking the Memory Wall with Optimal Tensor Rematerialization\r\n METHOD DESCRIPTION GENERAL COST MEMORY\r\n GRAPHS AWARE AWARE\r\n Checkpoint all (Ideal) Norematerialization. Default in deep learning frameworks. \u221a \u00d7 \u00d7\r\n Griewank et al. logn Griewank & Walther (2000) REVOLVE procedure \u00d7 \u00d7 \u00d7\r\n \u221a Chenetal. (2016b) checkpointing heuristic \u00d7 \u00d7 \u00d7\r\n Chenetal. n\r\n Chenetal. greedy Chenetal. (2016b), with search over parameter b \u00d7 \u00d7 \u223c\r\n \u221a \u221a \u223c \u00d7 \u00d7\r\n AP n Chenetal. nonarticulation points + optimal R solve\r\n APgreedy Chenetal. greedy on articulation points + optimal R solve \u223c \u00d7 \u223c\r\n \u221a \u221a \u221a \u00d7 \u00d7\r\n Linearized n Chenetal. nontopological sort + optimal R solve \u221a\r\n Linearized greedy Chenetal. greedy on topological sort + optimal R solve \u00d7 \u223c\r\n CheckmateILP OurILPasformulatedinSection 4 \u221a \u221a \u221a\r\n Checkmateapprox. OurLProundingapproximation algorithm (Section 5) \u221a \u221a \u221a\r\n Table 1. Rematerialization baselines and our extensions to make them applicable to non-linear architectures\r\n ematically equivalent and incurs no accuracy penalty. &Walther, 2000; Chen et al., 2016b; Gruslys et al., 2016;\r\n Siskind & Pearlmutter, 2018b; Feng & Huang, 2018), as\r\n 4 OPTIMALREMATERIALIZATION values can be retained and deallocated many times, but\r\n comesatthecost of O(Tn) decision variables.\r\n In this section, we develop an optimal solver that schedules Totrade-off the number of decision variables and schedule\r\n computation and garbage collection during the evaluation \ufb02exibility, we limit T to T = n. This allows for O(n2)\r\n of general data-\ufb02ow graphs including those used in neu- operations and constant memory in linear graphs.\r\n ral network training. Our proposed scheduler minimizes\r\n computation or execution time while guaranteeing that the 4.3 Scheduling with ample memory\r\n schedule will not exceed device memory limitations. The\r\n rematerialization problem is formulated as a mixed integer First, consider neural network evaluation on a processor\r\n linear program (MILP) that can be solved with standard with ample memory. Even without a memory constraint,\r\n commercial or open-source solvers. our solver must ensure that checkpointed and computed op-\r\n erations have dependencies resident in memory. Minimizing\r\n 4.1 Problemde\ufb01nition the total cost of computation across stages with dependency\r\n Acomputation or data-\ufb02ow graph G = (V,E) is a directed constraints yields objective (1a):\r\n acyclic graph with n nodes V = {v ,...,v }thatrepresent n t\r\n 1 n arg min XXCR (1a)\r\n operations yielding values (e.g. tensors). Edges represent i t,i\r\n dependencies between operators, such as layer inputs in a R,S t=1 i=1\r\n neural network. Nodes are numbered according to a topo- subject to\r\n logical order, such that operation v may only depend on\r\n j R \u2264R +S \u2200t \u2200(v ,v ) \u2208 E, (1b)\r\n the results of operations v . t,j t,i t,i i j\r\n i k . By the \ufb01nal factor\r\n U =M +2M + MS (2) 2 1\r\n t,0 input param i t,i in (5), we have FREEt,i,k \u2264 1\u2212Rt,k = 0, which is a\r\n | {z } | {z } 1 2\r\n Constant overhead i=1 Checkpoints contradiction.\r\n Suppose U bytes of memory are in use after evaluating\r\n t,k 4.5 Linear reformulation of memory constraint\r\n v . Before evaluating v , v and dependencies (parents)\r\n k k+1 k While the recurrence (2-3) de\ufb01ning U is linear, the right\r\n of vk may be deallocated if there are no future uses. Then,\r\n an output tensor for the result of v is allocated, consum- hand size of (5) is a polynomial. To express FREE in our\r\n k+1 ILP, it must be de\ufb01ned via linear constraints. We rely on\r\n ing memory M . The timeline is depicted in Figure 4,\r\n k+1 Lemma4.1and4.2toreformulate(5)into a tractable form.\r\n yielding recurrence (3):\r\n Lemma4.1(LinearReformulationofBinaryPolynomial).\r\n U =U \u2212memfreed (v )+R M , (3)\r\n t,k+1 t,k t k t,k+1 k+1 If x1,...,xn \u2208 {0,1}, then\r\n n ( P\r\n where mem freed (v ) is the amount of memory freed by Y 1 n (1\u2212x)=0\r\n t k x = i=1 i\r\n deallocating v and its parents at stage t. Let i\r\n k 0 otherwise\r\n i=1\r\n DEPS[k] = {i : (v ,v ) \u2208 E}, and\r\n i k\r\n P\r\n USERS[i] = {j : (v ,v ) \u2208 E} n\r\n i j Proof. If all x ,...,x =1,then (1 \u2212x ) = 0 and\r\n 1 n i=1 i\r\n we have \u03a0n x = 1. If otherwise any x = 0, then we\r\n denote parents and children of a node, respectively. Then, i=1 i j\r\n have \u03a0n xi = 0, as desired. This can also be seen as an\r\n in terms of auxiliary variable FREE , for (v ,v ) \u2208 E, i=1\r\n t,i,k i k application of De Morgan\u2019slawsforbooleanarithmetic.\r\n mem freed (v ) = P M \u2217FREE , and (4)\r\n t k i\u2208DEPS[k] i t,i,k Lemma4.2(LinearReformulationofIndicatorConstraints).\r\n \u222a{k} Y Given 0 \u2264 y \u2264 \u03ba where y is integral and \u03ba is a constant\r\n FREE =R \u2217(1\u2212S ) (1 \u2212R ) (5) upper bound on y, then\r\n t,i,k t,k t+1,i t,j\r\n | {z }j\u2208USERS[i]| {z } (\r\n Notcheckpoint j>k Notdep.\r\n x= 1 y=0\r\n 1While gradients can be deleted after updating parameters, we 0 otherwise\r\n reserve constant space since many parameter optimizers such as\r\n SGDwithmomentummaintaingradientstatistics. if and only if x \u2208 {0,1} and (1 \u2212 x) \u2264 y \u2264 \u03ba(1 \u2212 x).\r\n Checkmate: Breaking the Memory Wall with Optimal Tensor Rematerialization\r\n Proof. For the forward direction, \ufb01rst note that by con- 4.7 CompleteInteger Linear Program formulation\r\n struction, x \u2208 {0,1}. If y = 0 and x = 1, then The complete memory constrained MILP follows in (9),\r\n (1 \u2212 x) = 0 \u2264 y \u2264 0 = \u03ba(1 \u2212 x). Similarly, if y \u2265 1 with O(|V||E|) variables and constraints.\r\n and x = 0, then 1 \u2264 y \u2264 \u03ba, which is true since 0 \u2264 y \u2264 \u03ba n t\r\n and y is integral. The converse holds similarly. XX\r\n arg min CR\r\n R,S,U,FREE i t,i\r\n Toreformulate Constraint 5, let num hazards(t,i,k) be the t=1 i=1\r\n numberofzerofactors on the RHS of the constraint. This is subject to (1b),(1c),(1f),(2),(3), (9)\r\n a linear function of the decision variables, (7a),(7b),(7c),(8a),(8b),(8c),\r\n X U \u2264M\r\n num hazards(t,i,k) = (1\u2212Rt,k)+St+1,i+ Rt,j t,k budget\r\n j\u2208USERS[i] 4.8 Constraints implied by optimality\r\n j>k\r\n Applying Lemma4.1 to the polynomial constraint, we have, Problem 9 can be simpli\ufb01ed by removing constraints im-\r\n ( plied by optimality of a solution. FREEt,k,k = 1 only if\r\n 1 num hazards(t,i,k) = 0 operation k is spuriously evaluated with no uses of the re-\r\n FREEt,i,k = (6) sult. Hence, the solver can set R = 0 to reduce cost.\r\n 0 otherwise t,k\r\n 2\r\n We eliminate |V| variables FREEt,k,k, assumed to be 0,\r\n By Lemma 4.2, if \u03ba is the maximum value that by modifying (4) to only sum over i \u2208 DEPS[k]. These\r\n num hazards(t,i,k) can assume, the following constraints variables can be computed inexpensively after solving.\r\n are equivalent to (6),\r\n 4.9 Generating an execution plan\r\n FREEt,i,k \u2208 {0,1} (7a) Given a feasible solution to (9), (R,S,U,FREE), Algo-\r\n 1\u2212FREEt,i,k \u2264 num hazards(t,i,k) (7b) rithm 1 generates an execution plan via a row major scan\r\n \u03ba(1\u2212FREEt,i,k) \u2265 num hazards(t,i,k) (7c) of R and S with deallocations determined by FREE. An\r\n execution plan is a program P = (s1,...,sk) with k state-\r\n 4.6 Tractability via frontier-advancing stages ments. When statement %r = compute v is interpreted,\r\n operation v is evaluated. The symbol %r denotes a vir-\r\n Fixing the execution order of nodes in the graph can im- tual register used to track the resulting value. Statement\r\n prove the running time of the algorithm. In eager-execution deallocate %rmarksthevaluetrackedbyvirtualreg-\r\n frameworks such as PyTorch, the order is given by user ister %r for garbage collection.\r\n code and operations are executed serially. Separating order- The execution plan generated by Algorithm 1 is further\r\n ing and allocation is common in compiler design, and both optimized by moving deallocations earlier in the plan when\r\n LLVM(Lattner, 2002) and GCC (Olesen, 2011) have sepa- possible. Spuriouscheckpointsthatareunusedinastagecan\r\n rate instruction scheduling and register allocation passes. be deallocated at the start of the stage rather than during the\r\n Anytopological order of the nodes is a possible execution stage. Still, this code motion is unnecessary for feasibility\r\n order. Given a topological order, such as the one intro- as the solver guarantees that the unoptimized schedule will\r\n duced in Section 4.1, we partition the schedule into frontier- not exceed the desired memory budget.\r\n advancing stages such that node v is evaluated for the \ufb01rst\r\n i Theexecution plan can either be interpreted during training,\r\n time in stage i. We replace constraints (1d, 1e) that ensure or encoded as a static computation graph. In this work, we\r\n the last node is computed with stricter constraints (8a-8c), \u2032 \u2032 \u2032\r\n generate a static graph G = (V ,E ) from the plan, which\r\n R =1 \u2200i (frontier-advancing partitions) (8a) is executed by a numerical machine learning framework.\r\n P i,i See Section 6.2 for implementation details.\r\n i\u2265t St,i = 0 (lower tri., no initial checkpoints) (8b)\r\n P R =0 (lowertriangular) (8c)\r\n i>t t,i 4.10 Cost model\r\n This reduces the feasible set, constraining the search space To estimate the runtime of a training iteration under a re-\r\n and improving running time. For an 8 layer (n = 17) materialization plan, we apply an additive cost model (1a),\r\n linear graph neural network with unit C ,M at a memory incurring cost C when node v is evaluated. Costs are de-\r\n i i i i\r\n budget of 4, Gurobi optimizes the unpartitioned MILP in termined prior to MILP construction by pro\ufb01ling network\r\n 9.4 hours and the partitioned MILP in 0.23 seconds to the layers on target hardware with random inputs across a range\r\n sameobjective. In Appendix A, we analyze the integrality of batch sizes and input shapes, and exclude static graph\r\n gap of both forms of the problem to understand the speedup. construction and input generation time. As neural network\r\n Checkmate: Breaking the Memory Wall with Optimal Tensor Rematerialization\r\n Algorithm 1 Generate execution plan to solve. For DenseNet161 (Huang et al., 2017), no feasible\r\n Input: graph G = (V,E), feasible (R,S,FREE) solution was found within one day.\r\n Output: execution plan P = (s1,...,sk) For many classical NP-hard problems, approximation al-\r\n Initialize REGS[1...|V |] = \u22121, r = 0, P = (). gorithms give solutions close to optimal with polynomial\r\n for t = 1 to |V | do runtime. We review a linear program that produces frac-\r\n for k = 1 to |V | do tional solutions in polynomial time in Section 5.1. Using\r\n if R then\r\n t,k the fractional solutions, we present a two-phase rounding\r\n // Materialize vk algorithm in Section 5.2 that rounds a subset of the decision\r\n add %r = compute v toP\r\n k variables, then \ufb01nds a minimum cost, feasible setting of the\r\n REGS[k] = r remaining variables to \ufb01nd near-optimal integral solutions.\r\n r = r +1\r\n endif 5.1 Relaxing integrality constraints\r\n // Free v anddependencies\r\n k\r\n for i \u2208 DEPS[k] \u222a {k} do By relaxing integrality constraints (1f), the problem be-\r\n if FREEt,i,k then comestrivial to solve as it is a linear program over continu-\r\n add deallocate %REGS[i]toP ousvariables. It is well known that an LP is solvable in poly-\r\n endif nomial time via Karmarkar\u2019s algorithm (Karmarkar, 1984)\r\n endfor or barrier methods (Nesterov & Nemirovskii, 1994). With\r\n endfor relaxation R,S,FREE \u2208 [0,1], the objective (1a) de\ufb01nes a\r\n endfor lower-bound for the cost of the optimal integral solution.\r\n return P Rounding is a common approach to \ufb01nd approximate inte-\r\n gral solutions given the result of an LP relaxation. For exam-\r\n ple, one can achieve a 3-approximation for MAX SAT (Yan-\r\n operations consist of dense numerical kernels such as matrix 4\r\n nakakis, 1994) via a simple combination of randomized\r\n multiplication, these runtimes are low variance and largely \u0002 int \u0003 \u2217\r\n rounding (Pr x =1 =xi)anddeterministicrounding\r\n independent of the speci\ufb01c input data (Jia et al., 2018a; Si- int \u2217 i\r\n (x =1ifxi \u2265p,wherecommonlyp=0.5).\r\n vathanu et al., 2019). However, forward pass time per batch i\r\n item decreases with increasing batch size due to improved Weattempt to round the fractional solution R\u2217,S\u2217 using\r\n data parallelism (Canziani et al., 2016), so it is important to these two strategies, and then apply Algorithm 1 to Rint,Sint.\r\n compute costs with appropriate input dimensions. However, direct application of deterministic rounding re-\r\n The memory consumption of each value in the data-\ufb02ow turns infeasible results: the rounded solution violates con-\r\n graph is computed statically as input and output sizes are straints. Randomized rounding mayshowmorepromiseasa\r\n known. Values are dense, multi-dimensional tensors stored single relaxed solution can be used to sample many integral\r\n at 4 byte \ufb02oating point precision. The computed consump- solutions, some of which are hopefully feasible. Unfortu-\r\n tion M is used to construct memory constraints (2-3). nately, using randomized rounding with the LP relaxation\r\n i for VGG16ata4\u00d7smallerbudgetthandefault, we could\r\n not \ufb01nd a single feasible solution out of 50,000 samples.\r\n 5 APPROXIMATION\r\n Manyofourbenchmarkprobleminstancesaretractable to 5.2 Atwo-phaseroundingstrategy\r\n solveusingoff-the-shelfintegerlinearprogramsolvers, with To\ufb01ndfeasiblesolutions, we introduce two-phase rounding,\r\n practical solve times ranging from seconds to an hour. ILP detailed in Algorithm 2. Two-phase rounding is applicable\r\n results in this paper are obtained with a 1 hour time limit whenasubsetofvariables can be solved in polynomial time\r\n on a computer with at least 24 cores. Relative to training given the remaining variables. Our approximation algorithm\r\n time, e.g. 21 days for the BERT model (Devlin et al., 2018), only rounds the checkpoint matrix S\u2217. Given S\u2217, we solve\r\n solving the ILP adds less than a percent of runtime overhead. for the conditionally optimal binary computation matrix Rint\r\n WhileCOTSsolverssuchasCOIN-OR(Forrestetal.,2019) bysetting as few values to 1 as possible. Algorithm 2 begins\r\n with an all-zero matrix Rint = 0, then iteratively corrects\r\n leverage methods like branch-and-bound to aggressively violated correctness constraints.\r\n prune the decision space, they can take superpolynomial\r\n time in the worst-case and solving ILPs is NP-hard in gen- Note that during any of the above steps, once we set some\r\n eral. In the worst-case, for neural network architectures with Rint = 1, the variable is never changed. Algorithm 2 cor-\r\n i,j\r\n hundreds of layers, it is not feasible to solve the remateri- rects constraints in a particular order so that constraints that\r\n alization problem via our ILP. An instance of the VGG16 are satis\ufb01ed will continue to be satis\ufb01ed as other violated\r\n architecture (Simonyan & Zisserman, 2014) takes seconds constraints are corrected. The matrix Rint generated by this\r\n Checkmate: Breaking the Memory Wall with Optimal Tensor Rematerialization\r\n Algorithm 2 Two-phase rounding allows training with larger input sizes than previously possi-\r\n Input: Fractional checkpoint matrix S\u2217 from LP ble, up to 5.1\u00d7 higher batch sizes on the same accelerator.\r\n Output: Binary Sint, Rint, FREE Finally, we \ufb01nd that our two-phase rounding approximation\r\n RoundS\u2217 deterministically: Sint \u2190 \u2736 [S\u2217 > 0.5] algorithm \ufb01nds near-optimal solutions in polynomial time.\r\n t,i t,i\r\n Rint \u2190 I thereby satisfying (8a)\r\n n\r\n while \u2203t \u2265 2,i \u2208 [n] such that Sint > Rint +Sint 6.1 Baselines and generalizations\r\n i.e. (1c) violated do t,i t\u22121,i t\u22121,i\r\n Computev tomaterialize checkpoint: Rint \u21901 Table 1 summarizes baseline rematerialization strategies.\r\n i t\u22121,i Thenominalevaluationstrategystoresallfeaturesgenerated\r\n endwhile during the forward pass for use during the backward pass\u2014\r\n while \u2203t \u2265 1,(i,j) \u2208 E such that Rint > Rint + Sint\r\n i.e. (1b) violated do t,j t,i t,i this is the default in frameworks such as TensorFlow. Hence,\r\n Computev astemporaryfordependency: Rint \u2190 1 every layer is computed once. We refer to this baseline as\r\n i t,i Checkpoint all, an ideal approach given ample memory.\r\n endwhile\r\n Evaluate FREE by simulating execution Onthelinear graph architectures, such as VGG16 and Mo-\r\n return Sint, Rint, FREE bileNet (v1), we directly apply prior work from Griewank &\r\n Walther (2000) and Chen et al. (2016b), baselines referred\r\n \u221a\r\n to as Griewank and Walther logn, Chen et al. n and\r\n Chen et al. greedy. To build a tradeoff curve for compu-\r\n rounding scheme will be optimal up to the choice of Sint as tation versus memory budget, we search over the segment\r\n every entry in Rint is set to 1 if and only if it is necessary size hyperparameter b in the greedy strategy. However,\r\n to satisfy a constraint. In implementation, we detect and these baselines cannot be used for modern architectures\r\n correct violations of (1b) in reverse topological order for with residual connections. For a fair comparison, we extend\r\n each stage, scanning Rint,Sint matrices from right to left. the \u221an and greedy algorithms to apply to general computa-\r\n tion graphs with residual connections or branching structure\r\n 5.3 Memorybudgetfeasibility (e.g. ResNet50 and U-Net).\r\n Since we approximate S by rounding the fractional solu- Chen et al. (2016b) suggests manually annotating good\r\n tion, Sint,Rint can be infeasible by the budget constraint checkpointing candidates in a computation graph. For the\r\n \u221a\r\n U \u2264 M . While the fractional solution may come \ufb01rst extensions, denoted by AP n and AP greedy, we\r\n t,k budget automatically identify articulation points, or cut vertices,\r\n under the budget and two-phase rounding preserves correct-\r\n ness constraints, the rounding procedure makes no attempt vertices that disconnect the forward pass DAG, and use these\r\n to maintain budget feasibility. Therefore, we leave an al- as candidates. The heuristics then select a subset of these\r\n lowance on the total memory budget constraint (Ut,k \u2264 candidates, and we work backwards from the checkpoints\r\n (1 \u2212\u01eb)M ). We empirically \ufb01nd \u01eb = 0.1 to work well. to identify which nodes require recomputation.\r\n budget\r\n Still, some networks have few articulation points, including\r\n 6 EVALUATION U-Net. We also extend heuristics by treating the original\r\n In this section, we investigate the impact of tensor remate- graph as a linear network, with nodes connected in topolog-\r\n rialization on the cost and memory usage of DNN training. ical order, again backing out the minimal recomputations\r\n fromtheselectedcheckpoints. Theseextensionsarereferred\r\n Westudy the following experimental questions: (1) What \u221a\r\n is the trade-off between memory usage and computational to as Linearized nandLinearized greedy.\r\n overhead when using rematerialization? (2) Are large in- Sections B.1 and B.2 provide more details on our gener-\r\n puts practical with rematerialization? and (3) How well can alizations. Note that all proposed generalizations exactly\r\n weapproximate the optimal rematerialization policy? reproduce the original heuristics on linear networks.\r\n Wecompareourproposedsolver against baseline heuristics 6.2 Evaluation setup\r\n on representative image classi\ufb01cation and high resolution\r\n semantic segmentation models including VGG16, VGG19, Checkmateis implemented in Tensor\ufb02ow 2.0 (Abadi et al.,\r\n ResNet50, MobileNet, U-Net and FCN with VGG layers, 2016),acceptinguser-de\ufb01nedmodelsexpressedviathehigh-\r\n andSegNet. Aspriorworkislargelylimitedtolineargraphs, level Keras interface. We extract the forward and backward\r\n weproposenovelextensions where necessary for compar- computation graph, then construct and solve optimization\r\n ison. Results show that optimal rematerialization allows problem (9) with the Gurobi mathematical programming\r\n signi\ufb01cantly lower computational overhead than baselines library as an integer linear program. Finally, Checkmate\r\n at all memory budgets, and lower memory usage than previ- translates solutions into execution plans and constructs a\r\n ously possible. As a consequence, optimal rematerialization newstatic training graph. Together, these components form\r\n Checkmate: Breaking the Memory Wall with Optimal Tensor Rematerialization\r\n 1.5 1.5\r\n VGG16(256) MobileNet (512) U-Net (32)\r\n *\r\n 1.4\r\n 224x224 416x608\r\n 224x224\r\n 1.4 1.4\r\n )\r\n 1.3\r\n x\r\n (\r\n 1.3 1.3\r\n *\r\n **\r\n d\r\n a\r\n **\r\n e\r\n h 1.2\r\n r\r\n 1.2 1.2\r\n e\r\n v\r\n O\r\n 1.1 1.1 1.1\r\n 1.0 1.0\r\n 1.0\r\n 14 16 18 20 10 20 30\r\n 10 20 30 40\r\n Budget(GB)\r\n Linearized\r\n *\r\n adaptation\r\n Chenetal. greedy Griewank & Walther Checkpoint all (ideal)\r\n APadaptation\r\n **\r\n Chenetal. Checkmate(proposed)\r\n Figure 5. Computational overhead versus memory budget for (a) VGG16 image classi\ufb01cation NN (Simonyan & Zisserman, 2014), (b)\r\n MobileNet image classi\ufb01cation NN, and (c) the U-Net semantic segmentation NN (Ronneberger et al., 2015). Overhead is with respect to\r\n the best possible strategy without a memory restriction based on a pro\ufb01le-based cost model of a single NVIDIA V100 GPU. For U-Net (c),\r\n at the 16 GB V100 memory budget, we achieve a 1.20\u00d7 speedup over the best baseline\u2014linearized greedy\u2014and a 1.38\u00d7 speedup over\r\n \u221a\r\n the next best\u2014linearized n. Takeaway: our model- and hardware-aware solver produces in-budget solutions with the lowest overhead\r\n onlinear networks (a-b), and dramatically lowers memory consumption and overhead on complex architectures (c).\r\n the Checkmate system, illustrated in Figure 2. where a smaller memory budget leads to higher overhead.\r\n To accelerate problem construction, decision variables R Takeaways: For all three DNNs, Checkmate produces\r\n and S are expressed as lower triangular matrices, as are clearly faster execution plans as compared to algorithms\r\n accounting variables U. FREE is represented as a |V | \u00d7|E| proposed by Chen et al. (2016b) and Griewank & Walther\r\n matrix. Except for our maximum batch size experiments, (2000) \u2013 over 1.2\u00d7 faster than the next best on U-Net at\r\n solutions are generated with a user-con\ufb01gurable time limit the NVIDIA V100 memory budget. Our framework allows\r\n of 3600 seconds, though the majority of problems solve training a U-Net at a batch size of 32 images per GPU with\r\n within minutes. Problems with exceptionally large batch less than 10% higher overhead. This would require 23 GB\r\n sizes or heavily constrained memory budgets may reach this of memory without rematerialization, or with the original\r\n time limit while the solver attempts to prove that the prob- baselines without our generalizations.\r\n lem is infeasible. The cost of a solution is measured with\r\n a pro\ufb01le-based cost model and compared to the (perhaps 6.4 Arelarge inputs practical with rematerialization?\r\n unachievable) cost with no recomputation (Section 4.10). Themaximumbatchsizeenabledbydifferent rematerializa-\r\n The feasible set of our optimal ILP formulation is a strict tion strategies is shown in Figure 6. The y-axis shows the\r\n superset of baseline heuristics. We implement baselines as theoretical maximum batch size we could feasibly train with\r\n a static policy for the decision variable S and then solve bounded compute cost. This is calculated by enforcing that\r\n for the lowest-cost recomputation schedule using a similar the total cost must be less than the cost of performing just\r\n procedure to that described in Algorithm 2. one additional forward pass. That is, in Figure 6 the cost is\r\n at most an additional forward pass higher, if the speci\ufb01ed\r\n 6.3 Whatisthetrade-off between memory usage and batch size would have \ufb01t in GPU memory. We reformu-\r\n computational overhead? late Problem (9) to maximize a batch size variable B \u2208 N\r\n subject to modi\ufb01ed memory constraints that use B \u2217 M in\r\n Figure 5 compares remateralization strategies on VGG-16, i\r\n place of M and subject to an additional cost constraint,\r\n MobileNet, and U-Net. The y-axis shows the computational i\r\n overhead of checkpointing in terms of time as compared to n t\r\n baseline. The time is computed by pro\ufb01ling each individual XXCiRt,i\u22642 X Ci+ X Ci. (10)\r\n layer of the network. The x-axis shows the total memory t=1 i=1 v \u2208G v \u2208G\r\n i fwd i bwd\r\n budget required to run each model with the speci\ufb01ed batch Themodi\ufb01edintegerprogramhasquadraticconstraints, and\r\n size, computed for single precision training. Except for the\r\n \u221a is dif\ufb01cult to solve. We set a time limit of one day for the\r\n nheuristics, each rematerialization algorithm has a knob experiment, but Gurobi may be unable to reach optimality\r\n to trade-off the amount of recomputation and memory usage, within that limit. Figure 6 then provides a lower bound on\r\n Checkmate: Breaking the Memory Wall with Optimal Tensor Rematerialization\r\n 1105 Chen Chen Griewank Two-phase\r\n 5x \u221a\r\n e n greedy logn LProunding\r\n z\r\n i 61 MobileNet 1.14\u00d7 1.07\u00d7 7.07\u00d7 1.06\u00d7\r\n s4x\r\n \r\n h VGG16 1.28\u00d7 1.06\u00d7 1.44\u00d7 1.01\u00d7\r\n c\r\n t 62 640\r\n a3x VGG19 1.54\u00d7 1.39\u00d7 1.75\u00d7 1.00\u00d7\r\n b\r\n 225\r\n d 35 60 43 452 U-Net 1.27\u00d7 1.23\u00d7 - 1.03\u00d7\r\n e 199\r\n z 5151\r\n i2x 289\r\n l 33 266 ResNet50 1.20\u00d7 1.25\u00d7 - 1.05\u00d7\r\n a 18 197 98116\r\n m 16 29 21 167 1 215\r\n r1x\r\n No Table 2. Approximation ratios for baseline heuristics and our LP\r\n 0x U-Net FCN8 SegNet VGG19 ResNet50MobileNet rounding strategy. Results are given as the geometric mean\r\n speedup of the optimal ILP across feasible budgets.\r\n Checkpoint all AP \u221an Lin. greedy Checkmate (ours)\r\n Figure 6. MaximumbatchsizepossibleonasingleNVIDIAV100 6.5 Howwellcanweapproximatetheoptimal\r\n GPUwhenusingdifferent generalized rematerialization strategies rematerialization policy?\r\n with at most a single extra forward pass. We enable increasing To understand how well our LP rounding strategy (Sec-\r\n batch size by up to 5.1\u00d7 over the current practice of caching tion 5) approximates the ILP, we measure the ratio\r\n all activations (on MobileNet), and up to 1.73\u00d7 over the best COSTapprox/COSTopt, i.e. the speedup of the optimal sched-\r\n checkpointing scheme (on U-Net). ule, in FLOPs. As in Section 6.3, we solve each strategy at a\r\n rangeofmemorybudgets,thencomputethegeometricmean\r\n the maximumbatch size that Checkmate can achieve. of the ratio across budgets. The aggregated ratio is used\r\n For fair comparison on the non-linear graphs used in U- because some budgets are feasible via the ILP but not via\r\n \u221a the approximations. Table 6 shows results. The two-phase\r\n Net, FCN, and ResNet, we use the AP nandlinearized deterministic rounding approach has approximation factors\r\n greedy baseline generalizations described in Section 6.1. close to optimal, at most 1.06\u00d7 for all tested architectures.\r\n Let M =2M , as in (2) and let M be the mem-\r\n \ufb01xed param @1\r\n ory a baseline strategy uses at batch size 1. The maximum 7 CONCLUSIONS\r\n baseline batch size is estimated with (11), where the mini-\r\n mization is taken with respect to hyperparameters, if any. Oneofthemainchallenges when training large neural net-\r\n \u0016 16GB\u2212M \u0017 works is the limited capacity of high-bandwidth memory\r\n maxB= \ufb01xed (11) onaccelerators such as GPUs and TPUs. This has created\r\n minM \u2212M\r\n @1 \ufb01xed a memory wall that limits the size of the models that can\r\n Costs are measured in FLOPs, determined statically. U- be trained. The bottleneck for state-of-the-art model de-\r\n Net, FCN8 and SegNet semantic segmentation networks velopment is now memory rather than data and compute\r\n use a resolution of 416 \u00d7 608, and classi\ufb01cation networks availability, and we expect this trend to worsen in the future.\r\n ResNet50,VGG19andMobileNetuseresolution224\u00d7224. Toaddress this challenge, we proposed a novel rematerial-\r\n Takeaways: We can increase the batch size of U-Net to ization algorithm which allows large models to be trained\r\n 61at a high resolution, an unprecedented result. For many with limited available memory. Our method does not make\r\n tasks such as semantic segmentation, where U-Net is com- the strong assumptions required in prior work, supporting\r\n monlyused, it is not possible to use batch sizes greater than general non-linear computation graphs such as residual net-\r\n 16, depending on resolution. This is sub-optimal for batch works and capturing the impact of non-uniform memory\r\n normalization layers, and being able to increase the batch usage and computation cost throughout the graph with a\r\n size by 3.8\u00d7 (61 vs 16 for a representative resolution) is hardware-aware, pro\ufb01le-guided cost model. We presented\r\n quite signi\ufb01cant. Orthogonal approaches to achieve this anILPformulationfortheproblem,implementedtheCheck-\r\n include model parallelism and distributed memory batch mate system for optimal rematerialization in TensorFlow,\r\n normalization which can be signi\ufb01cantly more dif\ufb01cult to andtested the proposed system on a range of neural network\r\n implement and have high communication costs. Further- models. In evaluation, we \ufb01nd that optimal rematerializa-\r\n more, for MobileNet, Checkmate allows a batch size of tion has minimal computational overhead at a wide range of\r\n 1105 which is 1.73\u00d7 higher than the best baseline solution, memorybudgetsandshowedthatCheckmateenablesprac-\r\n a greedy heuristic, and 5.1\u00d7 common practice, checkpoint- titioners to train high-resolution models with signi\ufb01cantly\r\n ing all activations. The same schedules can also be used to larger batch sizes. Finally, a novel two-phase rounding\r\n increase image resolution rather than batch size. strategy closely approximates the optimal solver.\r\n Checkmate: Breaking the Memory Wall with Optimal Tensor Rematerialization\r\n ACKNOWLEDGEMENTS Chen, L.-C., Papandreou, G., Kokkinos, I., Murphy, K., and\r\n WewouldliketothankBarnaSahaandLaurentElGhaoui Yuille, A. L. DeepLab: Semantic Image Segmentation\r\n for guidance on approximation, Mong H. Ng for help in with Deep Convolutional Nets, Atrous Convolution, and\r\n evaluation, and the paper and artifact reviewers for helpful Fully Connected CRFs. June 2016a. arXiv: 1606.00915.\r\n suggestions. In addition to NSF CISE Expeditions Award Chen,T., Xu, B., Zhang, C., and Guestrin, C. Training Deep\r\n CCF-1730628,thisworkissupportedbygiftsfromAlibaba, Nets with Sublinear Memory Cost. April 2016b. arXiv:\r\n AmazonWebServices,AntFinancial,CapitalOne,Ericsson, 1604.06174.\r\n Facebook, Futurewei, Google, Intel, Microsoft, NVIDIA,\r\n Scotiabank, Splunk and VMware. This work is also sup- Chen,X.,Ma,H.,Wan,J.,Li,B.,andXia,T. Multi-view3D\r\n ported by the NSF GRFP under Grant No. DGE-1752814. Object Detection Network for Autonomous Driving. In\r\n Any opinions, \ufb01ndings, and conclusions or recommenda- 2017IEEEConferenceonComputerVisionandPattern\r\n tions expressed in this material are those of the author(s) Recognition (CVPR), pp. 6526\u20136534. IEEE, 2017.\r\n and do not necessarily re\ufb02ect the views of the NSF. Child, R., Gray, S., Radford, A., and Sutskever, I. Gener-\r\n REFERENCES ating Long Sequences with Sparse Transformers. April\r\n 2019. arXiv: 1904.10509.\r\n Abadi, M., Agarwal, A., Barham, P., Brevdo, E., Chen, Z., Cytron, R., Ferrante, J., Rosen, B. K., Wegman, M. N.,\r\n Citro, C., Corrado, G. S., Davis, A., Dean, J., Devin, M., and Zadeck, F. K. Ef\ufb01ciently Computing Static Single\r\n Ghemawat, S., Goodfellow, I., Harp, A., Irving, G., Isard, Assignment Form and the Control Dependence Graph.\r\n M., Jia, Y., Jozefowicz, R., Kaiser, L., Kudlur, M., Lev- ACMTrans.Program.Lang.Syst., 13(4):451\u2013490, Octo-\r\n enberg, J., Mane, D., Monga, R., Moore, S., Murray, D., ber 1991.\r\n Olah, C., Schuster, M., Shlens, J., Steiner, B., Sutskever,\r\n I., Talwar, K., Tucker, P., Vanhoucke, V., Vasudevan, Dai, Z., Yang, Z., Yang, Y., Carbonell, J., Le, Q. V., and\r\n V., Viegas, F., Vinyals, O., Warden, P., Wattenberg, M., Salakhutdinov, R. Transformer-XL: Attentive Language\r\n Wicke, M., Yu, Y., and Zheng, X. TensorFlow: Large- ModelsBeyondaFixed-LengthContext. January 2019.\r\n Scale Machine Learning on Heterogeneous Distributed arXiv: 1901.02860.\r\n Systems. March 2016. Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K. BERT:\r\n Beaumont, O., Herrmann, J., Pallez, G., and Shilova, A. Pre-training of Deep Bidirectional Transformers for Lan-\r\n Optimal memory-aware backpropagation of deep join guage Understanding. October 2018. arXiv: 1810.04805.\r\n networks. Research Report RR-9273, Inria, May 2019. Dong, C., Loy, C. C., He, K., and Tang, X. Image super-\r\n Briggs, P., Cooper, K. D., and Torczon, L. Rematerialization. resolution using deep convolutional networks. IEEE\r\n In Proceedings of the ACM SIGPLAN 1992 Conference Transactions on Pattern Analysis and Machine Intelli-\r\n on Programming Language Design and Implementation, gence, 38(2):295\u2013307, Feb 2016.\r\n PLDI\u201992,pp.311\u2013321,NewYork,NY,USA,1992. Feng, J. and Huang, D. Cutting Down Training Memory by\r\n Brock, A., Donahue, J., and Simonyan, K. Large scale GAN Re-fowarding. July 2018.\r\n training for high \ufb01delity natural image synthesis. arXiv Forrest, J. J., Vigerske, S., Ralphs, T., Santos, H. G., Hafer,\r\n preprint arXiv:1809.11096, 2018. L., Kristjansson, B., Fasano, J., Straver, E., Lubin, M.,\r\n Bulo, S. R., Porzi, L., and Kontschieder, P. In-place Ac- rlougee, jpgoncal1, Gassmann, H. I., and Saltzman, M.\r\n tivated BatchNorm for Memory-Optimized Training of COIN-ORBranch-and-Cutsolver, June 2019.\r\n DNNs. In 2018 IEEE/CVF Conference on Computer Gholami,A.,Azad,A.,Jin,P.,Keutzer,K.,andBuluc,A. In-\r\n Vision and Pattern Recognition, pp. 5639\u20135647. IEEE, tegrated model, batch, and domain parallelism in training\r\n June 2018. neural networks. In Proceedings of the 30th on Sympo-\r\n Canziani, A., Paszke, A., and Culurciello, E. An Analysis of sium on Parallelism in Algorithms and Architectures, pp.\r\n DeepNeuralNetworkModelsforPractical Applications. 77\u201386. ACM, 2018.\r\n May2016. arXiv: 1605.07678. GLPK. GNUProject-FreeSoftwareFoundation(FSF).\r\n Chaitin, G. J., Auslander, M. A., Chandra, A. K., Cocke, J., Gomez,A.N.,Ren,M.,Urtasun,R., and Grosse, R. B. The\r\n Hopkins, M. E., and Markstein, P. W. Register allocation Reversible Residual Network: Backpropagation Without\r\n via coloring. Computer Languages, 6(1):47\u201357, January Storing Activations. In Guyon, I., Luxburg, U. V., Ben-\r\n 1981. gio, S., Wallach, H., Fergus, R., Vishwanathan, S., and\r\n Checkmate: Breaking the Memory Wall with Optimal Tensor Rematerialization\r\n Garnett, R. (eds.), Advances in Neural Information Pro- Jia, Z., Zaharia, M., and Aiken, A. Beyond Data and Model\r\n cessing Systems 30, pp. 2214\u20132224. Curran Associates, Parallelism for Deep Neural Networks. SysML Confer-\r\n Inc., 2017. ence, pp. 13, Feb. 2018b.\r\n Goodwin, D. W. and Wilken, K. D. Optimal and Near- Karmarkar, N. A new polynomial-time algorithm for linear\r\n optimal Global Register Allocation Using 0\u20131 Integer programming. In Proceedings of the sixteenth annual\r\n Programming. Software: Practice and Experience, 26(8): ACMsymposiumonTheoryofcomputing,pp.302\u2013311.\r\n 929\u2013965, 1996. ACM,1984.\r\n Griewank, A. and Walther, A. Algorithm 799: revolve: an Kim,J., Lee, J. K., and Lee, K. M. Accurate image super-\r\n implementation of checkpointing for the reverse or ad- resolution using very deep convolutional networks. In\r\n joint mode of computational differentiation. ACM Trans- 2016IEEEConferenceonComputerVisionandPattern\r\n actions on Mathematical Software, 26(1):19\u201345, March Recognition (CVPR), pp. 1646\u20131654, June 2016. doi:\r\n 2000. 10.1109/CVPR.2016.182.\r\n Gruslys, A., Munos, R., Danihelka, I., Lanctot, M., Koes, D. R. and Goldstein, S. C. A Global Progressive\r\n and Graves, A. Memory-ef\ufb01cient Backpropagation Register Allocator. In Proceedings of the 27th ACM\r\n Through Time. In Proceedings of the 30th International SIGPLANConferenceonProgrammingLanguageDesign\r\n Conference on Neural Information Processing Systems, andImplementation, PLDI \u201906, pp. 204\u2013215, New York,\r\n NIPS\u201916, pp. 4132\u20134140, USA, June 2016. Curran Asso- NY, USA, 2006. ACM. event-place: Ottawa, Ontario,\r\n ciates Inc. Canada.\r\n Gueguen, L., Sergeev, A., Kadlec, B., Liu, R., and Yosin- Krizhevsky, A., Sutskever, I., and Hinton, G. E. ImageNet\r\n ski, J. Faster Neural Networks Straight from JPEG. In Classi\ufb01cation with Deep Convolutional Neural Networks.\r\n Bengio, S., Wallach, H., Larochelle, H., Grauman, K., InPereira, F., Burges, C. J. C., Bottou, L., and Weinberger,\r\n Cesa-Bianchi, N., and Garnett, R. (eds.), Advances in K. Q. (eds.), Advances in Neural Information Process-\r\n Neural Information Processing Systems 31, pp. 3933\u2013 ing Systems 25, pp. 1097\u20131105. Curran Associates, Inc.,\r\n 3944. Curran Associates, Inc., 2018. 2012.\r\n He, K., Zhang, X., Ren, S., and Sun, J. Deep residual learn- Lattner, C. LLVM: An Infrastructure for Multi-Stage Op-\r\n ing for image recognition. In Proceedings of the IEEE timization. Master\u2019s thesis, Computer Science Dept.,\r\n conference on computer vision and pattern recognition, University of Illinois at Urbana-Champaign, Urbana, IL,\r\n pp. 770\u2013778, 2016. December2002.\r\n Holder, L. Graph Algorithms: Applications, 2008. Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D.,\r\n Huang, G., Liu, Z., Van Der Maaten, L., and Weinberger, Levy, O., Lewis, M., Zettlemoyer, L., and Stoyanov, V.\r\n K. Q. Densely connected convolutional networks. In RoBERTa: ARobustlyOptimizedBERTPretraining Ap-\r\n Proceedings of the IEEE conference on computer vision proach. July 2019. arXiv: 1907.11692.\r\n andpattern recognition, pp. 4700\u20134708, 2017. Long, J., Shelhamer, E., and Darrell, T. Fully convolutional\r\n Ioffe, S. and Szegedy, C. Batch Normalization: Accelerating networks for semantic segmentation. In Proceedings\r\n DeepNetworkTrainingbyReducingInternal Covariate of the IEEE conference on computer vision and pattern\r\n Shift. International Conference on Machine Learning, recognition, pp. 3431\u20133440, 2015.\r\n February 2015. Lozano, R. C., Carlsson, M., Blindell, G. H., and Schulte,\r\n Jain, A., Phanishayee, A., Mars, J., Tang, L., and Pekhi- C. Combinatorial Register Allocation and Instruction\r\n menko, G. Gist: Ef\ufb01cient Data Encoding for Deep Neu- Scheduling. April 2018. arXiv: 1804.02452.\r\n ral Network Training. In Proceedings of the 45th An- McCandlish, S., Kaplan, J., Amodei, D., and Team, O. D.\r\n nual International Symposium on Computer Architecture, An Empirical Model of Large-Batch Training. arXiv:\r\n ISCA \u201918, pp. 776\u2013789, Piscataway, NJ, USA, 2018. 1812.06162.\r\n IEEEPress.\r\n Meng, C., Sun, M., Yang, J., Qiu, M., and Gu, Y. Train-\r\n Jia, Z., Lin, S., Qi, C. R., and Aiken, A. Exploring Hidden ing Deeper Models by GPU Memory Optimization on\r\n Dimensions in Accelerating Convolutional Neural Net- TensorFlow. pp. 8, December 2017.\r\n works. In International Conference on Machine Learning,\r\n pp. 2274\u20132283, July 2018a. Micikevicius, P. Local Memory and Register Spilling, 2011.\r\n Checkmate: Breaking the Memory Wall with Optimal Tensor Rematerialization\r\n Nakata, I. On Compiling Algorithms for Arithmetic Simonyan, K. and Zisserman, A. Very Deep Convolutional\r\n Expressions. Commun. ACM, 10(8):492\u2013494, August NetworksforLarge-ScaleImageRecognition. September\r\n 1967. ISSN 0001-0782. doi: 10.1145/363534.363549. 2014. arXiv: 1409.1556.\r\n URL http://doi.acm.org/10.1145/363534. Siskind, J. M. and Pearlmutter, B. A. Divide-and-conquer\r\n 363549. checkpointing for arbitrary programs with no user annota-\r\n Nesterov, Y. and Nemirovskii, A. Interior-point polynomial tion. Optimization Methods and Software, 33(4-6):1288\u2013\r\n algorithms in convex programming, volume 13. Siam, 1330, 2018a. doi: 10.1080/10556788.2018.1459621.\r\n 1994. Siskind, J. M. and Pearlmutter, B. A. Divide-and-Conquer\r\n NVIDIA. NVIDIA Tesla V100 GPU Architecture, Checkpointing for Arbitrary Programs with No User An-\r\n August 2017. URL https://images.nvidia. notation. Optimization Methods and Software, 33(4-6):\r\n com/content/volta-architecture/pdf/ 1288\u20131330, November 2018b.\r\n volta-architecture-whitepaper.pdf.\r\n Sivathanu, M., Chugh, T., Singapuram, S. S., and Zhou, L.\r\n Olesen, J. S. Register Allocation in LLVM 3.0, November Astra: Exploiting Predictability to Optimize Deep Learn-\r\n 2011. ing. In Proceedings of the Twenty-Fourth International\r\n Paszke, A., Gross, S., Chintala, S., Chanan, G., Yang, E., Conference on Architectural Support for Programming\r\n DeVito, Z., Lin, Z., Desmaison, A., Antiga, L., and Lerer, Languages and Operating Systems - ASPLOS \u201919, pp.\r\n A. Automatic differentiation in PyTorch. In NIPS 2017 909\u2013923, Providence, RI, USA, 2019. ACM Press.\r\n Autodiff Workshop, 2017. Sze, V., Chen, Y.-H., Yang, T.-J., and Emer, J. S. Ef\ufb01cient\r\n Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., processing of deep neural networks: A tutorial and survey.\r\n Chanan, G., Killeen, T., Lin, Z., Gimelshein, N., Antiga, Proceedings of the IEEE, 105(12):2295\u20132329, 2017.\r\n L., Desmaison, A., Kopf, A., Yang, E., DeVito, Z., Rai- Szegedy, C., Wei Liu, Yangqing Jia, Sermanet, P., Reed,\r\n son, M., Tejani, A., Chilamkurthy, S., Steiner, B., Fang, S., Anguelov, D., Erhan, D., Vanhoucke, V., and Ra-\r\n L., Bai, J., and Chintala, S. Pytorch: An imperative binovich, A. Going deeper with convolutions. In\r\n style, high-performance deep learning library. In Wal- 2015 IEEE Conference on Computer Vision and Pat-\r\n \u00b4\r\n lach, H., Larochelle, H., Beygelzimer, A., d\u2019 Alche-Buc, tern Recognition (CVPR), pp. 1\u20139, June 2015. doi:\r\n F., Fox, E., and Garnett, R. (eds.), Advances in Neural In- 10.1109/CVPR.2015.7298594.\r\n formation Processing Systems 32, pp. 8024\u20138035. Curran\r\n Associates, Inc., 2019. Tai, Y., Yang, J., and Liu, X. Image super-resolution via\r\n Pohlen, T., Hermans, A., Mathias, M., and Leibe, B. Full- deep recursive residual network. In 2017 IEEE Con-\r\n resolution residual networks for semantic segmentation ference on Computer Vision and Pattern Recognition\r\n in street scenes. In Computer Vision and Pattern Recog- (CVPR), pp. 2790\u20132798, July 2017. doi: 10.1109/CVPR.\r\n nition (CVPR), 2017 IEEE Conference on, 2017. 2017.298.\r\n Punjani, M. Register Rematerialization in GCC. In GCC Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones,\r\n Developers\u2019 Summit, volume 2004. Citeseer, 2004. L., Gomez,A.N.,Kaiser,L.,andPolosukhin,I. Attention\r\n is All you Need. In Guyon, I., Luxburg, U. V., Bengio, S.,\r\n Ronneberger, O., Fischer, P., and Brox, T. U-Net: Convolu- Wallach, H., Fergus, R., Vishwanathan, S., and Garnett, R.\r\n tional Networks for Biomedical Image Segmentation. In (eds.), Advances in Neural Information Processing Sys-\r\n Navab, N., Hornegger, J., Wells, W. M., and Frangi, A. F. tems 30, pp. 5998\u20136008. Curran Associates, Inc., 2017.\r\n (eds.), Medical Image Computing and Computer-Assisted Wu,Y.andHe,K. GroupNormalization. pp. 3\u201319, 2018.\r\n Intervention \u2013 MICCAI 2015, Lecture Notes in Computer\r\n Science, pp. 234\u2013241. Springer International Publishing, \u00b4\r\n Xie, S., Girshick, R., Dollar, P., Tu, Z., and He, K. Aggre-\r\n 2015. ISBN 978-3-319-24574-4. gated residual transformations for deep neural networks.\r\n Rosen, B. K., Wegman, M. N., and Zadeck, F. K. Global In Proceedings of the IEEE conference on computer vi-\r\n Value Numbers and Redundant Computations. In Pro- sion and pattern recognition, pp. 1492\u20131500, 2017.\r\n ceedings of the 15th ACM SIGPLAN-SIGACTSymposium Yang, B., Liang, M., and Urtasun, R. HDNET: Exploiting\r\n on Principles of Programming Languages, POPL \u201988, pp. HDMapsfor3DObjectDetection. pp. 10, 2018.\r\n 12\u201327, New York, NY, USA, 1988. ACM.\r\n Yannakakis, M. On the approximation of maximum satis\ufb01a-\r\n Sethi, R. Complete Register Allocation Problems. pp. 14, bility. Journal of Algorithms, 17(3):475\u2013502, 1994.\r\n April 1973.\r\n Checkmate: Breaking the Memory Wall with Optimal Tensor Rematerialization\r\n A INTEGRALITYGAP per row of R that \ufb01lls in entries when a needed value is not\r\n Tounderstand why the partitioned variant of the MILP (Sec- in memory by the same process described in Section 5.2.\r\n tion 4.2) is faster to solve via branch-and-bound, we can \u221a\r\n measure the integrality gap for particular problem instances. B.2 Linearized nandLinearizedgreedy\r\n Theintegrality gap is the maximum ratio between the opti- The forward graph of the DNN G =(V ,E ) can\r\n fwd fwd fwd\r\n malvalue of the ILP and its relaxation, de\ufb01ned as follows: be treated as a linear graph G =(V ,E )withedges\r\n lin fwd lin\r\n COST connecting consecutive vertices in a topological order:\r\n IG=max int ,\r\n I COSTfrac E ={(v ,v ),(v ,v ),...,(v , v )}\r\n lin 1 2 2 3 L\u22121 L\r\n where COSTint and COSTfrac are the optimal value While G does not properly encode data dependencies, it\r\n lin\r\n the ILP and that of its relaxation, respectively. I = is a linear graph that baselines can analyze. To extend a\r\n (G,C,M,M ) describes a problem instance. As our baseline, we apply it to G , generate checkpoint matrix S\r\n budget lin\r\n ILP is a minimization problem, COSTint \u2265 COSTfrac for from the resulting checkpoint set, and \ufb01nd the optimal R as\r\n all I, and IG \u2265 1. While it is not possible to measure with the AP baselines.\r\n the ratio between the ILP and LP solutions for all problem\r\n instances, the ratio for any particular problem instance gives C HARDNESSOFREMATERIALIZATION\r\n a lower bound on the integrality gap.\r\n For the 8-layer linear neural network graph discussed in Sethi (1973) reduced 3-SAT to a decision problem based on\r\n Section 4.2, frontier-advancement reduces the integrality register allocation in straight line programs, with no recom-\r\n gapfrom21.56to1.18,i.e. the LP relaxation is signi\ufb01cantly putation permitted. Such programs can be represented by\r\n tighter. In branch-and-bound algorithms for ILP optimiz- result-rooted Directed Ayclic Graphs (DAGs), with nodes\r\n tion, a subset of feasible solutions can be pruned if the LP corresponding to operations and edges labeled by values.\r\n relaxation over the subset yields an objective higher than In Sethi\u2019s graphs, the desired results are the roots of the\r\n the best integer solution found thus far. With a tight LP DAG.Ifaprogramhasnocommonsubexpressions,i.e. the\r\n relaxation, this condition for pruning is often met, so fewer graph forms a tree, optimal allocation is possible via a lin-\r\n solutions need to be enumerated. ear time tree traversal (Nakata, 1967). However, Sethi\u2019s\r\n reduction shows a register allocation decision problem in\r\n B GENERALIZATIONSOFPRIORWORK the general case\u2014whether a result-rooted DAG can be com-\r\n puted with fewer than k registers without recomputation\u2014is\r\n B.1 AP\u221anandAPgreedy NP-complete.\r\n Weidentify Articulation Points (AP) in the undirected form Thedecision problem characterizes computation of a DAG\r\n of the forward pass data-\ufb02ow graph as candidates for check- as a sequence of four possible moves of stones, or registers,\r\n pointing. Articulation points are vertices that increase the onthenodesofthegraph,analogoustostatementsdiscussed\r\n number of connected components (e.g. disconnect) the in Section 4.9. The valid moves are to (1) place a register at\r\n graph if removed, and can be identi\ufb01ed in time O(V + E) a leaf, computing it, or (2) pick up a register from a node.\r\n via a modi\ufb01edDFStraversal(Holder,2008). Anarticulation Also, if there are registers at all children of a node x, then\r\n point v is a good candidate for checkpointing as subsequent it is valid to (3) place a register at x, computing it, or (4)\r\n a moveastonetoxfromoneofthechildrenofx,computing\r\n vertices in the topological order have no dependencies on x. The register allocation problem reduces to the following\r\n vertices before v in the order. DNN computation graphs are\r\n a no-overhead rematerialization decision problem (RP-DEC):\r\n connected, so each intermediate tensor can be reconstructed De\ufb01nition C.1. (RP-DEC): Given result-terminated data-\r\n from a single articulation point earlier in the topological\r\n order, or the input if there is no such AP. APs include the \ufb02owDAGG=(V,E)correspondingtoaprogram, with\r\n input and output nodes of residual blocks in ResNet, but not unit cost to compute each node and unit memory for the\r\n vertices inside blocks. We apply Chen\u2019s heuristics to check- results of each node, does there exist an execution plan that\r\n point a subset of these candidates, then solve for the optimal evaluates the leaf (terminal) node t \u2208 V with maximum\r\n recomputation plan R to restore correctness. Solving for memoryusagebatcostatmost|V|?\r\n Rensures that the dependencies of a node are in memory RP-DECisdecidable by solving the memory-constrained\r\n whenit is computed. form of Problem 1 with suf\ufb01cient stages, then checking if\r\n Wecould \ufb01nd R by solving the optimization problem (9) the returned execution plan has cost at most |V |. RP-DEC\r\n with additional constraints on S that encode the heuristi- closely resembles Sethi\u2019s decision problem, differing only\r\n cally selected checkpoints. However, as S is given, the in subtleties. The register allocation DAG is rooted at the\r\n optimization is solvable in O(|V ||E|) via a graph traversal desired result t whereas a data-\ufb02ow graph terminates at the\r\n Checkmate: Breaking the Memory Wall with Optimal Tensor Rematerialization\r\n VGG16 MobileNet E ARTIFACTREPRODUCIBILITY\r\n 20 185 INSTRUCTIONS\r\n )\r\n s\r\n 180\r\n 19\r\n m\r\n ( Checkmate is a Python package that computes memory-\r\n e\r\n 175 ef\ufb01cient schedules for evaluating neural network data\ufb02ow\r\n 18\r\n m\r\n i\r\n t graphs created by the backpropagation algorithm. To save\r\n U\r\n 170\r\n 17\r\n P memory,the package deletes and rematerializes intermedi-\r\n G\r\n 165 ate values via recomputation. The schedule with minimum\r\n 16\r\n recomputation for a given memory budget is chosen by solv-\r\n 15.4 15.6 15.8\r\n 15 16 ing an integer linear program. Find the software for the\r\n Activation memory usage (GB) artifact and documentation at https://github.com/\r\n Checkpoint all Deterministic rounding parasj/checkmate/tree/mlsys20_artifact.\r\n Randomizedrounding\r\n ILP\r\n E.1 Artifact check-list (meta-information)\r\n Figure 7. Comparison of the two-phase LP rounding approxima- \u2022 Algorithm: Integer linear programming (Gurobi 9.0)\r\n tion with randomized rounding of S\u2217 and deterministic rounding\r\n of S\u2217 on different models. We compare memory usage and compu- \u2022 Model: Code included in setup and public, including neural\r\n tational cost (objective), in milliseconds according to pro\ufb01le-based network architectures VGG16, VGG19, U-Net, MobileNet,\r\n cost model. The average of the randomized rounding costs is SegNet, FCN, ResNet50. Trained weights not required.\r\n shownasadottedline. \u2022 Run-timeenvironment: Ubuntu 18.04.3 LTS\r\n \u2022 Hardware: 2xIntel E5-2670 CPUs, 256GB DDR4 RAM\r\n result. Second, register-based computations can be in place, \u2022 Execution: Runtime varies, 1m to 24hr\r\n e.g. a summation a + b may be written to the same location \u2022 Metrics: Computational overhead (slowdown based on cost\r\n as either of the operands. In neural network computation model), maximum supported batch size\r\n graphs, we cannot perform all computations in place, so we\r\n did not make this assumption. To reduce Sethi\u2019s decision \u2022 Output: Plotofmemorybudgetvsoverhead. Consoleoutput\r\n problem to RP-DEC, given result-rooted DAG G, construct of maximumsupported batch size\r\n \u2032\r\n result-terminated G by reversing all edges. Then, if Sethi\u2019s \u2022 Experiments: Commands provided in README.md for\r\n instance allows for at most k registers, allow for a memory Gurobi installation and running experiment Python scripts\r\n budget of b = k + 1 bytes: one byte to temporarily write \u2022 Howmuchdiskspacerequired?: 1GB\r\n outputs of operations that would have been written in place.\r\n Despite hardness of register allocation, Goodwin & Wilken \u2022 Publicly available?: Yes. https://github.com/\r\n (1996) observe that a 0-1 integer program for optimal alloca- parasj/checkmate/tree/mlsys20_artifact.\r\n Archived at https://zenodo.org/badge/\r\n tion under an instruction schedule has empirical complexity latestdoi/209406827.\r\n O(n2.5),polynomialinthenumberofconstraints. Similarly,\r\n Section 6 shows that the frontier-advancing, constrained op- \u2022 Codelicenses: Apache 2.0 licensed\r\n timization problem (9) is tractable for many networks.\r\n D COMPARISONOFAPPROXIMATIONS\r\n In Section 5, we discussed an approximation strategy based\r\n on rounding the LP relaxation, evaluated with deterministic\r\n rounding in Section 6.5. Figure 7 compares schedules pro-\r\n duced by our proposed two-phase rounding strategy when\r\n the S\u2217 matrix from the LP relaxation is rounded with a ran-\r\n domized and a deterministic approach. While two-phase\r\n randomized rounding of S\u2217 offers a range of feasible so-\r\n lutions, two-phase deterministic rounding produces consis-\r\n tently lower cost schedules. While appropriate for VGG16,\r\n for MobileNet, our budget allowance \u01eb = 0.1 is overly\r\n conservative as schedules use less memory than the 16 GB\r\n budget. A search procedure over \u01eb \u2208 [0,1] could be used to\r\n produce more ef\ufb01cient schedules.", "award": [], "sourceid": 196, "authors": [{"given_name": "Paras", "family_name": "Jain", "institution": "UC Berkeley"}, {"given_name": "Ajay", "family_name": "Jain", "institution": "UC Berkeley"}, {"given_name": "Aniruddha", "family_name": "Nrusimha", "institution": "UC Berkeley"}, {"given_name": "Amir", "family_name": "Gholami", "institution": "UC Berkeley"}, {"given_name": "Pieter", "family_name": "Abbeel", "institution": "UC Berkeley"}, {"given_name": "Joseph", "family_name": "Gonzalez", "institution": "UC Berkeley"}, {"given_name": "Kurt", "family_name": "Keutzer", "institution": "EECS, UC Berkeley"}, {"given_name": "Ion", "family_name": "Stoica", "institution": "UC Berkeley"}]}