{"title": "YellowFin and the Art of Momentum Tuning", "book": "Proceedings of Machine Learning and Systems", "page_first": 289, "page_last": 308, "abstract": "Hyperparameter tuning is one of the most time-consuming workloads in deep learning. State-of-the-art optimizers, such as AdaGrad, RMSProp and Adam, reduce this labor by adaptively tuning an individual learning rate for each variable. Recently researchers have shown renewed interest in simpler methods like momentum SGD as they may yield better test metrics. Motivated by this trend, we ask: can simple adaptive methods based on SGD perform as well or better? We revisit the momentum SGD algorithm and show that hand-tuning a single learning rate and momentum makes it competitive with Adam. We then analyze its robustness to learning rate misspecification and objective curvature variation. Based on these insights, we design YellowFin, an automatic tuner for momentum and learning rate in SGD. YellowFin optionally uses a negative-feedback loop to compensate for the momentum dynamics in asynchronous settings on the fly. We empirically show that YellowFin can converge in fewer iterations than Adam on ResNets and LSTMs for image recognition, language modeling and constituency parsing, with a speedup of up to 3.28x in synchronous and up to 2.69x in asynchronous settings.\n", "full_text": " YELLOWFINANDTHEARTOFMOMENTUMTUNING\r\n Jian Zhang1 Ioannis Mitliagkas2\r\n ABSTRACT\r\n Hyperparameter tuning is one of the most time-consuming workloads in deep learning. State-of-the-art optimizers,\r\n such as AdaGrad, RMSProp and Adam, reduce this labor by adaptively tuning an individual learning rate for each\r\n variable. Recently researchers have shown renewed interest in simpler methods like momentum SGD as they may\r\n yield better test metrics. Motivated by this trend, we ask: can simple adaptive methods based on SGD perform as\r\n well or better? We revisit the momentum SGD algorithm and show that hand-tuning a single learning rate and\r\n momentummakesitcompetitive with Adam. We then analyze its robustness to learning rate misspeci\ufb01cation and\r\n objective curvature variation. Based on these insights, we design YELLOWFIN, an automatic tuner for momentum\r\n andlearning rate in SGD. YELLOWFIN optionally uses a negative-feedback loop to compensate for the momentum\r\n dynamics in asynchronous settings on the \ufb02y. We empirically show that YELLOWFIN can converge in fewer\r\n iterations than Adam on ResNets and LSTMs for image recognition, language modeling and constituency parsing,\r\n with a speedup of up to 3.28x in synchronous and up to 2.69x in asynchronous settings.\r\n 1 INTRODUCTION 101 Synchronous training 101 Asynchronous training\r\n Adam Adam\r\n Accelerated forms of stochastic gradient descent (SGD), YellowFin YellowFin\r\n loss Closed-loop\r\n pioneered by Polyak (1964) and Nesterov (1983), are the YellowFin\r\n de-facto training algorithms for deep learning. Their use\r\n requires a sane choice for their hyperparameters: typically Training 100\r\n a learning rate and momentum parameter (Sutskever et al., 100\r\n 2013). However, tuning hyperparameters is arguably the 0k 30k 60k 90k 120k 0k 30k 60k 90k 120k\r\n most time-consuming part of deep learning, with many pa- Iterations Iterations\r\n pers outlining best tuning practices written (Bengio, 2012; Figure 1. YELLOWFIN in comparison to Adam on a ResNet (CI-\r\n \u00a8\r\n Orr & Muller, 2003; Bengio et al., 2012; Bottou, 2012). FAR100,cf. Section 5) in synchronous and asynchronous settings.\r\n Deeplearning researchers have proposed a number of meth-\r\n ods to deal with hyperparameter optimization, ranging from better test scores (Wilson et al., 2017). Motivated by this\r\n grid-search and smart black-box methods (Bergstra & Ben- trend, we ask the question: can simpler adaptive methods\r\n gio, 2012; Snoek et al., 2012) to adaptive optimizers. Adap- based on momentum SGD perform as well or better? We\r\n tive optimizers aim to eliminate hyperparameter search by empirically show, with a hand-tuned learning rate, Polyak\u2019s\r\n tuning on the \ufb02y for a single training run: algorithms like momentumSGDachievesfasterconvergencethanAdamfor\r\n AdaGrad(Duchietal., 2011), RMSProp (Tieleman & Hin- a large class of models. We then formulate the optimization\r\n ton, 2012) and Adam (Kingma & Ba, 2014) use the magni- update as a dynamical system and study certain robustness\r\n tude of gradient elements to tune learning rates individually properties of the momentum operator. Inspired by our anal-\r\n for each variable and have been largely successful in reliev- ysis, we design YELLOWFIN, an automatic hyperparameter\r\n ing practitioners of tuning the learning rate. tuner for momentum SGD. YELLOWFIN simultaneously\r\n Recently some researchers have started favoring simple mo- tunes the learning rate and momentum on the \ufb02y, and can\r\n mentumSGDoverthepreviouslymentionedadaptive meth- handle the complex dynamics of asynchronous execution.\r\n ods (Chen et al., 2016; Gehring et al., 2017), often reporting Ourcontribution and outline are as follows:\r\n 1Computer Science Department, Stanford University, CA, \u2022 In Section 2, we demonstrate examples where momen-\r\n 2 \u00b4\r\n USA. Mila, University of Montreal, Canada CIFAR AI Chair. tum offers convergence robust to learning rate mis-\r\n Correspondence to: Jian Zhang , Ioannis speci\ufb01cation and curvature variation in a class of non-\r\n Mitliagkas . convexobjectives. Thisrobustnessisdesirablefordeep\r\n Proceedings of the 2nd SysML Conference, Palo Alto, CA, USA, learning. It stems from a known but obscure fact: the\r\n 2019. Copyright 2019 by the author(s). momentumoperator\u2019s spectral radius is constant in a\r\n large subset of the hyperparameter space.\r\n YELLOWFINandtheArtofMomentumTuning\r\n \u2022 In Section 3, we use these robustness insights and a mentumcanexhibit linear convergence robust to learning\r\n simple quadratic model analysis to motivate the design rate misspeci\ufb01cation and to curvature variation. The robust-\r\n of YELLOWFIN, an automatic tuner for momentum ness to learning rate misspeci\ufb01cation means tolerance to a\r\n SGD.YELLOWFINuseson-the-\ufb02ymeasurementsfrom less-carefully-tuned learning rate. On the other hand, the\r\n the gradients to tune both a single learning rate and a robustness to curvature variation means empirical linear con-\r\n single momentum. vergence on a class of non-convex objectives with varying\r\n \u2022 In Section 3.3, we discuss common stability concerns curvatures. After preliminary on momentum, we discuss\r\n related to the phenomenon of exploding gradients (Pas- these two properties desirable for deep learning objectives.\r\n canu et al., 2013). We present a natural extension to\r\n our basic tuner, using adaptive gradient clipping, to sta- 2.1 Preliminaries\r\n bilize training for objectives with exploding gradients. Weaimtominimizesomeobjectivef(x). Inmachinelearn-\r\n \u2022 In Section 4 we present closed-loop YELLOWFIN, ing, x is referred to as the model and the objective is some\r\n suited for asynchronous training. It uses a novel com- loss function. A low loss implies a well-\ufb01t model. Gradient\r\n ponent for measuring the total momentum in a running descent-based procedures use the gradient of the objective\r\n system, including any asynchrony-induced momentum, function, \u2207f(x), to update the model iteratively. These\r\n a phenomenon described in (Mitliagkas et al., 2016). procedures can be characterized by the convergence rate\r\n This measurement is used in a negative feedback loop with respect to the distance to a minimum.\r\n to control the value of algorithmic momentum. De\ufb01nition1(Convergencerate). Letx\u2217 bealocalminimum\r\n Weprovide a thorough empirical evaluation of the perfor- of f(x) and xt denote the model after t steps of an iterative\r\n manceandstability of our tuner. In Section 5, we demon- procedure. The iterates converge to x\u2217 with linear rate \u03b2, if\r\n strate empirically that on ResNets and LSTMs YELLOWFIN \u2217 t \u2217\r\n kx \u2212x k=O(\u03b2 kx \u2212x k).\r\n canconvergeinfeweriterationscomparedto: (i)hand-tuned t 0\r\n momentumSGD(upto1.75xspeedup);and(ii)hand-tuned\r\n Adam (0.77x to 3.28x speedup). Under asynchrony, the Polyak\u2019s momentum gradient descent (Polyak, 1964) is one\r\n closed-loop control architecture speeds up YELLOWFIN, of these iterative procedures, given by\r\n makingitupto2.69xfasterthanAdam. Ourexperimentsin-\r\n x =x \u2212\u03b1\u2207f(x )+\u00b5(x \u2212x ), (1)\r\n clude runs on 7 different models, randomized over at least 3 t+1 t t t t\u22121\r\n different random seeds. YELLOWFIN is stable and achieves where \u03b1 denotes a single learning rate and \u00b5 a single mo-\r\n consistent performance: the normalized sample standard mentumforall model variables. Momentum\u2019s main appeal\r\n deviation of test metrics varies from 0.05% to 0.6%. We re-\r\n leased PyTorch and TensorFlow implementations 1that can is its established ability to accelerate convergence (Polyak,\r\n be used as drop-in replacements for any optimizer. YEL- 1964). On a \u03b3-strongly convex \u03b4-smooth function with con-\r\n LOWFINhasalsobeenimplementedinvariousotherpack- dition number \u03ba = \u03b4/\u03b3, the optimal convergence rate of\r\n gradient descent without momentum is O(\u03ba\u22121) (Nesterov,\r\n ages. Its large-scale deployment in industry has taught us \u03ba+1\r\n important lessons about stability; we discuss those chal- 2013). On the other hand, for certain classes of strongly\r\n lenges and our solution in Section 3.3. We conclude with convex and smooth functions, like quadratics, the optimal\r\n related work and discussion in Section 6 and 7. momentumvalue,\r\n Ourgoal is to explore the value of moment adaptation for \u2217 \u0012\u221a\u03ba\u22121\u00132\r\n SGDandprovideaprototype, ef\ufb01cient tuner achieving this. \u00b5 = \u221a\u03ba+1 , (2)\r\n Whilewereportstate-of-the-art performanceresults in some yields the optimal accelerated linear convergence rate\r\n tasks, we do not claim that on-the-\ufb02y momentum adaptation \u221a\r\n \u03ba\u22121\r\n is a necessary feature of a well-performing synchronous sys- O(\u221a ). This guarantee does not generalize to arbitrary\r\n \u03ba+1\r\n tem. In Section 5.1 we demonstrate that a simple variation strongly convex smooth functions (Lessard et al., 2016).\r\n of YELLOWFIN,onlyusing the momentum value to further Nonetheless, this linear rate can often be observed in prac-\r\n rescale the step size, can yield an adaptive step size method tice even on non-quadratics (cf. Section 2.2).\r\n that performs almost as well in some cases. Keyinsight: Consider a quadratic objective with condition\r\n 2 THEMOMENTUMOPERATOR number\u03ba > 1. Even though its curvature is different along\r\n the different directions, Polyak\u2019s momentum gradient de-\r\n In this section, we identify the main technical insight be- scent, with \u00b5 \u2265 \u00b5\u2217, achieves the same linear convergence\r\n \u221a \u2217\r\n rate \u00b5along all directions. Speci\ufb01cally, let xi,t and x\r\n hind the design of our tuner: gradient descent with mo- i\r\n \u2217 \u2217\r\n be the i-th coordinates of x and x . For any \u00b5 \u2265 \u00b5 with\r\n t\r\n 1TensorFlow: goo.gl/zC2rjG. PyTorch: goo.gl/N4sFfs an appropriate learning rate, the update in (1) can achieve\r\n YELLOWFINandtheArtofMomentumTuning\r\n \u2217 \u221a t \u2217\r\n |x \u2212x |\u2264 \u00b5|x \u2212x |simultaneouslyalongallaxes Lemma 3 (Robustness of the momentum operator). As-\r\n i,t i i,0 i\r\n i. This insight has been hidden away in proofs. sume that generalized curvature h and hyperparameters\r\n In this quadratic case, curvature is different across different \u03b1,\u00b5satisfy\r\n axes, but remains constant on any one-dimensional slice. In \u221a 2 \u221a 2\r\n the next section (Section 2.2), we extend this insight to non- (1\u2212 \u00b5) \u2264\u03b1h(xt)\u2264(1+ \u00b5) . (6)\r\n quadratic one-dimensional functions. We then present the Then as proven in Appendix A, the spectral radius of the\r\n maintechnical insight behind the design of YELLOWFIN: momentumoperator at step t depends solely on the momen-\r\n similar linear convergence rate \u221a\u00b5 can be achieved in a tumparameter: \u03c1(A ) = \u221a\u00b5, for all t. The inequalities in\r\n t\r\n class of one-dimensional non-convex objectives where cur- (6) de\ufb01ne the robust region, the set of learning rate \u03b1 and\r\n \u221a\r\n vature varies; this linear convergence behavior is robust to momentum\u00b5achievingthis \u00b5spectral radius.\r\n learning rate misspeci\ufb01cation and to the varying curvature. We know that the spectral radius of an operator, A, de-\r\n These robustness properties are behind a tuning rule for scribes its asymptotic behavior when applied multiple times:\r\n learning rate and momentum in Section 2.2. We extend this kAtxk \u2248 O(\u03c1(A)t).2 Unfortunately, the same does not\r\n rule to handle SGD noise and generalize it to multidimen- always hold for the composition of different operators, even\r\n sional objectives in Section 3. if they have the same spectral radius, \u03c1(A ) = \u221a\u00b5. It is\r\n t\r\n \u221a t\r\n 2.2 Robustness properties of the momentum operator not always true that kAt \u00b7\u00b7\u00b7A1xk = O( \u00b5 ). However,\r\n \u221a t\r\n a homogeneous spectral radius often yields the \u00b5 rate\r\n In this section, we analyze the dynamics of momentum on a empirically. In other words, this linear convergence rate is\r\n class of one-dimensional, non-convex objectives. We \ufb01rst not guaranteed. Instead, we demonstrate examples to ex-\r\n introduce the notion of generalized curvature and use it pose the robustness properties: if the learning rate \u03b1 and\r\n to describe the momentum operator. Then we discuss the momentum\u00b5areintherobust region, the homogeneity of\r\n robustness properties of the momentum operator. spectral radii can empirically yield linear convergence with\r\n \u221a\r\n Curvature along different directions is encoded in the dif- rate \u00b5; this behavior is robust with respect to learning\r\n ferent eigenvalues of the Hessian. It is the only feature rate misspeci\ufb01cation and to varying curvature.\r\n of a quadratic needed to characterize the convergence of Momentum is robust to learning rate misspeci\ufb01cation\r\n gradient descent. Speci\ufb01cally, gradient descent achieves\r\n a linear convergence rate |1 \u2212 \u03b1h | on one-dimensional For a one-dimensional quadratic with curvature h, we have\r\n c generalized curvature h(x) = h for all x. Lemma 3 implies\r\n quadratics with constant curvature hc. On one-dimensional \u221a\r\n non-quadratic objectives with varying curvature, this neat the spectral radius \u03c1(At)= \u00b5if\r\n characterization is lost. We can recover it by de\ufb01ning a new \u221a 2 \u221a 2\r\n kind of \u201ccurvature\u201d with respect to a speci\ufb01c minimum. (1\u2212 \u00b5) /h\u2264\u03b1\u2264(1+ \u00b5) /h. (7)\r\n \u2217\r\n De\ufb01nition 2 (Generalized curvature). Let x be a local In Figure 2, we plot \u03c1(A ) 1.0\r\n minimum of f(x) : R \u2192 R. Generalized curvature with t\r\n respect to x\u2217, denoted by h(x), satis\ufb01es the following. for different \u03b1 and \u00b5 when 0.8\r\n h = 1. The solid line radius0.6\r\n \u2032 \u2217 segments correspond to the \u00b5=0.0\r\n f (x) = h(x)(x\u2212x ). (3) robust region. As we in- 0.4 \u00b5=0.1\r\n Generalized curvature describes, in some sense, non-local crease momentum, a linear Spectral0.2 \u00b5=0.3\r\n \u221a \u00b5=0.5\r\n curvature with respect to minimum x\u2217. It coincides with rate of convergence, \u00b5, is 0.0\r\n curvature on quadratics. On non-quadratic objectives, it robustly achieved by an ever- 0.0 0.5 1.0 1.5 2.0 2.5 3.0\r\n characterizes the convergence behavior of gradient descent- widening range of learning Learning rate (\u03b1)\r\n based algorithms. Speci\ufb01cally, we recover the fact that rates: higher values of mo- Figure 2. Spectral radius of\r\n \u2217 mentum are more robust to\r\n starting at point x , the distance from minimum x is re- momentumoperatoronscalar\r\n t learning rate mispeci\ufb01cation.\r\n ducedby|1\u2212\u03b1h(x )|inonestepofgradientdescent. Using quadratic for varying \u03b1.\r\n t\r\n a state-space augmentation, we can rewrite the momentum This property in\ufb02uences the design of our tuner: more\r\n update of (1) as generally for a class of one-dimensional non-convex objec-\r\n \u0012 \u2217\u0013 \u0012 \u2217 \u0013 tives, as long as the learning rate \u03b1 and momentum \u00b5 are in\r\n x \u2212x x \u2212x\r\n t+1 =A t (4)\r\n \u2217 t \u2217 the robust region, i.e. satisfy (6) at every step, then momen-\r\n x \u2212x x \u2212x\r\n t t\u22121 tumoperators at all steps t have the same spectral radius.\r\n where the momentum operator A at time t is de\ufb01ned as In the case of quadratics, this implies a convergence rate of\r\n t\r\n \u0014 \u0015 2\r\n At , 1\u2212\u03b1h(xt)+\u00b5 \u2212\u00b5 (5) For any \u01eb > 0, there exists a matrix norm k \u00b7 k such that\r\n 1 0 kAk\u2264\u03c1(A)+\u01eb(Foucart,2012).\r\n YELLOWFINandtheArtofMomentumTuning\r\n \u221a\r\n \u00b5, independent of the learning rate. Having established of the objectives. This property in\ufb02uences our tuner de-\r\n that, we can just focus on optimally tuning momentum. sign: in the next section, we extend the tuning rules of (9)\r\n to handle SGD noise; we generalize the extended rule to\r\n Momentum is robust to varying curvature As dis- multidimensional cases as the tuning rule in YELLOWFIN.\r\n cussedinSection2.1,theintuitionhiddeninclassicresultsis\r\n that for certain strongly convex smooth objectives, momen- The role of generalized curvature. GCde\ufb01nes a quan-\r\n tumat least as high as the value in (2) can achieve the same tity that is an alternative to classic curvature and is directly\r\n rate of linear convergence along all axes with different cur- related to the contraction properties of the momentum op-\r\n vatures. We extend this intuition to certain one-dimensional erator on non-quadratic scalar problems. Note that similar\r\n non-convex functions with varying curvatures along their quantities, e.g. the PL condition (Karimi et al., 2016), have\r\n domains; we discuss the generalization to multidimensional been used in the analysis of gradient descent. Respectively,\r\n cases in Section 3.1. Lemma 3 guarantees constant, time- the ensuing generalized condition number (GCN) is meant\r\n homogeneous spectral radii for momentum operators At to describe the dynamic range of this contractivity around a\r\n assuming (6) is satis\ufb01ed at every step. This assumption mo- minumumonnon-quadraticfunction.\r\n tivates a \u201clong-range\u201d extension of the condition number.\r\n De\ufb01nition4(Generalizedconditionnumber). Wede\ufb01nethe 3 THEYELLOWFINTUNER\r\n generalized condition number (GCN) with respect to a local Here we describe our tuner for momentum SGD that uses\r\n \u2217\r\n minimumx ofascalarfunction, f(x) : R \u2192 R, to be the the same learning rate for all variables. We \ufb01rst introduce a\r\n dynamic range of its generalized curvature h(x): noisy quadratic model f(x) as the local approximation of an\r\n sup h(x) arbitrary one-dimensional objective. On this approximation,\r\n \u03bd = x\u2208dom(f) (8) weextendthetuning rule of (9) to SGD. In section 3.1, we\r\n infx\u2208dom(f)h(x) generalize the discussion to multidimensional objectives; it\r\n yields the YELLOWFIN tuning rule.\r\n TheGCNcapturesvariations in generalized curvature along\r\n a scalar slice. From Lemma 3 we get Noisy quadratic model Weconsider a scalar quadratic\r\n \u0012\u221a\u03bd\u22121\u00132 h 2 Xh 2 1 X\r\n \u00b5\u2265\u00b5\u2217= \u221a , f(x) = 2x +C = 2n(x\u2212ci) , n fi(x) (10)\r\n \u03bd +1 i i\r\n \u221a 2 \u221a 2 (9) with P c = 0. f(x) is a quadratic approximation of the\r\n (1 \u2212 \u00b5) \u2264\u03b1\u2264 (1+ \u00b5) i i\r\n infx\u2208dom(f)h(x) supx\u2208dom(f)h(x) original objectives with h and C derived from measurement\r\n on the original objective. The function f(x) is de\ufb01ned as\r\n as the description of the robust region. The momentum and the average of n component functions, fi. This is a common\r\n learning rate satisfying (9) guarantees a homogeneous spec- modelfor SGD,whereweuseonlyasingledatapoint(or\r\n \u221a \u2217 a mini-batch) drawn uniformly at random, S \u223c Uni([n])\r\n tral radius of \u00b5forall At. Speci\ufb01cally, \u00b5 is the smallest t\r\n momentumvaluethatallowsforhomogeneousspectralradii. to compute a noisy gradient, \u2207fS (x), for step t. Here,\r\n P t\r\n \u2217 1 2\r\n Similar to the optimal \u00b5 in (2) for the quadratic case, we C = hc denotes the gradient variance. As opti-\r\n 2n i i\r\n notice that the optimal \u00b5 in (9) is objective dependent. The mization on quadratics decomposes into scalar problems\r\n optimal momentum\u00b5\u2217 iscloseto1forobjectiveswithlarge along the principal eigenvectors of the Hessian, the scalar\r\n generalized condition number \u03bd, while objectives with small modelin(10) is suf\ufb01cient to study local quadratic approx-\r\n \u03bd implies a optimal momentum \u00b5\u2217 that is close to 0. imations of multidimensional objectives. Next we get an\r\n Wedemonstrate with examples that by using a momentum exact expression for the mean square error after running\r\n larger than the objective-dependent \u00b5\u2217, homogeneous spec- momentumSGDonthescalarquadraticin(10)fortsteps\r\n tral radii suggest an empirical linear convergence behavior in Lemma5;wedelaytheprooftoAppendixB.\r\n Lemma5. Let f(x) be de\ufb01ned as in (10), x = x and\r\n on a class of non-convex objectives. In Figure 3(a), the 1 0\r\n x follow the momentum update (1) with stochastic gra-\r\n non-convex objective, composed of two quadratics with t\r\n T\r\n dients \u2207fS (xt\u22121) for t \u2265 2. Let e1 = [1,0] and\r\n curvatures 1 and 1000, has a GCN of 1000. Using the tun- t\r\n f =[1,0,0]T, the expectation of squared distance to the\r\n ing rule of (9), and running the momentum algorithm (Fig- 1\r\n \u2217\r\n ure 3(b)) practically yields the linear convergence predicted optimumx is\r\n byLemma3. InFigures3(c,d), we demonstrate an LSTM \u2217 2 \u22a4 t \u2217 \u2217 \u22a4 2\r\n E(x \u2212x ) =(e A [x \u2212x ,x \u2212x ] )\r\n as another example. As we increase the momentum value t+1 1 1 0 (11)\r\n +\u03b12Cf\u22a4(I \u2212Bt)(I \u2212B)\u22121f1,\r\n (the same value for all variables in the model), more model 1\r\n \u221a\r\n variables follow a \u00b5convergence rate. In these examples, where the \ufb01rst and second term correspond to squared bias\r\n the linear convergence is robust to the varying curvature andvariance, and their corresponding momentum dynamics\r\n YELLOWFINandtheArtofMomentumTuning\r\n 0.8 3 \u22121\r\n 10 10\r\n 2\r\n 0.7 10\r\n 1 value\r\n 0.6 10 \u22122\r\n 0 10\r\n 0.5 optimum10\r\n \u22121 \ufb01nal\r\n ) 10\r\n x0.4 \u22122 \u22123\r\n ( 10 10\r\n f from\u22123\r\n 0.3 10 from\r\n \u22124\r\n 0.2 10 \u22124\r\n \u22125 10\r\n 10\r\n 0.1 \u22126 \u00b5=0.9 \u00b5=0.99\r\n Distance10\r\n 0.0 \u22127 Distance\u22125\r\n 10 10\r\n \u221220\u221215\u221210\u22125 0 5 10 15 20 0 100 200 300 400 500 0 50 100 150 200 250 300 0 50 100 150 200 250 300\r\n x Iterations Iterations Iterations\r\n (a) (b) (c) (d)\r\n Figure 3. (a) Non-convex toy example; (b) linear convergence rate achieved empirically on the example in (a) tuned according to (9);\r\n (c,d) LSTM on MNIST: as momentum increases from 0.9 to 0.99, the global learning rate and momentum falls in robust regions of more\r\n \u221a\r\n modelvariables. The convergence behavior (shown in grey) of these variables follow the robust rate \u00b5(showninred).\r\n are captured by operators SINGLESTEP is a multidimensional SGD version of the\r\n \u0014 \u0015 noiseless tuning rule in (9). We \ufb01rst generalize (9) and (14)\r\n A= 1\u2212\u03b1h+\u00b5 \u2212\u00b5 , to multidimensional cases, and then discuss the rule SIN-\r\n 1 0 GLESTEP as well as its implementation in Algorithm 1.\r\n \uf8ee 2 2 \uf8f9 (12)\r\n (1 \u2212\u03b1h+\u00b5) \u00b5 \u22122\u00b5(1\u2212\u03b1h+\u00b5) Asdiscussed in Section 2.2, GCN \u03bd captures the dynamic\r\n B=\uf8f0 1 0 0 \uf8fb. range of generalized curvatures in a one-dimensional ob-\r\n 1\u2212\u03b1h+\u00b5 0 \u2212\u00b5 jective with varying curvature. The consequent robust re-\r\n Eventhoughit is possible to numerically work on (11) di- gion described by (9) implies homogeneous spectral radii.\r\n rectly, we use a scalar, asymptotic surrogate in (13) based On a multidimensional non-convex objective, each one-\r\n on the spectral radii of operators to simplify analysis and dimensional slice passing a minimum x\u2217 can have varying\r\n expose insights. This decision is supported by our \ufb01nd- curvature. As we use a single \u00b5 and \u03b1 for the entire model,\r\n ings in Section 2: the spectral radii can capture empirical if \u03bd simultaneously captures the dynamic range of general-\r\n convergence rate. ized curvature over all these slices, \u00b5 and \u03b1 in (9) are in the\r\n robust region for all these slices. This implies homogeneous\r\n \u2217 2 \u221a\r\n E(xt+1 \u2212x ) spectral radii \u00b5according to Lemma 3, empirically facili-\r\n \u03b12C (13) tating convergence at a common rate along all the directions.\r\n 2t 2 t \u221a\r\n \u2248\u03c1(A) (x \u2212x ) +(1\u2212\u03c1(B) )\r\n 0 \u2217 1\u2212\u03c1(B) Given homogeneous spectral radii \u00b5alongall directions,\r\n the surrogate in (14) generalizes on the local quadratic ap-\r\n Oneofourdesigndecisions for YELLOWFIN is to always proximation of multiple dimensional objectives. On this ap-\r\n \u2217\r\n work in the robust region of Lemma 3. We know that this proximation with minimum x , the expectation of squared\r\n \u221a distance to x\u2217, Ekx \u2212 x\u2217k2, decomposes into indepen-\r\n implies a spectral radius \u00b5ofthemomentumoperator,A, 0\r\n for the bias. Lemma 6, as proved in Appendix C, shows that dent scalar components along the eigenvectors of the Hes-\r\n under the exact same condition, the variance operator B has sian. We de\ufb01ne gradient variance C as the sum of gradient\r\n spectral radius \u00b5. variance along these eigenvectors. The one-dimensional\r\n Lemma6. Thespectralradiusofthevariance operator, B surrogates in (14) for the independent components sum to\r\n \u221a \u221a t \u2217 2 t 2\r\n 2 2 \u00b5 kx \u2212x k +(1\u2212\u00b5 )\u03b1 C/(1\u2212\u00b5),themultidimensional\r\n is \u00b5, if (1 \u2212 \u00b5) \u2264\u03b1h\u2264(1+ \u00b5) . 0\r\n surrogate corresponding to the one in (14).\r\n Asaresult, the surrogate objective of (13), takes the follow- Algorithm 1 YELLOWFIN\r\n ing form in the robust region. function YELLOWFIN(gradient g , \u03b2)\r\n t\r\n \u03b12C h , h \u2190CURVATURERANGE(g ,\u03b2)\r\n \u2217 2 t \u2217 2 t max min t\r\n E(xt+1 \u2212x ) \u2248 \u00b5 (x0 \u2212x ) +(1\u2212\u00b5 )1\u2212\u00b5 (14) C\u2190VARIANCE(g ,\u03b2)\r\n t\r\n D\u2190DISTANCE(g ,\u03b2)\r\n Weextend this surrogate to multidimensional cases to ex- t\r\n \u00b5 ,\u03b1 \u2190SINGLESTEP(C,D,h , h )\r\n tract a noisy tuning rule for YELLOWFIN. t t max min\r\n return \u00b5 ,\u03b1\r\n t t\r\n 3.1 Tuningrule endfunction\r\n In this section, we present SINGLESTEP, the tuning rule of Let D be an estimate of the current model\u2019s distance to a\r\n YellowFin (Algorithm 1). Based on the surrogate in (14), local quadratic approximation\u2019s minimum, and C denote an\r\n YELLOWFINandtheArtofMomentumTuning\r\n Algorithm 2 Curvature range Algorithm 3 Gradient variance Algorithm 4 Distance to opt.\r\n state: h , h , h ,\u2200i \u2208 {1,2,3,...} 2\r\n max min i state: g \u2190 0, g \u2190 0 state: kgk \u2190 0, h \u2190 0\r\n function CURVATURERANGE(gradient gt, \u03b2) function DISTANCE(gradient g , \u03b2)\r\n h \u2190kg k2 function VARIANCE(gradient gt, \u03b2) t\r\n t t kgk \u2190\u03b2\u00b7kgk+(1\u2212\u03b2)\u00b7kg k\r\n h \u2190 max h ,h \u2190 min h 2 2 t\r\n max,t t\u2212w\u2264i\u2264t i min,t t\u2212w\u2264i\u2264t i g \u2190\u03b2\u00b7g +(1\u2212\u03b2)\u00b7gt\u2299gt 2\r\n h \u2190\u03b2\u00b7h +(1\u2212\u03b2)\u00b7h g \u2190\u03b2\u00b7g+(1\u2212\u03b2)\u00b7g h\u2190\u03b2\u00b7h+(1\u2212\u03b2)\u00b7kgtk\r\n max max max,t \u0010 \u0011 t D\u2190\u03b2\u00b7D+(1\u2212\u03b2)\u00b7kgk/h\r\n h \u2190\u03b2\u00b7h +(1\u2212\u03b2)\u00b7h\r\n min min min,t T 2 2\r\n return h , h return 1 \u00b7 g \u2212g return D\r\n max min\r\n endfunction endfunction endfunction\r\n estimate for gradient variance. SINGLESTEP minimizes the Fisher information matrix\u2014i.e. the expected outer prod-\r\n multidimensional surrogate after a single step (i.e. t = 1) uct of noisy gradients\u2014approximates the Hessian of the\r\n whileensuring\u00b5and\u03b1intherobustregionforalldirections. objective (Duchi, 2016; Pascanu & Bengio, 2013). This\r\n Asingleinstanceof SINGLESTEPsolvesasinglemomentum allows for measurements purely being approximated from\r\n and learning rate for the entire model at each iteration. minibatch gradients with overhead linear to model dimen-\r\n Speci\ufb01cally, the extremal curvatures h andh denote sionality. These implementations are not guaranteed to give\r\n min max\r\n estimates for the largest and smallest generalized curvature accurate measurements. Nonetheless, their use in our ex-\r\n respectively. They are meant to capture both generalized periments in Section 5 shows that they are suf\ufb01cient for\r\n curvature variation along all different directions (like the YELLOWFINtooutperformthestateoftheartonavariety\r\n classic condition number) and also variation that occurs of objectives. We also refer to Appendix D for details on\r\n as the landscape evolves. The constraints keep the global zero-debias (Kingma & Ba, 2014), slow start (Schaul et al.,\r\n learning rate and momentum in the robust region (de\ufb01ned 2013) and smoothing for curvature range estimation.\r\n in Lemma3)forslices along all directions.\r\n The problem in (15) (SINGLESTEP) Curvature range Let gt be a noisy gradient, we estimate\r\n does not need iterative 2 2 the curvatures range in Algorithm 2. We notice that the\r\n \u00b5 ,\u03b1 =argmin\u00b5D +\u03b1 C T 2\r\n t t outer product g g has an eigenvalue h = kg k with\r\n solver but has an analyt- \u00b5 t t t t\r\n p ! eigenvector g . Thus under our negative log-likelihood as-\r\n ical solution. Substitut- 2 t\r\n h /h \u22121\r\n ing only the second con- s.t. \u00b5 \u2265 p max min sumption, we use ht to approximate the curvature of Hes-\r\n h /h +1 sian along gradient direction gt. Note here we use empirical\r\n straint, the objective be- max min\r\n 2 2 \u221a Fisher g gT instead of Fisher information matrix. Empirical\r\n comes p(x) = x D + (1\u2212 \u00b5)2 t t\r\n 4 2 \u03b1= Fisher is typically used in practical natural gradient meth-\r\n (1 \u2212 x) /h C with\r\n min h\r\n x = \u221a\u00b5 \u2208 [0,1). By min (15) ods (Martens, 2014; Roux et al., 2008; Duchi et al., 2011).\r\n setting the gradient of p(x) to 0, we can get a cubic equa- For practically ef\ufb01cient measurement, we use the empirical\r\n \u221a Fisher as a coarse proxy of Fisher information matrix which\r\n tion whose root x = \u00b5p can be computed in closed form approximates the Hessian of the objective. Speci\ufb01cally in\r\n using Vieta\u2019s substitution. As p(x) is uni-modal in [0,1), Algorithm 2, we maintain h and h as running aver-\r\n the optimizer for (15) is exactly the maximum of \u00b5p and min max\r\n p 2 p 2 ages of extreme curvature hmin,t and hmax,t, from a sliding\r\n ( h /h \u22121) /( h /h +1) ,theright hand-\r\n max min max min 3\r\n side of the \ufb01rst constraint in (15). window of width 20 . As gradient directions evolve, we\r\n estimate curvatures along different directions. Thus h\r\n min\r\n YELLOWFIN uses functions CURVATURERANGE, VARI- and hmax capture the curvature variations.\r\n ANCEandDISTANCEtomeasurequantitieshmax,hmin,C\r\n and D respectively. These functions can be designed in Gradient variance To estimate the gradient variance in\r\n different ways. We present the implementations used in our Algorithm3,weuserunningaveragesg andg2 tokeeptrack\r\n experiments, based completely on gradients, in Section 3.2. of gt and gt \u2299 gt, the \ufb01rst and second order moment of the\r\n gradient. As Var(g ) = Eg2 \u2212 Eg \u2299Eg , we estimate the\r\n t t t t\r\n 3.2 Measurementfunctionsin YELLOWFIN gradient variance C in (15) using C = 1T\u00b7 (g2 \u2212 g2).\r\n This section describes our implementation of the measure- Distance to optimum Weestimate the distance to the op-\r\n mentoracles used by YELLOWFIN: CURVATURERANGE, timumofthelocal quadratic approximation in Algorithm 4.\r\n VARIANCE, and DISTANCE. We design the measurement Inspired by the fact that k\u2207f(x)k \u2264 kHkkx \u2212 x\u22c6k for a\r\n functions with the assumption of a negative log-probability quadratic f(x) with Hessian H andminimizerx\u2217,wemain-\r\n objective; this is in line with typical losses in machine learn-\r\n ing, e.g. cross-entropy for neural nets and maximum like- 3We use window width 20 across all the models and experi-\r\n lihood estimation in general. Under this assumption, the ments in our paper. We refer to Section 5 for details on selecting\r\n the window width\r\n YELLOWFINandtheArtofMomentumTuning\r\n 105 102 the maximumnormh . In Figure 4, we demonstrate the\r\n Without clipping Without clipping max\r\n 103 Withclipping Withclipping mechanismofourheuristic by presenting an example of an\r\n norm loss101\r\n Clipping thresh. LSTMthat exhibits \u2019exploding gradients\u2019. The proposed\r\n 100 0\r\n 10 adaptive clipping can stabilize the training process using\r\n Gradient Training YELLOWFINandpreventlargecatastrophic loss spikes.\r\n 10\u22123 10\u22121\r\n 0k 1k 2k 3k 0k 1k 2k 3k Wevalidate the proposed Table 1. German-English trans-\r\n Iterations Iterations adaptive clipping on the lation validation metrics using\r\n Figure 4. A variation of the LSTM architecture in (Zhu et al., 2016) convolutional sequence to convolutional seq-to-seq model.\r\n exhibits exploding gradients. The proposed adaptive gradient sequence learning model Loss BLEU4\r\n clipping threshold (blue) stabilizes the training loss. (Gehring et al., 2017)\r\n for IWSLT 2014 German- Default w/o clip. diverge\r\n 7 English translation. The Default w/ clip. 2.86 30.75\r\n YFwithclipping YFwithclipping\r\n loss6 YFwithoutclipping loss100 YFwithoutclipping default optimizer (Gehring YF 2.75 31.59\r\n 5 et al., 2017) uses learning\r\n rate 0.25 and Nesterov\u2019s momentum 0.99, diverging to\r\n Training4 Training loss over\ufb02ow due to \u2019exploding gradient\u2019. It requires, as\r\n 3.5 10\u22121 in Gehring et al. (2017), strict manually set gradient norm\r\n 0k 5k 10k 15k 20k 25k 30k 0k 10k 20k 30k 40k threshold 0.1 to stabilize. In Table 1, we can see YellowFin,\r\n Iterations Iterations\r\n Figure 5. Training losses on PTB LSTM (left) and CIFAR10 with adaptive clipping, outperforms the default optimizer\r\n ResNet (right) for YellowFin with and without adaptive clipping. using manually set clipping, with 0.84 higher validation\r\n BLEU4after120epochs. Tofurther demonstrate the prac-\r\n tical applicability of our gradient clipping heuristics, in\r\n tain h and kgk as running mean of curvature h and gradient\r\n t Figure 5, we demonstrate that the adaptive clipping does not\r\n normkgtk; the distance is approximated with kgk/h. hurt performance on models that do not exhibit instabilities\r\n without clipping. Speci\ufb01cally, for both PTB LSTM and CI-\r\n 3.3 Stability on non-smooth objectives FAR10ResNet,thedifference between YELLOWFIN with\r\n Theprocess of training neural networks is inherently non- and without adaptive clipping diminishes quickly.\r\n stationary, with the landscape abruptly switching from \ufb02at\r\n to steep areas. In particular, the objective functions of 4 CLOSED-LOOP YELLOWFIN\r\n RNNs with hidden units can exhibit occasional but very Asynchrony is a parallelization technique that avoids syn-\r\n steep slopes (Pascanu et al., 2013; Szegedy et al., 2013). To chronization barriers (Niu et al., 2011). In this section, we\r\n deal with this issue, gradient clipping has been established propose a closed momentum loop variant of YELLOWFIN\r\n in literature as a standard tool to stabilize the training using to accelerate convergence in asynchronous training. After\r\n such objectives (Pascanu et al., 2013; Goodfellow et al., somepreliminaries, we show the mechanism of the exten-\r\n 2016; Gehring et al., 2017). sion: it measures the dynamics on a running system and\r\n Weuseadaptivegradient clipping heuristics as a very natu- controls momentum with a negative feedback loop.\r\n ral addition to our basic tuner. However, the classic tradeoff\r\n between adaptivity and stability applies: setting a clipping\r\n threshold that is too low can hurt performance; setting it Preliminaries Whentraining on M asynchronous work-\r\n to be high, can compromise stability. YELLOWFIN, keeps ers, staleness (the number of model updates between a\r\n running estimates of extremal gradient magnitude squares, worker\u2019sreadandwriteoperations)isonaverage\u03c4 = M\u22121,\r\n h and h in order to estimate a generalized condition i.e., the gradient in the SGD update is delayed by \u03c4 itera-\r\n max min \u221a\r\n number. We posit that h is an ideal gradient norm tions as \u2207f (x ). Asynchrony yields faster steps,\r\n max St\u2212\u03c4 t\u2212\u03c4\r\n threshold for adaptive clipping. In order to ensure robust- but can increase the number of iterations to achieve the\r\n ness to extreme gradient spikes, like the ones in Figure 4, samesolution, a tradeoff between hardware and statistical\r\n \u00b4\r\n wealso limit the growth rate of the envelope h in Algo- ef\ufb01ciency (Zhang & Re, 2014). Mitliagkas et al. (2016) in-\r\n max\r\n rithm 2 as follows: terpret asynchrony as added momentum dynamics. Experi-\r\n mentsinHadjisetal.(2016)supportthis\ufb01nding,anddemon-\r\n h \u2190\u03b2\u00b7h +(1\u2212\u03b2)\u00b7min{h , 100 \u00b7 h } strate that reducing algorithmic momentum can compensate\r\n max max max,t max\r\n (16) for asynchrony-induced momentumandsigni\ufb01cantlyreduce\r\n Our heuristics follows along the lines of classic recipes the number of iterations for convergence. Motivated by that\r\n like (Pascanu et al., 2013). However, instead of using the result, we use the model in (17), where the total momen-\r\n average gradient norm to clip, it uses a running estimate of tum, \u00b5T, includes both asynchrony-induced and algorithmic\r\n YELLOWFINandtheArtofMomentumTuning\r\n 0.8 0.8 0.8\r\n 0.6 Total mom. 0.6 Asynchrony 0.6 Total mom.\r\n m Target mom. -induced momentum Target mom.\r\n u 0.4 0.4 0.4\r\n t Algorithmic mom.\r\n en 0.2 0.2 0.2\r\n om Total mom.\r\n M 0.0 0.0 0.0\r\n !0.2 !0.2 Target mom. !0.2\r\n 0k 5k 10k 15k 20k 25k 30k 35k 40k 0k 5k 10k 15k 20k 25k 30k 35k 40k 0k 5k 10k 15k 20k 25k 30k 35k 40k\r\n Iterations Iterations Iterations\r\n Figure 6. When running YELLOWFIN, total momentum \u00b5\u02c6 equals algorithmic value in synchronous settings (left); \u00b5\u02c6 is greater than\r\n t t\r\n algorithmic value on 16 asynchronous workers (middle). Closed-loop YELLOWFIN automatically lowers algorithmic momentum and\r\n brings total momentum to match the target value (right). Red dots are total momentum estimates, \u00b5\u02c6 , at each iteration. The solid red line\r\n T\r\n is a running average of \u00b5\u02c6 .\r\n T\r\n momentum,\u00b5,in(1). Algorithm 5 Closed-loop YELLOWFIN\r\n E[x \u2212x]=\u00b5 E[x \u2212x ]\u2212\u03b1E\u2207f(x) (17) 1: Input: \u00b5 \u2190 0, \u03b1 \u2190 0.0001, \u03b3 \u2190 0.01,\u03c4 (staleness)\r\n t+1 t T t t\u22121 t 2: for t \u2190 1 to T do\r\n 3: x \u2190x +\u00b5(x \u2212x )\u2212\u03b1\u2207 f(x )\r\n t t\u22121 t\u22121 t\u22122 S t\u2212\u03c4\u22121\r\n Wewill use this expression to design an estimator for the t\r\n 4: \u00b5\u2217,\u03b1 \u2190 YELLOWFIN(\u2207 f(x ), \u03b2)\r\n value of total momentum, \u00b5\u02c6T. This estimator is a basic \u0010 St t\u2212\u03c4\u22121 \u0011\r\n x \u2212x +\u03b1\u2207 f(x )\r\n building block of closed-loop YELLOWFIN; it removes the 5: \u00b5\u02c6 \u2190median t\u2212\u03c4 t\u2212\u03c4\u22121 St\u2212\u03c4\u22121 t\u2212\u03c4\u22121\r\n T x \u2212x\r\n need to manually compensate for the effects of asynchrony. \u22b2Measuringtotal momentum t\u2212\u03c4\u22121 t\u2212\u03c4\u22122\r\n 6: \u00b5\u2190\u00b5+\u03b3\u00b7(\u00b5\u2217\u2212\u00b5\u02c6) \u22b2Closing the loop\r\n Measuringthemomentumdynamics Closed-loopYEL- T\r\n 7: end for\r\n LOWFINestimatestotalmomentum\u00b5T onarunningsystem\r\n and uses a negative feedback loop to adjust algorithmic mo-\r\n mentumaccordingly. Equation (18) gives an estimate of \u00b5\u02c6T YELLOWFINaccelerates with momentum closed-loop con-\r\n onasystemwithstaleness \u03c4, based on (18). trol, signi\ufb01cantly outperforming Adam.\r\n \u0012x \u2212x +\u03b1\u2207 f(x )\u0013\r\n \u00b5\u02c6T = median t\u2212\u03c4 t\u2212\u03c4\u22121 St\u2212\u03c4\u22121 t\u2212\u03c4\u22121 Weevaluate on convolutional neural networks (CNN) and\r\n x \u2212x recurrent neural networks (RNN). For CNN, we train\r\n t\u2212\u03c4\u22121 t\u2212\u03c4\u22122 (18) ResNet (He et al., 2016) for image recognition on CIFAR10\r\n Weuse\u03c4-stale model values to match the staleness of the and CIFAR100 (Krizhevsky et al., 2014). For RNN, we\r\n gradient, and perform element-wise operations. This way train LSTMs for character-level language modeling with the\r\n wegetatotal momentummeasurementfromeachvariable; TinyShakespeare (TS) dataset (Karpathy et al., 2015), word-\r\n the median combines them into a more robust estimate. level language modeling with the Penn TreeBank (PTB)\r\n (Marcus et al., 1993), and constituency parsing on the Wall\r\n Closing the asynchrony loop Given a reliable measure- Street Journal (WSJ) dataset (Choe & Charniak). We re-\r\n ment of \u00b5 , we can use it to adjust the value of algorithmic fer to Table 3 in Appendix E for model speci\ufb01cations. To\r\n T eliminate in\ufb02uences of a speci\ufb01c random seed, in our syn-\r\n momentum so that the total momentum matches the tar- chronous and asynchronous experiments, the training loss\r\n get momentum as decided by YELLOWFIN in Algorithm 1. andvalidation metrics are averaged from 3 runs using dif-\r\n Closed-loop YELLOWFIN in Algorithm 5 uses a simple ferent random seeds. Across all experiments on the eight\r\n negative feedback loop to achieve the adjustment. models, we use sliding window width 20 for estimating the\r\n extreme curvature h ax and h in in Algorithm 2. It is\r\n 5 EXPERIMENTS m m\r\n selected based on the performance on PTB LSTM and CI-\r\n Weempirically validate the importance of momentum tun- FAR10ResNetmodel. Theselected sliding window width\r\n ing and evaluate YELLOWFIN in both synchronous (single- is directly applied to the other 6 models, including the con-\r\n node) and asynchronous settings. In synchronous settings, volutional sequence to sequence model in Section 3.3, as\r\n we \ufb01rst demonstrate that, with hand-tuning, momentum well as the ResNext and Tied LSTM in Appendix G.3.\r\n SGDiscompetitive with Adam, a state-of-the-art adaptive 5.1 Synchronousexperiments\r\n method. Then, we evaluate YELLOWFIN without any hand\r\n tuning in comparison to hand-tuned Adam and momentum WetuneAdamandmomentumSGDonlearningrategrids\r\n SGD.Inasynchronoussettings, we show that closed-loop with prescribed momentum 0.9 for SGD. We \ufb01x the param-\r\n YELLOWFINandtheArtofMomentumTuning\r\n 2 101 2\r\n YellowFin YellowFin MomentumSGD\r\n loss YFmom.=0.0 loss YFmom.=0.0 loss Vanilla SGD\r\n YFmom.=0.9 YFmom.=0.9 YellowFin\r\n 1.5 YFrescaling YFrescaling 1.5 YFrescaling\r\n 100\r\n Training Training Training\r\n 1 1\r\n 0k 5k 10k 15k 20k 0k 30k 60k 90k 0k 5k 10k 15k 20k\r\n Iterations Iterations Iterations\r\n Figure 7. The importance of adaptive momentum: The training loss comparison between YELLOWFIN with adaptive momentum and\r\n YELLOWFINwith\ufb01xedmomentumvalues;thiscomparisonisconductedonTSLSTM(left)andCIFAR100ResNet(middle). Learning\r\n rate scaling based on YELLOWFIN tuned momentum can match the performance of full YELLOWFIN on the TS LSTM(right). However\r\n without the YELLOWFIN tuned momentum, hand-tuned Vanilla SGD demonstrates observably larger training loss than momentum based\r\n methods, including full YELLOWFIN, YELLOWFIN learning rate rescaling and hand-tuned momentum SGD (with the same learning rate\r\n search grid as with Vanilla SGD).\r\n eters of Algorithm 1 in all experiments, i.e. YELLOWFIN ing loss matching hand-tuned momentum SGD for all the\r\n runs without any hand tuning. We provide full speci\ufb01ca- ResNet and LSTM models in Figure 8 and 9 (Appendix\r\n tions, including the learning rate (grid) and the number of D). When comparing to tuned Adam in Table 2, except be-\r\n iterations we train on each model in Appendix F. For visual- ing slightly slower on PTB LSTM, YELLOWFIN achieves\r\n ization purposes, we smooth training losses with a uniform 1.38x to 3.28x speedups in training losses on the other four\r\n windowofwidth1000. For AdamandmomentumSGDon models. More importantly, YELLOWFIN consistently shows\r\n each model, we pick the con\ufb01guration achieving the lowest better validation metrics than tuned Adam in Figure 8. It\r\n averaged smoothed loss. To compare two algorithms, we demonstrates that YELLOWFIN can match tuned momen-\r\n record the lowest smoothed loss achieved by both. Then the tum SGD and outperform tuned state-of-the-art adaptive\r\n speedup is reported as the ratio of iterations to achieve this optimizers. In Appendix G.3, we show YELLOWFIN further\r\n loss. We use this setup to validate our claims. speeding up with \ufb01ner-grain manual learning rate tuning.\r\n Table 2. The speedup of YELLOWFIN and tuned momentum SGD Importance of adaptive momentum in YELLOWFIN\r\n over tuned Adam on ResNet and LSTM models. In De\ufb01nition 4, we noticed that the optimally tuned \u00b5\u2217 is\r\n CIFAR10 CIFAR100 PTB TS WSJ highly objective-dependent. Empirically, we indeed observe\r\n the momentum values chosen by YF range from smaller\r\n Adam 1x 1x 1x 1x 1x than 0.03 in the PTM LSTM to 0.89 for ResNext. We\r\n mom. SGD 1.71x 1.87x 0.88x 2.49x 1.33x perform an ablation study to validate the importance of\r\n YF 1.93x 1.38x 0.77x 3.28x 2.33x objective-dependent momentum adaptivity of YELLOWFIN\r\n on CIFAR100 ResNet and TS LSTM. In the experiments,\r\n MomentumSGDiscompetitive with adaptive methods YELLOWFINtunesthelearningrate. Instead of also using\r\n In Table 2, we compare tuned momentum SGD and tuned the momentumtunedbyYF,wecontinuouslyfeedobjective-\r\n AdamonResNetswithtraininglosses shown in Figure 9 in agnostic prescribed momentum value 0.0 and 0.9 to the un-\r\n AppendixD.WecanobservethatmomentumSGDachieves derlying momentum SGD optimizer which YF is tuning. In\r\n 1.71x and 1.87x speedup to tuned Adam on CIFAR10 and Figure 7, when comparing to YELLOWFIN with prescribed\r\n CIFAR100respectively. In Figure 8 and Table 2, with the momentum0.0or0.9, YELLOWFIN withadaptively tuned\r\n exception of PTB LSTM, momentum SGD also produces momentumachievesobservably faster convergence on both\r\n better training loss, as well as better validation perplexity TSLSTMandCIFAR100ResNet.\r\n in language modeling and validation F1 in parsing. For the In Figure 8 (bottom right) and Figure 7 (right), we also ob-\r\n parsing task, we also compare with tuned Vanilla SGD and serve that hand-tuned vanilla SGD, typically does not match\r\n AdaGrad, which are used in the NLP community. Figure 8 the performance of momentum based methods (including\r\n (right) shows that \ufb01xed momentum 0.9 can already speedup YELLOWFIN and momentum SGD hand-tuned using the\r\n Vanilla SGD by 2.73x, achieving better validation F1. same learning rate grid as with vanilla SGD). However,\r\n wecanrescale the learning rate based on the YELLOWFIN\r\n YELLOWFIN can match hand-tuned momentum SGD tuned momentum \u00b5t, and use 0 momentum in the model\r\n and can outperform hand-tuned Adam In our experi- updatestomatchtheperformanceofmomentumbasedmeth-\r\n ments, YELLOWFIN, without any hand-tuning, yields train- ods. Speci\ufb01cally, we rescale the YELLOWFIN tuned learn-\r\n YELLOWFINandtheArtofMomentumTuning\r\n 7 2 2.5 Vanilla SGD\r\n MomentumSGD MomentumSGD MomentumSGD\r\n loss6 Adam loss Adam loss Adam\r\n YellowFin YellowFin Adagrad\r\n 5 1.5 2.0 YellowFin\r\n Training4 Training Training\r\n 3.5 1 1.5\r\n 0k 5k 10k 15k 20k 25k 30k 0k 5k 10k 15k 20k 0k 30k 60k 90k 120k\r\n Iterations Iterations Iterations\r\n 103 7 91.5\r\n MomentumSGD MomentumSGD 91.0\r\n Adam Adam F190.5\r\n perplexity YellowFin perplexity6 YellowFin 90.0 MomentumSGD\r\n 89.5 Adam\r\n alidation89.0 YellowFin\r\n 5 V88.5 Adagrad\r\n Vanilla SGD\r\n alidation102 alidation4.5 88.0\r\n V 0k 5k 10k 15k 20k 25k 30k V 0k 5k 10k 15k 20k 0k 30k 60k 90k 120k\r\n Iterations Iterations Iterations\r\n Figure 8. Training loss and validation metrics on (left to right) word-level language modeling with PTB, char-level language modeling\r\n with TS and constituency parsing on WSJ. The valid. metrics are monotonic as we report the best values up to each number of iterations.\r\n ing rate \u03b1 with 1/(1 \u2212 \u00b5 ) 4. Model updates with this 2012) and Bayesian approaches (Snoek et al., 2012; Hutter\r\n t t\r\n rescaled learning rate and 0 momentum can demonstrate et al., 2011), can directly tune optimizers. As another trend,\r\n training loss closely matching those of YELLOWFIN and adaptive methods, including AdaGrad (Duchi et al., 2011),\r\n hand-tuned momentum SGD for WSJ LSTM in Figure 8 RMSProp(Tieleman&Hinton,2012)andAdam(Kingma\r\n (bottom right) and TS LSTM in Figure 7 (right). & Ba, 2014), uses per-dimension learning rate. Schaul\r\n et al. (2013) use a noisy quadratic model similar to ours\r\n 5.2 Asynchronousexperiments to tune the learning rate in Vanilla SGD. However they\r\n In this section, we evaluate closed-loop YELLOWFIN with do not use momentum which is essential in training mod-\r\n focus on the number of iterations to reach a certain solu- ern neural nets. Existing adaptive momentum approach\r\n tion. To that end, we run 16 asynchronous workers on a either consider the deterministic setting (Graepel & Schrau-\r\n single machine and force them to update the model in a dolph, 2002; Rehman & Nawi, 2011; Hameed et al., 2016;\r\n round-robin fashion, i.e. the gradient is delayed for 15 it- Swanston et al., 1994; Ampazis & Perantonis, 2000; Qiu\r\n erations. Figure 1 (right) presents training losses on the et al., 1992) or only analyze stochasticity with O(1/t) learn-\r\n CIFAR100 ResNet, using YELLOWFIN in Algorithm 1, ing rate (Leen & Orr, 1994). In contrast, we aim at practical\r\n closed-loop YELLOWFIN in Algorithm 5 and Adam with momentumadaptivity for stochastically training neural nets.\r\n the learning rate achieving the best smoothed loss in Sec- 7 DISCUSSION\r\n tion 5.1. We can observe closed-loop YELLOWFIN achieves Wepresented YELLOWFIN,the\ufb01rst optimization method\r\n 20.1x speedup to YELLOWFIN, and consequently a 2.69x that automatically tunes momentum as well as the learning\r\n speedup to Adam. This demonstrates that (1) closed-loop rate of momentumSGD.YELLOWFINoutperformsthestate-\r\n YELLOWFINacceleratesbyreducingalgorithmic momen- of-the-art adaptive optimizers on a large class of models\r\n tumtocompensatefor asynchrony and (2) can converge in both in synchronous and asynchronous settings. It estimates\r\n less iterations than Adam in asynchronous-parallel training. statistics purely from the gradients of a running system, and\r\n 6 RELATEDWORK then tunes the hyperparameters of momentum SGD based\r\n Many techniques have been proposed on tuning hyperpa- on noisy, local quadratic approximations. As future work,\r\n rameters for optimizers. General hyperparameter tuning webelieve that more accurate curvature estimation methods,\r\n approaches, such as random search (Bergstra & Bengio, like the bbprop method (Martens et al., 2012) can further\r\n improve YELLOWFIN. Wealsobelieve that our closed-loop\r\n 4Let v = x \u2212 x be the model update, this rescaling is momentumcontrolmechanisminSection4couldaccelerate\r\n t t t\u22121 other adaptive methods in asynchronous-parallel settings.\r\n motivated with the fact that v =\u00b5v \u2212\u03b1\u2207f(x).Assuming\r\n t+1 t t t t\r\n the v evolves smoothly, we have v \u2248 \u03b1 /(1 \u2212 \u00b5 )\u2207f(x ).\r\n t t t t t\r\n YELLOWFINandtheArtofMomentumTuning\r\n ACKNOWLEDGEMENTS Gehring, J., Auli, M., Grangier, D., Yarats, D., and Dauphin,\r\n \u00b4 Y. N. Convolutional sequence to sequence learning. arXiv\r\n Wearegrateful to Christopher Re for his valuable guidance preprint arXiv:1705.03122, 2017.\r\n and support. We thank Bryan He, Paroma Varma, Chris De\r\n Sa, Tri Dao, Albert Gu, Fred Sala, Alex Ratner, Theodoros Goodfellow, I., Bengio, Y., and Courville, A. Deep\r\n Rekatsinas, Olexa Bilaniuk and Avner May for helpful dis- Learning. MIT Press, 2016. http://www.\r\n cussions and feedback. We gratefully acknowledge the deeplearningbook.org.\r\n support of the D3M program under No. FA8750-17-2-0095, Graepel, T. and Schraudolph, N. N. Stable adaptive momen-\r\n the FRQNT new researcher program (2019-NC-257943), tum for rapid online learning in nonlinear systems. In\r\n a grant by IVADO and a Canada CIFAR AI chair. Any International Conference on Arti\ufb01cial Neural Networks,\r\n opinions, \ufb01ndings, and conclusions or recommendations pp. 450\u2013455. Springer, 2002.\r\n expressed in this material are those of the authors and do\r\n not necessarily re\ufb02ect the views of DARPA, or the Canadian \u00b4\r\n Hadjis, S., Zhang, C., Mitliagkas, I., Iter, D., and Re, C.\r\n or U.S. governments. Omnivore: An optimizer for multi-device deep learning\r\n on cpus and gpus. arXiv preprint arXiv:1606.04487,\r\n REFERENCES 2016.\r\n Ampazis, N. and Perantonis, S. J. Levenberg-marquardt Hameed, A. A., Karlik, B., and Salman, M. S. Back-\r\n algorithm with adaptive momentum for the ef\ufb01cient train- propagation algorithm with variable adaptive momentum.\r\n ing of feedforward networks. In Neural Networks, 2000. Knowledge-Based Systems, 114:79\u201387, 2016.\r\n IJCNN2000, Proceedings of the IEEE-INNS-ENNS In- He, K., Zhang, X., Ren, S., and Sun, J. Deep residual learn-\r\n ternational Joint Conference on, volume 1, pp. 126\u2013131. ing for image recognition. In Proceedings of the IEEE\r\n IEEE, 2000. Conference on Computer Vision and Pattern Recognition,\r\n Bengio, Y. Practical recommendations for gradient-based pp. 770\u2013778, 2016.\r\n training of deep architectures. In Neural networks: Tricks Hutter, F., Hoos, H. H., and Leyton-Brown, K. Sequential\r\n of the trade, pp. 437\u2013478. Springer, 2012. model-based optimization for general algorithm con\ufb01gu-\r\n Bengio, Y. et al. Deep learning of representations for unsu- ration. LION, 5:507\u2013523, 2011.\r\n pervised and transfer learning. ICML Unsupervised and Karimi, H., Nutini, J., and Schmidt, M. Linear conver-\r\n Transfer Learning, 27:17\u201336, 2012. gence of gradient and proximal-gradient methods under\r\n Bergstra, J. and Bengio, Y. Random search for hyper- the polyak-\u0142ojasiewicz condition. In Joint European Con-\r\n parameter optimization. Journal of Machine Learning ference on Machine Learning and Knowledge Discovery\r\n Research, 13(Feb):281\u2013305, 2012. in Databases, pp. 795\u2013811. Springer, 2016.\r\n Bottou, L. Stochastic gradient descent tricks. In Neural Karpathy, A., Johnson, J., and Fei-Fei, L. Visualizing\r\n networks: Tricks of the trade, pp. 421\u2013436. Springer, and understanding recurrent networks. arXiv preprint\r\n 2012. arXiv:1506.02078, 2015.\r\n Chen, D., Bolton, J., and Manning, C. D. A thorough Kingma, D. and Ba, J. Adam: A method for stochastic\r\n examination of the cnn/daily mail reading comprehension optimization. arXiv preprint arXiv:1412.6980, 2014.\r\n task. arXiv preprint arXiv:1606.02858, 2016. Krizhevsky,A.,Nair,V.,andHinton,G. Thecifar-10dataset,\r\n Choe,D.K.andCharniak,E. Parsingaslanguagemodeling. 2014.\r\n Leen, T. K. and Orr, G. B. Optimal stochastic search and\r\n Duchi, J. Fisher information., 2016. URL adaptive momentum. In Advances in neural information\r\n https://web.stanford.edu/class/ processing systems, pp. 477\u2013484, 1994.\r\n stats311/Lectures/lec-09.pdf. Lessard, L., Recht, B., and Packard, A. Analysis and de-\r\n Duchi, J., Hazan, E., and Singer, Y. Adaptive subgradient sign of optimization algorithms via integral quadratic\r\n methods for online learning and stochastic optimization. constraints. SIAM Journal on Optimization, 26(1):57\u201395,\r\n Journal of Machine Learning Research, 12(Jul):2121\u2013 2016.\r\n 2159, 2011. Marcus, M. P., Marcinkiewicz, M. A., and Santorini, B.\r\n Foucart, S. University Lecture, 2012. URL Building a large annotated corpus of english: The penn\r\n http://www.math.drexel.edu/ foucart/ treebank. Computational linguistics, 19(2):313\u2013330,\r\n \u02dc\r\n TeachingFiles/F12/M504Lect6.pdf. 1993.\r\n YELLOWFINandtheArtofMomentumTuning\r\n Martens, J. New insights and perspectives on the natural Schaul, T., Zhang, S., and LeCun, Y. No more pesky learn-\r\n gradient method. arXiv preprint arXiv:1412.1193, 2014. ing rates. ICML (3), 28:343\u2013351, 2013.\r\n Martens, J., Sutskever, I., and Swersky, K. Estimating the Snoek, J., Larochelle, H., and Adams, R. P. Practical\r\n hessian by back-propagating curvature. arXiv preprint bayesian optimization of machine learning algorithms.\r\n arXiv:1206.6464, 2012. In Advances in neural information processing systems,\r\n pp. 2951\u20132959, 2012.\r\n \u00b4\r\n Mitliagkas, I., Zhang, C., Hadjis, S., and Re, C. Asynchrony\r\n begets momentum, with an application to deep learning. Sutskever, I., Martens, J., Dahl, G., and Hinton, G. On the\r\n arXiv preprint arXiv:1605.09774, 2016. importance of initialization and momentum in deep learn-\r\n ing. In Proceedings of the 30th international conference\r\n Nesterov, Y. A method of solving a convex programming onmachinelearning (ICML-13), pp. 1139\u20131147, 2013.\r\n problem with convergence rate o (1/k2). In Soviet Mathe-\r\n matics Doklady, volume 27, pp. 372\u2013376, 1983. Swanston, D., Bishop, J., and Mitchell, R. J. Simple\r\n adaptive momentum: new algorithm for training multi-\r\n Nesterov, Y. Introductory lectures on convex optimization: layer perceptrons. Electronics Letters, 30(18):1498\u20131500,\r\n Abasic course, volume 87. Springer Science & Business 1994.\r\n Media, 2013. Szegedy, C., Zaremba, W., Sutskever, I., Bruna, J., Erhan,\r\n Niu, F., Recht, B., Re, C., and Wright, S. Hogwild: A lock- D., Goodfellow, I., and Fergus, R. Intriguing properties of\r\n free approach to parallelizing stochastic gradient descent. neural networks. arXiv preprint arXiv:1312.6199, 2013.\r\n In Advances in Neural Information Processing Systems, Tieleman, T. and Hinton, G. Lecture 6.5-rmsprop: Divide\r\n pp. 693\u2013701, 2011. the gradient by a running average of its recent magnitude.\r\n \u00a8 COURSERA:Neuralnetworks for machine learning, 4\r\n Orr, G. B. and Muller, K.-R. Neural networks: tricks of the\r\n trade. Springer, 2003. (2), 2012.\r\n Pascanu, R. and Bengio, Y. Revisiting natural gradient for Wilson, A. C., Roelofs, R., Stern, M., Srebro, N., and Recht,\r\n deep networks. arXiv preprint arXiv:1301.3584, 2013. B. The marginal value of adaptive gradient methods\r\n in machine learning. arXiv preprint arXiv:1705.08292,\r\n Pascanu, R., Mikolov, T., and Bengio, Y. On the dif\ufb01culty 2017.\r\n of training recurrent neural networks. In International \u00b4\r\n Conference on Machine Learning, pp. 1310\u20131318, 2013. Xie, S., Girshick, R., Dollar, P., Tu, Z., and He, K. Aggre-\r\n gated residual transformations for deep neural networks.\r\n Polyak, B. T. Somemethodsofspeedinguptheconvergence arXiv preprint arXiv:1611.05431, 2016.\r\n of iteration methods. USSR Computational Mathematics \u00b4\r\n andMathematical Physics, 4(5):1\u201317, 1964. Zhang, C. and Re, C. Dimmwitted: A study of main-\r\n memory statistical analytics. PVLDB, 7(12):1283\u2013\r\n Press, O. and Wolf, L. Using the output embedding to im- 1294,2014. URLhttp://www.vldb.org/pvldb/\r\n provelanguagemodels. arXivpreprintarXiv:1608.05859, vol7/p1283-zhang.pdf.\r\n 2016. Zhu, C., Han, S., Mao, H., and Dally, W. J. Trained ternary\r\n Qiu, G., Varley, M., and Terrell, T. Accelerated training of quantization. arXiv preprint arXiv:1612.01064, 2016.\r\n backpropagation networks by using adaptive momentum\r\n step. Electronics letters, 28(4):377\u2013379, 1992.\r\n Reddi, S. J., Kale, S., and Kumar, S. On the convergence of\r\n adamandbeyond. 2018.\r\n Rehman, M. Z. and Nawi, N. M. The effect of adaptive\r\n momentuminimprovingtheaccuracyofgradientdescent\r\n backpropagationalgorithmonclassi\ufb01cationproblems. In\r\n International Conference on Software Engineering and\r\n ComputerSystems, pp. 380\u2013390. Springer, 2011.\r\n Roux,N.L.,Manzagol,P.-A.,andBengio,Y. Topmoumoute\r\n online natural gradient algorithm. In Advances in neural\r\n information processing systems, pp. 849\u2013856, 2008.\r\n YELLOWFINandtheArtofMomentumTuning\r\n A PROOFOFLEMMA3\r\n To prove Lemma 3, we \ufb01rst prove a more generalized version in Lemma 7. By restricting f to be a one dimensional\r\n quadratics function, the generalized curvature h itself is the only eigenvalue. We can prove Lemma 3 as a straight-forward\r\n t \u221a\r\n corollary. Lemma 7 also implies, in the multiple dimensional correspondence of (4), the spectral radius \u03c1(At) = \u00b5if the\r\n curvature on all eigenvector directions (eigenvalue) satis\ufb01es (6).\r\n Lemma7. Letthegradientsofafunction f be described by\r\n \u2207f(xt) = H(xt)(xt \u2212x\u2217), (19)\r\n with H(xt) \u2208 Rn 7\u2192 Rn\u00d7n. Then the momentum update can be expressed as a linear operator:\r\n \u0012yt+1\u0013=\u0012I \u2212\u03b1H(xt)+\u00b5I \u2212\u00b5I\u0013\u0012 yt \u0013=At\u0012 yt \u0013, (20)\r\n yt I 0 yt\u22121 yt\u22121\r\n where yt , xt \u2212x\u2217. Now, assume that the following condition holds for all eigenvalues \u03bb(H(xt)) of H(xt):\r\n \u221a 2 \u221a 2\r\n (1 \u2212 \u00b5) \u2264\u03bb(H(xt))\u2264 (1+ \u00b5) . (21)\r\n \u03b1 \u03b1\r\n \u221a\r\n then the spectral radius of A is controlled by momentum with \u03c1(A ) = \u00b5.\r\n t t\r\n Proof. Let \u03bbt be an eigenvalue of matrix At, it gives det(At \u2212 \u03bbtI) = 0. We de\ufb01ne the blocks in At as C = I \u2212\u03b1Ht +\r\n \u00b5I \u2212\u03bb I,D =\u2212\u00b5I,E =I andF =\u2212\u03bb I whichgives\r\n t t\r\n det(At \u2212\u03bbtI) = detF detC \u2212DF\u22121E\u0001=0\r\n assuming generally F is invertible. Note we use Ht , H(xt) for simplicity in writing. The equation\r\n detC \u2212DF\u22121E\u0001=0impliesthat det\u03bb2I \u2212\u03bbtMt+\u00b5I\u0001=0 (22)\r\n t\r\n 2\r\n with M =(I \u2212\u03b1H +\u00b5I). Inotherwords,\u03bb satis\ufb01edthat\u03bb \u2212\u03bb \u03bb(M )+\u00b5 = 0with\u03bb(M )beingoneeigenvalue\r\n t t t t t t t\r\n of M . I.e.\r\n t p\r\n 2\r\n \u03bb = \u03bb(Mt)\u00b1 \u03bb(Mt) \u22124\u00b5 (23)\r\n t 2\r\n Ontheother hand, (21) guarantees that (1 \u2212 \u03b1\u03bb(Ht) + \u00b5)2 \u2264 4\u00b5. We know both Ht and I \u2212\u03b1Ht +\u00b5I are symmetric.\r\n 2 2 \u221a\r\n Thusfor all eigenvalues \u03bb(Mt) of Mt, we have \u03bb(Mt) = (1\u2212\u03b1\u03bb(Ht)+\u00b5) \u2264 4\u00b5whichguarantees|\u03bbt| = \u00b5forall\r\n \u03bb . As the spectral radius is equal to the magnitude of the largest eigenvalue of A , we have the spectral radius of A being\r\n t t t\r\n \u221a\r\n \u00b5.\r\n B PROOFOFLEMMA5\r\n We\ufb01rstproveLemma8andLemma9aspreparationfortheproofofLemma5. Aftertheproofforonedimensionalcase,\r\n wediscuss the trivial generalization to multiple dimensional case.\r\n Lemma8. Letthehbethecurvatureofaonedimensionalquadraticfunction f and xt = Ext. We assume, without loss of\r\n \u22c6\r\n generality, the optimum point of f is x = 0. Then we have the following recurrence\r\n \u0012 \u0013 \u0012 \u0013t\u0012 \u0013\r\n xt+1 = 1\u2212\u03b1h+\u00b5 \u2212\u00b5 x1 (24)\r\n x 1 0 x\r\n t 0\r\n Proof. From the recurrence of momentum SGD, we have\r\n Ex =E[x \u2212\u03b1\u2207f (x )+\u00b5(x \u2212x )]\r\n t+1 t S t t t\u22121\r\n t\r\n =E [x \u2212\u03b1E \u2207f (x )+\u00b5(x \u2212x )]\r\n x t S S t t t\u22121\r\n t t t\r\n =E [x \u2212\u03b1hx +\u00b5(x \u2212x )]\r\n xt t t t t\u22121\r\n =(1\u2212\u03b1h+\u00b5)x \u2212\u00b5x\r\n t t\u22121\r\n YELLOWFINandtheArtofMomentumTuning\r\n Byputting the equation in to matrix form, (24) is a straight-forward result from unrolling the recurrence for t times. Note as\r\n wesetx =x withnouncertaintyinmomentumSGD,wehave[x ,x ] = [x ,x ].\r\n 1 0 0 1 0 1\r\n 2\r\n Lemma9. LetU =E(x \u2212x ) andV =E(x \u2212x )(x \u2212x )withx beingtheexpectationofx . Forquadratic\r\n t t t t t t t\u22121 t\u22121 t t\r\n function f(x) with curvature h \u2208 R, We have the following recurrence\r\n \uf8eb \uf8f6 \uf8eb 2 \uf8f6\r\n U \u03b1 C\r\n t+1\r\n \uf8ed U \uf8f8 \u22a4 \u22121\uf8ed 0 \uf8f8\r\n V t =(I \u2212B )(I \u2212B) 0 (25)\r\n t+1\r\n where \uf8eb \uf8f6\r\n 2 2\r\n \uf8ed(1\u2212\u03b1h+\u00b5) \u00b5 \u22122\u00b5(1\u2212\u03b1h+\u00b5)\uf8f8\r\n B= 1 0 0 (26)\r\n 1\u2212\u03b1h+\u00b5 0 \u2212\u00b5\r\n 2\r\n andC =E(\u2207fS (xt)\u2212\u2207f(xt)) isthevarianceofgradientonminibatchSt.\r\n t\r\n Proof. We prove by \ufb01rst deriving the recurrence for U and V respectively and combining them in to a matrix form. For U ,\r\n t t t\r\n wehave\r\n U =E(x \u2212x )2\r\n t+1 t+1 t+1\r\n =E(x \u2212\u03b1\u2207f (x )+\u00b5(x \u2212x ) \u2212(1\u2212\u03b1h+\u00b5)x +\u00b5x )2\r\n t St t t t\u22121 t t\u22121\r\n 2\r\n =E(x \u2212\u03b1\u2207f(x )+\u00b5(x \u2212x ) \u2212(1\u2212\u03b1h+\u00b5)x +\u00b5x +\u03b1(\u2207f(x )\u2212\u2207f (x )))\r\n t t t t\u22121 t t\u22121 t St t (27)\r\n 2 2 2\r\n =E((1\u2212\u03b1h+\u00b5)(x \u2212x )\u2212\u00b5(x \u2212x )) +\u03b1 E(\u2207f(x )\u2212\u2207f (x ))\r\n t t t\u22121 t\u22121 t S t\r\n t\r\n =(1\u2212\u03b1h+\u00b5)2E(x \u2212x )2\u22122\u00b5(1\u2212\u03b1h+\u00b5)E(x \u2212x )(x \u2212x )\r\n t t t t t\u22121 t\u22121\r\n 2 2 2\r\n +\u00b5 E(x \u2212x ) +\u03b1 C\r\n t\u22121 t\u22121\r\n where the cross terms cancels due to the fact E [\u2207f(x )\u2212\u2207f (x )] = 0inthethirdequality.\r\n S t S t\r\n t t\r\n For V , we can similarly derive\r\n t\r\n Vt =E(xt \u2212xt)(xt\u22121 \u2212xt\u22121)\r\n =E((1\u2212\u03b1h+\u00b5)(x \u2212x )\u2212\u00b5(x \u2212x )+\u03b1(\u2207f(x)\u2212\u2207f (x)))(x \u2212x ) (28)\r\n t\u22121 t\u22121 t\u22122 t\u22122 t St t t\u22121 t\u22121\r\n =(1\u2212\u03b1h+\u00b5)E(x \u2212x )2\u2212\u00b5E(x \u2212x )(x \u2212x )\r\n t\u22121 t\u22121 t\u22121 t\u22121 t\u22122 t\u22122\r\n Again, the term involving \u2207f(x )\u2212\u2207f (x )cancelsinthethirdequalityasaresultsofE [\u2207f(x )\u2212\u2207f (x )] = 0.(27)\r\n t S t S t S t\r\n t t t\r\n and (28) can be jointly expressed in the following matrix form\r\n \uf8eb \uf8f6 \uf8eb \uf8f6 \uf8eb 2 \uf8f6 t\u22121 \uf8eb 2 \uf8f6 \uf8eb \uf8f6 \uf8eb 2 \uf8f6\r\n Ut+1 Ut \u03b1 C X \u03b1C U1 \u03b1 C\r\n \uf8ed \uf8f8 \uf8ed \uf8f8 \uf8ed \uf8f8 i \uf8ed \uf8f8 t \uf8ed \uf8f8 t \u22121\uf8ed \uf8f8\r\n U =B U + 0 = B 0 +B U =(I\u2212B)(I\u2212B) 0 . (29)\r\n t t\u22121 0\r\n Vt+1 Vt 0 i=0 0 V1 0\r\n Note the second term in the second equality is zero because x0 and x1 are deterministic. Thus U1=U0=V1=0.\r\n AccordingtoLemma8and9,wehaveE(x \u2212x\u2217)2 = (e\u22a4At[x ,x ]\u22a4)2 andE(x \u2212x )2 = \u03b12Ce\u22a4(I\u2212Bt)(I\u2212B)\u22121e\r\n t 1 1 0 t t 1 1\r\n where e1 \u2208 Rn has all zero entries but the \ufb01rst dimension. Combining these two terms, we prove Lemma 5. Though the\r\n proof here is for one dimensional quadratics, it trivially generalizes to multiple dimensional quadratics. Speci\ufb01cally, we can\r\n decompose the quadratics along the eigenvector directions, and then apply Lemma 5 to each eigenvector direction using the\r\n corresponding curvature h (eigenvalue). By summing quantities in (11) for all eigenvector directions, we can achieve the\r\n multiple dimensional correspondence of (11).\r\n C PROOFOFLEMMA6\r\n Again we \ufb01rst present a proof of a multiple dimensional generalized version of Lemma 6. The proof of Lemma 6 is a one\r\n dimensional special case of Lemma 10. Lemma 10 also implies that for multiple dimension quadratics, the corresponding\r\n \u221a 2 \u221a 2\r\n spectral radius \u03c1(B) = \u00b5 if (1\u2212 \u00b5) \u2264 h \u2264 (1+ \u00b5) on all the eigenvector directions with h being the eigenvalue\r\n (curvature). \u03b1 \u03b1\r\n YELLOWFINandtheArtofMomentumTuning\r\n Lemma10. LetH \u2208Rn\u00d7nbeasymmetricmatrixand\u03c1(B)bethespectralradiusofmatrix\r\n \uf8eb \u22a4 2 \uf8f6\r\n (I \u2212\u03b1H +\u00b5I) (I \u2212\u03b1H +\u00b5I) \u00b5 I \u22122\u00b5(I \u2212\u03b1H +\u00b5I)\r\n B=\uf8ed I 0 0 \uf8f8 (30)\r\n I \u2212\u03b1H +\u00b5I 0 \u2212\u00b5I\r\n Wehave\u03c1(B)=\u00b5ifalleigenvalues\u03bb(H)ofH satis\ufb01es\r\n \u221a 2 \u221a 2\r\n (1 \u2212 \u00b5) \u2264\u03bb(H)\u2264 (1+ \u00b5) . (31)\r\n \u03b1 \u03b1\r\n Proof. Let \u03bb be an eigenvalue of matrix B, it gives det(B \u2212 \u03bbI) = 0 which can be alternatively expressed as\r\n det(B \u2212\u03bbI)=detFdetC \u2212DF\u22121E\u0001=0 (32)\r\n assuming F is invertible, i.e. \u03bb + \u00b5 6= 0, where the blocks in B\r\n \u0012 \u0013 \u0012 \u0013 \u0012 \u0013\r\n \u22a4 2 \u22122\u00b5M M \u22a4\r\n C = M M\u2212\u03bbI \u00b5I ,D= 0 , E = 0 , F = \u2212\u00b5I \u2212\u03bbI\r\n I \u2212\u03bbI\r\n with M = I \u2212\u03b1H +\u00b5I.(32)canbetransformedusingstraight-forward algebra as\r\n \u0012 \u22a4 2 \u0013\r\n det (\u03bb\u2212\u00b5)M M \u2212(\u03bb+\u00b5)\u03bbI (\u03bb+\u00b5)\u00b5 I =0 (33)\r\n (\u03bb+\u00b5)I \u2212(\u03bb+\u00b5)\u03bbI\r\n Using similar simpli\ufb01cation technique as in (32), we can further simplify into\r\n \u0010 2 \u22a4 \u0011\r\n (\u03bb\u2212\u00b5)det (\u03bb+\u00b5) I \u2212\u03bbM M =0 (34)\r\n 2 \u22a4 2 2\r\n if \u03bb 6= \u00b5, as (\u03bb + \u00b5) I \u2212 \u03bbM M is diagonalizable, we have (\u03bb + \u00b5) \u2212\u03bb\u03bb(M) = 0 with \u03bb(M) being an eigenvalue\r\n of symmetric M. The analytic solution to the equation can be explicitly expressed as\r\n \u03bb=\u03bb(M)2\u22122\u00b5\u00b1p(\u03bb(M)2\u22122\u00b5)2\u22124\u00b52. (35)\r\n 2\r\n Whenthecondition in (31) holds, we have \u03bb(M)2 = (1\u2212\u03b1\u03bb(H)+\u00b5)2 \u2264 4\u00b5. One can verify that\r\n 2 2 2 2 2\r\n (\u03bb(M) \u22122\u00b5) \u22124\u00b5 = (\u03bb(M) \u22124\u00b5)\u03bb(M)\r\n 2 \u0001 2 (36)\r\n = (1 \u2212\u03b1\u03c1(H)+\u00b5) \u22124\u00b5 \u03bb(M)\r\n \u2264 0\r\n Thusthe roots in (35) are conjugate with |\u03bb| = \u00b5. In conclusion, the condition in (31) can guarantee all the eigenvalues of\r\n Bhasmagnitude\u00b5. Thusthespectral radius of B is controlled by \u00b5.\r\n D PRACTICALIMPLEMENTATION\r\n In Section 3.2, we discuss estimators for learning rate and momentum tuning in YELLOWFIN. In our experiment practice,\r\n wehaveidenti\ufb01ed a few practical implementation details which are important for improving estimators. Zero-debias is\r\n proposed by Kingma & Ba (2014), which accelerates the process where exponential average adapts to the level of original\r\n quantity in the beginning. We applied zero-debias to all the exponential average quantities involved in our estimators. In\r\n some LSTMmodels, we observe that our estimated curvature may decrease quickly along the optimization process. In\r\n order to better estimate extremal curvature h and h with fast decreasing trend, we apply zero-debias exponential\r\n max min\r\n average on the logarithmic of hmax,t and hmin,t, instead of directly on hmax,t and hmin,t. Except from the above two\r\n techniques, we also implemented the slow start heuristic proposed by (Schaul et al., 2013). More speci\ufb01cally, we use\r\n \u03b1=min{\u03b1t,t\u00b7\u03b1t/(10\u00b7w)}asourlearningratewithw asthesizeofourslidingwindowinhmax andhmin estimation.\r\n It discount the learning rate in the \ufb01rst 10 \u00b7 w steps and helps to keep the learning rate small in the beginning when the\r\n exponential averaged quantities are not accurate enough.\r\n YELLOWFINandtheArtofMomentumTuning\r\n network # layers Conv0 Unit 1s Unit 2s Unit 3s\r\n CIFAR10ResNet 110 \u0002 3\u00d73, 4 \u0003 \u0014 3\u00d73, 4 \u0015\u00d76 \u0014 3\u00d73, 8 \u0015\u00d76 \u0014 3\u00d73, 16 \u0015\u00d76\r\n \uf8ee 3\u00d73, 4 \uf8f9 \uf8ee 3\u00d73, 8 \uf8f9 \uf8ee 3\u00d73, 16 \uf8f9\r\n \u0002 \u0003 \uf8f0 1\u00d71, 16 \uf8fb \uf8f0 1\u00d71, 32 \uf8fb \uf8f0 1\u00d71, 64 \uf8fb\r\n CIFAR100ResNet 164 3\u00d73, 4 3\u00d73, 16 \u00d76 3\u00d73, 32 \u00d76 3\u00d73, 64 \u00d76\r\n 1\u00d71, 64 1\u00d71, 128 1\u00d71, 256\r\n network # layers WordEmbed. Layer 1 Layer 2 Layer 3\r\n TSLSTM 2 [65 vocab, 128 dim] 128 hidden units 128 hidden units \u2013\r\n PTBLSTM 2 [10000 vocab, 200 dim] 200 hidden units 200 hidden units \u2013\r\n WSJLSTM 3 [6922 vocab, 500 dim] 500 hidden units 500 hidden units 500 hidden units\r\n Table 3. Speci\ufb01cation of ResNet and LSTM model architectures.\r\n E MODELSPECIFICATION\r\n Themodelspeci\ufb01cation is shown in Table 3 for all the experiments in Section 5. CIRAR10 ResNet uses the regular ResNet\r\n units while CIFAR100 ResNet uses the bottleneck units. Only the convolutional layers are shown with \ufb01lter size, \ufb01lter\r\n numberaswellastherepeating count of the units. The layer counting for ResNets also includes batch normalization and\r\n Relu layers. The LSTM models are also diversi\ufb01ed for different tasks with different vocabulary sizes, word embedding\r\n dimensions and number of layers.\r\n F SPECIFICATION FOR SYNCHRONOUS EXPERIMENTS\r\n In Section 5.1, we demonstrate the synchronous experiments with extensive discussions. For the reproducibility, we provide\r\n here the speci\ufb01cation of learning rate grids. The number of iterations as well as epochs, i.e. the number of passes over the\r\n full training sets, are also listed for completeness. For YELLOWFIN in all the experiments in Section 5, we uniformly use\r\n sliding window size 20 for extremal curvature estimation and \u03b2 = 0.999 for smoothing. For momentum SGD and Adam,\r\n weusethefollowing con\ufb01gurations.\r\n \u2022 CIFAR10ResNet\r\n \u2013 40k iterations (\u223c114 epochs)\r\n \u2013 MomentumSGDlearningrates{0.001,0.01(best),0.1,1.0}, momentum 0.9\r\n \u2013 Adamlearning rates {0.0001,0.001(best),0.01,0.1}\r\n \u2022 CIFAR100ResNet\r\n \u2013 120k iterations (\u223c341 epochs)\r\n \u2013 MomentumSGDlearningrates{0.001,0.01(best),0.1,1.0}, momentum 0.9\r\n \u2013 Adamlearning rates {0.00001,0.0001(best),0.001,0.01}\r\n \u2022 PTBLSTM\r\n \u2013 30kiterations (\u223c13 epochs)\r\n \u2013 MomentumSGDlearningrates{0.01,0.1,1.0(best),10.0}, momentum 0.9\r\n \u2013 Adamlearning rates {0.0001,0.001(best),0.01,0.1}\r\n \u2022 TSLSTM\r\n \u2013 \u223c21kiterations (50 epochs)\r\n \u2013 MomentumSGDlearningrates{0.05,0.1,0.5,1.0(best),5.0}, momentum 0.9\r\n \u2013 Adamlearning rates {0.0005,0.001,0.005(best),0.01,0.05}\r\n \u2013 Decrease learning rate by factor 0.97 every epoch for all optimizers, following the design by Karpathy et al.\r\n (2015).\r\n YELLOWFINandtheArtofMomentumTuning\r\n \u2022 WSJLSTM\r\n \u2013 \u223c120kiterations (50 epochs)\r\n \u2013 MomentumSGDlearningrates{0.05,0.1,0.5(best),1.0,5.0}, momentum 0.9\r\n \u2013 Adamlearning rates {0.0001,0.0005,0.001(best),0.005,0.01}\r\n \u2013 Vanilla SGD learning rates {0.05,0.1,0.5,1.0(best),5.0}\r\n \u2013 Adagrad learning rates {0.05,0.1,0.5(best),1.0,5.0}\r\n \u2013 Decrease learning rate by factor 0.9 every epochs after 14 epochs for all optimizers, following the design by Choe\r\n &Charniak.\r\n G ADDITIONALEXPERIMENTRESULTS\r\n G.1 Theimportanceofadaptivemomentum\r\n In Section 5.1, we discussed the importance of adaptive momentum by demonstrating the training loss on the TS LSTM and\r\n CIFAR100ResNetmodels. InFigure9,wefurthervalidate the importance of adaptive momentum by demonstrating the\r\n corresponding validation/test performance on the PSTM LSTM and CIFAR100 ResNet models. Particularly in Figure 7 (left\r\n and middle), similar to our observation on training loss comparison, we can also see that neither prescribed momentum 0.0\r\n or 0.9 can match the performance of YELLOWFIN with adaptive momentum across the two tasks. Furthermore, in Figure 7\r\n (right), hand-tuned Vanilla SGD without momentum decreases the validation perplexity in TS LSTM more slowly than\r\n momentumbasedmethods. HoweverbydynamicallyrescalingtheVanillaSGDlearningratebasedonYellowFintuned\r\n momentum,itdemonstrates a validation perplexity decreasing speed matching that of momentum based methods.\r\n 2.0 0.7 2.0\r\n YellowFin 0.6 MomentumSGD\r\n 1.9 YFmom.=0.0 1.9 Vanilla SGD\r\n YFmom.=0.9 0.5 YellowFin\r\n perplexity1.8 YellowFin perplexity1.8\r\n YFrescaling accuracy0.4 YFrescaling\r\n 1.7 YFmom.=0.0 1.7\r\n 0.3 YFmom.=0.9\r\n 1.6 est0.2 YFrescaling 1.6\r\n T\r\n alidation1.5 0.1 alidation1.5\r\n V 0k 5k 10k 15k 20k 0k 30k 60k 90k V 0k 5k 10k 15k 20k\r\n Iterations Iterations Iterations\r\n Figure 9. Importanceofadaptivemomentum: Thevalidation/testperformancecomparisonbetween YELLOWFINwithadaptivemomentum\r\n and YELLOWFIN with \ufb01xed momentum values; this comparison is conducted on TS LSTM (left) and CIFAR100 ResNet (middle).\r\n Prescribed momentum values do not match the performance of YELLOWFIN with adaptive momentum across the two tasks. An adaptive\r\n learning rate for SGD based on YELLOWFIN tuned momentum, can match the performance of momentum based methods on the TS\r\n LSTM(right).\r\n G.2 TraininglossandtestaccuracyonCIFAR10andCIFAR100ResNet\r\n In Figure 10, we demonstrate the training loss on CIFAR10 ResNet and CIFAR100 ResNet. Speci\ufb01cally, YELLOWFIN can\r\n match the performance of hand-tuned momentum SGD, and achieves 1.93x and 1.38x speedup comparing to hand-tuned\r\n Adamrespectively on CIFAR10 and CIFAR100 ResNet. In Figure 11, we show the test accuracy curves corresponding to\r\n the curves in Figure 10. We can observe that YELLOWFIN can have matching or better training loss at the end of training\r\n than hand-tuned momentum SGD, while the test accuracy is worse (e.g. CIFAR100); this phenomenon where better training\r\n loss does not guarantee better generalization is often observed in deep learning results.\r\n YELLOWFINandtheArtofMomentumTuning\r\n 101\r\n MomentumSGD MomentumSGD\r\n Adam Adam\r\n loss100 YellowFin loss\r\n YellowFin\r\n 100\r\n Training Training\r\n \u22121\r\n 10\r\n 0k 10k 20k 30k 40k 0k 30k 60k 90k 120k\r\n Iterations Iterations\r\n Figure 10. The best training loss for the 100-layer CIFAR10 ResNet (left) and 164-layer CIFAR100 bottleneck ResNet (right).\r\n 0.90 0.70\r\n MomentumSGD\r\n 0.88 0.65 Adam\r\n 0.86 YellowFin\r\n accuracy accuracy0.60\r\n 0.84 MomentumSGD\r\n est Adam est0.55\r\n T 0.82 T\r\n YellowFin\r\n 0.80 0.50\r\n 0k 10k 20k 30k 40k 0k 30k 60k 90k\r\n Iterations Iterations\r\n Figure 11. Test accuracy for ResNet on 100-layer CIFAR10 ResNet (left) and 164-layer CIFAR100 bottleneck ResNet. The test accuracy\r\n curves corresponds to the training loss curves in Figure 10\r\n YELLOWFINandtheArtofMomentumTuning\r\n G.3 TuningmomentumcanimproveAdaminasynchronous-parallelsetting\r\n 7\r\n 6\r\n loss\r\n 5\r\n Training4\r\n \u03b2 =\u22120.2 \u03b2 =0.3 \u03b2 =0.7\r\n 1 1 1\r\n \u03b2 =0.0 \u03b2 =0.5 \u03b2 =0.9\r\n 3 1 1 1\r\n 0k 5k 10k 15k 20k 25k 30k\r\n Iterations\r\n Figure 12. Hand-tuning Adam\u2019s momentum under asynchrony.\r\n WeconductexperimentsonPTBLSTMwith16asynchronousworkersusingAdamusingthesameprotocolasinSection5.2.\r\n Fixing the learning rate to the value achieving the lowest smoothed loss in Section 5.1, we sweep the smoothing parameter\r\n \u03b2 (Kingma&Ba,2014)ofthe\ufb01rstordermomentestimateingrid{\u22120.2,0.0,0.3,0.5,0.7,0.9}. \u03b2 servesthesameroleas\r\n 1 1\r\n momentuminSGDandwecallitthemomentuminAdam. Figure12showstuningmomentumforAdamunderasynchrony\r\n gives measurably better training loss. This result emphasizes the importance of momentum tuning in asynchronous settings\r\n and suggests that state-of-the-art adaptive methods can perform sub-optimally when using prescribed momentum.\r\n G.4 Accelerating YELLOWFIN with\ufb01nergrainlearningratetuning\r\n Asanadaptive tuner, YELLOWFIN does not involve manual tuning. It can present faster development iterations on model\r\n architectures than grid search on optimizer hyperparameters. In deep learning practice for computer vision and natural\r\n language processing, after \ufb01xing the model architecture, extensive optimizer tuning (e.g. grid search or random search)\r\n can further improve the performance of a model. A natural question to ask is can we also slightly tune YELLOWFIN to\r\n accelerate convergence and improve the model performance. Speci\ufb01cally, we can manually multiply a positive number, the\r\n learning rate factor, to the auto-tuned learning rate in YELLOWFIN to further accelerate.\r\n In this section, we empirically demonstrate the effectiveness of learning rate factor on a 29-layer ResNext (2x64d) (Xie\r\n et al., 2016) on CIFAR10 and a Tied LSTM model (Press & Wolf, 2016) with 650 dimensions for word embedding and\r\n two hidden units layers on the PTB dataset. When running YELLOWFIN, we search for the optimal learning rate factor\r\n in grid {1,0.5,1,2(best for ResNext),3(best for Tied LSTM),10}. Similarly, we search the same learning rate factor grid\r\n 3\r\n for Adam, multiplying the factor to its default learning rate 0.001. To further strengthen the performance of Adam as a\r\n \u22125 \u22124 \u22124 \u22123 \u22123\r\n baseline, we also run it on conventional logarithmic learning rate grid {5e , 1e , 5e , 1e , 5e } for ResNext and\r\n \u22124 \u22124 \u22123 \u22123 \u22122\r\n {1e , 5e , 1e , 5e , 1e } for Tied LSTM. We report the best metric from searching the union of learning rate factor\r\n grid and logarithmic learning rate grid as searched Adam results. Recently, AMSGrad (AMSG) (Reddi et al., 2018) is\r\n proposed as an variant of Adam to correct the convergence issue on certain convex problems. To provide a complete\r\n comparison, we additionally perform learning rate factor search with grid {0.1, 1,0.5,1,2,3,10} for AMSG. Empirically,\r\n 3 1\r\n weobserve AdamandAMSGhavesimilarconvergencebehaviorwithsameleraningrate factors, and learning factor 3 and\r\n 1.0 work best for Adam/AMSG respectively on ResNext and Tied LSTM.\r\n AsshowninFigure13,withthesearchedbest learning rate factor, YELLOWFIN can improve validation perplexity on Tied\r\n LSTMfrom88.7to80.5,animprovementofmorethan9%. Similarly,thesearchedlearningrate factor can improve test\r\n accuracy from 92.63 to 94.75 on ResNext. More importantly, we can observe, with learning rate factor search on the two\r\n models, YELLOWFIN can achieve better validation metric than the searched Adam and AMSG results. It demonstrates that\r\n \ufb01ner-grain learning rate tuning, i.e. the learning rate factor search, can be effectively applied on YELLOWFIN to improve the\r\n performance of deep learning models.\r\n YELLOWFINandtheArtofMomentumTuning\r\n 110 95\r\n YellowFin accuracy90\r\n perplexity90 Adamdefault\r\n AMSGdefault 85\r\n YFsearched\r\n Adamsearched alidation\r\n alidation70 AMSGsearched V 80\r\n V 0 5 10 15 20 25 30 35 40 0 50 100 150 200\r\n Epochs Epochs\r\n Figure 13. Validation perplexity on Tied LSTM and validation accuracy on ResNext. Learning rate \ufb01ne-tuning using grid-searched factor\r\n can further improve the performance of YELLOWFIN in Algorithm 1. YELLOWFIN with learning factor search can outperform hand-tuned\r\n Adamonvalidation metrics on both models.\r\n", "award": [], "sourceid": 153, "authors": [{"given_name": "Jian", "family_name": "Zhang", "institution": "Stanford University"}, {"given_name": "Ioannis", "family_name": "Mitliagkas", "institution": "Mila & University of Montreal"}]}