{"title": "YellowFin and the Art of Momentum Tuning", "book": "Proceedings of Machine Learning and Systems", "page_first": 289, "page_last": 308, "abstract": "Hyperparameter tuning is one of the most time-consuming workloads in deep learning. State-of-the-art optimizers, such as AdaGrad, RMSProp and  Adam, reduce this labor by adaptively tuning an individual learning rate for each variable. Recently researchers have shown renewed interest in simpler methods like momentum SGD as they may yield better test metrics. Motivated by this trend, we ask: can simple adaptive methods based on SGD perform as well or better? We revisit the momentum SGD algorithm and show that hand-tuning a single learning rate and momentum makes it competitive with Adam. We then analyze its robustness to learning rate misspecification and objective curvature variation. Based on these insights, we design YellowFin, an automatic tuner for momentum and learning rate in SGD. YellowFin optionally uses a negative-feedback loop to compensate for the momentum dynamics in asynchronous settings on the fly. We empirically show that YellowFin can converge in fewer iterations than Adam on ResNets and LSTMs for image recognition, language modeling and constituency parsing, with a speedup of up to 3.28x in synchronous and up to 2.69x in asynchronous settings.\n", "full_text": "                                    YELLOWFINANDTHEARTOFMOMENTUMTUNING\r\n                                                             Jian Zhang1 Ioannis Mitliagkas2\r\n                                                                         ABSTRACT\r\n                     Hyperparameter tuning is one of the most time-consuming workloads in deep learning. State-of-the-art optimizers,\r\n                     such as AdaGrad, RMSProp and Adam, reduce this labor by adaptively tuning an individual learning rate for each\r\n                     variable. Recently researchers have shown renewed interest in simpler methods like momentum SGD as they may\r\n                     yield better test metrics. Motivated by this trend, we ask: can simple adaptive methods based on SGD perform as\r\n                     well or better? We revisit the momentum SGD algorithm and show that hand-tuning a single learning rate and\r\n                     momentummakesitcompetitive with Adam. We then analyze its robustness to learning rate misspeci\ufb01cation and\r\n                     objective curvature variation. Based on these insights, we design YELLOWFIN, an automatic tuner for momentum\r\n                     andlearning rate in SGD. YELLOWFIN optionally uses a negative-feedback loop to compensate for the momentum\r\n                     dynamics in asynchronous settings on the \ufb02y. We empirically show that YELLOWFIN can converge in fewer\r\n                     iterations than Adam on ResNets and LSTMs for image recognition, language modeling and constituency parsing,\r\n                     with a speedup of up to 3.28x in synchronous and up to 2.69x in asynchronous settings.\r\n               1    INTRODUCTION                                                        101 Synchronous training    101 Asynchronous training\r\n                                                                                                        Adam                    Adam\r\n               Accelerated forms of stochastic gradient descent (SGD),                                  YellowFin               YellowFin\r\n                                                                                       loss                                     Closed-loop\r\n               pioneered by Polyak (1964) and Nesterov (1983), are the                                                          YellowFin\r\n               de-facto training algorithms for deep learning. Their use\r\n               requires a sane choice for their hyperparameters: typically             Training                     100\r\n               a learning rate and momentum parameter (Sutskever et al.,                100\r\n               2013). However, tuning hyperparameters is arguably the                    0k   30k   60k   90k  120k  0k   30k   60k   90k  120k\r\n               most time-consuming part of deep learning, with many pa-                          Iterations                  Iterations\r\n               pers outlining best tuning practices written (Bengio, 2012;         Figure 1. YELLOWFIN in comparison to Adam on a ResNet (CI-\r\n                          \u00a8\r\n               Orr & Muller, 2003; Bengio et al., 2012; Bottou, 2012).             FAR100,cf. Section 5) in synchronous and asynchronous settings.\r\n               Deeplearning researchers have proposed a number of meth-\r\n               ods to deal with hyperparameter optimization, ranging from          better test scores (Wilson et al., 2017). Motivated by this\r\n               grid-search and smart black-box methods (Bergstra & Ben-            trend, we ask the question: can simpler adaptive methods\r\n               gio, 2012; Snoek et al., 2012) to adaptive optimizers. Adap-        based on momentum SGD perform as well or better? We\r\n               tive optimizers aim to eliminate hyperparameter search by           empirically show, with a hand-tuned learning rate, Polyak\u2019s\r\n               tuning on the \ufb02y for a single training run: algorithms like         momentumSGDachievesfasterconvergencethanAdamfor\r\n               AdaGrad(Duchietal., 2011), RMSProp (Tieleman & Hin-                 a large class of models. We then formulate the optimization\r\n               ton, 2012) and Adam (Kingma & Ba, 2014) use the magni-              update as a dynamical system and study certain robustness\r\n               tude of gradient elements to tune learning rates individually       properties of the momentum operator. Inspired by our anal-\r\n               for each variable and have been largely successful in reliev-       ysis, we design YELLOWFIN, an automatic hyperparameter\r\n               ing practitioners of tuning the learning rate.                      tuner for momentum SGD. YELLOWFIN simultaneously\r\n               Recently some researchers have started favoring simple mo-          tunes the learning rate and momentum on the \ufb02y, and can\r\n               mentumSGDoverthepreviouslymentionedadaptive meth-                   handle the complex dynamics of asynchronous execution.\r\n               ods (Chen et al., 2016; Gehring et al., 2017), often reporting      Ourcontribution and outline are as follows:\r\n                   1Computer Science Department, Stanford University, CA,             \u2022 In Section 2, we demonstrate examples where momen-\r\n                      2                          \u00b4\r\n               USA. Mila, University of Montreal, Canada CIFAR AI Chair.                tum offers convergence robust to learning rate mis-\r\n               Correspondence to: Jian Zhang <zjian@stanford.edu>, Ioannis              speci\ufb01cation and curvature variation in a class of non-\r\n               Mitliagkas <ioannis@iro.umontreal.ca>.                                   convexobjectives. Thisrobustnessisdesirablefordeep\r\n               Proceedings of the 2nd SysML Conference, Palo Alto, CA, USA,             learning. It stems from a known but obscure fact: the\r\n               2019. Copyright 2019 by the author(s).                                   momentumoperator\u2019s spectral radius is constant in a\r\n                                                                                        large subset of the hyperparameter space.\r\n                                                      YELLOWFINandtheArtofMomentumTuning\r\n                 \u2022 In Section 3, we use these robustness insights and a         mentumcanexhibit linear convergence robust to learning\r\n                    simple quadratic model analysis to motivate the design      rate misspeci\ufb01cation and to curvature variation. The robust-\r\n                    of YELLOWFIN, an automatic tuner for momentum               ness to learning rate misspeci\ufb01cation means tolerance to a\r\n                    SGD.YELLOWFINuseson-the-\ufb02ymeasurementsfrom                  less-carefully-tuned learning rate. On the other hand, the\r\n                    the gradients to tune both a single learning rate and a     robustness to curvature variation means empirical linear con-\r\n                    single momentum.                                            vergence on a class of non-convex objectives with varying\r\n                 \u2022 In Section 3.3, we discuss common stability concerns         curvatures. After preliminary on momentum, we discuss\r\n                    related to the phenomenon of exploding gradients (Pas-      these two properties desirable for deep learning objectives.\r\n                    canu et al., 2013). We present a natural extension to\r\n                    our basic tuner, using adaptive gradient clipping, to sta-  2.1   Preliminaries\r\n                    bilize training for objectives with exploding gradients.    Weaimtominimizesomeobjectivef(x). Inmachinelearn-\r\n                 \u2022 In Section 4 we present closed-loop YELLOWFIN,               ing, x is referred to as the model and the objective is some\r\n                    suited for asynchronous training. It uses a novel com-      loss function. A low loss implies a well-\ufb01t model. Gradient\r\n                    ponent for measuring the total momentum in a running        descent-based procedures use the gradient of the objective\r\n                    system, including any asynchrony-induced momentum,          function, \u2207f(x), to update the model iteratively. These\r\n                    a phenomenon described in (Mitliagkas et al., 2016).        procedures can be characterized by the convergence rate\r\n                    This measurement is used in a negative feedback loop        with respect to the distance to a minimum.\r\n                    to control the value of algorithmic momentum.               De\ufb01nition1(Convergencerate). Letx\u2217 bealocalminimum\r\n               Weprovide a thorough empirical evaluation of the perfor-         of f(x) and xt denote the model after t steps of an iterative\r\n               manceandstability of our tuner. In Section 5, we demon-          procedure. The iterates converge to x\u2217 with linear rate \u03b2, if\r\n               strate empirically that on ResNets and LSTMs YELLOWFIN                                 \u2217          t        \u2217\r\n                                                                                              kx \u2212x k=O(\u03b2 kx \u2212x k).\r\n               canconvergeinfeweriterationscomparedto: (i)hand-tuned                             t                  0\r\n               momentumSGD(upto1.75xspeedup);and(ii)hand-tuned\r\n               Adam (0.77x to 3.28x speedup). Under asynchrony, the             Polyak\u2019s momentum gradient descent (Polyak, 1964) is one\r\n               closed-loop control architecture speeds up YELLOWFIN,            of these iterative procedures, given by\r\n               makingitupto2.69xfasterthanAdam. Ourexperimentsin-\r\n                                                                                         x     =x \u2212\u03b1\u2207f(x )+\u00b5(x \u2212x                ),      (1)\r\n               clude runs on 7 different models, randomized over at least 3                t+1     t           t        t     t\u22121\r\n               different random seeds. YELLOWFIN is stable and achieves         where \u03b1 denotes a single learning rate and \u00b5 a single mo-\r\n               consistent performance: the normalized sample standard           mentumforall model variables. Momentum\u2019s main appeal\r\n               deviation of test metrics varies from 0.05% to 0.6%. We re-\r\n               leased PyTorch and TensorFlow implementations 1that can          is its established ability to accelerate convergence (Polyak,\r\n               be used as drop-in replacements for any optimizer. YEL-          1964). On a \u03b3-strongly convex \u03b4-smooth function with con-\r\n               LOWFINhasalsobeenimplementedinvariousotherpack-                  dition number \u03ba = \u03b4/\u03b3, the optimal convergence rate of\r\n                                                                                gradient descent without momentum is O(\u03ba\u22121) (Nesterov,\r\n               ages. Its large-scale deployment in industry has taught us                                                   \u03ba+1\r\n               important lessons about stability; we discuss those chal-        2013). On the other hand, for certain classes of strongly\r\n               lenges and our solution in Section 3.3. We conclude with         convex and smooth functions, like quadratics, the optimal\r\n               related work and discussion in Section 6 and 7.                  momentumvalue,\r\n               Ourgoal is to explore the value of moment adaptation for                              \u2217    \u0012\u221a\u03ba\u22121\u00132\r\n               SGDandprovideaprototype, ef\ufb01cient tuner achieving this.                              \u00b5 = \u221a\u03ba+1 ,                           (2)\r\n               Whilewereportstate-of-the-art performanceresults in some         yields the optimal accelerated linear convergence rate\r\n               tasks, we do not claim that on-the-\ufb02y momentum adaptation           \u221a\r\n                                                                                     \u03ba\u22121\r\n               is a necessary feature of a well-performing synchronous sys-     O(\u221a      ). This guarantee does not generalize to arbitrary\r\n                                                                                     \u03ba+1\r\n               tem. In Section 5.1 we demonstrate that a simple variation       strongly convex smooth functions (Lessard et al., 2016).\r\n               of YELLOWFIN,onlyusing the momentum value to further             Nonetheless, this linear rate can often be observed in prac-\r\n               rescale the step size, can yield an adaptive step size method    tice even on non-quadratics (cf. Section 2.2).\r\n               that performs almost as well in some cases.                      Keyinsight: Consider a quadratic objective with condition\r\n               2    THEMOMENTUMOPERATOR                                         number\u03ba > 1. Even though its curvature is different along\r\n                                                                                the different directions, Polyak\u2019s momentum gradient de-\r\n               In this section, we identify the main technical insight be-      scent, with \u00b5 \u2265 \u00b5\u2217, achieves the same linear convergence\r\n                                                                                     \u221a                                                     \u2217\r\n                                                                                rate   \u00b5along all directions. Speci\ufb01cally, let xi,t and x\r\n               hind the design of our tuner: gradient descent with mo-                                                                     i\r\n                                                                                                                    \u2217                 \u2217\r\n                                                                                be the i-th coordinates of x and x . For any \u00b5 \u2265 \u00b5 with\r\n                                                                                                            t\r\n                  1TensorFlow: goo.gl/zC2rjG. PyTorch: goo.gl/N4sFfs            an appropriate learning rate, the update in (1) can achieve\r\n                                                        YELLOWFINandtheArtofMomentumTuning\r\n                         \u2217    \u221a t            \u2217\r\n               |x    \u2212x |\u2264 \u00b5|x \u2212x |simultaneouslyalongallaxes                      Lemma 3 (Robustness of the momentum operator). As-\r\n                  i,t    i           i,0     i\r\n               i. This insight has been hidden away in proofs.                     sume that generalized curvature h and hyperparameters\r\n               In this quadratic case, curvature is different across different     \u03b1,\u00b5satisfy\r\n               axes, but remains constant on any one-dimensional slice. In                           \u221a 2                        \u221a 2\r\n               the next section (Section 2.2), we extend this insight to non-                  (1\u2212 \u00b5) \u2264\u03b1h(xt)\u2264(1+ \u00b5) .                         (6)\r\n               quadratic one-dimensional functions. We then present the            Then as proven in Appendix A, the spectral radius of the\r\n               maintechnical insight behind the design of YELLOWFIN:               momentumoperator at step t depends solely on the momen-\r\n               similar linear convergence rate \u221a\u00b5 can be achieved in a             tumparameter: \u03c1(A ) = \u221a\u00b5, for all t. The inequalities in\r\n                                                                                                          t\r\n               class of one-dimensional non-convex objectives where cur-           (6) de\ufb01ne the robust region, the set of learning rate \u03b1 and\r\n                                                                                                                  \u221a\r\n               vature varies; this linear convergence behavior is robust to        momentum\u00b5achievingthis           \u00b5spectral radius.\r\n               learning rate misspeci\ufb01cation and to the varying curvature.         We know that the spectral radius of an operator, A, de-\r\n               These robustness properties are behind a tuning rule for            scribes its asymptotic behavior when applied multiple times:\r\n               learning rate and momentum in Section 2.2. We extend this           kAtxk \u2248 O(\u03c1(A)t).2 Unfortunately, the same does not\r\n               rule to handle SGD noise and generalize it to multidimen-           always hold for the composition of different operators, even\r\n               sional objectives in Section 3.                                     if they have the same spectral radius, \u03c1(A ) = \u221a\u00b5. It is\r\n                                                                                                                                  t\r\n                                                                                                                                 \u221a t\r\n               2.2   Robustness properties of the momentum operator                not always true that kAt \u00b7\u00b7\u00b7A1xk = O( \u00b5 ). However,\r\n                                                                                                                                         \u221a t\r\n                                                                                   a homogeneous spectral radius often yields the          \u00b5 rate\r\n               In this section, we analyze the dynamics of momentum on a           empirically. In other words, this linear convergence rate is\r\n               class of one-dimensional, non-convex objectives. We \ufb01rst            not guaranteed. Instead, we demonstrate examples to ex-\r\n               introduce the notion of generalized curvature and use it            pose the robustness properties: if the learning rate \u03b1 and\r\n               to describe the momentum operator. Then we discuss the              momentum\u00b5areintherobust region, the homogeneity of\r\n               robustness properties of the momentum operator.                     spectral radii can empirically yield linear convergence with\r\n                                                                                         \u221a\r\n               Curvature along different directions is encoded in the dif-         rate    \u00b5; this behavior is robust with respect to learning\r\n               ferent eigenvalues of the Hessian. It is the only feature           rate misspeci\ufb01cation and to varying curvature.\r\n               of a quadratic needed to characterize the convergence of            Momentum is robust to learning rate misspeci\ufb01cation\r\n               gradient descent. Speci\ufb01cally, gradient descent achieves\r\n               a linear convergence rate |1 \u2212 \u03b1h | on one-dimensional              For a one-dimensional quadratic with curvature h, we have\r\n                                                      c                            generalized curvature h(x) = h for all x. Lemma 3 implies\r\n               quadratics with constant curvature hc. On one-dimensional                                         \u221a\r\n               non-quadratic objectives with varying curvature, this neat          the spectral radius \u03c1(At)=      \u00b5if\r\n               characterization is lost. We can recover it by de\ufb01ning a new                          \u221a 2                     \u221a 2\r\n               kind of \u201ccurvature\u201d with respect to a speci\ufb01c minimum.                          (1\u2212 \u00b5) /h\u2264\u03b1\u2264(1+ \u00b5) /h.                          (7)\r\n                                                                 \u2217\r\n               De\ufb01nition 2 (Generalized curvature). Let x be a local               In Figure 2, we plot \u03c1(A )             1.0\r\n               minimum of f(x) : R \u2192 R. Generalized curvature with                                              t\r\n               respect to x\u2217, denoted by h(x), satis\ufb01es the following.             for different \u03b1 and \u00b5 when             0.8\r\n                                                                                   h = 1.        The solid line          radius0.6\r\n                                    \u2032                    \u2217                         segments correspond to the                              \u00b5=0.0\r\n                                  f (x) = h(x)(x\u2212x ).                      (3)     robust region.     As we in-           0.4              \u00b5=0.1\r\n               Generalized curvature describes, in some sense, non-local           crease momentum, a linear             Spectral0.2       \u00b5=0.3\r\n                                                                                                           \u221a                               \u00b5=0.5\r\n               curvature with respect to minimum x\u2217. It coincides with             rate of convergence,      \u00b5, is        0.0\r\n               curvature on quadratics. On non-quadratic objectives, it            robustly achieved by an ever-            0.0 0.5 1.0 1.5 2.0 2.5 3.0\r\n               characterizes the convergence behavior of gradient descent-         widening range of learning                   Learning rate (\u03b1)\r\n               based algorithms. Speci\ufb01cally, we recover the fact that             rates: higher values of mo-         Figure 2. Spectral radius of\r\n                                                                       \u2217           mentum are more robust to\r\n               starting at point x , the distance from minimum x is re-                                                momentumoperatoronscalar\r\n                                   t                                               learning rate mispeci\ufb01cation.\r\n               ducedby|1\u2212\u03b1h(x )|inonestepofgradientdescent. Using                                                      quadratic for varying \u03b1.\r\n                                    t\r\n               a state-space augmentation, we can rewrite the momentum             This property in\ufb02uences the design of our tuner: more\r\n               update of (1) as                                                    generally for a class of one-dimensional non-convex objec-\r\n                           \u0012           \u2217\u0013        \u0012          \u2217 \u0013                    tives, as long as the learning rate \u03b1 and momentum \u00b5 are in\r\n                             x     \u2212x               x \u2212x\r\n                               t+1         =A         t                    (4)\r\n                                      \u2217         t            \u2217                     the robust region, i.e. satisfy (6) at every step, then momen-\r\n                              x \u2212x                 x     \u2212x\r\n                                t                    t\u22121                           tumoperators at all steps t have the same spectral radius.\r\n               where the momentum operator A at time t is de\ufb01ned as                In the case of quadratics, this implies a convergence rate of\r\n                                                   t\r\n                                     \u0014                        \u0015                        2\r\n                              At , 1\u2212\u03b1h(xt)+\u00b5 \u2212\u00b5                           (5)          For any \u01eb > 0, there exists a matrix norm k \u00b7 k such that\r\n                                               1            0                      kAk\u2264\u03c1(A)+\u01eb(Foucart,2012).\r\n                                                       YELLOWFINandtheArtofMomentumTuning\r\n               \u221a\r\n                 \u00b5, independent of the learning rate. Having established         of the objectives. This property in\ufb02uences our tuner de-\r\n               that, we can just focus on optimally tuning momentum.             sign: in the next section, we extend the tuning rules of (9)\r\n                                                                                 to handle SGD noise; we generalize the extended rule to\r\n               Momentum is robust to varying curvature              As dis-      multidimensional cases as the tuning rule in YELLOWFIN.\r\n               cussedinSection2.1,theintuitionhiddeninclassicresultsis\r\n               that for certain strongly convex smooth objectives, momen-        The role of generalized curvature.       GCde\ufb01nes a quan-\r\n               tumat least as high as the value in (2) can achieve the same      tity that is an alternative to classic curvature and is directly\r\n               rate of linear convergence along all axes with different cur-     related to the contraction properties of the momentum op-\r\n               vatures. We extend this intuition to certain one-dimensional      erator on non-quadratic scalar problems. Note that similar\r\n               non-convex functions with varying curvatures along their          quantities, e.g. the PL condition (Karimi et al., 2016), have\r\n               domains; we discuss the generalization to multidimensional        been used in the analysis of gradient descent. Respectively,\r\n               cases in Section 3.1. Lemma 3 guarantees constant, time-          the ensuing generalized condition number (GCN) is meant\r\n               homogeneous spectral radii for momentum operators At              to describe the dynamic range of this contractivity around a\r\n               assuming (6) is satis\ufb01ed at every step. This assumption mo-       minumumonnon-quadraticfunction.\r\n               tivates a \u201clong-range\u201d extension of the condition number.\r\n               De\ufb01nition4(Generalizedconditionnumber). Wede\ufb01nethe                3    THEYELLOWFINTUNER\r\n               generalized condition number (GCN) with respect to a local        Here we describe our tuner for momentum SGD that uses\r\n                           \u2217\r\n               minimumx ofascalarfunction, f(x) : R \u2192 R, to be the               the same learning rate for all variables. We \ufb01rst introduce a\r\n               dynamic range of its generalized curvature h(x):                  noisy quadratic model f(x) as the local approximation of an\r\n                                       sup          h(x)                         arbitrary one-dimensional objective. On this approximation,\r\n                                 \u03bd =       x\u2208dom(f)                      (8)     weextendthetuning rule of (9) to SGD. In section 3.1, we\r\n                                       infx\u2208dom(f)h(x)                           generalize the discussion to multidimensional objectives; it\r\n                                                                                 yields the YELLOWFIN tuning rule.\r\n               TheGCNcapturesvariations in generalized curvature along\r\n               a scalar slice. From Lemma 3 we get                               Noisy quadratic model       Weconsider a scalar quadratic\r\n                                            \u0012\u221a\u03bd\u22121\u00132                                        h 2          Xh             2    1 X\r\n                                \u00b5\u2265\u00b5\u2217= \u221a                   ,                       f(x) = 2x +C =            2n(x\u2212ci) , n           fi(x) (10)\r\n                                                \u03bd +1                                                     i                      i\r\n                               \u221a 2                         \u221a 2           (9)     with P c = 0. f(x) is a quadratic approximation of the\r\n                          (1 \u2212   \u00b5)       \u2264\u03b1\u2264 (1+ \u00b5)                                     i i\r\n                      infx\u2208dom(f)h(x)             supx\u2208dom(f)h(x)                original objectives with h and C derived from measurement\r\n                                                                                 on the original objective. The function f(x) is de\ufb01ned as\r\n               as the description of the robust region. The momentum and         the average of n component functions, fi. This is a common\r\n               learning rate satisfying (9) guarantees a homogeneous spec-       modelfor SGD,whereweuseonlyasingledatapoint(or\r\n                             \u221a                              \u2217                    a mini-batch) drawn uniformly at random, S \u223c Uni([n])\r\n               tral radius of  \u00b5forall At. Speci\ufb01cally, \u00b5 is the smallest                                                       t\r\n               momentumvaluethatallowsforhomogeneousspectralradii.               to compute a noisy gradient, \u2207fS (x), for step t. Here,\r\n                                                                                           P                          t\r\n                                        \u2217                                               1        2\r\n               Similar to the optimal \u00b5 in (2) for the quadratic case, we        C =           hc denotes the gradient variance. As opti-\r\n                                                                                        2n   i   i\r\n               notice that the optimal \u00b5 in (9) is objective dependent. The      mization on quadratics decomposes into scalar problems\r\n               optimal momentum\u00b5\u2217 iscloseto1forobjectiveswithlarge               along the principal eigenvectors of the Hessian, the scalar\r\n               generalized condition number \u03bd, while objectives with small       modelin(10) is suf\ufb01cient to study local quadratic approx-\r\n               \u03bd implies a optimal momentum \u00b5\u2217 that is close to 0.               imations of multidimensional objectives. Next we get an\r\n               Wedemonstrate with examples that by using a momentum              exact expression for the mean square error after running\r\n               larger than the objective-dependent \u00b5\u2217, homogeneous spec-         momentumSGDonthescalarquadraticin(10)fortsteps\r\n               tral radii suggest an empirical linear convergence behavior       in Lemma5;wedelaytheprooftoAppendixB.\r\n                                                                                 Lemma5. Let f(x) be de\ufb01ned as in (10), x = x and\r\n               on a class of non-convex objectives. In Figure 3(a), the                                                          1      0\r\n                                                                                 x follow the momentum update (1) with stochastic gra-\r\n               non-convex objective, composed of two quadratics with              t\r\n                                                                                                                                       T\r\n                                                                                 dients \u2207fS (xt\u22121) for t \u2265 2. Let e1 = [1,0]              and\r\n               curvatures 1 and 1000, has a GCN of 1000. Using the tun-                      t\r\n                                                                                 f  =[1,0,0]T, the expectation of squared distance to the\r\n               ing rule of (9), and running the momentum algorithm (Fig-          1\r\n                                                                                            \u2217\r\n               ure 3(b)) practically yields the linear convergence predicted     optimumx is\r\n               byLemma3. InFigures3(c,d), we demonstrate an LSTM                               \u2217 2       \u22a4 t          \u2217        \u2217 \u22a4 2\r\n                                                                                   E(x     \u2212x ) =(e A [x \u2212x ,x \u2212x ] )\r\n               as another example. As we increase the momentum value                   t+1               1      1         0               (11)\r\n                                                                                                   +\u03b12Cf\u22a4(I \u2212Bt)(I \u2212B)\u22121f1,\r\n               (the same value for all variables in the model), more model                                  1\r\n                                  \u221a\r\n               variables follow a   \u00b5convergence rate. In these examples,        where the \ufb01rst and second term correspond to squared bias\r\n               the linear convergence is robust to the varying curvature         andvariance, and their corresponding momentum dynamics\r\n                                                             YELLOWFINandtheArtofMomentumTuning\r\n                       0.8                                    3                                   \u22121\r\n                                                            10                                  10\r\n                                                              2\r\n                       0.7                                  10\r\n                                                              1                               value\r\n                       0.6                                  10                                    \u22122\r\n                                                              0                                 10\r\n                       0.5                               optimum10\r\n                                                             \u22121                               \ufb01nal\r\n                      )                                    10\r\n                      x0.4                                   \u22122                                   \u22123\r\n                      (                                    10                                   10\r\n                      f                                  from\u22123\r\n                       0.3                                 10                                 from\r\n                                                             \u22124\r\n                       0.2                                 10                                     \u22124\r\n                                                             \u22125                                 10\r\n                                                           10\r\n                       0.1                                   \u22126                                                 \u00b5=0.9                        \u00b5=0.99\r\n                                                         Distance10\r\n                       0.0                                   \u22127                               Distance\u22125\r\n                                                           10                                   10\r\n                         \u221220\u221215\u221210\u22125 0 5 10 15 20              0   100 200 300 400 500              0   50 100 150 200 250 300    0  50 100 150 200 250 300\r\n                                      x                                Iterations                           Iterations                   Iterations\r\n                                  (a)                                  (b)                                  (c)                              (d)\r\n                 Figure 3. (a) Non-convex toy example; (b) linear convergence rate achieved empirically on the example in (a) tuned according to (9);\r\n                 (c,d) LSTM on MNIST: as momentum increases from 0.9 to 0.99, the global learning rate and momentum falls in robust regions of more\r\n                                                                                                                           \u221a\r\n                 modelvariables. The convergence behavior (shown in grey) of these variables follow the robust rate          \u00b5(showninred).\r\n                 are captured by operators                                                 SINGLESTEP is a multidimensional SGD version of the\r\n                                       \u0014                      \u0015                            noiseless tuning rule in (9). We \ufb01rst generalize (9) and (14)\r\n                                 A= 1\u2212\u03b1h+\u00b5 \u2212\u00b5 ,                                            to multidimensional cases, and then discuss the rule SIN-\r\n                                               1           0                               GLESTEP as well as its implementation in Algorithm 1.\r\n                         \uf8ee                  2     2                         \uf8f9 (12)\r\n                           (1 \u2212\u03b1h+\u00b5)            \u00b5     \u22122\u00b5(1\u2212\u03b1h+\u00b5)                          Asdiscussed in Section 2.2, GCN \u03bd captures the dynamic\r\n                   B=\uf8f0             1             0              0           \uf8fb.             range of generalized curvatures in a one-dimensional ob-\r\n                             1\u2212\u03b1h+\u00b5              0             \u2212\u00b5                          jective with varying curvature. The consequent robust re-\r\n                 Eventhoughit is possible to numerically work on (11) di-                  gion described by (9) implies homogeneous spectral radii.\r\n                 rectly, we use a scalar, asymptotic surrogate in (13) based               On a multidimensional non-convex objective, each one-\r\n                 on the spectral radii of operators to simplify analysis and               dimensional slice passing a minimum x\u2217 can have varying\r\n                 expose insights. This decision is supported by our \ufb01nd-                   curvature. As we use a single \u00b5 and \u03b1 for the entire model,\r\n                 ings in Section 2: the spectral radii can capture empirical               if \u03bd simultaneously captures the dynamic range of general-\r\n                 convergence rate.                                                         ized curvature over all these slices, \u00b5 and \u03b1 in (9) are in the\r\n                                                                                           robust region for all these slices. This implies homogeneous\r\n                                      \u2217 2                                                                  \u221a\r\n                        E(xt+1 \u2212x )                                                        spectral radii    \u00b5according to Lemma 3, empirically facili-\r\n                                                                   \u03b12C           (13)      tating convergence at a common rate along all the directions.\r\n                              2t            2                 t                                                                     \u221a\r\n                     \u2248\u03c1(A) (x \u2212x ) +(1\u2212\u03c1(B) )\r\n                                   0      \u2217                     1\u2212\u03c1(B)                     Given homogeneous spectral radii           \u00b5alongall directions,\r\n                                                                                           the surrogate in (14) generalizes on the local quadratic ap-\r\n                 Oneofourdesigndecisions for YELLOWFIN is to always                        proximation of multiple dimensional objectives. On this ap-\r\n                                                                                                                              \u2217\r\n                 work in the robust region of Lemma 3. We know that this                   proximation with minimum x , the expectation of squared\r\n                                              \u221a                                            distance to x\u2217, Ekx \u2212 x\u2217k2, decomposes into indepen-\r\n                 implies a spectral radius      \u00b5ofthemomentumoperator,A,                                          0\r\n                 for the bias. Lemma 6, as proved in Appendix C, shows that                dent scalar components along the eigenvectors of the Hes-\r\n                 under the exact same condition, the variance operator B has               sian. We de\ufb01ne gradient variance C as the sum of gradient\r\n                 spectral radius \u00b5.                                                        variance along these eigenvectors. The one-dimensional\r\n                 Lemma6. Thespectralradiusofthevariance operator, B                        surrogates in (14) for the independent components sum to\r\n                               \u221a                        \u221a                                    t         \u2217 2           t   2\r\n                                     2                       2                             \u00b5 kx \u2212x k +(1\u2212\u00b5 )\u03b1 C/(1\u2212\u00b5),themultidimensional\r\n                 is \u00b5, if (1 \u2212    \u00b5) \u2264\u03b1h\u2264(1+ \u00b5) .                                                0\r\n                                                                                           surrogate corresponding to the one in (14).\r\n                 Asaresult, the surrogate objective of (13), takes the follow-             Algorithm 1 YELLOWFIN\r\n                 ing form in the robust region.                                               function YELLOWFIN(gradient g , \u03b2)\r\n                                                                                                                                     t\r\n                                                                        \u03b12C                       h     , h     \u2190CURVATURERANGE(g ,\u03b2)\r\n                                 \u2217 2       t         \u2217 2             t                              max    min                                 t\r\n                   E(xt+1 \u2212x ) \u2248 \u00b5 (x0 \u2212x ) +(1\u2212\u00b5 )1\u2212\u00b5 (14)                                       C\u2190VARIANCE(g ,\u03b2)\r\n                                                                                                                         t\r\n                                                                                                  D\u2190DISTANCE(g ,\u03b2)\r\n                 Weextend this surrogate to multidimensional cases to ex-                                                t\r\n                                                                                                  \u00b5 ,\u03b1 \u2190SINGLESTEP(C,D,h                    , h    )\r\n                 tract a noisy tuning rule for YELLOWFIN.                                           t   t                               max    min\r\n                                                                                                  return \u00b5 ,\u03b1\r\n                                                                                                             t   t\r\n                 3.1   Tuningrule                                                             endfunction\r\n                 In this section, we present SINGLESTEP, the tuning rule of                Let D be an estimate of the current model\u2019s distance to a\r\n                 YellowFin (Algorithm 1). Based on the surrogate in (14),                  local quadratic approximation\u2019s minimum, and C denote an\r\n                                                    YELLOWFINandtheArtofMomentumTuning\r\n              Algorithm 2 Curvature range                  Algorithm 3 Gradient variance         Algorithm 4 Distance to opt.\r\n                state: h    , h  , h ,\u2200i \u2208 {1,2,3,...}               2\r\n                        max   min   i                        state: g \u2190 0, g \u2190 0                    state: kgk \u2190 0, h \u2190 0\r\n                function CURVATURERANGE(gradient gt, \u03b2)                                             function DISTANCE(gradient g , \u03b2)\r\n                    h \u2190kg k2                                 function VARIANCE(gradient gt, \u03b2)                                  t\r\n                     t      t                                                                          kgk \u2190\u03b2\u00b7kgk+(1\u2212\u03b2)\u00b7kg k\r\n                    h     \u2190 max h ,h         \u2190 min h              2        2                                                      t\r\n                     max,t  t\u2212w\u2264i\u2264t i   min,t  t\u2212w\u2264i\u2264t i         g \u2190\u03b2\u00b7g +(1\u2212\u03b2)\u00b7gt\u2299gt                                            2\r\n                    h    \u2190\u03b2\u00b7h       +(1\u2212\u03b2)\u00b7h                     g \u2190\u03b2\u00b7g+(1\u2212\u03b2)\u00b7g                        h\u2190\u03b2\u00b7h+(1\u2212\u03b2)\u00b7kgtk\r\n                     max        max              max,t                     \u0010       \u0011 t                 D\u2190\u03b2\u00b7D+(1\u2212\u03b2)\u00b7kgk/h\r\n                    h    \u2190\u03b2\u00b7h +(1\u2212\u03b2)\u00b7h\r\n                     min        min             min,t                    T   2    2\r\n                    return h   , h                               return 1 \u00b7 g \u2212g                       return D\r\n                            max   min\r\n                endfunction                                  endfunction                            endfunction\r\n              estimate for gradient variance. SINGLESTEP minimizes the      Fisher information matrix\u2014i.e. the expected outer prod-\r\n              multidimensional surrogate after a single step (i.e. t = 1)   uct of noisy gradients\u2014approximates the Hessian of the\r\n              whileensuring\u00b5and\u03b1intherobustregionforalldirections.          objective (Duchi, 2016; Pascanu & Bengio, 2013). This\r\n              Asingleinstanceof SINGLESTEPsolvesasinglemomentum             allows for measurements purely being approximated from\r\n              and learning rate for the entire model at each iteration.     minibatch gradients with overhead linear to model dimen-\r\n              Speci\ufb01cally, the extremal curvatures h   andh      denote     sionality. These implementations are not guaranteed to give\r\n                                                   min       max\r\n              estimates for the largest and smallest generalized curvature  accurate measurements. Nonetheless, their use in our ex-\r\n              respectively. They are meant to capture both generalized      periments in Section 5 shows that they are suf\ufb01cient for\r\n              curvature variation along all different directions (like the  YELLOWFINtooutperformthestateoftheartonavariety\r\n              classic condition number) and also variation that occurs      of objectives. We also refer to Appendix D for details on\r\n              as the landscape evolves. The constraints keep the global     zero-debias (Kingma & Ba, 2014), slow start (Schaul et al.,\r\n              learning rate and momentum in the robust region (de\ufb01ned       2013) and smoothing for curvature range estimation.\r\n              in Lemma3)forslices along all directions.\r\n              The problem in (15)               (SINGLESTEP)                Curvature range     Let gt be a noisy gradient, we estimate\r\n              does not need iterative                        2    2         the curvatures range in Algorithm 2. We notice that the\r\n                                        \u00b5 ,\u03b1 =argmin\u00b5D +\u03b1 C                                   T                                 2\r\n                                         t   t                              outer product g g    has an eigenvalue h    = kg k with\r\n              solver but has an analyt-               \u00b5                                     t t                       t      t\r\n                                                  p                  !      eigenvector g . Thus under our negative log-likelihood as-\r\n              ical solution. Substitut-                                2                  t\r\n                                                     h     /h    \u22121\r\n              ing only the second con-  s.t. \u00b5 \u2265   p max min                sumption, we use ht to approximate the curvature of Hes-\r\n                                                     h     /h    +1         sian along gradient direction gt. Note here we use empirical\r\n              straint, the objective be-               max   min\r\n                               2  2                   \u221a                     Fisher g gT instead of Fisher information matrix. Empirical\r\n              comes p(x) = x D +                 (1\u2212 \u00b5)2                            t t\r\n                      4   2                 \u03b1=                              Fisher is typically used in practical natural gradient meth-\r\n              (1 \u2212 x) /h      C with\r\n                          min                       h\r\n              x = \u221a\u00b5 \u2208 [0,1). By                     min          (15)      ods (Martens, 2014; Roux et al., 2008; Duchi et al., 2011).\r\n              setting the gradient of p(x) to 0, we can get a cubic equa-   For practically ef\ufb01cient measurement, we use the empirical\r\n                                   \u221a                                        Fisher as a coarse proxy of Fisher information matrix which\r\n              tion whose root x =    \u00b5p can be computed in closed form      approximates the Hessian of the objective. Speci\ufb01cally in\r\n              using Vieta\u2019s substitution. As p(x) is uni-modal in [0,1),    Algorithm 2, we maintain h      and h     as running aver-\r\n              the optimizer for (15) is exactly the maximum of \u00b5p and                                   min       max\r\n               p                  2  p                  2                   ages of extreme curvature hmin,t and hmax,t, from a sliding\r\n              (  h    /h     \u22121) /( h        /h    +1) ,theright hand-\r\n                   max   min             max   min                                               3\r\n              side of the \ufb01rst constraint in (15).                          window of width 20 . As gradient directions evolve, we\r\n                                                                            estimate curvatures along different directions. Thus h\r\n                                                                                                                                   min\r\n              YELLOWFIN uses functions CURVATURERANGE, VARI-                and hmax capture the curvature variations.\r\n              ANCEandDISTANCEtomeasurequantitieshmax,hmin,C\r\n              and D respectively. These functions can be designed in        Gradient variance     To estimate the gradient variance in\r\n              different ways. We present the implementations used in our    Algorithm3,weuserunningaveragesg andg2 tokeeptrack\r\n              experiments, based completely on gradients, in Section 3.2.   of gt and gt \u2299 gt, the \ufb01rst and second order moment of the\r\n                                                                            gradient. As Var(g ) = Eg2 \u2212 Eg \u2299Eg , we estimate the\r\n                                                                                               t       t      t      t\r\n              3.2   Measurementfunctionsin YELLOWFIN                        gradient variance C in (15) using C = 1T\u00b7 (g2 \u2212 g2).\r\n              This section describes our implementation of the measure-     Distance to optimum      Weestimate the distance to the op-\r\n              mentoracles used by YELLOWFIN: CURVATURERANGE,                timumofthelocal quadratic approximation in Algorithm 4.\r\n              VARIANCE, and DISTANCE. We design the measurement             Inspired by the fact that k\u2207f(x)k \u2264 kHkkx \u2212 x\u22c6k for a\r\n              functions with the assumption of a negative log-probability   quadratic f(x) with Hessian H andminimizerx\u2217,wemain-\r\n              objective; this is in line with typical losses in machine learn-\r\n              ing, e.g. cross-entropy for neural nets and maximum like-        3We use window width 20 across all the models and experi-\r\n              lihood estimation in general. Under this assumption, the      ments in our paper. We refer to Section 5 for details on selecting\r\n                                                                            the window width\r\n                                                            YELLOWFINandtheArtofMomentumTuning\r\n                     105                              102                               the maximumnormh              . In Figure 4, we demonstrate the\r\n                               Without clipping                Without clipping                                  max\r\n                     103       Withclipping                    Withclipping             mechanismofourheuristic by presenting an example of an\r\n                   norm                             loss101\r\n                               Clipping thresh.                                         LSTMthat exhibits \u2019exploding gradients\u2019. The proposed\r\n                     100                                0\r\n                                                      10                                adaptive clipping can stabilize the training process using\r\n                   Gradient                         Training                            YELLOWFINandpreventlargecatastrophic loss spikes.\r\n                    10\u22123                             10\u22121\r\n                       0k      1k      2k      3k       0k      1k      2k      3k      Wevalidate the proposed          Table 1. German-English trans-\r\n                                Iterations                      Iterations              adaptive clipping on the         lation validation metrics using\r\n                Figure 4. A variation of the LSTM architecture in (Zhu et al., 2016)    convolutional sequence to        convolutional seq-to-seq model.\r\n                exhibits exploding gradients. The proposed adaptive gradient            sequence learning model                             Loss BLEU4\r\n                clipping threshold (blue) stabilizes the training loss.                 (Gehring et al., 2017)\r\n                                                                                        for IWSLT 2014 German-           Default w/o clip.      diverge\r\n                        7                                                               English translation. The          Default w/ clip.   2.86  30.75\r\n                               YFwithclipping                   YFwithclipping\r\n                     loss6     YFwithoutclipping      loss100   YFwithoutclipping       default optimizer (Gehring              YF           2.75  31.59\r\n                        5                                                               et al., 2017) uses learning\r\n                                                                                        rate 0.25 and Nesterov\u2019s momentum 0.99, diverging to\r\n                     Training4                        Training                          loss over\ufb02ow due to \u2019exploding gradient\u2019. It requires, as\r\n                      3.5                              10\u22121                             in Gehring et al. (2017), strict manually set gradient norm\r\n                        0k 5k 10k 15k 20k 25k 30k         0k   10k  20k   30k  40k      threshold 0.1 to stabilize. In Table 1, we can see YellowFin,\r\n                                Iterations                       Iterations\r\n                Figure 5. Training losses on PTB LSTM (left) and CIFAR10                with adaptive clipping, outperforms the default optimizer\r\n                ResNet (right) for YellowFin with and without adaptive clipping.        using manually set clipping, with 0.84 higher validation\r\n                                                                                        BLEU4after120epochs. Tofurther demonstrate the prac-\r\n                                                                                        tical applicability of our gradient clipping heuristics, in\r\n                tain h and kgk as running mean of curvature h and gradient\r\n                                                                    t                   Figure 5, we demonstrate that the adaptive clipping does not\r\n                normkgtk; the distance is approximated with kgk/h.                      hurt performance on models that do not exhibit instabilities\r\n                                                                                        without clipping. Speci\ufb01cally, for both PTB LSTM and CI-\r\n                3.3    Stability on non-smooth objectives                               FAR10ResNet,thedifference between YELLOWFIN with\r\n                Theprocess of training neural networks is inherently non-               and without adaptive clipping diminishes quickly.\r\n                stationary, with the landscape abruptly switching from \ufb02at\r\n                to steep areas. In particular, the objective functions of               4     CLOSED-LOOP YELLOWFIN\r\n                RNNs with hidden units can exhibit occasional but very                  Asynchrony is a parallelization technique that avoids syn-\r\n                steep slopes (Pascanu et al., 2013; Szegedy et al., 2013). To           chronization barriers (Niu et al., 2011). In this section, we\r\n                deal with this issue, gradient clipping has been established            propose a closed momentum loop variant of YELLOWFIN\r\n                in literature as a standard tool to stabilize the training using        to accelerate convergence in asynchronous training. After\r\n                such objectives (Pascanu et al., 2013; Goodfellow et al.,               somepreliminaries, we show the mechanism of the exten-\r\n                2016; Gehring et al., 2017).                                            sion: it measures the dynamics on a running system and\r\n                Weuseadaptivegradient clipping heuristics as a very natu-               controls momentum with a negative feedback loop.\r\n                ral addition to our basic tuner. However, the classic tradeoff\r\n                between adaptivity and stability applies: setting a clipping\r\n                threshold that is too low can hurt performance; setting it              Preliminaries       Whentraining on M asynchronous work-\r\n                to be high, can compromise stability. YELLOWFIN, keeps                  ers, staleness (the number of model updates between a\r\n                running estimates of extremal gradient magnitude squares,               worker\u2019sreadandwriteoperations)isonaverage\u03c4 = M\u22121,\r\n                h      and h       in order to estimate a generalized condition         i.e., the gradient in the SGD update is delayed by \u03c4 itera-\r\n                  max         min           \u221a\r\n                number. We posit that         h       is an ideal gradient norm         tions as \u2207f         (x    ).  Asynchrony yields faster steps,\r\n                                                max                                                   St\u2212\u03c4    t\u2212\u03c4\r\n                threshold for adaptive clipping. In order to ensure robust-             but can increase the number of iterations to achieve the\r\n                ness to extreme gradient spikes, like the ones in Figure 4,             samesolution, a tradeoff between hardware and statistical\r\n                                                                                                                  \u00b4\r\n                wealso limit the growth rate of the envelope h            in Algo-      ef\ufb01ciency (Zhang & Re, 2014). Mitliagkas et al. (2016) in-\r\n                                                                     max\r\n                rithm 2 as follows:                                                     terpret asynchrony as added momentum dynamics. Experi-\r\n                                                                                        mentsinHadjisetal.(2016)supportthis\ufb01nding,anddemon-\r\n                 h      \u2190\u03b2\u00b7h           +(1\u2212\u03b2)\u00b7min{h                , 100 \u00b7 h     }      strate that reducing algorithmic momentum can compensate\r\n                   max            max                        max,t          max\r\n                                                                               (16)     for asynchrony-induced momentumandsigni\ufb01cantlyreduce\r\n                Our heuristics follows along the lines of classic recipes               the number of iterations for convergence. Motivated by that\r\n                like (Pascanu et al., 2013). However, instead of using the              result, we use the model in (17), where the total momen-\r\n                average gradient norm to clip, it uses a running estimate of            tum, \u00b5T, includes both asynchrony-induced and algorithmic\r\n                                                       YELLOWFINandtheArtofMomentumTuning\r\n                        0.8                                    0.8                                   0.8\r\n                        0.6                 Total mom.         0.6          Asynchrony               0.6             Total mom.\r\n                     m                      Target mom.                  -induced momentum                           Target mom.\r\n                     u  0.4                                    0.4                                   0.4\r\n                     t                                                                                               Algorithmic mom.\r\n                     en 0.2                                    0.2                                   0.2\r\n                     om                                                            Total mom.\r\n                     M  0.0                                    0.0                                   0.0\r\n                      !0.2                                   !0.2                  Target mom.      !0.2\r\n                          0k 5k 10k 15k 20k 25k 30k 35k 40k      0k 5k 10k 15k 20k 25k 30k 35k 40k     0k 5k 10k 15k 20k 25k 30k 35k 40k\r\n                                     Iterations                             Iterations                             Iterations\r\n               Figure 6. When running YELLOWFIN, total momentum \u00b5\u02c6 equals algorithmic value in synchronous settings (left); \u00b5\u02c6 is greater than\r\n                                                                       t                                                       t\r\n               algorithmic value on 16 asynchronous workers (middle). Closed-loop YELLOWFIN automatically lowers algorithmic momentum and\r\n               brings total momentum to match the target value (right). Red dots are total momentum estimates, \u00b5\u02c6 , at each iteration. The solid red line\r\n                                                                                                           T\r\n               is a running average of \u00b5\u02c6 .\r\n                                      T\r\n               momentum,\u00b5,in(1).                                                 Algorithm 5 Closed-loop YELLOWFIN\r\n                   E[x     \u2212x]=\u00b5 E[x \u2212x ]\u2212\u03b1E\u2207f(x)                       (17)      1: Input: \u00b5 \u2190 0, \u03b1 \u2190 0.0001, \u03b3 \u2190 0.01,\u03c4 (staleness)\r\n                       t+1     t      T    t     t\u22121              t               2: for t \u2190 1 to T do\r\n                                                                                  3:     x \u2190x       +\u00b5(x       \u2212x )\u2212\u03b1\u2207 f(x                  )\r\n                                                                                           t    t\u22121        t\u22121     t\u22122         S     t\u2212\u03c4\u22121\r\n               Wewill use this expression to design an estimator for the                                                        t\r\n                                                                                  4:     \u00b5\u2217,\u03b1 \u2190 YELLOWFIN(\u2207 f(x                  ), \u03b2)\r\n               value of total momentum, \u00b5\u02c6T. This estimator is a basic                                   \u0010          St    t\u2212\u03c4\u22121              \u0011\r\n                                                                                                           x   \u2212x      +\u03b1\u2207         f(x      )\r\n               building block of closed-loop YELLOWFIN; it removes the            5:     \u00b5\u02c6 \u2190median         t\u2212\u03c4   t\u2212\u03c4\u22121     St\u2212\u03c4\u22121    t\u2212\u03c4\u22121\r\n                                                                                           T                        x      \u2212x\r\n               need to manually compensate for the effects of asynchrony.            \u22b2Measuringtotal momentum t\u2212\u03c4\u22121           t\u2212\u03c4\u22122\r\n                                                                                  6:     \u00b5\u2190\u00b5+\u03b3\u00b7(\u00b5\u2217\u2212\u00b5\u02c6)                     \u22b2Closing the loop\r\n               Measuringthemomentumdynamics Closed-loopYEL-                                                    T\r\n                                                                                  7: end for\r\n               LOWFINestimatestotalmomentum\u00b5T onarunningsystem\r\n               and uses a negative feedback loop to adjust algorithmic mo-\r\n               mentumaccordingly. Equation (18) gives an estimate of \u00b5\u02c6T         YELLOWFINaccelerates with momentum closed-loop con-\r\n               onasystemwithstaleness \u03c4, based on (18).                          trol, signi\ufb01cantly outperforming Adam.\r\n                              \u0012x      \u2212x         +\u03b1\u2207          f(x        )\u0013\r\n               \u00b5\u02c6T = median       t\u2212\u03c4     t\u2212\u03c4\u22121        St\u2212\u03c4\u22121      t\u2212\u03c4\u22121         Weevaluate on convolutional neural networks (CNN) and\r\n                                            x       \u2212x                           recurrent neural networks (RNN). For CNN, we train\r\n                                             t\u2212\u03c4\u22121      t\u2212\u03c4\u22122           (18)     ResNet (He et al., 2016) for image recognition on CIFAR10\r\n               Weuse\u03c4-stale model values to match the staleness of the           and CIFAR100 (Krizhevsky et al., 2014). For RNN, we\r\n               gradient, and perform element-wise operations. This way           train LSTMs for character-level language modeling with the\r\n               wegetatotal momentummeasurementfromeachvariable;                  TinyShakespeare (TS) dataset (Karpathy et al., 2015), word-\r\n               the median combines them into a more robust estimate.             level language modeling with the Penn TreeBank (PTB)\r\n                                                                                 (Marcus et al., 1993), and constituency parsing on the Wall\r\n               Closing the asynchrony loop       Given a reliable measure-       Street Journal (WSJ) dataset (Choe & Charniak). We re-\r\n               ment of \u00b5 , we can use it to adjust the value of algorithmic      fer to Table 3 in Appendix E for model speci\ufb01cations. To\r\n                         T                                                       eliminate in\ufb02uences of a speci\ufb01c random seed, in our syn-\r\n               momentum so that the total momentum matches the tar-              chronous and asynchronous experiments, the training loss\r\n               get momentum as decided by YELLOWFIN in Algorithm 1.              andvalidation metrics are averaged from 3 runs using dif-\r\n               Closed-loop YELLOWFIN in Algorithm 5 uses a simple                ferent random seeds. Across all experiments on the eight\r\n               negative feedback loop to achieve the adjustment.                 models, we use sliding window width 20 for estimating the\r\n                                                                                 extreme curvature h ax and h in in Algorithm 2. It is\r\n               5    EXPERIMENTS                                                                        m          m\r\n                                                                                 selected based on the performance on PTB LSTM and CI-\r\n               Weempirically validate the importance of momentum tun-            FAR10ResNetmodel. Theselected sliding window width\r\n               ing and evaluate YELLOWFIN in both synchronous (single-           is directly applied to the other 6 models, including the con-\r\n               node) and asynchronous settings. In synchronous settings,         volutional sequence to sequence model in Section 3.3, as\r\n               we \ufb01rst demonstrate that, with hand-tuning, momentum              well as the ResNext and Tied LSTM in Appendix G.3.\r\n               SGDiscompetitive with Adam, a state-of-the-art adaptive           5.1   Synchronousexperiments\r\n               method. Then, we evaluate YELLOWFIN without any hand\r\n               tuning in comparison to hand-tuned Adam and momentum              WetuneAdamandmomentumSGDonlearningrategrids\r\n               SGD.Inasynchronoussettings, we show that closed-loop              with prescribed momentum 0.9 for SGD. We \ufb01x the param-\r\n                                                     YELLOWFINandtheArtofMomentumTuning\r\n                       2                                      101                                      2\r\n                                        YellowFin                                YellowFin                           MomentumSGD\r\n                     loss               YFmom.=0.0           loss                YFmom.=0.0         loss             Vanilla SGD\r\n                                        YFmom.=0.9                               YFmom.=0.9                          YellowFin\r\n                      1.5               YFrescaling                              YFrescaling          1.5            YFrescaling\r\n                                                              100\r\n                     Training                                Training                               Training\r\n                       1                                                                               1\r\n                       0k     5k     10k   15k    20k          0k      30k     60k    90k              0k     5k     10k   15k    20k\r\n                                 Iterations                              Iterations                              Iterations\r\n              Figure 7. The importance of adaptive momentum: The training loss comparison between YELLOWFIN with adaptive momentum and\r\n              YELLOWFINwith\ufb01xedmomentumvalues;thiscomparisonisconductedonTSLSTM(left)andCIFAR100ResNet(middle). Learning\r\n               rate scaling based on YELLOWFIN tuned momentum can match the performance of full YELLOWFIN on the TS LSTM(right). However\r\n              without the YELLOWFIN tuned momentum, hand-tuned Vanilla SGD demonstrates observably larger training loss than momentum based\r\n               methods, including full YELLOWFIN, YELLOWFIN learning rate rescaling and hand-tuned momentum SGD (with the same learning rate\r\n               search grid as with Vanilla SGD).\r\n               eters of Algorithm 1 in all experiments, i.e. YELLOWFIN        ing loss matching hand-tuned momentum SGD for all the\r\n               runs without any hand tuning. We provide full speci\ufb01ca-        ResNet and LSTM models in Figure 8 and 9 (Appendix\r\n               tions, including the learning rate (grid) and the number of    D). When comparing to tuned Adam in Table 2, except be-\r\n               iterations we train on each model in Appendix F. For visual-   ing slightly slower on PTB LSTM, YELLOWFIN achieves\r\n               ization purposes, we smooth training losses with a uniform     1.38x to 3.28x speedups in training losses on the other four\r\n              windowofwidth1000. For AdamandmomentumSGDon                     models. More importantly, YELLOWFIN consistently shows\r\n               each model, we pick the con\ufb01guration achieving the lowest      better validation metrics than tuned Adam in Figure 8. It\r\n               averaged smoothed loss. To compare two algorithms, we          demonstrates that YELLOWFIN can match tuned momen-\r\n               record the lowest smoothed loss achieved by both. Then the     tum SGD and outperform tuned state-of-the-art adaptive\r\n               speedup is reported as the ratio of iterations to achieve this optimizers. In Appendix G.3, we show YELLOWFIN further\r\n               loss. We use this setup to validate our claims.                speeding up with \ufb01ner-grain manual learning rate tuning.\r\n              Table 2. The speedup of YELLOWFIN and tuned momentum SGD        Importance of adaptive momentum in YELLOWFIN\r\n               over tuned Adam on ResNet and LSTM models.                     In De\ufb01nition 4, we noticed that the optimally tuned \u00b5\u2217 is\r\n                             CIFAR10 CIFAR100        PTB     TS    WSJ        highly objective-dependent. Empirically, we indeed observe\r\n                                                                              the momentum values chosen by YF range from smaller\r\n                  Adam          1x          1x        1x     1x     1x        than 0.03 in the PTM LSTM to 0.89 for ResNext. We\r\n                mom. SGD       1.71x      1.87x     0.88x   2.49x  1.33x      perform an ablation study to validate the importance of\r\n                    YF         1.93x      1.38x     0.77x   3.28x  2.33x      objective-dependent momentum adaptivity of YELLOWFIN\r\n                                                                              on CIFAR100 ResNet and TS LSTM. In the experiments,\r\n               MomentumSGDiscompetitive with adaptive methods                 YELLOWFINtunesthelearningrate. Instead of also using\r\n               In Table 2, we compare tuned momentum SGD and tuned            the momentumtunedbyYF,wecontinuouslyfeedobjective-\r\n              AdamonResNetswithtraininglosses shown in Figure 9 in            agnostic prescribed momentum value 0.0 and 0.9 to the un-\r\n              AppendixD.WecanobservethatmomentumSGDachieves                   derlying momentum SGD optimizer which YF is tuning. In\r\n              1.71x and 1.87x speedup to tuned Adam on CIFAR10 and            Figure 7, when comparing to YELLOWFIN with prescribed\r\n               CIFAR100respectively. In Figure 8 and Table 2, with the        momentum0.0or0.9, YELLOWFIN withadaptively tuned\r\n               exception of PTB LSTM, momentum SGD also produces              momentumachievesobservably faster convergence on both\r\n               better training loss, as well as better validation perplexity  TSLSTMandCIFAR100ResNet.\r\n               in language modeling and validation F1 in parsing. For the     In Figure 8 (bottom right) and Figure 7 (right), we also ob-\r\n               parsing task, we also compare with tuned Vanilla SGD and       serve that hand-tuned vanilla SGD, typically does not match\r\n              AdaGrad, which are used in the NLP community. Figure 8          the performance of momentum based methods (including\r\n              (right) shows that \ufb01xed momentum 0.9 can already speedup        YELLOWFIN and momentum SGD hand-tuned using the\r\n              Vanilla SGD by 2.73x, achieving better validation F1.           same learning rate grid as with vanilla SGD). However,\r\n                                                                              wecanrescale the learning rate based on the YELLOWFIN\r\n               YELLOWFIN can match hand-tuned momentum SGD                    tuned momentum \u00b5t, and use 0 momentum in the model\r\n               and can outperform hand-tuned Adam           In our experi-    updatestomatchtheperformanceofmomentumbasedmeth-\r\n               ments, YELLOWFIN, without any hand-tuning, yields train-       ods. Speci\ufb01cally, we rescale the YELLOWFIN tuned learn-\r\n                                                           YELLOWFINandtheArtofMomentumTuning\r\n                            7                                        2                                       2.5                  Vanilla SGD\r\n                                            MomentumSGD                               MomentumSGD                                 MomentumSGD\r\n                         loss6              Adam                   loss               Adam                  loss                  Adam\r\n                                            YellowFin                                 YellowFin                                   Adagrad\r\n                            5                                       1.5                                      2.0                  YellowFin\r\n                         Training4                                 Training                                 Training\r\n                           3.5                                       1                                       1.5\r\n                             0k   5k   10k  15k  20k  25k   30k       0k      5k     10k     15k     20k       0k      30k     60k    90k     120k\r\n                                       Iterations                                Iterations                               Iterations\r\n                           103                                        7                                      91.5\r\n                                            MomentumSGD                              MomentumSGD             91.0\r\n                                            Adam                                     Adam                   F190.5\r\n                         perplexity         YellowFin              perplexity6       YellowFin               90.0                 MomentumSGD\r\n                                                                                                             89.5                 Adam\r\n                                                                                                            alidation89.0         YellowFin\r\n                                                                      5                                     V88.5                 Adagrad\r\n                                                                                                                                  Vanilla SGD\r\n                         alidation102                              alidation4.5                              88.0\r\n                         V   0k   5k   10k  15k  20k  25k  30k     V  0k      5k     10k     15k     20k        0k     30k     60k     90k    120k\r\n                                       Iterations                                Iterations                               Iterations\r\n                Figure 8. Training loss and validation metrics on (left to right) word-level language modeling with PTB, char-level language modeling\r\n                with TS and constituency parsing on WSJ. The valid. metrics are monotonic as we report the best values up to each number of iterations.\r\n                ing rate \u03b1 with 1/(1 \u2212 \u00b5 ) 4. Model updates with this                  2012) and Bayesian approaches (Snoek et al., 2012; Hutter\r\n                            t                  t\r\n                rescaled learning rate and 0 momentum can demonstrate                  et al., 2011), can directly tune optimizers. As another trend,\r\n                training loss closely matching those of YELLOWFIN and                  adaptive methods, including AdaGrad (Duchi et al., 2011),\r\n                hand-tuned momentum SGD for WSJ LSTM in Figure 8                       RMSProp(Tieleman&Hinton,2012)andAdam(Kingma\r\n                (bottom right) and TS LSTM in Figure 7 (right).                        & Ba, 2014), uses per-dimension learning rate. Schaul\r\n                                                                                       et al. (2013) use a noisy quadratic model similar to ours\r\n                5.2    Asynchronousexperiments                                         to tune the learning rate in Vanilla SGD. However they\r\n                In this section, we evaluate closed-loop YELLOWFIN with                do not use momentum which is essential in training mod-\r\n                focus on the number of iterations to reach a certain solu-             ern neural nets. Existing adaptive momentum approach\r\n                tion. To that end, we run 16 asynchronous workers on a                 either consider the deterministic setting (Graepel & Schrau-\r\n                single machine and force them to update the model in a                 dolph, 2002; Rehman & Nawi, 2011; Hameed et al., 2016;\r\n                round-robin fashion, i.e. the gradient is delayed for 15 it-           Swanston et al., 1994; Ampazis & Perantonis, 2000; Qiu\r\n                erations. Figure 1 (right) presents training losses on the             et al., 1992) or only analyze stochasticity with O(1/t) learn-\r\n                CIFAR100 ResNet, using YELLOWFIN in Algorithm 1,                       ing rate (Leen & Orr, 1994). In contrast, we aim at practical\r\n                closed-loop YELLOWFIN in Algorithm 5 and Adam with                     momentumadaptivity for stochastically training neural nets.\r\n                the learning rate achieving the best smoothed loss in Sec-             7     DISCUSSION\r\n                tion 5.1. We can observe closed-loop YELLOWFIN achieves                Wepresented YELLOWFIN,the\ufb01rst optimization method\r\n                20.1x speedup to YELLOWFIN, and consequently a 2.69x                   that automatically tunes momentum as well as the learning\r\n                speedup to Adam. This demonstrates that (1) closed-loop                rate of momentumSGD.YELLOWFINoutperformsthestate-\r\n                YELLOWFINacceleratesbyreducingalgorithmic momen-                       of-the-art adaptive optimizers on a large class of models\r\n                tumtocompensatefor asynchrony and (2) can converge in                  both in synchronous and asynchronous settings. It estimates\r\n                less iterations than Adam in asynchronous-parallel training.           statistics purely from the gradients of a running system, and\r\n                6    RELATEDWORK                                                       then tunes the hyperparameters of momentum SGD based\r\n                Many techniques have been proposed on tuning hyperpa-                  on noisy, local quadratic approximations. As future work,\r\n                rameters for optimizers. General hyperparameter tuning                 webelieve that more accurate curvature estimation methods,\r\n                approaches, such as random search (Bergstra & Bengio,                  like the bbprop method (Martens et al., 2012) can further\r\n                                                                                       improve YELLOWFIN. Wealsobelieve that our closed-loop\r\n                    4Let v = x \u2212 x         be the model update, this rescaling is      momentumcontrolmechanisminSection4couldaccelerate\r\n                          t      t     t\u22121                                             other adaptive methods in asynchronous-parallel settings.\r\n                motivated with the fact that v    =\u00b5v \u2212\u03b1\u2207f(x).Assuming\r\n                                             t+1      t t     t      t\r\n                the v evolves smoothly, we have v \u2248 \u03b1 /(1 \u2212 \u00b5 )\u2207f(x ).\r\n                     t                              t     t        t       t\r\n                                                      YELLOWFINandtheArtofMomentumTuning\r\n               ACKNOWLEDGEMENTS                                                 Gehring, J., Auli, M., Grangier, D., Yarats, D., and Dauphin,\r\n                                                \u00b4                                  Y. N. Convolutional sequence to sequence learning. arXiv\r\n               Wearegrateful to Christopher Re for his valuable guidance           preprint arXiv:1705.03122, 2017.\r\n               and support. We thank Bryan He, Paroma Varma, Chris De\r\n               Sa, Tri Dao, Albert Gu, Fred Sala, Alex Ratner, Theodoros        Goodfellow, I., Bengio, Y., and Courville, A.           Deep\r\n               Rekatsinas, Olexa Bilaniuk and Avner May for helpful dis-           Learning.       MIT Press, 2016.          http://www.\r\n               cussions and feedback. We gratefully acknowledge the                deeplearningbook.org.\r\n               support of the D3M program under No. FA8750-17-2-0095,           Graepel, T. and Schraudolph, N. N. Stable adaptive momen-\r\n               the FRQNT new researcher program (2019-NC-257943),                  tum for rapid online learning in nonlinear systems. In\r\n               a grant by IVADO and a Canada CIFAR AI chair. Any                   International Conference on Arti\ufb01cial Neural Networks,\r\n               opinions, \ufb01ndings, and conclusions or recommendations               pp. 450\u2013455. Springer, 2002.\r\n               expressed in this material are those of the authors and do\r\n               not necessarily re\ufb02ect the views of DARPA, or the Canadian                                                                \u00b4\r\n                                                                                Hadjis, S., Zhang, C., Mitliagkas, I., Iter, D., and Re, C.\r\n               or U.S. governments.                                                Omnivore: An optimizer for multi-device deep learning\r\n                                                                                   on cpus and gpus. arXiv preprint arXiv:1606.04487,\r\n               REFERENCES                                                          2016.\r\n               Ampazis, N. and Perantonis, S. J. Levenberg-marquardt            Hameed, A. A., Karlik, B., and Salman, M. S. Back-\r\n                 algorithm with adaptive momentum for the ef\ufb01cient train-          propagation algorithm with variable adaptive momentum.\r\n                 ing of feedforward networks. In Neural Networks, 2000.            Knowledge-Based Systems, 114:79\u201387, 2016.\r\n                 IJCNN2000, Proceedings of the IEEE-INNS-ENNS In-               He, K., Zhang, X., Ren, S., and Sun, J. Deep residual learn-\r\n                 ternational Joint Conference on, volume 1, pp. 126\u2013131.           ing for image recognition. In Proceedings of the IEEE\r\n                 IEEE, 2000.                                                       Conference on Computer Vision and Pattern Recognition,\r\n               Bengio, Y. Practical recommendations for gradient-based             pp. 770\u2013778, 2016.\r\n                 training of deep architectures. In Neural networks: Tricks     Hutter, F., Hoos, H. H., and Leyton-Brown, K. Sequential\r\n                 of the trade, pp. 437\u2013478. Springer, 2012.                        model-based optimization for general algorithm con\ufb01gu-\r\n               Bengio, Y. et al. Deep learning of representations for unsu-        ration. LION, 5:507\u2013523, 2011.\r\n                 pervised and transfer learning. ICML Unsupervised and          Karimi, H., Nutini, J., and Schmidt, M. Linear conver-\r\n                 Transfer Learning, 27:17\u201336, 2012.                                gence of gradient and proximal-gradient methods under\r\n               Bergstra, J. and Bengio, Y.     Random search for hyper-            the polyak-\u0142ojasiewicz condition. In Joint European Con-\r\n                 parameter optimization. Journal of Machine Learning               ference on Machine Learning and Knowledge Discovery\r\n                 Research, 13(Feb):281\u2013305, 2012.                                  in Databases, pp. 795\u2013811. Springer, 2016.\r\n               Bottou, L. Stochastic gradient descent tricks. In Neural         Karpathy, A., Johnson, J., and Fei-Fei, L.       Visualizing\r\n                 networks: Tricks of the trade, pp. 421\u2013436. Springer,             and understanding recurrent networks. arXiv preprint\r\n                 2012.                                                             arXiv:1506.02078, 2015.\r\n               Chen, D., Bolton, J., and Manning, C. D. A thorough              Kingma, D. and Ba, J. Adam: A method for stochastic\r\n                 examination of the cnn/daily mail reading comprehension           optimization. arXiv preprint arXiv:1412.6980, 2014.\r\n                 task. arXiv preprint arXiv:1606.02858, 2016.                   Krizhevsky,A.,Nair,V.,andHinton,G. Thecifar-10dataset,\r\n               Choe,D.K.andCharniak,E. Parsingaslanguagemodeling.                  2014.\r\n                                                                                Leen, T. K. and Orr, G. B. Optimal stochastic search and\r\n               Duchi,    J.      Fisher   information.,   2016.        URL         adaptive momentum. In Advances in neural information\r\n                 https://web.stanford.edu/class/                                   processing systems, pp. 477\u2013484, 1994.\r\n                 stats311/Lectures/lec-09.pdf.                                  Lessard, L., Recht, B., and Packard, A. Analysis and de-\r\n               Duchi, J., Hazan, E., and Singer, Y. Adaptive subgradient           sign of optimization algorithms via integral quadratic\r\n                 methods for online learning and stochastic optimization.          constraints. SIAM Journal on Optimization, 26(1):57\u201395,\r\n                 Journal of Machine Learning Research, 12(Jul):2121\u2013               2016.\r\n                 2159, 2011.                                                    Marcus, M. P., Marcinkiewicz, M. A., and Santorini, B.\r\n               Foucart,   S.      University Lecture,      2012.       URL         Building a large annotated corpus of english: The penn\r\n                 http://www.math.drexel.edu/ foucart/                              treebank.   Computational linguistics, 19(2):313\u2013330,\r\n                                                           \u02dc\r\n                 TeachingFiles/F12/M504Lect6.pdf.                                  1993.\r\n                                                  YELLOWFINandtheArtofMomentumTuning\r\n              Martens, J. New insights and perspectives on the natural    Schaul, T., Zhang, S., and LeCun, Y. No more pesky learn-\r\n                gradient method. arXiv preprint arXiv:1412.1193, 2014.       ing rates. ICML (3), 28:343\u2013351, 2013.\r\n              Martens, J., Sutskever, I., and Swersky, K. Estimating the  Snoek, J., Larochelle, H., and Adams, R. P.     Practical\r\n                hessian by back-propagating curvature. arXiv preprint        bayesian optimization of machine learning algorithms.\r\n                arXiv:1206.6464, 2012.                                       In Advances in neural information processing systems,\r\n                                                                             pp. 2951\u20132959, 2012.\r\n                                                      \u00b4\r\n              Mitliagkas, I., Zhang, C., Hadjis, S., and Re, C. Asynchrony\r\n                begets momentum, with an application to deep learning.    Sutskever, I., Martens, J., Dahl, G., and Hinton, G. On the\r\n                arXiv preprint arXiv:1605.09774, 2016.                       importance of initialization and momentum in deep learn-\r\n                                                                             ing. In Proceedings of the 30th international conference\r\n              Nesterov, Y. A method of solving a convex programming          onmachinelearning (ICML-13), pp. 1139\u20131147, 2013.\r\n                problem with convergence rate o (1/k2). In Soviet Mathe-\r\n                matics Doklady, volume 27, pp. 372\u2013376, 1983.             Swanston, D., Bishop, J., and Mitchell, R. J.     Simple\r\n                                                                             adaptive momentum: new algorithm for training multi-\r\n              Nesterov, Y. Introductory lectures on convex optimization:     layer perceptrons. Electronics Letters, 30(18):1498\u20131500,\r\n                Abasic course, volume 87. Springer Science & Business        1994.\r\n                Media, 2013.                                              Szegedy, C., Zaremba, W., Sutskever, I., Bruna, J., Erhan,\r\n              Niu, F., Recht, B., Re, C., and Wright, S. Hogwild: A lock-    D., Goodfellow, I., and Fergus, R. Intriguing properties of\r\n                free approach to parallelizing stochastic gradient descent.  neural networks. arXiv preprint arXiv:1312.6199, 2013.\r\n                In Advances in Neural Information Processing Systems,     Tieleman, T. and Hinton, G. Lecture 6.5-rmsprop: Divide\r\n                pp. 693\u2013701, 2011.                                           the gradient by a running average of its recent magnitude.\r\n                              \u00a8                                              COURSERA:Neuralnetworks for machine learning, 4\r\n              Orr, G. B. and Muller, K.-R. Neural networks: tricks of the\r\n                trade. Springer, 2003.                                       (2), 2012.\r\n              Pascanu, R. and Bengio, Y. Revisiting natural gradient for  Wilson, A. C., Roelofs, R., Stern, M., Srebro, N., and Recht,\r\n                deep networks. arXiv preprint arXiv:1301.3584, 2013.         B. The marginal value of adaptive gradient methods\r\n                                                                             in machine learning. arXiv preprint arXiv:1705.08292,\r\n              Pascanu, R., Mikolov, T., and Bengio, Y. On the dif\ufb01culty      2017.\r\n                of training recurrent neural networks. In International                             \u00b4\r\n                Conference on Machine Learning, pp. 1310\u20131318, 2013.      Xie, S., Girshick, R., Dollar, P., Tu, Z., and He, K. Aggre-\r\n                                                                             gated residual transformations for deep neural networks.\r\n              Polyak, B. T. Somemethodsofspeedinguptheconvergence            arXiv preprint arXiv:1611.05431, 2016.\r\n                of iteration methods. USSR Computational Mathematics                        \u00b4\r\n                andMathematical Physics, 4(5):1\u201317, 1964.                 Zhang, C. and Re, C. Dimmwitted: A study of main-\r\n                                                                             memory statistical analytics.   PVLDB, 7(12):1283\u2013\r\n              Press, O. and Wolf, L. Using the output embedding to im-       1294,2014. URLhttp://www.vldb.org/pvldb/\r\n                provelanguagemodels. arXivpreprintarXiv:1608.05859,          vol7/p1283-zhang.pdf.\r\n                2016.                                                     Zhu, C., Han, S., Mao, H., and Dally, W. J. Trained ternary\r\n              Qiu, G., Varley, M., and Terrell, T. Accelerated training of   quantization. arXiv preprint arXiv:1612.01064, 2016.\r\n                backpropagation networks by using adaptive momentum\r\n                step. Electronics letters, 28(4):377\u2013379, 1992.\r\n              Reddi, S. J., Kale, S., and Kumar, S. On the convergence of\r\n                adamandbeyond. 2018.\r\n              Rehman, M. Z. and Nawi, N. M. The effect of adaptive\r\n                momentuminimprovingtheaccuracyofgradientdescent\r\n                backpropagationalgorithmonclassi\ufb01cationproblems. In\r\n                International Conference on Software Engineering and\r\n                ComputerSystems, pp. 380\u2013390. Springer, 2011.\r\n              Roux,N.L.,Manzagol,P.-A.,andBengio,Y. Topmoumoute\r\n                online natural gradient algorithm. In Advances in neural\r\n                information processing systems, pp. 849\u2013856, 2008.\r\n                                                      YELLOWFINandtheArtofMomentumTuning\r\n               A PROOFOFLEMMA3\r\n               To prove Lemma 3, we \ufb01rst prove a more generalized version in Lemma 7. By restricting f to be a one dimensional\r\n               quadratics function, the generalized curvature h itself is the only eigenvalue. We can prove Lemma 3 as a straight-forward\r\n                                                                t                                                                  \u221a\r\n               corollary. Lemma 7 also implies, in the multiple dimensional correspondence of (4), the spectral radius \u03c1(At) =        \u00b5if the\r\n               curvature on all eigenvector directions (eigenvalue) satis\ufb01es (6).\r\n               Lemma7. Letthegradientsofafunction f be described by\r\n                                                               \u2207f(xt) = H(xt)(xt \u2212x\u2217),                                                   (19)\r\n               with H(xt) \u2208 Rn 7\u2192 Rn\u00d7n. Then the momentum update can be expressed as a linear operator:\r\n                                           \u0012yt+1\u0013=\u0012I \u2212\u03b1H(xt)+\u00b5I \u2212\u00b5I\u0013\u0012 yt \u0013=At\u0012 yt \u0013,                                                     (20)\r\n                                              yt                   I              0      yt\u22121            yt\u22121\r\n               where yt , xt \u2212x\u2217. Now, assume that the following condition holds for all eigenvalues \u03bb(H(xt)) of H(xt):\r\n                                                               \u221a 2                          \u221a 2\r\n                                                         (1 \u2212    \u00b5) \u2264\u03bb(H(xt))\u2264 (1+ \u00b5) .                                                  (21)\r\n                                                              \u03b1                             \u03b1\r\n                                                                                          \u221a\r\n               then the spectral radius of A is controlled by momentum with \u03c1(A ) =         \u00b5.\r\n                                             t                                       t\r\n               Proof. Let \u03bbt be an eigenvalue of matrix At, it gives det(At \u2212 \u03bbtI) = 0. We de\ufb01ne the blocks in At as C = I \u2212\u03b1Ht +\r\n               \u00b5I \u2212\u03bb I,D =\u2212\u00b5I,E =I andF =\u2212\u03bb I whichgives\r\n                       t                                     t\r\n                                                   det(At \u2212\u03bbtI) = detF detC \u2212DF\u22121E\u0001=0\r\n               assuming generally F is invertible.         Note we use Ht , H(xt) for simplicity in writing.                  The equation\r\n               detC \u2212DF\u22121E\u0001=0impliesthat                     det\u03bb2I \u2212\u03bbtMt+\u00b5I\u0001=0                                                        (22)\r\n                                                                     t\r\n                                                                                  2\r\n               with M =(I \u2212\u03b1H +\u00b5I). Inotherwords,\u03bb satis\ufb01edthat\u03bb \u2212\u03bb \u03bb(M )+\u00b5 = 0with\u03bb(M )beingoneeigenvalue\r\n                       t              t                          t                t     t     t                      t\r\n               of M . I.e.\r\n                    t                                                         p\r\n                                                                                        2\r\n                                                             \u03bb = \u03bb(Mt)\u00b1 \u03bb(Mt) \u22124\u00b5                                                        (23)\r\n                                                              t                 2\r\n               Ontheother hand, (21) guarantees that (1 \u2212 \u03b1\u03bb(Ht) + \u00b5)2 \u2264 4\u00b5. We know both Ht and I \u2212\u03b1Ht +\u00b5I are symmetric.\r\n                                                                         2                       2                                 \u221a\r\n               Thusfor all eigenvalues \u03bb(Mt) of Mt, we have \u03bb(Mt) = (1\u2212\u03b1\u03bb(Ht)+\u00b5) \u2264 4\u00b5whichguarantees|\u03bbt| =                           \u00b5forall\r\n               \u03bb . As the spectral radius is equal to the magnitude of the largest eigenvalue of A , we have the spectral radius of A being\r\n                t                                                                                 t                                   t\r\n               \u221a\r\n                 \u00b5.\r\n               B PROOFOFLEMMA5\r\n               We\ufb01rstproveLemma8andLemma9aspreparationfortheproofofLemma5. Aftertheproofforonedimensionalcase,\r\n               wediscuss the trivial generalization to multiple dimensional case.\r\n               Lemma8. Letthehbethecurvatureofaonedimensionalquadraticfunction f and xt = Ext. We assume, without loss of\r\n                                                       \u22c6\r\n               generality, the optimum point of f is x = 0. Then we have the following recurrence\r\n                                                         \u0012      \u0013 \u0012                       \u0013t\u0012 \u0013\r\n                                                           xt+1   = 1\u2212\u03b1h+\u00b5 \u2212\u00b5                  x1                                        (24)\r\n                                                            x                1         0       x\r\n                                                              t                                 0\r\n               Proof. From the recurrence of momentum SGD, we have\r\n                                                   Ex      =E[x \u2212\u03b1\u2207f (x )+\u00b5(x \u2212x                 )]\r\n                                                       t+1       t        S    t        t    t\u22121\r\n                                                                           t\r\n                                                           =E [x \u2212\u03b1E \u2207f (x )+\u00b5(x \u2212x                    )]\r\n                                                               x   t      S     S    t        t    t\u22121\r\n                                                                t           t    t\r\n                                                           =E [x \u2212\u03b1hx +\u00b5(x \u2212x                )]\r\n                                                               xt  t        t       t    t\u22121\r\n                                                           =(1\u2212\u03b1h+\u00b5)x \u2212\u00b5x\r\n                                                                             t      t\u22121\r\n                                                      YELLOWFINandtheArtofMomentumTuning\r\n               Byputting the equation in to matrix form, (24) is a straight-forward result from unrolling the recurrence for t times. Note as\r\n               wesetx =x withnouncertaintyinmomentumSGD,wehave[x ,x ] = [x ,x ].\r\n                       1      0                                                      0   1      0   1\r\n                                                 2\r\n               Lemma9. LetU =E(x \u2212x ) andV =E(x \u2212x )(x                          \u2212x )withx beingtheexpectationofx . Forquadratic\r\n                                 t        t    t         t        t    t    t\u22121     t\u22121         t                           t\r\n               function f(x) with curvature h \u2208 R, We have the following recurrence\r\n                                                       \uf8eb       \uf8f6                            \uf8eb 2 \uf8f6\r\n                                                         U                                    \u03b1 C\r\n                                                           t+1\r\n                                                       \uf8ed U \uf8f8                \u22a4           \u22121\uf8ed 0 \uf8f8\r\n                                                         V t      =(I \u2212B )(I \u2212B)                0                                        (25)\r\n                                                           t+1\r\n               where                                     \uf8eb                                            \uf8f6\r\n                                                                          2    2\r\n                                                         \uf8ed(1\u2212\u03b1h+\u00b5)            \u00b5    \u22122\u00b5(1\u2212\u03b1h+\u00b5)\uf8f8\r\n                                                    B=             1          0             0                                            (26)\r\n                                                             1\u2212\u03b1h+\u00b5           0            \u2212\u00b5\r\n                                                  2\r\n               andC =E(\u2207fS (xt)\u2212\u2207f(xt)) isthevarianceofgradientonminibatchSt.\r\n                                t\r\n               Proof. We prove by \ufb01rst deriving the recurrence for U and V respectively and combining them in to a matrix form. For U ,\r\n                                                                      t       t                                                             t\r\n               wehave\r\n                          U     =E(x       \u2212x )2\r\n                            t+1        t+1     t+1\r\n                                =E(x \u2212\u03b1\u2207f (x )+\u00b5(x \u2212x                  ) \u2212(1\u2212\u03b1h+\u00b5)x +\u00b5x              )2\r\n                                       t        St  t         t    t\u22121                     t      t\u22121\r\n                                                                                                                                 2\r\n                                =E(x \u2212\u03b1\u2207f(x )+\u00b5(x \u2212x                 ) \u2212(1\u2212\u03b1h+\u00b5)x +\u00b5x               +\u03b1(\u2207f(x )\u2212\u2207f (x )))\r\n                                       t          t         t    t\u22121                     t      t\u22121            t        St   t           (27)\r\n                                                                                  2     2                          2\r\n                                =E((1\u2212\u03b1h+\u00b5)(x \u2212x )\u2212\u00b5(x                   \u2212x )) +\u03b1 E(\u2207f(x )\u2212\u2207f (x ))\r\n                                                      t    t         t\u22121     t\u22121                  t        S    t\r\n                                                                                                            t\r\n                                =(1\u2212\u03b1h+\u00b5)2E(x \u2212x )2\u22122\u00b5(1\u2212\u03b1h+\u00b5)E(x \u2212x )(x                             \u2212x )\r\n                                                      t     t                          t     t   t\u22121     t\u22121\r\n                                       2                 2     2\r\n                                   +\u00b5 E(x        \u2212x ) +\u03b1 C\r\n                                            t\u22121      t\u22121\r\n               where the cross terms cancels due to the fact E    [\u2207f(x )\u2212\u2207f (x )] = 0inthethirdequality.\r\n                                                               S        t        S    t\r\n                                                                 t                t\r\n               For V , we can similarly derive\r\n                     t\r\n                           Vt =E(xt \u2212xt)(xt\u22121 \u2212xt\u22121)\r\n                              =E((1\u2212\u03b1h+\u00b5)(x             \u2212x )\u2212\u00b5(x            \u2212x )+\u03b1(\u2207f(x)\u2212\u2207f (x)))(x                     \u2212x )             (28)\r\n                                                    t\u22121     t\u22121         t\u22122     t\u22122              t       St   t     t\u22121     t\u22121\r\n                              =(1\u2212\u03b1h+\u00b5)E(x             \u2212x )2\u2212\u00b5E(x             \u2212x )(x         \u2212x )\r\n                                                   t\u22121     t\u22121            t\u22121     t\u22121    t\u22122     t\u22122\r\n               Again, the term involving \u2207f(x )\u2212\u2207f (x )cancelsinthethirdequalityasaresultsofE [\u2207f(x )\u2212\u2207f (x )] = 0.(27)\r\n                                               t       S    t                                              S        t       S   t\r\n                                                         t                                                  t                t\r\n               and (28) can be jointly expressed in the following matrix form\r\n                     \uf8eb      \uf8f6        \uf8eb       \uf8f6 \uf8eb 2 \uf8f6 t\u22121               \uf8eb 2 \uf8f6           \uf8eb \uf8f6                              \uf8eb 2 \uf8f6\r\n                       Ut+1              Ut         \u03b1 C        X \u03b1C                      U1                                \u03b1 C\r\n                     \uf8ed      \uf8f8        \uf8ed       \uf8f8 \uf8ed \uf8f8                   i \uf8ed      \uf8f8       t \uf8ed   \uf8f8             t          \u22121\uf8ed       \uf8f8\r\n                        U      =B U            +      0     =      B       0     +B U =(I\u2212B)(I\u2212B)                           0     .      (29)\r\n                          t              t\u22121                                               0\r\n                       Vt+1              Vt           0        i=0         0             V1                                 0\r\n               Note the second term in the second equality is zero because x0 and x1 are deterministic. Thus U1=U0=V1=0.\r\n               AccordingtoLemma8and9,wehaveE(x \u2212x\u2217)2 = (e\u22a4At[x ,x ]\u22a4)2 andE(x \u2212x )2 = \u03b12Ce\u22a4(I\u2212Bt)(I\u2212B)\u22121e\r\n                                                          t              1      1   0             t    t            1                       1\r\n               where e1 \u2208 Rn has all zero entries but the \ufb01rst dimension. Combining these two terms, we prove Lemma 5. Though the\r\n               proof here is for one dimensional quadratics, it trivially generalizes to multiple dimensional quadratics. Speci\ufb01cally, we can\r\n               decompose the quadratics along the eigenvector directions, and then apply Lemma 5 to each eigenvector direction using the\r\n               corresponding curvature h (eigenvalue). By summing quantities in (11) for all eigenvector directions, we can achieve the\r\n               multiple dimensional correspondence of (11).\r\n               C PROOFOFLEMMA6\r\n               Again we \ufb01rst present a proof of a multiple dimensional generalized version of Lemma 6. The proof of Lemma 6 is a one\r\n               dimensional special case of Lemma 10. Lemma 10 also implies that for multiple dimension quadratics, the corresponding\r\n                                                  \u221a 2                \u221a 2\r\n               spectral radius \u03c1(B) = \u00b5 if (1\u2212 \u00b5)        \u2264 h \u2264 (1+ \u00b5) on all the eigenvector directions with h being the eigenvalue\r\n               (curvature).                       \u03b1                  \u03b1\r\n                                                    YELLOWFINandtheArtofMomentumTuning\r\n              Lemma10. LetH \u2208Rn\u00d7nbeasymmetricmatrixand\u03c1(B)bethespectralradiusofmatrix\r\n                                          \uf8eb                  \u22a4                      2                         \uf8f6\r\n                                            (I \u2212\u03b1H +\u00b5I) (I \u2212\u03b1H +\u00b5I) \u00b5 I \u22122\u00b5(I \u2212\u03b1H +\u00b5I)\r\n                                     B=\uf8ed                     I                     0               0          \uf8f8                    (30)\r\n                                                       I \u2212\u03b1H +\u00b5I                   0             \u2212\u00b5I\r\n              Wehave\u03c1(B)=\u00b5ifalleigenvalues\u03bb(H)ofH satis\ufb01es\r\n                                                              \u221a 2                     \u221a 2\r\n                                                         (1 \u2212   \u00b5) \u2264\u03bb(H)\u2264 (1+ \u00b5) .                                                 (31)\r\n                                                             \u03b1                        \u03b1\r\n              Proof. Let \u03bb be an eigenvalue of matrix B, it gives det(B \u2212 \u03bbI) = 0 which can be alternatively expressed as\r\n                                                  det(B \u2212\u03bbI)=detFdetC \u2212DF\u22121E\u0001=0                                                   (32)\r\n              assuming F is invertible, i.e. \u03bb + \u00b5 6= 0, where the blocks in B\r\n                                   \u0012                         \u0013        \u0012           \u0013        \u0012      \u0013\r\n                                          \u22a4              2               \u22122\u00b5M                 M \u22a4\r\n                              C =     M M\u2212\u03bbI \u00b5I ,D=                         0       , E =     0       , F = \u2212\u00b5I \u2212\u03bbI\r\n                                             I         \u2212\u03bbI\r\n              with M = I \u2212\u03b1H +\u00b5I.(32)canbetransformedusingstraight-forward algebra as\r\n                                                 \u0012             \u22a4                              2   \u0013\r\n                                             det    (\u03bb\u2212\u00b5)M M \u2212(\u03bb+\u00b5)\u03bbI (\u03bb+\u00b5)\u00b5 I                      =0                             (33)\r\n                                                              (\u03bb+\u00b5)I                \u2212(\u03bb+\u00b5)\u03bbI\r\n              Using similar simpli\ufb01cation technique as in (32), we can further simplify into\r\n                                                                  \u0010        2         \u22a4    \u0011\r\n                                                     (\u03bb\u2212\u00b5)det (\u03bb+\u00b5) I \u2212\u03bbM M =0                                                     (34)\r\n                                  2          \u22a4                                       2           2\r\n              if \u03bb 6= \u00b5, as (\u03bb + \u00b5) I \u2212 \u03bbM M is diagonalizable, we have (\u03bb + \u00b5) \u2212\u03bb\u03bb(M) = 0 with \u03bb(M) being an eigenvalue\r\n              of symmetric M. The analytic solution to the equation can be explicitly expressed as\r\n                                                   \u03bb=\u03bb(M)2\u22122\u00b5\u00b1p(\u03bb(M)2\u22122\u00b5)2\u22124\u00b52.                                                    (35)\r\n                                                                            2\r\n              Whenthecondition in (31) holds, we have \u03bb(M)2 = (1\u2212\u03b1\u03bb(H)+\u00b5)2 \u2264 4\u00b5. One can verify that\r\n                                                  2       2      2            2              2\r\n                                           (\u03bb(M) \u22122\u00b5) \u22124\u00b5 = (\u03bb(M) \u22124\u00b5)\u03bb(M)\r\n                                                                                          2     \u0001       2                         (36)\r\n                                                                   =    (1 \u2212\u03b1\u03c1(H)+\u00b5) \u22124\u00b5 \u03bb(M)\r\n                                                                   \u2264 0\r\n              Thusthe roots in (35) are conjugate with |\u03bb| = \u00b5. In conclusion, the condition in (31) can guarantee all the eigenvalues of\r\n              Bhasmagnitude\u00b5. Thusthespectral radius of B is controlled by \u00b5.\r\n              D PRACTICALIMPLEMENTATION\r\n              In Section 3.2, we discuss estimators for learning rate and momentum tuning in YELLOWFIN. In our experiment practice,\r\n              wehaveidenti\ufb01ed a few practical implementation details which are important for improving estimators. Zero-debias is\r\n              proposed by Kingma & Ba (2014), which accelerates the process where exponential average adapts to the level of original\r\n              quantity in the beginning. We applied zero-debias to all the exponential average quantities involved in our estimators. In\r\n              some LSTMmodels, we observe that our estimated curvature may decrease quickly along the optimization process. In\r\n              order to better estimate extremal curvature h     and h     with fast decreasing trend, we apply zero-debias exponential\r\n                                                           max        min\r\n              average on the logarithmic of hmax,t and hmin,t, instead of directly on hmax,t and hmin,t. Except from the above two\r\n              techniques, we also implemented the slow start heuristic proposed by (Schaul et al., 2013). More speci\ufb01cally, we use\r\n              \u03b1=min{\u03b1t,t\u00b7\u03b1t/(10\u00b7w)}asourlearningratewithw asthesizeofourslidingwindowinhmax andhmin estimation.\r\n              It discount the learning rate in the \ufb01rst 10 \u00b7 w steps and helps to keep the learning rate small in the beginning when the\r\n              exponential averaged quantities are not accurate enough.\r\n                                                  YELLOWFINandtheArtofMomentumTuning\r\n                    network      # layers        Conv0                 Unit 1s              Unit 2s               Unit 3s\r\n                CIFAR10ResNet      110       \u0002 3\u00d73, 4 \u0003          \u0014 3\u00d73, 4 \u0015\u00d76          \u0014 3\u00d73, 8 \u0015\u00d76         \u0014 3\u00d73, 16 \u0015\u00d76\r\n                                                                \uf8ee 3\u00d73, 4 \uf8f9           \uf8ee 3\u00d73, 8 \uf8f9            \uf8ee 3\u00d73, 16 \uf8f9\r\n                                             \u0002            \u0003     \uf8f0 1\u00d71, 16 \uf8fb          \uf8f0 1\u00d71,     32 \uf8fb       \uf8f0 1\u00d71,     64 \uf8fb\r\n               CIFAR100ResNet      164          3\u00d73, 4             3\u00d73, 16      \u00d76      3\u00d73,    32    \u00d76      3\u00d73,    64    \u00d76\r\n                                                                   1\u00d71, 64              1\u00d71, 128              1\u00d71, 256\r\n                    network      # layers     WordEmbed.              Layer 1               Layer 2               Layer 3\r\n                   TSLSTM           2      [65 vocab, 128 dim]     128 hidden units     128 hidden units             \u2013\r\n                  PTBLSTM           2     [10000 vocab, 200 dim]   200 hidden units     200 hidden units             \u2013\r\n                  WSJLSTM           3     [6922 vocab, 500 dim]    500 hidden units     500 hidden units      500 hidden units\r\n                                            Table 3. Speci\ufb01cation of ResNet and LSTM model architectures.\r\n              E MODELSPECIFICATION\r\n              Themodelspeci\ufb01cation is shown in Table 3 for all the experiments in Section 5. CIRAR10 ResNet uses the regular ResNet\r\n              units while CIFAR100 ResNet uses the bottleneck units. Only the convolutional layers are shown with \ufb01lter size, \ufb01lter\r\n              numberaswellastherepeating count of the units. The layer counting for ResNets also includes batch normalization and\r\n              Relu layers. The LSTM models are also diversi\ufb01ed for different tasks with different vocabulary sizes, word embedding\r\n              dimensions and number of layers.\r\n              F SPECIFICATION FOR SYNCHRONOUS EXPERIMENTS\r\n              In Section 5.1, we demonstrate the synchronous experiments with extensive discussions. For the reproducibility, we provide\r\n              here the speci\ufb01cation of learning rate grids. The number of iterations as well as epochs, i.e. the number of passes over the\r\n              full training sets, are also listed for completeness. For YELLOWFIN in all the experiments in Section 5, we uniformly use\r\n              sliding window size 20 for extremal curvature estimation and \u03b2 = 0.999 for smoothing. For momentum SGD and Adam,\r\n              weusethefollowing con\ufb01gurations.\r\n                \u2022 CIFAR10ResNet\r\n                     \u2013 40k iterations (\u223c114 epochs)\r\n                     \u2013 MomentumSGDlearningrates{0.001,0.01(best),0.1,1.0}, momentum 0.9\r\n                     \u2013 Adamlearning rates {0.0001,0.001(best),0.01,0.1}\r\n                \u2022 CIFAR100ResNet\r\n                     \u2013 120k iterations (\u223c341 epochs)\r\n                     \u2013 MomentumSGDlearningrates{0.001,0.01(best),0.1,1.0}, momentum 0.9\r\n                     \u2013 Adamlearning rates {0.00001,0.0001(best),0.001,0.01}\r\n                \u2022 PTBLSTM\r\n                     \u2013 30kiterations (\u223c13 epochs)\r\n                     \u2013 MomentumSGDlearningrates{0.01,0.1,1.0(best),10.0}, momentum 0.9\r\n                     \u2013 Adamlearning rates {0.0001,0.001(best),0.01,0.1}\r\n                \u2022 TSLSTM\r\n                     \u2013 \u223c21kiterations (50 epochs)\r\n                     \u2013 MomentumSGDlearningrates{0.05,0.1,0.5,1.0(best),5.0}, momentum 0.9\r\n                     \u2013 Adamlearning rates {0.0005,0.001,0.005(best),0.01,0.05}\r\n                     \u2013 Decrease learning rate by factor 0.97 every epoch for all optimizers, following the design by Karpathy et al.\r\n                       (2015).\r\n                                              YELLOWFINandtheArtofMomentumTuning\r\n               \u2022 WSJLSTM\r\n                   \u2013 \u223c120kiterations (50 epochs)\r\n                   \u2013 MomentumSGDlearningrates{0.05,0.1,0.5(best),1.0,5.0}, momentum 0.9\r\n                   \u2013 Adamlearning rates {0.0001,0.0005,0.001(best),0.005,0.01}\r\n                   \u2013 Vanilla SGD learning rates {0.05,0.1,0.5,1.0(best),5.0}\r\n                   \u2013 Adagrad learning rates {0.05,0.1,0.5(best),1.0,5.0}\r\n                   \u2013 Decrease learning rate by factor 0.9 every epochs after 14 epochs for all optimizers, following the design by Choe\r\n                     &Charniak.\r\n             G ADDITIONALEXPERIMENTRESULTS\r\n             G.1 Theimportanceofadaptivemomentum\r\n             In Section 5.1, we discussed the importance of adaptive momentum by demonstrating the training loss on the TS LSTM and\r\n             CIFAR100ResNetmodels. InFigure9,wefurthervalidate the importance of adaptive momentum by demonstrating the\r\n             corresponding validation/test performance on the PSTM LSTM and CIFAR100 ResNet models. Particularly in Figure 7 (left\r\n             and middle), similar to our observation on training loss comparison, we can also see that neither prescribed momentum 0.0\r\n             or 0.9 can match the performance of YELLOWFIN with adaptive momentum across the two tasks. Furthermore, in Figure 7\r\n             (right), hand-tuned Vanilla SGD without momentum decreases the validation perplexity in TS LSTM more slowly than\r\n             momentumbasedmethods. HoweverbydynamicallyrescalingtheVanillaSGDlearningratebasedonYellowFintuned\r\n             momentum,itdemonstrates a validation perplexity decreasing speed matching that of momentum based methods.\r\n                  2.0                                0.7                                 2.0\r\n                                  YellowFin          0.6                                               MomentumSGD\r\n                  1.9             YFmom.=0.0                                             1.9           Vanilla SGD\r\n                                  YFmom.=0.9         0.5                                               YellowFin\r\n                 perplexity1.8                               YellowFin                  perplexity1.8\r\n                                  YFrescaling       accuracy0.4                                        YFrescaling\r\n                  1.7                                        YFmom.=0.0                  1.7\r\n                                                     0.3     YFmom.=0.9\r\n                  1.6                               est0.2   YFrescaling                 1.6\r\n                                                    T\r\n                 alidation1.5                        0.1                                alidation1.5\r\n                 V 0k    5k    10k   15k   20k         0k    30k    60k    90k          V 0k     5k   10k   15k    20k\r\n                            Iterations                          Iterations                         Iterations\r\n             Figure 9. Importanceofadaptivemomentum: Thevalidation/testperformancecomparisonbetween YELLOWFINwithadaptivemomentum\r\n             and YELLOWFIN with \ufb01xed momentum values; this comparison is conducted on TS LSTM (left) and CIFAR100 ResNet (middle).\r\n             Prescribed momentum values do not match the performance of YELLOWFIN with adaptive momentum across the two tasks. An adaptive\r\n             learning rate for SGD based on YELLOWFIN tuned momentum, can match the performance of momentum based methods on the TS\r\n             LSTM(right).\r\n             G.2 TraininglossandtestaccuracyonCIFAR10andCIFAR100ResNet\r\n             In Figure 10, we demonstrate the training loss on CIFAR10 ResNet and CIFAR100 ResNet. Speci\ufb01cally, YELLOWFIN can\r\n             match the performance of hand-tuned momentum SGD, and achieves 1.93x and 1.38x speedup comparing to hand-tuned\r\n             Adamrespectively on CIFAR10 and CIFAR100 ResNet. In Figure 11, we show the test accuracy curves corresponding to\r\n             the curves in Figure 10. We can observe that YELLOWFIN can have matching or better training loss at the end of training\r\n             than hand-tuned momentum SGD, while the test accuracy is worse (e.g. CIFAR100); this phenomenon where better training\r\n             loss does not guarantee better generalization is often observed in deep learning results.\r\n                                                  YELLOWFINandtheArtofMomentumTuning\r\n                                                                              101\r\n                                                 MomentumSGD                                      MomentumSGD\r\n                                                 Adam                                             Adam\r\n                           loss100               YellowFin                  loss\r\n                                                                                                  YellowFin\r\n                                                                              100\r\n                           Training                                         Training\r\n                              \u22121\r\n                             10\r\n                                0k      10k     20k      30k     40k            0k      30k      60k     90k     120k\r\n                                             Iterations                                      Iterations\r\n              Figure 10. The best training loss for the 100-layer CIFAR10 ResNet (left) and 164-layer CIFAR100 bottleneck ResNet (right).\r\n                             0.90                                             0.70\r\n                                                                                         MomentumSGD\r\n                             0.88                                             0.65       Adam\r\n                             0.86                                                        YellowFin\r\n                           accuracy                                         accuracy0.60\r\n                             0.84                MomentumSGD\r\n                           est                   Adam                       est0.55\r\n                           T 0.82                                           T\r\n                                                 YellowFin\r\n                             0.80                                             0.50\r\n                               0k      10k      20k     30k      40k             0k       30k       60k      90k\r\n                                            Iterations                                        Iterations\r\n              Figure 11. Test accuracy for ResNet on 100-layer CIFAR10 ResNet (left) and 164-layer CIFAR100 bottleneck ResNet. The test accuracy\r\n              curves corresponds to the training loss curves in Figure 10\r\n                                                      YELLOWFINandtheArtofMomentumTuning\r\n               G.3 TuningmomentumcanimproveAdaminasynchronous-parallelsetting\r\n                                                         7\r\n                                                         6\r\n                                                        loss\r\n                                                         5\r\n                                                        Training4\r\n                                                                 \u03b2 =\u22120.2       \u03b2 =0.3      \u03b2 =0.7\r\n                                                                  1             1           1\r\n                                                                 \u03b2 =0.0        \u03b2 =0.5      \u03b2 =0.9\r\n                                                         3        1             1           1\r\n                                                          0k     5k    10k   15k    20k   25k    30k\r\n                                                                         Iterations\r\n                                                Figure 12. Hand-tuning Adam\u2019s momentum under asynchrony.\r\n               WeconductexperimentsonPTBLSTMwith16asynchronousworkersusingAdamusingthesameprotocolasinSection5.2.\r\n               Fixing the learning rate to the value achieving the lowest smoothed loss in Section 5.1, we sweep the smoothing parameter\r\n               \u03b2 (Kingma&Ba,2014)ofthe\ufb01rstordermomentestimateingrid{\u22120.2,0.0,0.3,0.5,0.7,0.9}. \u03b2 servesthesameroleas\r\n                1                                                                                                   1\r\n               momentuminSGDandwecallitthemomentuminAdam. Figure12showstuningmomentumforAdamunderasynchrony\r\n               gives measurably better training loss. This result emphasizes the importance of momentum tuning in asynchronous settings\r\n               and suggests that state-of-the-art adaptive methods can perform sub-optimally when using prescribed momentum.\r\n               G.4 Accelerating YELLOWFIN with\ufb01nergrainlearningratetuning\r\n               Asanadaptive tuner, YELLOWFIN does not involve manual tuning. It can present faster development iterations on model\r\n               architectures than grid search on optimizer hyperparameters. In deep learning practice for computer vision and natural\r\n               language processing, after \ufb01xing the model architecture, extensive optimizer tuning (e.g. grid search or random search)\r\n               can further improve the performance of a model. A natural question to ask is can we also slightly tune YELLOWFIN to\r\n               accelerate convergence and improve the model performance. Speci\ufb01cally, we can manually multiply a positive number, the\r\n               learning rate factor, to the auto-tuned learning rate in YELLOWFIN to further accelerate.\r\n               In this section, we empirically demonstrate the effectiveness of learning rate factor on a 29-layer ResNext (2x64d) (Xie\r\n               et al., 2016) on CIFAR10 and a Tied LSTM model (Press & Wolf, 2016) with 650 dimensions for word embedding and\r\n               two hidden units layers on the PTB dataset. When running YELLOWFIN, we search for the optimal learning rate factor\r\n               in grid {1,0.5,1,2(best for ResNext),3(best for Tied LSTM),10}. Similarly, we search the same learning rate factor grid\r\n                        3\r\n               for Adam, multiplying the factor to its default learning rate 0.001. To further strengthen the performance of Adam as a\r\n                                                                                             \u22125    \u22124     \u22124    \u22123    \u22123\r\n               baseline, we also run it on conventional logarithmic learning rate grid {5e     , 1e   , 5e  , 1e   , 5e  } for ResNext and\r\n                   \u22124    \u22124    \u22123     \u22123    \u22122\r\n               {1e   , 5e   , 1e  , 5e   , 1e  } for Tied LSTM. We report the best metric from searching the union of learning rate factor\r\n               grid and logarithmic learning rate grid as searched Adam results. Recently, AMSGrad (AMSG) (Reddi et al., 2018) is\r\n               proposed as an variant of Adam to correct the convergence issue on certain convex problems. To provide a complete\r\n               comparison, we additionally perform learning rate factor search with grid {0.1, 1,0.5,1,2,3,10} for AMSG. Empirically,\r\n                                                                                                 3                                     1\r\n               weobserve AdamandAMSGhavesimilarconvergencebehaviorwithsameleraningrate factors, and learning factor 3 and\r\n               1.0 work best for Adam/AMSG respectively on ResNext and Tied LSTM.\r\n               AsshowninFigure13,withthesearchedbest learning rate factor, YELLOWFIN can improve validation perplexity on Tied\r\n               LSTMfrom88.7to80.5,animprovementofmorethan9%. Similarly,thesearchedlearningrate factor can improve test\r\n               accuracy from 92.63 to 94.75 on ResNext. More importantly, we can observe, with learning rate factor search on the two\r\n               models, YELLOWFIN can achieve better validation metric than the searched Adam and AMSG results. It demonstrates that\r\n               \ufb01ner-grain learning rate tuning, i.e. the learning rate factor search, can be effectively applied on YELLOWFIN to improve the\r\n               performance of deep learning models.\r\n                                                     YELLOWFINandtheArtofMomentumTuning\r\n                     110                                                                          95\r\n                                                                      YellowFin                 accuracy90\r\n                    perplexity90                                      Adamdefault\r\n                                                                      AMSGdefault                 85\r\n                                                                      YFsearched\r\n                                                                      Adamsearched              alidation\r\n                    alidation70                                       AMSGsearched              V 80\r\n                    V    0   5   10   15  20   25  30   35  40                                      0        50      100       150      200\r\n                                      Epochs                                                                      Epochs\r\n              Figure 13. Validation perplexity on Tied LSTM and validation accuracy on ResNext. Learning rate \ufb01ne-tuning using grid-searched factor\r\n               can further improve the performance of YELLOWFIN in Algorithm 1. YELLOWFIN with learning factor search can outperform hand-tuned\r\n              Adamonvalidation metrics on both models.\r\n", "award": [], "sourceid": 153, "authors": [{"given_name": "Jian", "family_name": "Zhang", "institution": "Stanford University"}, {"given_name": "Ioannis", "family_name": "Mitliagkas", "institution": "Mila & University of Montreal"}]}