{"title": "A System for Massively Parallel Hyperparameter Tuning", "book": "Proceedings of Machine Learning and Systems", "page_first": 230, "page_last": 246, "abstract": "Modern learning models are characterized by large hyperparameter spaces and long training times. These properties, coupled with the rise of parallel computing and the growing demand to productionize machine learning workloads, motivate the need to develop mature hyperparameter optimization functionality in distributed computing settings. We address this challenge by first introducing a simple and robust hyperparameter optimization algorithm called ASHA, which exploits parallelism and aggressive early-stopping to tackle large-scale hyperparameter optimization problems. Our extensive empirical results show that ASHA outperforms existing state-of-the-art hyperparameter optimization methods; scales linearly with the number of workers in distributed settings; and is suitable for massive parallelism, converging to a high quality configuration in half the time taken by Vizier (Google\u2019s internal hyperparameter optimization service) in an experiment with 500 workers. We then describe several design decisions we encountered, along with our associated solutions, when integrating ASHA in SystemX, an end-to-end production-quality machine learning system that offers hyperparameter tuning as a service.", "full_text": "                      ASYSTEMFORMASSIVELYPARALLELHYPERPARAMETERTUNING\r\n                LiamLi1 KevinJamieson2 AfshinRostamizadeh3 EkaterinaGonina3 JonathanBen-Tzur4 MoritzHardt5\r\n                                                        BenjaminRecht5 AmeetTalwalkar14\r\n                                                                       ABSTRACT\r\n                    Modernlearning models are characterized by large hyperparameter spaces and long training times. These prop-\r\n                    erties, coupled with the rise of parallel computing and the growing demand to productionize machine learning\r\n                    workloads, motivate the need to develop mature hyperparameter optimization functionality in distributed com-\r\n                    puting settings. We address this challenge by \ufb01rst introducing a simple and robust hyperparameter optimization\r\n                    algorithmcalledASHA,whichexploitsparallelismandaggressiveearly-stoppingtotacklelarge-scalehyperparam-\r\n                    eter optimization problems. Our extensive empirical results show that ASHA outperforms existing state-of-the-art\r\n                    hyperparameter optimization methods; scales linearly with the number of workers in distributed settings; and is\r\n                    suitable for massive parallelism, as demonstrated on a task with 500 workers. We then describe several design\r\n                    decisions we encountered, along with our associated solutions, when integrating ASHA in Determined AI\u2019s\r\n                    end-to-end production-quality machine learning system that offers hyperparameter tuning as a service.\r\n               1    INTRODUCTION                                                ter setting. Leveraging distributed computational resources\r\n               Although machine learning (ML) models have recently              presents a solution to the increasingly challenging problem\r\n               achieved dramatic successes in a variety of practical ap-        of hyperparameter optimization.\r\n               plications, these models are highly sensitive to internal pa-    4. Productionization of ML. ML is increasingly driving\r\n               rameters, i.e., hyperparameters. In these modern regimes,        innovations in industries ranging from autonomous vehicles\r\n               four trends motivate the need for production-quality systems     to scienti\ufb01c discovery to quantitative \ufb01nance. As ML moves\r\n               that support massively parallel for hyperparameter tuning:       from R&D to production, ML infrastructure must mature\r\n               1. High-dimensionalsearchspaces. Modelsarebecoming               accordingly, with hyperparameter optimization as one of the\r\n               increasingly complex, as evidenced by modern neural net-         core supported workloads.\r\n               works with dozens of hyperparameters. For such complex           In this work, we address the problem of developing\r\n               modelswithhyperparametersthatinteractinunknownways,              production-quality hyperparameter tuning functionality in\r\n               a practitioner is forced to evaluate potentially thousands of    a distributed computing setting. Support for massive paral-\r\n               different hyperparameter settings.                               lelism is a cornerstone design criteria of such a system and\r\n               2. Increasing training times. As datasets grow larger and        thus a main focus of our work.\r\n               modelsbecomemorecomplex,trainingamodelhasbecome                  Tothis end, and motivated by the shortfalls of existing meth-\r\n               dramatically more expensive, often taking days or weeks          ods, we \ufb01rst introduce Asynchronous Successive Halving\r\n               on specialized high-performance hardware. This trend is          Algorithm (ASHA), a simple and practical hyperparameter\r\n               particularly onerous in the context of hyperparameter opti-      optimization method suitable for massive parallelism that\r\n               mization, as a new model must be trained to evaluate each        exploits aggressive early stopping. Our algorithm is inspired\r\n               candidate hyperparameter con\ufb01guration.                           bythe Successive Halving algorithm (SHA) (Karnin et al.,\r\n               3. Rise of parallel computing. The combination of a grow-        2013; Jamieson & Talwalkar, 2015), a theoretically princi-\r\n               ing number of hyperparameters and longer training time per       pled early stopping method that allocates more resources to\r\n               model precludes evaluating con\ufb01gurations sequentially; we        promising con\ufb01gurations. ASHA is designed for what we\r\n               simply cannot wait years to \ufb01nd a suitable hyperparame-          refer to as the \u2018large-scale regime,\u2019 where to \ufb01nd a good hy-\r\n                                                                                perparameter setting, we must evaluate orders of magnitude\r\n                   1Carnegie Mellon University 2University of Washington        morehyperparameter con\ufb01gurations than available parallel\r\n               3Google Research 4Determined AI 5University of California,       workers in a small multiple of the wall-clock time needed\r\n               Berkeley. Correspondence to: Liam Li <me@liamcli.com>.           to train a single model.\r\n               Proceedings of the 3rd MLSys Conference, Austin, TX, USA,        Wenextperformathorough comparison of several hyper-\r\n               2020. Copyright 2020 by the author(s).\r\n                                                  ASystemforMassivelyParallelHyperparameterTuning\r\n               parameter tuning methods in both the sequential and par-         cate more training \u201cresources\u201d to promising con\ufb01gurations.\r\n                                                                                                            \u00a8\r\n               allel settings. We focus on \u2018mature\u2019 methods, i.e., well-        Previous methods like Gyorgy & Kocsis (2011); Agarwal\r\n               established techniques that have been empirically and/or         et al. (2011); Sabharwal et al. (2016) provide theoretical\r\n               theoretically studied to an extent that they could be con-       guarantees under strong assumptions on the convergence\r\n               sidered for adoption in a production-grade system. In the        behavior of intermediate losses. (Krueger et al., 2015) re-\r\n               sequential setting, we compare SHA with Fabolas (Klein           lies on a heuristic early-stopping rule based on sequential\r\n               et al., 2017a), Population Based Tuning (PBT) (Jaderberg         analysis to terminate poor con\ufb01gurations.\r\n               et al., 2017), and BOHB (Falkner et al., 2018), state-of-the-    In contrast, SHA (Jamieson & Talwalkar, 2015) and Hyper-\r\n               art methods that exploit partial training. Our results show      band (Li et al., 2018) are adaptive con\ufb01guration evaluation\r\n               that SHA outperforms these methods, which when coupled           approaches which do not have the aforementioned draw-\r\n               with SHA\u2019s simplicity and theoretical grounding, motivate        backs and have achieved state-of-the-art performance on\r\n               the use of a SHA-based method in production. We further          several empirical tasks. SHA serves as the inner loop for\r\n               verify that SHA and ASHA achieve similar results. In the         Hyperband, with Hyperband automating the choice of the\r\n               parallel setting, our experiments demonstrate that ASHA          early-stopping rate by running different variants of SHA.\r\n               addresses the intrinsic issues of parallelizing SHA, scales      While the appropriate choice of early stopping rate is prob-\r\n               linearly with the number of workers, and exceeds the per-        lemdependent, Li et al. (2018)\u2019s empirical results show that\r\n               formance of PBT, BOHB, and Vizier (Golovin et al., 2017),        aggressive early-stopping works well for a wide variety of\r\n               Google\u2019s internal hyperparameter optimization service.           tasks. Hence, we focus on adapting SHA to the parallel set-\r\n               Finally, based on our experience developing ASHA within          ting in Section 3, though we also evaluate the corresponding\r\n               Determined AI\u2019s production-quality machine learning sys-         asynchronous Hyperband method.\r\n               tem that offers hyperparameter tuning as a service, we de-       Hybrid approaches combine adaptive con\ufb01guration selec-\r\n               scribe several systems design decisions and optimizations        tion and evaluation (Swersky et al., 2013; 2014; Domhan\r\n               that we explored as part of the implementation. We focus         et al., 2015; Klein et al., 2017a). Li et al. (2018) showed\r\n               onfour key considerations: (1) streamlining the user inter-      that SHA/Hyperband outperforms SMAC with the learning\r\n               face to enhance usability; (2) autoscaling parallel training to  curve based early-stopping method introduced by Domhan\r\n               systematically balance the tradeoff between lower latency        et al. (2015). In contrast, Klein et al. (2017a) reported\r\n               in individual model training and higher throughput in total      state-of-the-art performance for Fabolas on several tasks in\r\n               con\ufb01guration evaluation; (3) ef\ufb01ciently scheduling ML jobs       comparison to Hyperband and other leading methods. How-\r\n               to optimize multi-tenant cluster utilization; and (4) tracking   ever, our results in Section 4.1 demonstrate that under an\r\n               parallel hyperparameter tuning jobs for reproducibility.         appropriate experimental setup, SHA and Hyperband in fact\r\n               2    RELATEDWORK                                                 outperform Fabolas. Moreover, we note that Fabolas, along\r\n                                                                                with most other Bayesian optimization approaches, can be\r\n               Wewill\ufb01rst discuss related work that motivated our focus         parallelized using a constant liar (CL) type heuristic (Gins-\r\n                                                                                                            \u00b4\r\n               on parallelizing SHA for the large-scale regime. We then         bourger et al., 2010; Gonzalez et al., 2016). However, the\r\n               provide an overview of methods for parallel hyperparameter       parallel version will underperform the sequential version,\r\n               tuning, from which we identify a mature subset to compare        since the latter uses a more accurate posterior to propose\r\n               to in our empirical studies (Section 4). Finally, we discuss     newpoints. Hence, our comparisons to these methods are\r\n               related work on systems for hyperparameter optimization.         restricted to the sequential setting.\r\n               Sequential Methods.       Existing hyperparameter tuning         Other hybrid approaches combine Hyperband with adap-\r\n               methods attempt to speed up the search for a good con-           tive sampling. For example, Klein et al. (2017b) combined\r\n               \ufb01guration by either adaptively selecting con\ufb01gurations or        Bayesian neural networks with Hyperband by \ufb01rst train-\r\n               adaptively evaluating con\ufb01gurations. Adaptive con\ufb01gura-          ing a Bayesian neural network to predict learning curves\r\n               tion selection approaches attempt to identify promising re-      and then using the model to select promising con\ufb01gura-\r\n               gions of the hyperparameter search space from which to           tions to use as inputs to Hyperband. More recently, Falkner\r\n               sample new con\ufb01gurations to evaluate (Hutter et al., 2011;       et al. (2018) introduced BOHB, a hybrid method combining\r\n               Snoek et al., 2012; Bergstra et al., 2011; Srinivas et al.,      Bayesian optimization with Hyperband. They also propose\r\n               2010). However, by relying on previous observations to in-       a parallelization scheme for SHA that retains synchronized\r\n               form which con\ufb01guration to evaluate next, these algorithms       eliminations of underperforming con\ufb01gurations. We discuss\r\n               are inherently sequential and thus not suitable for the large-   the drawbacks of this parallelization scheme in Section 3\r\n               scale regime, where the number of updates to the posterior       and demonstrate that ASHA outperforms this version of\r\n               is limited. In contrast, adaptive con\ufb01guration evaluation ap-    parallel SHA as well as BOHB in Section 4.2. We note that\r\n               proaches attempt to early-stop poor con\ufb01gurations and allo-      similar to SHA/Hyperband, ASHA can be combined with\r\n                                                    ASystemforMassivelyParallelHyperparameterTuning\r\n                adaptive sampling for more robustness to certain challenges          the performance prediction framework by Qi et al. (2017).\r\n                of parallel computing that we discuss in Section 3.\r\n                Parallel Methods. Established parallel methods for hyper-            3    ASHAALGORITHM\r\n                parameter tuning include PBT (Jaderberg et al., 2017; Li             We start with an overview of SHA (Karnin et al., 2013;\r\n                et al., 2019) and Vizier (Golovin et al., 2017). PBT is a            Jamieson&Talwalkar,2015)andmotivatetheneedtoadapt\r\n                state-of-the-art hybrid evolutionary approach that exploits          it to the parallel setting. Then we present ASHA and dis-\r\n                partial training to iteratively increase the \ufb01tness of a pop-        cuss how it addresses issues with synchronous SHA and\r\n                ulation of models. In contrast to Hyperband, PBT lacks               improves upon the original algorithm.\r\n                any theoretical guarantees. Additionally, PBT is primarily\r\n                designed for neural networks and is not a general approach           3.1   Successive Halving (SHA)\r\n                for hyperparameter tuning. We further note that PBT is\r\n                more comparable to SHA than to Hyperband since both                  The idea behind SHA (Algorithm 1) is simple: allocate a\r\n                PBTandSHArequiretheusertosettheearly-stopping rate                   small budget to each con\ufb01guration, evaluate all con\ufb01gura-\r\n                via internal hyperparameters.                                        tions and keep the top 1/\u03b7, increase the budget per con-\r\n                Vizier is Google\u2019s black-box optimization service with sup-          \ufb01guration by a factor of \u03b7, and repeat until the maximum\r\n                port for multiple hyperparameter optimization methods and            per-con\ufb01guration budget of R is reached (lines 5\u201311). The\r\n                early-stopping options. For succinctness, we will refer to           resource allocated by SHA can be iterations of stochastic\r\n                Vizier\u2019s default algorithm as \u201cVizier\u201d although it is simply         gradient descent, number of training examples, number of\r\n                one of methods available on the Vizier platform. While               randomfeatures, etc.\r\n                Vizier provides early-stopping rules, the strategies only of-        Algorithm 1 Successive Halving Algorithm.\r\n                fer approximately 3\u00d7 speedup in contrast to the order of                input number of con\ufb01gurations n, minimum resource\r\n                magnitudespeedupsobservedforSHA.WecomparetoPBT                          r, maximum resource R, reduction factor \u03b7, minimum\r\n                and Vizier in Section 4.2 and Section 4.4, respectively.                early-stopping rate s\r\n                HyperparameterOptimizationSystems. Whilethereisa                        s     =\u230alog (R/r)\u230b\r\n                large body of work on systems for machine learning, we nar-              max         \u03b7\r\n                rowourfocustosystemsfor hyperparameter optimization.                    assert n \u2265 \u03b7smax\u2212s so that at least one con\ufb01guration will\r\n                AutoWEKA(Kotthoffetal.,2017)andAutoSklearn(Feurer                       be allocated R.\r\n                et al., 2015) are two established single-machine, single-user           T =get hyperparameter configuration(n)\r\n                systems for hyperparameter optimization. Existing systems               // All configurations trained for a given\r\n                for distributed hyperparameter optimization include Vizier              i constitute a \u2018\u2018rung.\u2019\u2019\r\n                (Golovin et al., 2017), RayTune (Liaw et al., 2018), CHOPT              for i \u2208 {0,...,smax \u2212 s} do\r\n                (Kimetal., 2018) and Optuna (Akiba et al.). These existing                ni = \u230an\u03b7\u2212i\u230b\r\n                                                                                          r =r\u03b7i+s\r\n                systemsprovidegenericsupportforawiderangeofhyperpa-                         i\r\n                rametertuningalgorithms; bothRayTuneandOptunainfact                       L=runthenreturn val loss(\u03b8,ri) : \u03b8 \u2208 T\r\n                                                                                          T =top k(T,L,n /\u03b7)\r\n                have support for ASHA. In contrast, our work focuses on a                                       i\r\n                speci\ufb01c algorithm\u2014ASHA\u2014that we argue is particularly                    endfor\r\n                well-suited for massively parallel hyperparameter optimiza-             return best con\ufb01guration in T\r\n                tion. Wefurtherintroduceavarietyofsystemsoptimizations\r\n                designed speci\ufb01cally to improve the performance, usabil-             SHArequiresthenumberofcon\ufb01gurationsn,aminimum\r\n                ity, and robustness of ASHA in production environments.              resource r, a maximum resource R, a reduction factor\r\n                Webelieve that these optimizations would directly bene\ufb01t             \u03b7 \u2265 2, and a minimum early-stopping rate s. Addition-\r\n                existing systems to effectively support ASHA, and general-           ally, the get hyperparameter configuration(n)\r\n                izations of these optimizations could also be bene\ufb01cial in           subroutine returns n con\ufb01gurations sampled randomly\r\n                supporting other hyperparameter tuning algorithms.                   from a given search space;             and the run then\r\n                Similarly, we note that Kim et al. (2018) address the prob-          return val loss(\u03b8,r)subroutinereturnsthevalidation\r\n                lem of resource management for generic hyperparameter                loss after training the model with the hyperparameter setting\r\n                optimization methods in a shared compute environment,                \u03b8 and for r resources. For a given early-stopping rate s, a\r\n                                                                                     minimum resource of r = r\u03b7s will be allocated to each\r\n                whilewefocusonef\ufb01cientresourceallocationwithadaptive                                           0\r\n                scheduling speci\ufb01cally for ASHA in Section 5.3. Addition-            con\ufb01guration. Hence, lower s corresponds to more aggres-\r\n                ally, in contrast to the user-speci\ufb01ed automated scaling capa-       sive early-stopping, with s = 0 prescribing a minimum\r\n                bility for parallel training presented in Xiao et al. (2018), we     resource of r. We will refer to SHA with different values\r\n                propose to automate appropriate autoscaling limits by using          of s as brackets and, within a bracket, we will refer to each\r\n                                                                                     round of promotion as a rung with the base rung numbered\r\n                                                   ASystemforMassivelyParallelHyperparameterTuning\r\n                                                                                         bracket s    rung i    ni   ri    total budget\r\n                                                                                         0              0       9     1               9\r\n                                                                                                        1       3     3               9\r\n                                                                                                        2       1     9               9\r\n                                                                                         1              0       9     3              27\r\n                                                                                                        1       3     9              27\r\n                                                                                         2              0       9     9              81\r\n                  (a) Visual depiction of the promotion scheme for bracket s = 0.         (b) Promotion scheme for different brackets s.\r\n                                          Figure 1. Promotion scheme for SHA with n = 9, r = 1, R = 9, and \u03b7 = 3.\r\n               0 and increasing. Figure 1(a) shows the rungs for bracket           3.2   AsynchronousSHA(ASHA)\r\n               0 for an example setting with n = 9,r = 1,R = 9, and                WenowintroduceASHAasaneffectivetechniquetoparal-\r\n               \u03b7 = 3, while Figure 1(b) shows how resource allocations             lelize SHA,leveragingasynchronytomitigatestragglersand\r\n               change for different brackets s. Namely, the starting budget        maximize parallelism. Intuitively, ASHA promotes con\ufb01gu-\r\n               per con\ufb01guration r      \u2264 R increases by a factor of \u03b7 per\r\n                                    0                                              rations to the next rung whenever possible instead of waiting\r\n               increment of s. Hence, it takes more resources to explore           for a rung to complete before proceeding to the next rung.\r\n               the same number of con\ufb01gurations for higher s. Note that            Additionally, if no promotions are possible, ASHA simply\r\n               for a given s, the same budget is allocated to each rung but        addsacon\ufb01gurationtothebaserung,sothatmorecon\ufb01gura-\r\n               is split between fewer con\ufb01gurations in higher rungs.               tions can be promoted to the upper rungs. ASHA is formally\r\n               Straightforward ways of parallelizing SHA are not well              de\ufb01ned in Algorithm 2. Given its asynchronous nature it\r\n               suited for the parallel regime. We could consider the em-           doesnotrequiretheusertopre-specifythenumberofcon\ufb01g-\r\n               barrassingly parallel approach of running multiple instances        urationstoevaluate, butitotherwiserequiresthesameinputs\r\n               of SHA, one on each worker. However, this strategy is not           as SHA. Note that the run then return val loss\r\n               well suited for the large-scale regime, where we would like         subroutine in ASHA is asynchronous and the code execution\r\n               results in little more than the time to train one con\ufb01guration.     continues after the job is passed to the worker. ASHA\u2019s\r\n               To see this, assume that training time for a con\ufb01guration           promotion scheme is laid out in the get job subroutine.\r\n               scales linearly with the allocated resource and time(R) rep-        ASHAiswell-suited for the large-scale regime, where wall-\r\n               resents the time required to train a con\ufb01guration for the max-      clock time is constrained to a small multiple of the time\r\n               imumresourceR. In general, for a given bracket s, the min-          needed to train a single model. For ease of comparison with\r\n               imumtimetoreturnacon\ufb01gurationtrained to completion is               SHA,assumetraining time scales linearly with the resource.\r\n               (log\u03b7(R/r)\u2212s+1)\u00d7time(R),wherelog\u03b7(R/r)\u2212s+1                          Consider the example of Bracket 0 shown in Figure 1, and\r\n               counts the number of rungs. For example, consider Bracket           assumewecanrunASHAwith9machines. ThenASHAre-\r\n               0 in the toy example in Figure 1. The time needed to return                                               13\r\n               a fully trained con\ufb01guration is 3\u00d7time(R), since there are          turns a fully trained con\ufb01guration in   /9\u00d7time(R),since9\r\n               three rungs and each rung is allocated R resource. In con-          machinesaresuf\ufb01cienttopromotecon\ufb01gurationstothenext\r\n               trast, as we will see in the next section, our parallelization      rung in the same time it takes to train a single con\ufb01guration\r\n               schemefor SHAcanreturn an answer in just time(R).                   in the rung. Hence, the training time for a con\ufb01guration in\r\n                                                                                             1                               1\r\n                                                                                             /9\u00d7time(R)                       /3\u00d7time(R)\r\n                                                                                   rung 0 is                , for rung 1 it is               , and\r\n               Another naive way of parallelizing SHA is to distribute the         for rung 2 it is time(R). In general, \u03b7log\u03b7(R)\u2212s machines\r\n               training of the n/\u03b7k surviving con\ufb01gurations on each rung           are needed to advance a con\ufb01guration to the next rung in the\r\n               k as is done by Falkner et al. (2018) and add brackets when         sametimeit takes to train a single con\ufb01guration in the rung,\r\n               there are no jobs available in existing brackets. We will           andit takes \u03b7s+i\u2212log\u03b7(R)\u00d7time(R)totrainacon\ufb01guration\r\n               refer to this method as \u201csynchronous\u201d SHA. The ef\ufb01cacy of           in rung i. Hence, ASHA can return a con\ufb01guration trained\r\n               this strategy is severely hampered by two issues: (1) SHA\u2019s         to completion in time\r\n               synchronous nature is sensitive to stragglers and dropped\r\n                                                                                        \u0012log\u03b7(R)              \u0013\r\n               jobs as every con\ufb01guration within a rung must complete                       X \u03b7i\u2212log\u03b7(R) \u00d7time(R)\u22642time(R).\r\n               before proceeding to the next rung, and (2) the estimate of\r\n               the top 1/\u03b7 con\ufb01gurations for a given early-stopping rate                     i=s\r\n               does not improve as more brackets are run since promotions          Moreover,whentrainingisiterative,ASHAcanreturnanan-\r\n               are performed independently for each bracket. We demon-             swerintime(R),sinceincrementallytrainedcon\ufb01gurations\r\n               strate the susceptibility of synchronous SHA to stragglers          can be checkpointed and resumed.\r\n               and dropped jobs on simulated workloads in Appendix A.1.            Finally, since Hyperband simply runs multiple SHA brack-\r\n                                                   ASystemforMassivelyParallelHyperparameterTuning\r\n               Algorithm 2 Asynchronous Successive Halving (ASHA)                 ways. First, Li et al. (2018) discusses two SHA variants:\r\n                  input minimum resource r, maximum resource R, reduc-            \ufb01nite horizon (bounded resource R per con\ufb01guration) and\r\n                  tion factor \u03b7, minimum early-stopping rate s                    in\ufb01nite horizon (unbounded resources R per con\ufb01guration).\r\n                  function ASHA()                                                 ASHAconsolidates these settings into one algorithm. In\r\n                     repeat                                                       Algorithm 2, we do not promote con\ufb01gurations that have\r\n                       for for each free worker do                                been trained for R, thereby restricting the number of rungs.\r\n                          (\u03b8,k) = get job()                                       However, this algorithm trivially generalizes to the in\ufb01nite\r\n                          run then return val loss(\u03b8,r\u03b7s+k)                       horizon; we can removethis restriction so that the maximum\r\n                       endfor                                                     resource per con\ufb01guration increases naturally as con\ufb01gura-\r\n                       for completed job (\u03b8, k) with loss l do                    tions are promoted to higher rungs. In contrast, SHA does\r\n                          Update con\ufb01guration \u03b8 in rung k with loss l.            notnaturally extend to the in\ufb01nite horizon setting, as it relies\r\n                       endfor                                                     on the doubling trick and must rerun brackets with larger\r\n                     until desired                                                budgets to increase the maximum resource.\r\n                  endfunction                                                     Additionally, SHA does not return an output until a single\r\n                  function get job()                                              bracket completes. In the \ufb01nite horizon this means that there\r\n                     // Check if there is a promotable config.                    is a constant interval of (# of rungs \u00d7 time(R)) between\r\n                     for k = \u230alog\u03b7(R/r)\u230b\u2212s\u22121,...,1,0do                            receiving outputs from SHA. In the in\ufb01nite horizon this\r\n                       candidates = top k(rung k,|rung k|/\u03b7)                      interval doubles between outputs. In contrast, ASHA grows\r\n                       promotable = {t \u2208 candidates : t not promoted}             the bracket incrementally instead of in \ufb01xed budget intervals.\r\n                       if |promotable| > 0 then                                   Tofurther reduce latency, ASHA uses intermediate losses\r\n                          return promotable[0],k + 1                              to determine the current best performing con\ufb01guration, as\r\n                       endif                                                      opposed to only considering the \ufb01nal SHA outputs.\r\n                       // If not, grow bottom rung.\r\n                       Drawrandomcon\ufb01guration\u03b8.                                   4    EMPIRICAL EVALUATION\r\n                       return \u03b8,0\r\n                     endfor                                                       We\ufb01rstpresentresults in the sequential setting to justify our\r\n                  endfunction                                                     choice of focusing on SHA and to compare SHA to ASHA.\r\n                                                                                  Wenextevaluate ASHAinparallel environments on three\r\n                                                                                  benchmark tasks.\r\n               ets, we can asynchronously parallelize Hyperband by either         4.1   Sequential Experiments\r\n               running multiple brackets of ASHA or looping through\r\n               brackets of ASHA sequentially as is done in the original           WebenchmarkHyperbandandSHAagainstPBT,BOHB\r\n               Hyperband. We employ the latter looping scheme for asyn-           (synchronous SHA with Bayesian optimization as intro-\r\n               chronous Hyperband in the next section.                            duced by Falkner et al. (2018)), and Fabolas, and examine\r\n                                                                                  the relative performance of SHA versus ASHA and Hy-\r\n               3.3   Algorithm Discussion                                         perband versus asynchronous Hyperband. As mentioned\r\n               ASHA is able to remove the bottleneck associated with              previously, asynchronousHyperbandloopsthroughbrackets\r\n               synchronous promotions by incurring a small number of in-          of ASHAwithdifferent early-stopping rates.\r\n               correct promotions, i.e. con\ufb01gurations that were promoted          WecompareASHAagainstPBT,BOHB,andsynchronous\r\n               early on but are not in the top 1/\u03b7 of con\ufb01gurations in hind-      SHAontwobenchmarksforCIFAR-10: (1)tuningacon-\r\n               sight. By the law of large numbers, we expect to erroneously       volutional neural network (CNN) with the cuda-convnet\r\n               promote a vanishing fraction of con\ufb01gurations in each rung         architecture and the same search space as (Li et al., 2017);\r\n               as the number of con\ufb01gurations grows. Intuitively, in the          and (2) tuning a CNN architecture with varying number of\r\n               \ufb01rst rung with n evaluated con\ufb01gurations, the number of            layers, batch size, and number of \ufb01lters. The details for\r\n               mispromoted con\ufb01gurations is roughly \u221an, since the pro-            the search spaces considered and the settings we used for\r\n               cess resembles the convergence of an empirical cumulative          each search method can be found in Appendix A.3. Note\r\n               distribution function (CDF) to its expected value (Dvoretzky       that BOHB uses SHA to perform early-stopping and dif-\r\n               et al., 1956). For later rungs, although the con\ufb01gurations are     fers only in how con\ufb01gurations are sampled; while SHA\r\n               nolongeri.i.d. since they were advanced based on the empir-        uses random sampling, BOHB uses Bayesian optimization\r\n               ical CDF from the rung below, we expect this dependence            to adaptively sample new con\ufb01gurations. In the following\r\n               to be weak.                                                        experiments, we run BOHB using the same early-stopping\r\n               We further note that ASHA improves upon SHA in two                 rate as SHA and ASHA instead of looping through brackets\r\n                                                  ASystemforMassivelyParallelHyperparameterTuning\r\n                          CIFAR10 Using Small Cuda Convnet Model                 work used by Klein et al. (2017a), we present our results\r\n                    0.26                                                         on4different benchmarks comparing Hyperband to Fabo-\r\n                                         SHA           ASHA\r\n                    0.25                 Hyperband     Hyperband (async)         las in Appendix A.2. In summary, our results show that\r\n                    0.24                 Random        BOHB                      Hyperband, speci\ufb01cally the \ufb01rst bracket of SHA, tends to\r\n                                         PBT                                     outperform Fabolas while also exhibiting lower variance\r\n                    0.23                                                         across experimental trials.\r\n                    0.22\r\n                   Test Error0.21                                                4.2   Limited-Scale Distributed Experiments\r\n                    0.20\r\n                    0.19                                                         Wenext compare ASHA to synchronous SHA, the paral-\r\n                    0.18                                                         lelization scheme discussed in Section 3.1; BOHB; and PBT\r\n                        0       500     1000     1500     2000     2500          on the same two tasks. For each experiment, we run each\r\n                                       Duration (Minutes)                        search method with 25 workers for 150 minutes. We use the\r\n                    0.26CIFAR10 Using Small CNN Architecture Tuning Task         samesetups for ASHA and PBT as in the previous section.\r\n                                         SHA           ASHA                      WerunsynchronousSHAandBOHBwithdefaultsettings\r\n                    0.25                 Hyperband     Hyperband (async)         and the same \u03b7 and early-stopping rate as ASHA.\r\n                                         Random        BOHB\r\n                    0.24                 PBT                                     Figure 3 shows the average test error across 5 trials for\r\n                    0.23                                                         each search method. On benchmark 1, ASHA evaluated\r\n                                                                                 over 1000 con\ufb01gurations in just over 40 minutes with 25\r\n                   Test Error0.22                                                workers and found a good con\ufb01guration (error rate below\r\n                    0.21                                                         0.21) in approximately the time needed to train a single\r\n                                                                                 model, whereas it took ASHA nearly 400 minutes to do\r\n                    0.20                                                         so in the sequential setting (Figure 2). Notably, we only\r\n                        0       500     1000     1500     2000     2500          achieve a 10\u00d7 speedup on 25 workers due to the relative\r\n                                       Duration (Minutes)                        simplicity of this task, i.e., it only required evaluating a\r\n               Figure 2. Sequential experiments (1 worker). Average across       few hundred con\ufb01gurations to identify a good one in the\r\n               10 trials is shown for each hyperparameter optimization method.   sequential setting. In contrast, for the more dif\ufb01cult search\r\n               Gridded lines represent top and bottom quartiles of trials.       space used in benchmark 2, we observe linear speedups\r\n                                                                                 with ASHA, as the \u223c 700 minutes needed in the sequential\r\n                                                                                 setting (Figure 2) to reach a test error below 0.23 is reduced\r\n               with different early-stopping rates as is done by Hyperband.      to under 25 minutes in the distributed setting.\r\n               Theresults on these two benchmarks are shown in Figure 2.         Compared to synchronous SHA and BOHB, ASHA \ufb01nds\r\n               Onbenchmark1,HyperbandandallvariantsofSHA(i.e.,                   a good con\ufb01guration 1.5\u00d7 as fast on benchmark 1 while\r\n               SHA, ASHA, and BOHB) outperform PBT by 3\u00d7. On                     BOHB\ufb01ndsaslightlybetter \ufb01nal con\ufb01guration. On bench-\r\n               benchmark 2, while all methods comfortably beat random            mark 2, ASHA performs signi\ufb01cantly better than syn-\r\n               search, SHA, ASHA, BOHBandPBTperformedsimilarly                   chronous SHA and BOHB due to the higher variance in\r\n               and slightly outperform Hyperband and asynchronous Hy-            training times between con\ufb01gurations (the average time re-\r\n               perband. This last observation (i) corroborates the results in    quired to train a con\ufb01guration on the maximum resource\r\n               Li et al. (2017), which found that the brackets with the most     Ris30minutes with a standard deviation of 27 minutes),\r\n               aggressive early-stopping rates performed the best; and (ii)      which exacerbates the sensitivity of synchronous SHA to\r\n               follows from the discussion in Section 2 noting that PBT          stragglers (see Appendix A.1). BOHB actually underper-\r\n               is more similar in spirit to SHA than Hyperband, as PBT /         forms synchronous SHA on benchmark 2 due to its bias\r\n               SHAbothrequireuser-speci\ufb01ed early-stopping rates (and             towards more computationally expensive con\ufb01gurations, re-\r\n               are more aggressive in their early-stopping behavior in these     ducing the number of con\ufb01gurations trained to completion\r\n               experiments). WeobservethatSHAandASHAarecompeti-                  within the given time frame.\r\n               tive with BOHB, despite the adaptive sampling scheme used         Wefurther note that ASHA outperforms PBT on benchmark\r\n               by BOHB. Additionally, for both tasks, introducing asyn-          1; in fact the minimum and maximum range for ASHA\r\n               chrony does not consequentially impact the performance            across 5 trials does not overlap with the average for PBT.\r\n               of ASHA (relative to SHA) or asynchronous Hyperband               Onbenchmark2,PBTslightlyoutperformsasynchronous\r\n               (relative to Hyperband). This not surprising; as discussed in     Hyperband and performs comparably to ASHA. However,\r\n               Section 3.3, we expect the number of ASHA mispromotions           note that the ranges for the searchers share large overlap\r\n               to be square root in the number of con\ufb01gurations.                 and the result is likely not signi\ufb01cant. Overall, ASHA out-\r\n               Finally, due to the nuanced nature of the evaluation frame-       performs PBT, BOHB and SHA on these two tasks. This\r\n                                                     ASystemforMassivelyParallelHyperparameterTuning\r\n                     0.26   CIFAR10 Using Small Cuda Convnet Model                         0.10      NAS CNN Search Space on CIFAR-10\r\n                                                                   ASHA                                                                  ASHA\r\n                     0.25                                          PBT                     0.09                                          BOHB\r\n                     0.24                                          SHA                     0.08                                          SHA\r\n                                                                   BOHB\r\n                     0.23                                                                  0.07\r\n                     0.22                                                                  0.06\r\n                    Test Error0.21                                                        Test Error0.05\r\n                     0.20                                                                  0.04\r\n                     0.19                                                                  0.03\r\n                     0.18    40     60      80      100     120     140                    0.020     200    400   600    800    1000  1200   1400\r\n                                         Duration (Minutes)                                                    Duration (Minutes)\r\n                     0.26CIFAR10 Using Small CNN Architecture Tuning Task                   100        NAS RNN Search Space on PTB\r\n                                                                                                                                         ASHA\r\n                     0.25                                                                    95                                          BOHB\r\n                                                                                             90                                          SHA\r\n                     0.24\r\n                                                                                             85\r\n                     0.23                                                                    80\r\n                    Test Error0.22                                                           75\r\n                     0.21      ASHA                                                        Validation Perplexity70\r\n                               PBT\r\n                               SHA                                                           65\r\n                     0.20      BOHB\r\n                         0     20    40     60    80    100   120   140                      600      100     200     300     400     500     600\r\n                                         Duration (Minutes)                                                    Duration (Minutes)\r\n                Figure 3. Limited-scale distributed experiments with 25 work-         Figure 4. Tuning neural network architectures with 16 workers.\r\n                ers. For each searcher, the average test error across 5 trials is     For each searcher, the average test error across 4 trials is shown in\r\n                shownineachplot. The light dashed lines indicate the min/max          each plot. The light dashed lines indicate the min/max ranges.\r\n                ranges. The dotted black line represents the time needed to train\r\n                the most expensive model in the search space for the maximum\r\n                resource R. The dotted blue line represents the point at which\r\n                25 workers in parallel have performed as much work as a single        SHAandBOHBonbothbenchmarks. OurresultsforCNN\r\n                machine in the sequential experiments (Figure 2).                     search show that ASHA \ufb01nds an architecture with test error\r\n                                                                                      below 10% nearly twice as fast as SHA and BOHB. ASHA\r\n                                                                                      also \ufb01nds \ufb01nal architectures with lower test error on average:\r\n                improved performance, coupled with the fact that it is a              3.24%for ASHAvs3.42%forSHAand3.36%forBOHB.\r\n                more principled and general approach than either BOHB                 Ourresults for RNN search show that ASHA \ufb01nds an archi-\r\n                or PBT (e.g., agnostic to resource type and robust to hy-             tecture with validation perplexity below 80 nearly trice as\r\n                perparameters that change the size of the model), further             fast as SHA and BOHB and also converges an architecture\r\n                motivates its use for the large-scale regime.                         with lower perplexity: 63.5 for ASHA vs 64.3 for SHA\r\n                                                                                      and 64.2 for BOHB. Note that vanilla PBT is incompatible\r\n                4.3   TuningNeuralNetworkArchitectures                                with these search spaces since it is not possible to warmstart\r\n                                                                                      training with weights from a different architecture. We show\r\n                Motivated by the emergence of neural architecture search              ASHAoutperformsPBTinadditiontoSHAandBOHBon\r\n                (NAS)asaspecialized hyperparameter optimization prob-                 an additional search space for LSTMs in Appendix A.6.\r\n                lem,weevaluateASHAandcompetitorsontwoNASbench-\r\n                marks: (1) designing convolutional neural networks (CNN)              4.4   TuningLarge-Scale Language Models\r\n                for CIFAR-10 and (2) designing recurrent neural networks              In this experiment, we increase the number of workers to\r\n                (RNN) for Penn Treebank (Marcus et al., 1993). We use                 500toevaluate ASHAfor massively parallel hyperparame-\r\n                the same search spaces as that considered by ? (see Ap-               ter tuning. Our search space is constructed based off of the\r\n                pendix A.5 for more details).                                         LSTMsconsideredinZarembaetal.(2014),withthelargest\r\n                For both benchmarks, we ran ASHA, SHA, and BOHB on                    modelinoursearch space matching their large LSTM (see\r\n                                                                                                                                            R\r\n                16workerswith\u03b7 = 4andamaximumresourceofR = 300                        Appendix A.7). For ASHA, we set \u03b7 = 4, r = /64, and\r\n                epochs. The results in Figure 4 shows ASHA outperforms                s = 0; asynchronous Hyperband loops through brackets\r\n                                                     ASystemforMassivelyParallelHyperparameterTuning\r\n                                            LSTM on PTB                               mance, and reproducibility. We next discuss each of these\r\n                       90                                                             design decisions along with proposed systems optimizations\r\n                                    ASHA                             Vizier\r\n                       88           Hyperband (Loop Brackets)                         for each decision.\r\n                       86\r\n                                                                                      5.1   Usability\r\n                       84\r\n                       82                                                             Ease of use is one of the most important considerations in\r\n                      Perplexity                                                      production; if an advanced method is too cumbersome to\r\n                       80                                                             use, its bene\ufb01ts may never be realized. In the context of\r\n                       78                                                             hyperparameteroptimization, classical methods like random\r\n                       76                                                             or grid search require only two intuitive inputs: number of\r\n                         0R      1R      2R      3R       4R      5R      6R          con\ufb01gurations (n) and training resources per con\ufb01guration\r\n                                                Time                                  (R). In contrast, as a byproduct of adaptivity, all of the mod-\r\n                                                                                      ern methods we considered in this work have many internal\r\n                Figure 5. Large-scale ASHA benchmark requiring weeks to run           hyperparameters. ASHA in particular has the following\r\n                with 500 workers. The x-axis is measured in units of average time     internal settings: elimination rate \u03b7, early-stopping rate s,\r\n                to train a single con\ufb01guration for R resource. The average across     and, in the case of asynchronous Hyperband, the brackets\r\n                5 trials is shown, with dashed lines indicating min/max ranges.       of ASHAtorun. Tofacilitate use and increase adoption of\r\n                                                                                      ASHA,wesimplifyitsuser interface to require the same in-\r\n                s = 0,1,2,3. We compare to Vizier without the perfor-                 puts as random search and grid search, exposing the internal\r\n                                                                             1        hyperparameters of ASHA only to advanced users.\r\n                mancecurveearly-stopping rule (Golovin et al., 2017).\r\n                Theresults in Figure 5 show that ASHA and asynchronous                Selecting ASHAdefault settings. Our experiments in Sec-\r\n                Hyperband found good con\ufb01gurations for this task in                   tion 4 and the experiments conducted by Li et al. (2018)\r\n                1\u00d7time(R). Additionally, ASHA and asynchronous Hy-                    both show that aggressive early-stopping is effective across\r\n                perband are both about 3\u00d7 faster than Vizier at \ufb01nding a              a variety of different hyperparameter tuning tasks. Hence,\r\n                con\ufb01guration with test perplexity below 80, despite being             using both works as guidelines, we propose the following\r\n                much simpler and easier to implement. Furthermore, the                default settings for ASHA:\r\n                best model found by ASHA achieved a test perplexity of                \u2022 Elimination rate: we set \u03b7 = 4 so that the top 1/4 of\r\n                76.6, which is signi\ufb01cantly better than 78.4 reported for the            con\ufb01gurations are promoted to the next rung.\r\n                large LSTM in Zaremba et al. (2014). We also note that\r\n                asynchronous Hyperband initially lags behind ASHA, but                \u2022 Maximum early-stopping rate: we set the maximum\r\n                eventually catches up at around 1.5 \u00d7 time(R).                           early-stopping rate for bracket s to allow for a maxi-\r\n                                                                                                                              0\r\n                Notably, we observe that certain hyperparameter con\ufb01gu-                  mumof5rungswhichindicatesaminimumresourceof\r\n                                                                                               1 4\r\n                rations in this benchmark induce perplexities that are or-               r = ( /4 )R = R/256. Then the minimum resource per\r\n                                                                                         con\ufb01guration for a given bracket s is rs = r\u03b7s.\r\n                ders of magnitude larger than the average case perplexity.\r\n                Model-based methods that make assumptions on the data                 \u2022 Bracketstorun: toincreaserobustnesstomisspeci\ufb01cation\r\n                distribution, such as Vizier, can degrade in performance                 of the early-stopping rate, we default to running the three\r\n                without further care to adjust this signal. We attempted to              most aggressively early-stopping brackets s = 0,1,2 of\r\n                alleviate this by capping perplexity scores at 1000 but this             ASHA.Weexcludethetwoleastaggressive brackets (i.e.\r\n                still signi\ufb01cantly hampered the performance of Vizier. We                s with r = R and s with r = R/4) to allow for\r\n                                                                                          4        4               3        3\r\n                view robustness to these types of scenarios as an additional             higher speedups from early-stopping. We de\ufb01ne this de-\r\n                bene\ufb01t of ASHA and Hyperband.                                            fault set of brackets as the \u2018standard\u2019 set of early-stopping\r\n                                                                                         brackets, though we also expose the options for more\r\n                5    PRODUCTIONIZING ASHA                                                conservative or more aggressive bracket sets.\r\n                While integrating ASHA in Determined AI\u2019s software plat-              Using n as ASHA\u2019s stopping criterion. Algorithm 2 does\r\n                form to deliver production-quality hyperparameter tuning              not specify a stopping criterion; instead, it relies on the user\r\n                functionality, we encountered several fundamental design              to stop once an implicit condition is met, e.g., number of\r\n                decisions that impacted usability, computational perfor-              con\ufb01gurations evaluated, compute time, or minimum perfor-\r\n                   1 At the time of running the experiment, it was brought to         manceachieved. In a production environment, we decided\r\n                our attention by the team maintaining the Vizier service that the     to use the number of con\ufb01gurations n as an explicit stopping\r\n                early-stopping code contained a bug which negatively impacted its     criterion both to match the user interface for random and\r\n                performance. Hence, we omit the results here.                         grid search, and to provide an intuitive connection to the\r\n                                                             ASystemforMassivelyParallelHyperparameterTuning\r\n                  underlying dif\ufb01culty of the search space. In contrast, setting                                                    Inception\r\n                  a maximumcomputetimeorminimumperformancethresh-                                              27\r\n                  old requires prior knowledge that may not be available.                                      26\r\n                                                                                                               25\r\n                  From a technical perspective, n is allocated to the differ-                                  24\r\n                  ent brackets while maintaining the same total training re-\r\n                  sources across brackets. We do this by \ufb01rst calculating                                      23      1 days\r\n                                                                                                                       3 days\r\n                  the average budget per con\ufb01guration for a bracket (assum-                                   # of Models Evaluated226 days\r\n                                                                                                                       9 days\r\n                  ing no incorrect promotions), and then allocating con\ufb01g-                                     21      16 days\r\n                                                                                                                       24 days\r\n                  urations to brackets according to the inverse of this ratio.                                 20\r\n                                                                                                                20     21     22     23     24     25    26     27\r\n                  For concreteness, let B be the set of brackets we are con-                                                    # of GPUs per Model\r\n                  sidering, then the average resource for a given bracket s                        Figure 6. Tradeoffs for parallel training of Imagenet using In-\r\n                             # of Rungs   \u230alog\u03b7 R/r\u230b\u2212s\r\n                  is r\u00af   =            /\u03b7              .  For the default settings de-             ceptionV3. Giventhateachcon\ufb01gurationtakes24daystotrainon\r\n                       s\r\n                  scribed above, this corresponds to r\u00af = 5/256, r\u00af = 4/64,                        asingleTeslaK80GPU,wecharttheestimatednumber(according\r\n                                                                0                1\r\n                  and r\u00af     = 3/16, and further translates to 70.5%, 22.1%,                       to the Paleo performance model) of con\ufb01gurations evaluated by\r\n                          2                                                                        128 Tesla K80s as a function of the number of GPUs used to train\r\n                  and 7.1% of the con\ufb01gurations being allocated to brackets\r\n                  s ,s , and s , respectively.                                                     each model for different time budgets. The dashed line for each\r\n                    0   1         2                                                                color represents the number of models evaluated under perfect\r\n                  Note that we still run each bracket asynchronously; the allo-                    scaling, i.e. n GPUs train a single model n times as fast, and\r\n                  cated number of con\ufb01gurations ns for a particular bracket                        span the feasible range for number of GPUs per model in order to\r\n                  s simply imposes a limit on the width of the bottom rung.                        train within the allocated time budget. As expected, more GPUs\r\n                  In particular, upon reaching the limit ns in the bottom rung,                    per con\ufb01guration are required for smaller time budgets and the\r\n                  the number of pending con\ufb01gurations in the bottom rung is                        total number of con\ufb01gurations evaluated decreases with number of\r\n                  at most equal to the number of workers, k. Therefore, since                      GPUspermodelduetodecreasingmarginalbene\ufb01t.\r\n                  blocking occurs once a bracket can no longer add con\ufb01g-\r\n                  uration to the bottom rung and must wait for promotable                          speedups for a variety of models.\r\n                  con\ufb01gurations, for large-scale problems where ns \u226b k, lim-\r\n                  iting the width of rungs will not block promotions until the                     Figure 6 shows Paleo applied to Inception on ImageNet\r\n                  bracket is near completion. In contrast, synchronous SHA                         (et al., 2016) to estimate the training time with different\r\n                  is susceptible to blocking from stragglers throughout the                        numbers of GPUs under strong scaling (i.e. \ufb01xed batch size\r\n                  entire process, which can greatly reduce both the latency                        with increasing parallelism), Butter\ufb02y AllReduce commu-\r\n                  and throughput of con\ufb01gurations promoted to the top rung                         nication scheme, speci\ufb01ed hardware settings (namely Tesla\r\n                  (e.g. Section 4.2, Appendix A.1).                                                K80GPUand20GEthernet),andabatchsizeof1024.\r\n                  5.2    Automatic Scaling of Parallel Training                                    Thediminishing returns when using more GPUs to train a\r\n                                                                                                   single model is evident in Figure 6. Additionally, there is\r\n                  Thepromotion schedule for ASHA geometrically increases                           a tradeoff between using resources to train a model faster\r\n                  the resource per con\ufb01guration as we move up the rungs of a                       to reduce latency versus evaluating more con\ufb01gurations to\r\n                  bracket. Hence, the average training time for con\ufb01gurations                      increase throughput. Using the predicted tradeoff curves\r\n                  in higher rungs increases drastically for computation that                       generated using Paleo, we can automatically limit the num-\r\n                  scales linearly or super-linearly with the training resource,                    ber of GPUs per con\ufb01guration to control ef\ufb01ciency relative\r\n                  presenting an opportunity to speed up training by using                          to perfect linear scaling, e.g., if the desired level of ef\ufb01-\r\n                  multiple GPUs. We explore autoscaling of parallel training                       ciency is at least 75%, then we would limit the number of\r\n                  to exploit this opportunity when resources are available.                        GPUspercon\ufb01guration for Inception to at most 16 GPUs.\r\n                  Wedetermine the maximum degree of parallelism for au-                            5.3    Resource Allocation\r\n                  toscaling a training task using an ef\ufb01ciency criteria moti-\r\n                  vated by the observation that speedups from parallel training                    Whereas research clusters often require users to specify the\r\n                  do not scale linearly with the number of GPUs (Krizhevsky,                       numberofworkersrequestedandallocateworkersona\ufb01rst-\r\n                  2014; Szegedy et al., 2014; You et al., 2017; You et al.,                        in-\ufb01rst-out (FIFO) fashion, this scheduling mechanism is\r\n                  2017; Goyal et al., 2017). More speci\ufb01cally, we can use                          poorly suited for production settings for two main reasons.\r\n                  the Paleo framework, introduced by Qi et al. (2017), to esti-                    First, as we discuss below in the context of ASHA, machine\r\n                  mate the cost of training neural networks in parallel given                      learning work\ufb02ows can have variable resource requirements\r\n                  different speci\ufb01cations. Qi et al. (2017) demonstrated that                      over the lifetime of a job, and forcing users to specify static\r\n                  the speedups from parallel training computed using Paleo                         resource requirements can result in suboptimal cluster uti-\r\n                  are fairly accurate when compared to the actual observed                         lization. Second, FIFO scheduling can result in poor sharing\r\n                                                  ASystemforMassivelyParallelHyperparameterTuning\r\n               of cluster resources among users, as a single large job could     5.4   Reproducibility in Distributed Environments\r\n               saturate the cluster and block all other user jobs.               Reproducibility is critical in production settings to instill\r\n               Weaddressthese issues with a centralized fair-share sched-        trust during the model development process; foster collab-\r\n               uler that adaptively allocates resources over the lifetime of     oration and knowledge transfer across teams of users; and\r\n               eachjob. Suchaschedulermustboth(i)determinetheappro-              allow for fault tolerance and iterative re\ufb01nement of models.\r\n               priate amount of parallelism for each individual job, and (ii)    However, ASHA introduces two primary reproducibility\r\n               allocate computational resources across all user jobs. In the     challenges, each of which we describe below.\r\n               context of an ASHA workload, the scheduler automatically          Pausing and restarting con\ufb01gurations. There are many\r\n               determines the maximum resource requirement at any given          sourcesofrandomnesswhentrainingmachinelearningmod-\r\n               time based on the inputs to ASHA and the parallel scaling         els; some source can be made deterministic by setting the\r\n               pro\ufb01le determined by Paleo. Then, the scheduler allocates         random seed, while others related to GPU \ufb02oating-point\r\n               cluster resources by considering the resource requirements        computations and CPU multi-threading are harder to avoid\r\n               of all jobs while maintaining fair allocation across users.       without performance rami\ufb01cations. Hence, reproducibility\r\n               Wedescribe each of these components in more detail below.         whenresumingpromotedcon\ufb01gurations requires carefully\r\n               Algorithm level resource allocation. Recall that we pro-          checkpointing all stateful objects pertaining to the model.\r\n               pose to use the number of con\ufb01gurations, n, as a stopping         Ataminimumthisincludesthemodelweights,modelop-\r\n               criteria for ASHA in production settings. Crucially, this de-     timizer state, random number generator states, and data\r\n               sign decision limits the maximum degree of parallelism for        generator state. We provide a checkpointing solution that fa-\r\n               an ASHAjob. If n is the number of desired con\ufb01gurations           cilitates reproducibility in the presence of stateful variables\r\n               for a given ASHA bracket and \u03ba the maximum allowable              and seeded random generators. The availability of determin-\r\n               training parallelism, e.g., as determined by Paleo, then at       istic GPU \ufb02oating-point computations is dependent on the\r\n               initialization, the maximum parallelism for the bracket is        deep learning framework, but we allow users to control for\r\n               n\u03ba. We maintain a stack of training tasks S that is popu-         all other sources of randomness during training.\r\n               lated initially with all con\ufb01gurations for the bottom rung        Asynchronouspromotions. To allow for full reproducibil-\r\n               n. The top task in S is popped off whenever a worker re-          ity of ASHA, we track the sequence of all promotions made\r\n               quests a task and promotable con\ufb01gurations are added to the       within a bracket. This sequence \ufb01xes the nondeterminism\r\n               top of S when tasks complete. As ASHA progresses, the             from asynchrony, allowing subsequent replay of the exact\r\n               maximumparallelism is adaptively de\ufb01ned as \u03ba|S|. Hence,           promotions as the original run. Consequently, we can re-\r\n               an adaptive worker allocation schedule that relies on \u03ba|S|        construct the full state of a bracket at any point in time, i.e.\r\n               wouldimprovecluster utilization relative to a static alloca-      whichcon\ufb01gurations are on which rungs and which training\r\n               tion scheme, without adversely impacting performance.             tasks are in the stack.\r\n               Cluster level resource allocation. Given the maximum de-          Taken together, reproducible checkpoints and full bracket\r\n               gree of parallelism for any ASHA job, the scheduler then al-      states allow us to seamlessly resume hyperparameter tuning\r\n               locates resources uniformly across all jobs while respecting      jobs when crashes happen and allow users to request to\r\n               these maximumparallelism limits. We allow for an optional         evaluate more con\ufb01gurations if desired. For ASHA, re\ufb01ning\r\n               priority weighting factor so that certain jobs can receive a      hyperparameter selection by resuming an existing bracket is\r\n               larger ratio of the total computational resources. Resource       highly bene\ufb01cial, since a wider rung gives better empirical\r\n               allocation is performed using a water-\ufb01lling scheme where         estimates of the top 1/\u03b7 con\ufb01gurations.\r\n               any allocation above the maximum resource requirements\r\n               for a job are distributed evenly to remaining jobs.\r\n               For concreteness, consider a scenario in which we have a          6    CONCLUSION\r\n               cluster of 32 GPUs shared between a group of users. When          In this paper, we addressed the problem of developing a\r\n               a single user is running an ASHA job with 8 con\ufb01gurations         production-quality system for hyperparameter tuning by\r\n               in S and a maximum training parallelism of \u03ba = 4, the             introducing ASHA, a theoretically principled method for\r\n                   1                                             1\r\n               scheduler will allocate all 32 GPUs to this ASHA job. When        simple and robust massively parallel hyperparameter op-\r\n               another user submits an ASHA job with a maximum par-              timization. We presented empirical results demonstrating\r\n               allelism of \u03ba2|S2| = 64, the central scheduler will then          that ASHA outperforms state-of-the-art methods Fabolas,\r\n               allocate 16 GPUs to each user. This simple scenario demon-        PBT, BOHB,andVizierinasuite of hyperparameter tuning\r\n               strate how our central scheduler allows jobs to bene\ufb01t from       benchmarks. Finally, we provided systems level solutions\r\n               maximum parallelism when the computing resources are              to improve the effectiveness of ASHA that are applicable to\r\n               available, while maintaining fair allocation across jobs in       existing systems that support our algorithm.\r\n               the presence of resource contention.\r\n                                                  ASystemforMassivelyParallelHyperparameterTuning\r\n                                                                                    \u00a8\r\n               REFERENCES                                                       Gyorgy, A. and Kocsis, L. Ef\ufb01cient multi-start strategies for\r\n               Agarwal, A., Duchi, J., Bartlett, P. L., and Levrard, C. Ora-       local search algorithms. JAIR, 41, 2011.\r\n                 cle inequalities for computationally budgeted model se-        Hutter, F., Hoos, H., and Leyton-Brown., K. Sequential\r\n                 lection. In COLT, 2011.                                           model-based optimization for general algorithm con\ufb01gu-\r\n               Akiba, T., Sano, S., Yanase, T., Ohta, T., and Koyama, M.           ration. In Proc. of LION-5, 2011.\r\n                 Optuna: A next-generation hyperparameter optimization          Jaderberg, M., Dalibard, V., Osindero, S., Czarnecki, W.,\r\n                 framework. In Proceedings of the 19th ACM SIGKDD                  Donahue, J., Razavi, A., Vinyals, O., Green, T., Dun-\r\n                 International Conference on Knowledge Discovery and               ning, I., Simonyan, K., Fernando, C., and Kavukcuoglu,\r\n                 DataMining.                                                       K.    Population based training of neural networks.\r\n               Bergstra, J., Bardenet, R., Bengio, Y., and Kegl., B. Algo-         arXiv:1711.09846, 2017.\r\n                 rithms for hyper-parameter optimization. In NIPS, 2011.        Jamieson, K. and Talwalkar, A. Non-stochastic best arm\r\n                                                                                   identi\ufb01cation and hyperparameter optimization. In AIS-\r\n               Domhan, T., Springenberg, J. T., and Hutter, F. Speeding            TATS, 2015.\r\n                 upautomatic hyperparameter optimization of deep neural\r\n                 networks by extrapolation of learning curves. In IJCAI,        Kandasamy, K., Krishnamurthy, A., Schneider, J., and\r\n                                                                                     \u00b4\r\n                 2015.                                                             Poczos, B. Parallelised bayesian optimisation via thomp-\r\n                                                                                   son sampling. In International Conference on Arti\ufb01cial\r\n               Dvoretzky, A., Kiefer, J., and Wolfowitz, J. Asymptotic             Intelligence and Statistics, 2018.\r\n                 minimax character of the sample distribution function\r\n                 and of the classical multinomial estimator. The Annals of      Karnin, Z., Koren, T., and Somekh, O. Almost optimal\r\n                 Mathematical Statistics, 27:642\u2013669, 1956.                        exploration in multi-armed bandits. In ICML, 2013.\r\n               et  al.,  D. M.        Announcing tensor\ufb02ow 0.8 now              Kim,J., Kim, M., Park, H., Kusdavletov, E., Lee, D., Kim,\r\n                 with distributed computing support!, 2016.            URL         A., Kim, J., Ha, J., and Sung, N. CHOPT : Automated\r\n                 https://research.googleblog.com/2016/                             hyperparameter optimization framework for cloud-based\r\n                 04/announcing-tensorflow-08-now-with.                             machine learning platforms. arXiv:1810.03527, 2018.\r\n                 html.                                                          Klein, A., Falkner, S., Bartels, S., Hennig, P., and Hutter, F.\r\n               Falkner, S., Klein, A., and Hutter, F. Bohb: Robust and             Fast bayesian optimization of machine learning hyperpa-\r\n                 ef\ufb01cient hyperparameter optimization at scale. In Interna-        rameters on large datasets. AISTATS, 2017a.\r\n                 tional Conference on Machine Learning, pp. 1436\u20131445,          Klein, A., Faulkner, S., Springenberg, J., and Hutter, F.\r\n                 2018.                                                             Learning curve prediction with bayesian neural networks.\r\n               Feurer, M., Klein, A., Eggensperger, K., Springenberg, J.,          In ICLR, 2017b.\r\n                 Blum, M., and Hutter, F. Ef\ufb01cient and robust automated         Kotthoff, L., Thornton, C., Hoos, H. H., Hutter, F., and\r\n                 machine learning. In NIPS, 2015.                                  Leyton-Brown, K. Auto-weka 2.0: Automatic model\r\n               Ginsbourger, D., Le Riche, R., and Carraro, L. Kriging is           selection and hyperparameter optimization in weka. The\r\n                 well-suited to parallelize optimization. In Computational         Journal of Machine Learning Research, 18(1):826\u2013830,\r\n                 IntelligenceinExpensiveOptimizationProblems,pp.131\u2013               2017.\r\n                 162. Springer, 2010.                                           Krizhevsky, A. Learning multiple layers of features from\r\n               Golovin, D., Sonik, B., Moitra, S., Kochanski, G., Karro, J.,       tiny images. In Technical report, Department of Com-\r\n                 and D.Sculley. Google vizier: A service for black-box             puter Science, Univsersity of Toronto, 2009.\r\n                 optimization. In KDD, 2017.                                    Krizhevsky, A. One weird trick for parallelizing convolu-\r\n                     \u00b4                                                             tional neural networks. arXiv:1404.5997, 2014.\r\n               Gonzalez, J., Zhenwen, D., Hennig, P., and Lawrence, N.\r\n                 Batch bayesian optimization via local penalization. In         Krueger, T., Panknin, D., and Braun, M.           Fast cross-\r\n                 AISTATS, 2016.                                                    validation via sequential testing. In JMLR, 2015.\r\n                                \u00b4\r\n               Goyal, P., Dollar, P., Girshick, R. B., Noordhuis, P.,           Li, A., Spyra, O., Perel, S., Dalibard, V., Jaderberg, M.,\r\n                 Wesolowski, L., Kyrola, A., Tulloch, A., Jia, Y., and             Gu, C., Budden, D., Harley, T., and Gupta, P.           A\r\n                 He,K. Accurate, large minibatch SGD: training imagenet            generalized framework for population based training.\r\n                 in 1 hour. arXiv:1706.02677, 2017.                                arXiv:1902.01894, 2019.\r\n                                              ASystemforMassivelyParallelHyperparameterTuning\r\n              Li, L., Jamieson, K., DeSalvo, G., Rostamizadeh, A., and    Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S. E.,\r\n                Talwalkar, A. Hyperband: Bandit-based con\ufb01guration           Anguelov, D., Erhan, D., Vanhoucke, V., and Rabinovich,\r\n                evaluation for hyperparameter optimization. Proc. of         A. Going deeper with convolutions. arXiv:1409.4842,\r\n                ICLR, 17, 2017.                                              2014.\r\n              Li, L., Jamieson, K., DeSalvo, G., Rostamizadeh, A., and    Wu, J. and Frazier, P. The parallel knowledge gradient\r\n                Talwalkar, A. Hyperband: Anovelbandit-basedapproach          method for batch bayesian optimization. In NIPS, 2016.\r\n                to hyperparameter optimization. Journal of Machine        Xiao, W., Bhardwaj, R., Ramjee, R., Sivathanu, M., Kwatra,\r\n                Learning Research, 18(185):1\u201352, 2018. URL http:             N., Han, Z., Patel, P., Peng, X., Zhao, H., Zhang, Q.,\r\n                //jmlr.org/papers/v18/16-558.html.                           et al. Gandiva: Introspective cluster scheduling for deep\r\n              Liaw, R., Liang, E., Nishihara, R., Moritz, P., Gonzalez,      learning. In 13th {USENIX} Symposium on Operating\r\n                J. E., and Stoica, I. Tune: A research platform for dis-     Systems Design and Implementation ({OSDI} 18), pp.\r\n                tributed model selection and training. In ICML AutoML        595\u2013610, 2018.\r\n                Workshop, 2018.                                           You, Y., Gitman, I., and Ginsburg, B. Scaling SGD batch\r\n              Marcus, M., Marcinkiewicz, M., and Santorini, B. Building      size to 32k for imagenet training. arXiv:1708.03888,\r\n                a large annotated corpus of english: The penn treebank.      2017.\r\n                Computational Linguistics, 19(2):313\u2013330, 1993.           You, Y., Zhang, Z., Hsieh, C.-J., Demmel, J., and Keutzer,\r\n              Merity, S., Keskar, N., and Socher, R. Regularizing and opti-  K. 100-epoch ImageNet Training with AlexNet in 24\r\n                mizing LSTMlanguage models. In International Confer-         Minutes. arXiv:1709.0501, 2017.\r\n                ence on Learning Representations, 2018. URL https:        Zaremba,W.,Sutskever,I.,andVinyals,O. Recurrentneural\r\n                //openreview.net/forum?id=SyyGPP0TZ.                         network regularization. arXiv preprint arXiv:1409.2329,\r\n              Netzer, Y., Wang, T., Coates, A., Bissacco, A., Wu, B., and    2014.\r\n                Ng, A. Y. Reading digits in natural images with unsu-\r\n                pervised feature learning. In NIPS Workshop on Deep\r\n                Learning and Unsupervised Feature Learning, 2011.\r\n              Qi, H., Sparks, E. R., and Talwalkar, A. Paleo: A perfor-\r\n                mancemodelfordeepneuralnetworks. In ICLR, 2017.\r\n              Sabharwal, A., Samulowitz, H., and Tesauro, G. Selecting\r\n                near-optimal learners via incremental data allocation. In\r\n                AAAI, 2016.\r\n              Sermanet,P., Chintala, S., and LeCun, Y. Convolutional neu-\r\n                ral networks applied to house numbers digit classi\ufb01cation.\r\n                In ICPR, 2012.\r\n              Shah, A. and Ghahramani, Z. Parallel predictive entropy\r\n                search for batch global optimization of expensive objec-\r\n                tive functions. In NIPS, 2015.\r\n              Snoek, J., Larochelle, H., and Adams, R. Practical bayesian\r\n                optimization of machine learning algorithms. In NIPS,\r\n                2012.\r\n              Srinivas, N., Krause, A., Kakade, S., and Seeger, M. Gaus-\r\n                sian process optimization in the bandit setting: No regret\r\n                and experimental design. In ICML, 2010.\r\n              Swersky, K., Snoek, J., and Adams, R. Multi-task bayesian\r\n                optimization. In NIPS, 2013.\r\n              Swersky, K., Snoek, J., and Adams, R. P. Freeze-thaw\r\n                bayesian optimization. arXiv:1406.3896, 2014.\r\n                                                  ASystemforMassivelyParallelHyperparameterTuning\r\n               A APPENDIX                                                        in (Klein et al., 2017a) suggest that Fabolas is faster than\r\n               As part of our supplementary material, (1) compare the            Hyperband at \ufb01nding a good con\ufb01guration. We conducted\r\n               impact of stragglers and dropped jobs on synchronous SHA          our own experiments to compare Fabolas with Hyperband\r\n               and ASHA, (2) present the comparison to Fabolas in the            onthe following tasks:\r\n               sequential setting and (3) provide additional details for the       1. Tuning an SVM using the same search space as (Klein\r\n               empirical results shown in Section 4.                                  et al., 2017a).\r\n               A.1   ComparisonofSynchronousSHAandASHA                             2. Tuning a convolutional neural network (CNN) with the\r\n               Weusesimulated workloads to evaluate the impact of strag-              same search space as Li et al. (2017) on CIFAR-10\r\n               glers and dropped jobs on synchronous SHA and ASHA.                    (Krizhevsky, 2009).\r\n               For our simulated workloads, we run synchronous SHA                 3. Tuning a CNN on SVHN (Netzer et al., 2011) with\r\n               with \u03b7 = 4, r = 1, R = 256, and n = 256 and ASHA                       varying number of layers, batch size, and number of\r\n               with the same values and the maximum early-stopping rate               \ufb01lters (see Appendix A.4 for more details).\r\n               s = 0. Note that BOHB (Falkner et al., 2018), one of the\r\n               competitors we empirically compare to in Section 4, is also       InthecaseoftheSVMtask,theallocatedresourceisnumber\r\n               susceptible to stragglers and dropped jobs since it uses syn-     oftrainingdatapoints, whilefortheCNNtasks,theallocated\r\n               chronous SHA as its parallelization scheme but leverages          resource is the number of training iterations.\r\n               Bayesian optimization to perform adaptive sampling.               We note that Fabolas was speci\ufb01cally designed for data\r\n               For these synthetic experiments, we assume that the ex-           points as the resource, and hence, is not directly applica-\r\n               pected training time for each job is the same as the allocated    ble to tasks (2) and (3). However, freeze-thaw Bayesian\r\n               resource. We simulate stragglers by multiplying the ex-           optimization (Swersky et al., 2014), which was speci\ufb01cally\r\n               pected training time by (1 + |z|) where z is drawn from a         designed for models that use iterations as the resource, is\r\n               normal distribution with mean 0 and a speci\ufb01ed standard           knowntoperformpoorlyondeeplearningtasks (Domhan\r\n               deviation. We simulated dropped jobs by assuming that             et al., 2015). Hence, we believe Fabolas to be a reason-\r\n               there is a given p probability that a job will be dropped at      able competitor for tasks (2) and (3) as well, despite the\r\n               each time unit, hence, for a job with a runtime of 256 units,     aforementioned shortcoming.\r\n                                                                    256\r\n               the probability that it is not dropped is 1 \u2212 (1 \u2212 p)   .         We use the same evaluation framework as Klein et al.\r\n               Figure 7 shows the number of con\ufb01gurations trained to com-        (2017a), where the best con\ufb01guration, also known as the\r\n               pletion (left) and time required before one con\ufb01guration          incumbent, is recorded through time and the test error is\r\n               is trained to completion (right) when running synchronous         calculated in an of\ufb02ine validation step. Following Klein\r\n               SHAandASHAusing25workers. Foreachcombination                      et al. (2017a), the incumbent for Hyperband is taken to be\r\n               of training time standard deviation and drop probability, we      the con\ufb01guration with the lowest validation loss and the\r\n               simulate ASHA and synchronous SHA 25 times and report             incumbent for Fabolas is the con\ufb01guration with the lowest\r\n               the average. As can be seen in Figure 7a, ASHA trains many        predicted validation loss on the full dataset. Moreover, for\r\n               morecon\ufb01gurations to completion than synchronous SHA              these experiments, we set \u03b7 = 4 for Hyperband.\r\n               when the standard deviation is high; we hypothesize that          Notably, when tracking the best performing con\ufb01guration\r\n               this is one reason ASHA performs signi\ufb01cantly better than         for Hyperband, we consider two approaches. We \ufb01rst con-\r\n               synchronous SHA and BOHB for the second benchmark in              sider the approach proposed in Li et al. (2018) and used\r\n               Section 4.2. Figure 7b shows that ASHA returns a con\ufb01g-           by Klein et al. (2017a) in their evaluation of Hyperband. In\r\n               uration trained for the maximum resource R much faster            this variant, which we refer to as \u201cHyperband (by bracket),\u201d\r\n               than synchronous SHA when there is high variability in            the incumbent is recorded after the completion of each SHA\r\n               training time (i.e., stragglers) and high risk of dropped jobs.   bracket. We also consider a second approach where we\r\n               Although ASHAismorerobustthansynchronousSHAto                     record the incumbent after the completion of each rung of\r\n               stragglers and dropped jobs on these simulated workloads,         SHAtomakeuseofintermediatevalidation losses, similar\r\n               wenonetheless compare synchronous SHA in Section 4.4              to what we propose for ASHA(seediscussioninSection3.3\r\n               and show that ASHA performs better.                               for details). We will refer to Hyperband using this account-\r\n               A.2   ComparisonwithFabolasinSequentialSetting                    ing scheme as \u201cHyperband (by rung).\u201d Interestingly, by\r\n                                                                                 leveraging these intermediate losses, we observe that Hyper-\r\n               (Klein et al., 2017a) showed that Fabolas can be over an          band actually outperforms Fabolas.\r\n               order of magnitude faster than existing Bayesian optimiza-        In Figure 8, we show the performance of Hyperband, Fabo-\r\n               tion methods. Additionally, the empirical studies presented       las, and random search. Our results show that Hyperband\r\n                                                     ASystemforMassivelyParallelHyperparameterTuning\r\n                     18        train std: 0.10     18       train std: 0.24               2000      train std: 0.67     2000      train std: 1.00\r\n                     16                            16                                     1750                          1750\r\n                     14                            14                                     1500                          1500\r\n                     12                            12                                     1250                          1250\r\n                     10                            10                                     1000                          1000\r\n                      8                            8\r\n                      6                            6                                      750                           750\r\n                      4                            4                                      500                           500\r\n                      2                            2                                      250                           250\r\n                      0                            0                                        0                             0\r\n                     18        train std: 0.56     18       train std: 1.33               2000      train std: 1.33     2000      train std: 1.67\r\n                     16                            16                      ASHA\r\n                                                                           SHA            1750                          1750\r\n                     14                            14                                     1500                          1500\r\n                   # configurations trained for R1212\r\n                                                                                         time until first configuration trained for R12501250\r\n                     10                            10\r\n                                                                                          1000                          1000\r\n                      8                            8\r\n                      6                            6                                      750                           750\r\n                      4                            4                                      500                           500\r\n                      2                            2                                      250                           250    ASHA\r\n                                                                                                                               SHA\r\n                      0 0.0 0.2  0.4  0.6  0.8 1.0 0 0.0  0.2 0.4  0.6  0.8  1.0            00.00.5 1.0 1.5 2.02.5 3.0    00.00.5 1.0 1.5 2.02.5 3.0\r\n                                            drop probability1e2             1e2                                  drop probability1e3             1e 3\r\n                  (a) Average number of con\ufb01gurations trained on R resource.          (b) Average time before a con\ufb01guration is trained on R resource.\r\n                Figure 7. Simulated workloads comparing impact of stragglers and dropped jobs. The number of con\ufb01gurations trained for R\r\n                resource (left) is higher for ASHA than synchronous SHA when the standard deviation is high. Additionally, the average time before a\r\n                con\ufb01guration is trained for R resource (right) is lower for ASHA than for synchronous SHA when there is high variability in training\r\n                time (i.e., stragglers). Hence, ASHA is more robust to stragglers and dropped jobs than synchronous SHA since it returns a completed\r\n                con\ufb01guration faster and returns more con\ufb01gurations trained to completion.\r\n                (by rung) is competitive with Fabolas at \ufb01nding a good con-          until at least half of the population performs above random\r\n                \ufb01guration and will often \ufb01nd a better con\ufb01guration than              guessing.\r\n                Fabolas with less variance. Note that Hyperband loops                WeimplementPBTwithtruncationselection for the exploit\r\n                through the brackets of SHA, ordered by decreasing early-            phase, where the bottom 20% of con\ufb01gurations are replaced\r\n                stoppingrate; the \ufb01rst bracket \ufb01nishes when the test error for       with a uniformly sampled con\ufb01guration from the top 20%\r\n                Hyperband (by bracket) drops. Hence, most of the progress            (both weights and hyperparameters are copied over). Then,\r\n                madebyHyperbandcomesfromthebracketwiththemost                        the inherited hyperparameters pass through an exploration\r\n                aggressive early-stopping rate, i.e. bracket 0.                                      3\r\n                                                                                     phase where /4 of the time they are either perturbed by\r\n                A.3    Experiments in Section 4.1 and Section 4.2                    a factor of 1.2 or 0.8 (discrete hyperparameters are per-\r\n                                                                                                                             1\r\n                                                                                     turbed to two adjacent choices), and /4 of the time they are\r\n                Weusetheusualtrain/validation/test splits for CIFAR-10,              randomly resampled. Con\ufb01gurations are considered for ex-\r\n                evaluate con\ufb01gurations on the validation set to inform al-           ploitation/exploration every 1000 iterations, for a total of 30\r\n                gorithm decisions, and report test error. These experiments          rounds of adaptation. For the experiments in Section 4.2, to\r\n                were conducted using g2.2xlarge instances on Amazon                  maintain 100% worker ef\ufb01ciently for PBT while enforcing\r\n                AWS.                                                                 that all con\ufb01gurations are trained for within 2000 iterations\r\n                For both benchmark tasks, we run SHA and BOHB with                   of each other, we spawn new populations of 25 whenever a\r\n                n = 256, \u03b7 = 4, s = 0, and set r = R/256, where R =                  job is not available from existing populations.\r\n                30000 iterations of stochastic gradient descent. Hyperband           Vanilla PBT is not compatible with hyperparameters that\r\n                loops through 5 brackets of SHA, moving from bracket                 change the architecture of the neural network, since inher-\r\n                s = 0,r = R/256 to bracket s = 4,r = R. We run ASHA                  ited weights are no longer valid once those hyperparameters\r\n                and asynchronous Hyperband with the same settings as the             are perturbed. To adapt PBT for the architecture tuning task,\r\n                synchronous versions. We run PBT with a population size              we \ufb01x hyperparameters that affect the architecture in the\r\n                of 25, which is between the recommended 20\u201340 (Jaderberg             explore stage. Additionally, we restrict con\ufb01gurations to be\r\n                et al., 2017). Furthermore, to help PBT evolve from a good           trained within 2000 iterations of each other so a fair compar-\r\n                set of con\ufb01gurations, we randomly sample con\ufb01gurations               ison is made to select con\ufb01gurations to exploit. If we do not\r\n                                                   ASystemforMassivelyParallelHyperparameterTuning\r\n                    0.40                 SVM on vehicle                            0.6                  SVM on MNIST\r\n                                                       Hyperband (by rung)\r\n                    0.35                               Hyperband (by bracket)      0.5\r\n                                                       Fabolas\r\n                                                       Random\r\n                    0.30                                                           0.4\r\n                                                                                                                     Hyperband (by rung)\r\n                    0.25                                                           0.3                               Hyperband (by bracket)\r\n                                                                                                                     Fabolas\r\n                   Test Error                                                     Test Error                         Random\r\n                    0.20                                                           0.2\r\n                    0.15                                                           0.1\r\n                    0.100    100   200   300    400   500   600   700    800       0.00    100    200   300   400   500    600   700   800\r\n                                         Duration (Minutes)                                            Duration (Minutes)\r\n                    0.40  CIFAR10 Using Small Cuda Convnet Model                 SVHN Trained Using Small CNN Architecture Tuning Task0.200\r\n                                                       Hyperband (by rung)                                           Hyperband (by rung)\r\n                                                       Hyperband (by bracket)    0.175                               Hyperband (by bracket)\r\n                    0.35                               Fabolas                   0.150                               Fabolas\r\n                                                       Random                                                        Random\r\n                    0.30                                                         0.125\r\n                                                                                 0.100\r\n                   Test Error0.25                                               Test Error0.075\r\n                    0.20                                                         0.050\r\n                                                                                 0.025\r\n                    0.150        500      1000      1500      2000      2500     0.0000        500       1000      1500      2000      2500\r\n                                         Duration (Minutes)                                            Duration (Minutes)\r\n               Figure 8. Sequential Experiments (1 worker) with Hyperband running synchronous SHA. Hyperband (by rung) records the incumbent\r\n               after the completion of a SHA rung, while Hyperband (by bracket) records the incumbent after the completion of an entire SHA bracket.\r\n               Theaverage test error across 10 trials of each hyperparameter optimization method is shown in each plot. Dashed lines represent min and\r\n               maxrangesforeachtuning method.\r\n                    Hyperparameter         Type             Values                A.4   Experimental Setup for the Small CNN\r\n                                                          6   7   8  9                  Architecture Tuning Task\r\n                       batch size         choice        {2 ,2 ,2 ,2 }\r\n                       # of layers        choice           {2,3,4}                This benchmark tunes a multiple layer CNN network with\r\n                       # of \ufb01lters        choice       {16,32,48,64}              the hyperparameters shown in Table 1. This search space\r\n                                                              \u22124     \u22121\r\n                    weight init std 1   continuous     log [10   , 10   ]         was used for the small architecture task on SVHN (Sec-\r\n                                                                \u22123\r\n                    weight init std 2   continuous       log [10   , 1]           tion A.2) and CIFAR-10 (Section 4.2). The # of layers\r\n                                                                \u22123\r\n                    weight init std 3   continuous       log [10   , 1]           hyperparameter indicate the number of convolutional layers\r\n                                                                \u22125\r\n                      l  penalty 1      continuous       log [10   , 1]\r\n                       2                                                          before two fully connected layers. The # of \ufb01lters indicates\r\n                                                                \u22125\r\n                      l  penalty 2      continuous       log [10   , 1]\r\n                       2                                                          the # of \ufb01lters in the CNN layers with the last CNN layer\r\n                                                               \u22123     2\r\n                      l  penalty 3      continuous      log [10   , 10 ]\r\n                       2                                                          having 2 \u00d7# \ufb01lters. Weights are initialized randomly from\r\n                                                               \u22125     1\r\n                      learning rate     continuous      log [10   , 10 ]          a Gaussian distribution with the indicated standard devi-\r\n                                                                                  ation. There are three sets of weight init and l penalty\r\n               Table 1. Hyperparameters for small CNN architecture tuning task.                                                       2\r\n                                                                                  hyperparameters; weight init 1 and l penalty 1 apply to the\r\n                                                                                                                       2\r\n                                                                                  convolutional layers, weight init 2 and l penalty 2 to the\r\n                                                                                                                             2\r\n                                                                                  \ufb01rst fully connected layer, and weight init 3 and l penalty\r\n                                                                                                                                      2\r\n                                                                                  3tothelast fully connected layer. Finally, the learning rate\r\n               impose this restriction, PBT will be biased against con\ufb01gu-        hyperparameter controls the initial learning rate for SGD.\r\n               rations that take longer to train, since it will be comparing      All models use a \ufb01xed learning rate schedule with the learn-\r\n               these con\ufb01gurations with those that have been trained for          ing rate decreasing by a factor of 10 twice in equally spaced\r\n               moreiterations.                                                    intervals over the training window. This benchmark is run\r\n                                                      ASystemforMassivelyParallelHyperparameterTuning\r\n                                  LSTM with DropConnect on PTB                         overlap at the end. We then trained the best con\ufb01guration\r\n                       70                                                              foundbyASHAformoreepochsandreachedvalidationand\r\n                                                                      PBT              test perplexities of 60.2 and 58.1 respectively before \ufb01ne-\r\n                                                                      ASHA\r\n                       68                                             BOHB             tuning and 58.7 and 56.3 after \ufb01ne-tuning. For reference,\r\n                                                                      SHA              Merity et al. (2018) reported validation and test perplexi-\r\n                       66                                                              ties respectively of 60.7 and 58.8 without \ufb01ne-tuning and\r\n                                                                                       60.0 and 57.3 with \ufb01ne-tuning. This demonstrates the ef-\r\n                       64                                                              fectiveness of ASHA in the large-scale regime for modern\r\n                      Validation Perplexity                                            hyperparameter optimization problems.\r\n                       62\r\n                                                                                            Hyperparameter              Type                 Values\r\n                       60 0     200    400    600    800    1000   1200   1400                learning rate          continuous           log [10,100]\r\n                                           Duration (Minutes)                                 dropout (rnn)          continuous            [0.15,0.35]\r\n                Figure 9. Modern LSTMbenchmarkwithDropConnect(Merity                         dropout (input)         continuous             [0.3,0.5]\r\n                et al., 2018) using 16 GPUs. The average across 5 trials is shown,       dropout (embedding)         continuous            [0.05,0.2]\r\n                with dashed lines indicating min/max ranges.                                dropout (output)         continuous             [0.3,0.5]\r\n                                                                                         dropout (dropconnect)       continuous             [0.4,0.6]\r\n                                                                                              weight decay           continuous      log [0.5e \u2212 6,2e \u2212 6]\r\n                on the SVHN dataset (Netzer et al., 2011) following Ser-                        batch size             discrete            [15,20,25]\r\n                manet et al. (2012) to create the train, validation, and test                   time steps             discrete            [65,70,75]\r\n                splits.                                                                Table 2. Hyperparameters for 16 GPU near state-of-the-art LSTM\r\n                A.5    Experimental Setup for Neural Architecture                      task.\r\n                       Search Benchmarks                                               A.7    Experimental Setup for Large-Scale Benchmarks\r\n                For NAS benchmarks evaluated in Section 4.3, we used\r\n                the same search space as that considered by ? for de-                        Hyperparameter            Type             Values\r\n                signing CNN and RNN cells. Following ?, we sample                               batch size            discrete          [10,80]\r\n                architectures from the associated search space randomly                       # of time steps         discrete          [10,80]\r\n                and train them using the same hyperparameter settings                       # of hidden nodes         discrete        [200,1500]\r\n                as that used by ? in the evaluation stage. We refer the                        learning rate        continuous      log [0.01,100.]\r\n                reader to the following code repository for more details:                       decay rate          continuous        [0.01,0.99]\r\n                https://github.com/liamcli/darts_asha.                                        decay epochs            discrete           [1,10]\r\n                                                                                              clip gradients        continuous           [1,10]\r\n                A.6    TuningModernLSTMArchitectures                                       dropout probability      continuous          [0.1,1.]\r\n                Asafollowuptotheexperiment in Section 4.3, we consider                      weight init range       continuous       log [0.001,1]\r\n                a search space for language modeling that is able to achieve                    Table 3. Hyperparameters for PTB LSTM task.\r\n                near state-of-the-art performance. Our starting point was\r\n                the work of Merity et al. (2018), which introduced a near              The hyperparameters for the LSTM tuning task compar-\r\n                state-of-the-art LSTM architecture with a more effective               ing ASHAtoVizieronthePennTreeBank(PTB)dataset\r\n                regularization scheme called DropConenct. We constructed               presented in Section 4.4 is shown in Table 3. Note that\r\n                asearchspacearoundtheircon\ufb01gurationandranASHAand                       all hyperparameters are tuned on a linear scale and sam-\r\n                PBT,eachwith16GPUSononep2.16xlargeinstance                             pled uniform over the speci\ufb01ed range. The inputs to the\r\n                on AWS. The hyperparameters that we considered along                   LSTMlayer are embeddings of the words in a sequence.\r\n                with their associated ranges are shown in Table 2.                     Thenumberofhiddennodeshyperparameter refers to the\r\n                Then, for ASHA, SHA, and BOHB we used \u03b7 = 4, r = 1                     number of nodes in the LSTM. The learning rate is decayed\r\n                epoch, R = 256 epochs, and s = 0. For PBT, we use a                    by the decay rate after each interval of decay steps. Finally,\r\n                population size to 20, a maximum resource of 256 epochs,               the weight initialization range indicates the upper bound of\r\n                and perform explore/exploit every 8 epochs using the same              the uniform distribution used to initialize all weights. The\r\n                settings as the previous experiments.                                  other hyperparameters have their standard interpretations\r\n                                                                                       for neural networks. The default training (929k words) and\r\n                Figure 9 shows that while PBT performs better initially,               test (82k words) splits for PTB are used for training and\r\n                ASHAsooncatches up and \ufb01nds a better \ufb01nal con\ufb01gura-                    evaluation (Marcus et al., 1993). We de\ufb01ne resources as\r\n                tion; in fact, the min/max ranges for ASHA and PBT do not              the number of training records, which translates into the\r\n                      ASystemforMassivelyParallelHyperparameterTuning\r\n       number of training iterations after accounting for certain\r\n       hyperparameters.\r\n", "award": [], "sourceid": 94, "authors": [{"given_name": "Liam", "family_name": "Li", "institution": "Carnegie Mellon University"}, {"given_name": "Kevin", "family_name": "Jamieson", "institution": "U Washington"}, {"given_name": "Afshin", "family_name": "Rostamizadeh", "institution": "Google Research"}, {"given_name": "Ekaterina", "family_name": "Gonina", "institution": "Google"}, {"given_name": "Jonathan", "family_name": "Ben-tzur", "institution": "Determined AI"}, {"given_name": "Moritz", "family_name": "Hardt", "institution": "UC Berkeley"}, {"given_name": "Benjamin", "family_name": "Recht", "institution": "UC Berkeley"}, {"given_name": "Ameet", "family_name": "Talwalkar", "institution": "CMU"}]}