{"title": "Parmac: Distributed Optimisation Of Nested Functions, With Application To Learning Binary Autoencoders", "book": "Proceedings of Machine Learning and Systems", "page_first": 276, "page_last": 288, "abstract": "Many powerful machine learning models are based on the composition of multiple processing layers, such as deep nets, which gives rise to nonconvex objective functions. A general, recent approach to optimise such \"nested\" functions is the method of auxiliary coordinates (MAC). MAC introduces an auxiliary coordinate for each data point in order to decouple the nested model into independent submodels. This decomposes the optimisation into steps that alternate between training single layers and updating the coordinates. It has the advantage that it reuses existing single-layer algorithms, introduces parallelism, and does not need to use chain-rule gradients, so it works with nondifferentiable layers. We describe ParMAC, a distributed-computation model for MAC. This trains on a dataset distributed across machines while limiting the amount of communication so it does not obliterate the benefit of parallelism. ParMAC works on a cluster of machines with a circular topology and alternates two steps until convergence: one step trains the submodels in parallel using stochastic updates, and the other trains the coordinates in parallel. Only submodel parameters, no data or coordinates, are ever communicated between machines. ParMAC exhibits high parallelism, low communication overhead, and facilitates data shuffling, load balancing, fault tolerance and streaming data processing. We study the convergence of ParMAC and its parallel speedup, and implement ParMAC using MPI to learn binary autoencoders for fast image retrieval, achieving nearly perfect speedups in a 128-processor cluster with a training set of 100 million images.", "full_text": "PARMAC:DISTRIBUTEDOPTIMISATION OF NESTED FUNCTIONS,\r\n                                     WITHAPPLICATION TO LEARNING BINARY AUTOENCODERS\r\n                                                                                \u00b4                         \u02dc \u00b4   1                           1\r\n                                                                    Miguel A. Carreira-Perpinan                     MehdiAlizadeh\r\n                                                                                               ABSTRACT\r\n                           Many powerful machine learning models are based on the composition of multiple processing layers, such as\r\n                           deep nets, which gives rise to nonconvex objective functions. A general, recent approach to optimise such\r\n                           \u201cnested\u201d functions is the method of auxiliary coordinates (MAC). MAC introduces an auxiliary coordinate for\r\n                           each data point in order to decouple the nested model into independentsubmodels. This decomposesthe optimi-\r\n                           sation into steps that alternate between training single layers and updating the coordinates. It has the advantage\r\n                           that it reuses existing single-layer algorithms, introduces parallelism, and does not need to use chain-rule gradi-\r\n                           ents, so it works with nondifferentiablelayers. We describe ParMAC, a distributed-computationmodelforMAC.\r\n                           This trains on a dataset distributed across machines while limiting the amount of communication so it does not\r\n                           obliterate the bene\ufb01t of parallelism. ParMAC works on a cluster of machines with a circular topology and al-\r\n                           ternates two steps until convergence: one step trains the submodels in parallel using stochastic updates, and the\r\n                           other trains the coordinates in parallel. Only submodel parameters, no data or coordinates, are ever communi-\r\n                           cated between machines. ParMAC exhibits high parallelism, low communication overhead, and facilitates data\r\n                           shuf\ufb02ing, load balancing, fault tolerance and streaming data processing. We study the convergence of ParMAC\r\n                           anditsparallelspeedup,andimplementParMACusingMPItolearnbinaryautoencodersforfastimageretrieval,\r\n                           achieving nearly perfect speedups in a 128-processor cluster with a training set of 100 million images.\r\n                    1     INTRODUCTION                                                                      imately equal. Faults become more frequent as the num-\r\n                                                                                                            ber of machines increases, particularly if they are inexpen-\r\n                    Serial computing has reached a plateau and parallel, dis-                               sive machines. Machines may be heterogeneous and dif-\r\n                    tributed architectures are becoming widely available, from                              fer in CPU and memory; this is the case with initiatives\r\n                    machines with a few cores to cloud computing with 1000s                                 such as SETI@home (Anderson et al., 2002), which may\r\n                    of machines. The combination of powerful nested mod-                                    become an important source of distributed computation in\r\n                    els with large datasets is a key ingredient to solve dif\ufb01cult                           the future. Big data applications have additional restric-\r\n                    problems in machine learning, computer vision and other                                 tions. The size of the data means it cannot be stored on\r\n                    areas, and it underlies recent successes in deep learning                               a single machine, so distributed-memory architectures are\r\n                    (Hinton et al., 2012; Le et al., 2012; Dean et al., 2012).                              necessary. Sending data between machines is prohibitive\r\n                    Unfortunately, parallel computation is not easy, and many                               becauseofthesize ofthe data and the highcommunication\r\n                    good serial algorithms do not parallelise well. The cost of                             costs. In some applications, more data is collected than can\r\n                    communicating (through the memory hierarchy or a net-                                   be stored, so data must be regularly discarded. In others,\r\n                    work) greatly exceeds the cost of computing, both in time                               such as sensor networks, limited battery life and computa-\r\n                    and energy, and will continue to do so for the foresee-                                 tional power imply that data must be processed locally.\r\n                    able future. Thus, good parallel algorithms must minimise\r\n                    communication and maximise computation per machine,                                     In this paper, we focus on machine learning models of the\r\n                                                                                                            form y = F              (: : : F (F (x)):::), i.e., consisting of a\r\n                    while creating suf\ufb01ciently many subproblems (ideally in-                                                  K+1            2    1\r\n                    dependent) to bene\ufb01t from as many machines as possible.                                 nested mapping from the input x to the output y. Such\r\n                    The load (in runtime) on each machine should be approx-                                 nestedmodelsinvolvemultipleparameterisedlayersofpro-\r\n                                                                                                            cessing and include deep neural nets, cascades for object\r\n                         1EECS, School of Engineering, University of California,\r\n                                                                            \u00b4                               recognition in computer vision or for phoneme classi\ufb01ca-\r\n                    Merced, USA. Correspondence to: Miguel A. Carreira-Perpin\u02dca\u00b4n                           tion in speech processing, wrapper approaches to classi\ufb01-\r\n                    <mcarreira-perpinan@ucmerced.edu>.                                                      cation or regression, and various combinations of feature\r\n                    Proceedings of the 2nd SysML Conference, Palo Alto, CA, USA,                            extraction/learning and preprocessing prior to some learn-\r\n                    2019. Copyright 2019 by the author(s).                                                  ing task. Nested and hierarchical models are ubiquitous\r\n                                                ParMAC:DistributedOptimisationof Nested Functions\r\n              in machine learning because they provide a way to con-         entdescent(SGD)(Bottou,2010),coordinatedescent(CD)\r\n              struct complexmodelsbythecompositionofsimplelayers.            (Wright, 2016) or the alternating direction method of mul-\r\n              However,training nested models is dif\ufb01cult even in the se-     tipliers (ADMM) (Boyd et al., 2011). This has resulted in\r\n              rial case because function composition generally produces      several variations of parallel SGD (Bertsekas, 2011;Zinke-\r\n              nonconvex functions, which makes gradient-based optimi-        vich et al., 2010; Niu et al., 2011), parallel CD (Bradley\r\n              sation dif\ufb01cult and slow, and sometimes inapplicable (e.g.     et al., 2011; Richta\u00b4rik & Taka\u00b4c\u02c7, 2013; Liu & Wright, 2015)\r\n              with nonsmoothor discrete layers).                             and parallel ADMM (Boyd et al., 2011; Ouyang et al.,\r\n              Our starting point is a recently proposed technique to         2013;Zhang&Kwok,2014).\r\n              train nested models, the method of auxiliary coordinates       Little work has addressed nonconvex models. Most of it\r\n              (MAC) (Carreira-Perpin\u02dca\u00b4n & Wang, 2012; 2014).       This     has focused on deep nets (Dean et al., 2012; Le et al.,\r\n              reformulates the optimisation into an iterative procedure      2012).   For example, Google\u2019s DistBelief (Dean et al.,\r\n              that alternates training submodels independently with co-      2012) uses asynchronous parallel SGD (with gradients for\r\n              ordinating them. It introduces signi\ufb01cant model and data       the full model computed with backpropagation)to achieve\r\n              parallelism, can often train the submodels using exist-        data parallelism, and some form of model parallelism. The\r\n              ing algorithms, and has convergence guarantees with dif-       latter is achieved by carefully partitioning the neural net\r\n              ferentiable functions to a local stationary point, while it    into pieces and allocating them to machines to compute\r\n              also applies with nondifferentiable or even discrete lay-      gradients.  This is dif\ufb01cult to do and requires a careful\r\n              ers, such as binary autoencoders (Carreira-Perpin\u02dca\u00b4n &        matchoftheneuralnetstructure(numberoflayersandhid-\r\n              Raziperchikolaei, 2015). MAC has been applied to various       den units, connectivity, etc.) to the target hardware. Also,\r\n              nested models (Carreira-Perpin\u02dca\u00b4n & Wang, 2014; Wang          parallel SGD can diverge with nonconvex models, which\r\n              &Carreira-Perpin\u02dca\u00b4n, 2014; Carreira-Perpin\u02dca\u00b4n & Raziper-     requires heuristics to make sure we average replica mod-\r\n              chikolaei, 2015; Raziperchikolaei & Carreira-Perpin\u02dca\u00b4n,       els that are close in parameter space and thus associated\r\n              2016;Carreira-Perpin\u02dca\u00b4n&Vladymyrov,2015),andseveral           with the same optimum. Although this has managed to\r\n              variations of it have been proposed (e.g. Lee et al., 2015;    train huge nets on huge datasets by using tens of thou-\r\n              Taylor et al., 2016; Jaderberg et al., 2017; Askari et al.,    sandsofCPUcores,thespeedupsachievedwereverymod-\r\n              2018; Ororbia et al., 2018). However, the original papers      est. Other work has used similar techniques but for GPUs\r\n              proposing MAC (Carreira-Perpin\u02dca\u00b4n & Wang, 2012; 2014)         (Coatesetal., 2013;Seide et al., 2014). At present, Tensor-\r\n              did not address how to run MAC on a distributed com-           Flow does data parallelism automatically, but more com-\r\n              puting architecture, where communication between ma-           plex formsofparallelism mustbe programmedby theuser.\r\n              chinesisfarcostlierthancomputation. Thispaperproposes          Finally, there also exist speci\ufb01c approximation techniques\r\n              ParMAC, a parallel, distributed framework to learn nested      for certain types of large-scale machine learning prob-\r\n              models using MAC, analyses its parallel speedup and con-       lems,suchasspectralproblems,usingtheNystro\u00a8mformula\r\n              vergence,implementsit in MPIfortheproblemoflearning            or other landmark-based methods (Williams & Seeger,\r\n              binaryautoencoders,anddemonstratesitsabilitytotrainon          2001; Bengio et al., 2004; Drineas & Mahoney, 2005;\r\n              large datasets and achieve large speedups on a distributed     Talwalkar et al., 2008; Vladymyrov & Carreira-Perpin\u02dca\u00b4n,\r\n              cluster.                                                       2013;2016).\r\n              2    RELATED WORK                                              ParMAC (and MAC) is speci\ufb01cally designed for nested\r\n                                                                             models, which are typically nonconvex and include deep\r\n              Distributed optimisation and large-scale machine learning      nets and many other models, some of which have nondif-\r\n              havebeensteadilygaininginterest in recent years, with the      ferentiable layers. As we describe below, ParMAC has the\r\n              development of parallel computation abstractions tailored      advantages of being simple and relatively independent of\r\n              (or applicable to) to machine learning, such as Spark (Za-     the target hardware, while achieving high speedups.\r\n              haria et al., 2010), GraphLab (Low et al., 2012), Petuum\r\n              (Xing et al., 2015) or TensorFlow (Abadi et al., 2015),        3    OPTIMISING NESTED MODELS USING\r\n              which have the goal of making cloud computing easily                AUXILIARY COORDINATES (MAC)\r\n              available to train machine learning models. Most work has\r\n              centred on convex optimisation, particularly when the ob-      Many optimisation problems in machine learning involve\r\n              jective functionhastheformofempiricalriskminimisation          mathematically\u201cnested\u201dfunctionsoftheformF(x;W) =\r\n              (data \ufb01tting term plus regulariser) (Cevher et al., 2014).     F     (: : : F (F (x;W );W ):::;W           ) with param-\r\n                                                                               K+1       2   1       1     2         K+1\r\n              This includes many important models in machine learn-          eters W, such as deep nets. Such problems are tradition-\r\n              ing, such as linear regression, LASSO, logistic regression     ally optimised using methodsbased on gradients computed\r\n              orSVMs. Suchworkistypicallybasedonstochasticgradi-             using the chain rule. However, such gradients may some-\r\n                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           ParMAC:DistributedOptimisationof Nested Functions\r\n                                                                                                                                                                                                                                                                                                                                                                                                                                                                               times be inconvenientto use, or may not exist (e.g. if some                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         lowing equality-constrained problem:\r\n                                                                                                                                                                                                                                                                                                                                                                                                                                                                               of the layers are nondifferentiable, as with binary autoen-\r\n                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            N                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             \u001a                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           L\r\n                                                                                                                                                                                                                                                                                                                                                                                                                                                                               coders). Also, they are hard to parallelise, because of the                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   X                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              2                                                                                                                                                                                                                                                           z =h(x )\u2208{0;1} ;\r\n                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         min                                                                                                                                                                                                                                                                                                          kx \u2212f(z )k s.t.                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             n                                                                                                                                                                                                                                                                                                  n                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          (2)\r\n                                                                                                                                                                                                                                                                                                                                                                                                                                                                               inherent sequentiality in the chain rule.                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        The method of                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   h;f;Z                                                                                                                                                                                                                                                                                                                                                                                                            n                                                                                                                                                                                                                                                                  n                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             n=1;:::;N:\r\n                                                                                                                                                                                                                                                                                                                                                                                                                                                                               auxiliary coordinates (MAC) (Carreira-Perpin\u02dca\u00b4n & Wang,                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    n=1\r\n                                                                                                                                                                                                                                                                                                                                                                                                                                                                               2012;2014)isdesignedtooptimisenestedmodelswithout                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   Notethecodesarebinary. We nowapplyapenaltymethod\r\n                                                                                                                                                                                                                                                                                                                                                                                                                                                                               using chain-rule gradients while introducing parallelism.                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           to bring the equality constraints into the objective func-\r\n                                                                                                                                                                                                                                                                                                                                                                                                                                                                               The idea is to break nested functional relationships judi-                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          tion. The best method generally uses the augmented La-\r\n                                                                                                                                                                                                                                                                                                                                                                                                                                                                               ciously by introducing new variables (the auxiliary coor-                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           grangian (Nocedal & Wright, 2006; Carreira-Perpin\u02dca\u00b4n &\r\n                                                                                                                                                                                                                                                                                                                                                                                                                                                                               dinates) as equality constraints.                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        These are then solved                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      Wang, 2012; 2014), but for simplicity of notation we de-\r\n                                                                                                                                                                                                                                                                                                                                                                                                                                                                               by optimising a penalised function using alternating opti-                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          scribe here the quadratic penalty, which is identical ex-\r\n                                                                                                                                                                                                                                                                                                                                                                                                                                                                               misation over the original parameters (which we call the                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            cept it lacks the Lagrange multiplier parameters. This min-\r\n                                                                                                                                                                                                                                                                                                                                                                                                                                                                               Wstep) and over the coordinates (which we call the Z                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                imises the following objective while progressively increas-\r\n                                                                                                                                                                                                                                                                                                                                                                                                                                                                               step). The result is a coordination-minimisation (CM) al-                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           ing \u00b5, so the constraints are eventually satis\ufb01ed:\r\n                                                                                                                                                                                                                                                                                                                                                                                                                                                                               gorithm: the minimisation (W) step updates the param-                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           N\r\n                                                                                                                                                                                                                                                                                                                                                                                                                                                                               eters by splitting the nested model into independent sub-                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         X\u0010                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              2                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 2\u0011\r\n                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   E (h;f;Z;\u00b5) =                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           kx \u2212f(z )k +\u00b5kz \u2212h(x )k\r\n                                                                                                                                                                                                                                                                                                                                                                                                                                                                               models and training them using existing algorithms, and                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          Q                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     n                                                                                                                                                                                                                                                                  n                                                                                                                                                                                                                                                                                                                                                                                                        n                                                                                                                                                                                                                                                                                         n\r\n                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              n=1\r\n                                                                                                                                                                                                                                                                                                                                                                                                                                                                               the coordination(Z) step ensuresthat correspondinginputs\r\n                                                                                                                                                                                                                                                                                                                                                                                                                                                                               and outputs of submodels eventually match. MAC algo-                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                s.t. z                                                                                                                                                                                                                                           \u2208 {0;1}L, n = 1;:::;N. Finally, we apply alter-\r\n                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       n\r\n                                                                                                                                                                                                                                                                                                                                                                                                                                                                               rithms have been developed for several nested models so                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             nating optimisation over Z and W = (h;f). This gives the\r\n                                                                                                                                                                                                                                                                                                                                                                                                                                                                               far: deep nets (Carreira-Perpin\u02dca\u00b4n & Wang, 2014), low-                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             following steps:\r\n                                                                                                                                                                                                                                                                                                                                                                                                                                                                               dimensional SVMs (Wang & Carreira-Perpin\u02dca\u00b4n, 2014),                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                Coordinates Over Z for \ufb01xed (h;f), this is a binary op-\r\n                                                                                                                                                                                                                                                                                                                                                                                                                                                                               binary autoencoders (Carreira-Perpin\u02dca\u00b4n & Raziperchiko-                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  timisation on NL variables, but it separates into N\r\n                                                                                                                                                                                                                                                                                                                                                                                                                                                                               laei, 2015), af\ufb01nity-based loss functions for binary hashing                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              independent optimisations each on only L variables,\r\n                                                                                                                                                                                                                                                                                                                                                                                                                                                                               (Raziperchikolaei & Carreira-Perpin\u02dca\u00b4n, 2016) and para-\r\n                                                                                                                                                                                                                                                                                                                                                                                                                                                                               metric nonlinear embeddings (Carreira-Perpin\u02dca\u00b4n & Vla-                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   withtheformofabinaryproximaloperator(wherewe\r\n                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         omittheindexn): min kx\u2212f(z)k2+\u00b5kz\u2212h(x)k2\r\n                                                                                                                                                                                                                                                                                                                                                                                                                                                                               dymyrov, 2015). Although this paper proposes and analy-                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     z\r\n                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         s.t. z \u2208 {0;1}L. This can be solved approximately by\r\n                                                                                                                                                                                                                                                                                                                                                                                                                                                                               ses ParMACin general, our MPI implementationis for the\r\n                                                                                                                                                                                                                                                                                                                                                                                                                                                                               particular case of binary autoencoders. These de\ufb01ne a non-                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                alternating optimisation over bits.\r\n                                                                                                                                                                                                                                                                                                                                                                                                                                                                               convex nondifferentiable problem, yet its MAC algorithm                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             Submodels Over W = (h;f) for \ufb01xed Z, we obtain L +\r\n                                                                                                                                                                                                                                                                                                                                                                                                                                                                               is simple and effective. We brie\ufb02y describe it next.                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      D independent problems: for each of the L single-\r\n                                                                                                                                                                                                                                                                                                                                                                                                                                                                               3.1                                                                                                                                                                                       MACAlgorithmforBinaryAutoencoders                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               bit hash functions (which try to predict Z optimally\r\n                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         from X), each solvable by \ufb01tting a linear SVM; and\r\n                                                                                                                                                                                                                                                                                                                                                                                                                                                                               Abinaryautoencoder(BA)isausualautoencoderbutwith                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          for each of the D linear decoders in f (which try to\r\n                                                                                                                                                                                                                                                                                                                                                                                                                                                                               a binary code layer. It consists of an encoder h(x) that                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  reconstruct X optimally from Z), each a linear least-\r\n                                                                                                                                                                                                                                                                                                                                                                                                                                                                               maps a real vector x \u2208 RD onto a binary code vector                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       squares problem.\r\n                                                                                                                                                                                                                                                                                                                                                                                                                                                                               with L < D bits, z \u2208 {0;1}L, and a linear decoder f(z)\r\n                                                                                                                                                                                                                                                                                                                                                                                                                                                                               whichmapszbacktoRD inanefforttoreconstructx. We                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     Theusermustchooseascheduleforthe penaltyparameter\r\n                                                                                                                                                                                                                                                                                                                                                                                                                                                                               will call h a binary hash function (see later). Let us write                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        \u00b5(sequence of values 0 < \u00b51 < \u00b7\u00b7\u00b7 < \u00b5\u221e). This should\r\n                                                                                                                                                                                                                                                                                                                                                                                                                                                                               h(x) =   (Ax) (A includes a bias by having an extra di-                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             increase slowly enough that the binary codes can change\r\n                                                                                                                                                                                                                                                                                                                                                                                                                                                                               mension x = 1 for each x) where A \u2208 RL\u00d7(D+1) and                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    considerably and explore better solutions before the con-\r\n                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    0                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              straints are satis\ufb01ed and the algorithm stops. With BAs,\r\n                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                (t) is a step function applied elementwise, i.e.,                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    (t) = 1                                                                                                                                                                                                                                                                                                                                                                                                       MAC stops for a \ufb01nite value of \u00b5, which occurs when-\r\n                                                                                                                                                                                                                                                                                                                                                                                                                                                                               if t \u2265 0 and   (t) = 0 otherwise. Given a dataset of                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                ever Z does not change compared to the previous Z step.\r\n                                                                                                                                                                                                                                                                                                                                                                                                                                                                               D-dimensional patterns X = (x ;:::;x ), our objective\r\n                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         1                                                                                                                                                                                                                                                                                N                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        This gives a practical stopping criterion. Carreira-Perpin\u02dca\u00b4n\r\n                                                                                                                                                                                                                                                                                                                                                                                                                                                                               function, which involves the nested model y = f(h(x)), is\r\n                                                                                                                                                                                                                                                                                                                                                                                                                                                                               the usual least-squares reconstruction error:                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       &Raziperchikolaei (2015) give proofs of these statements\r\n                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                N                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  and further details about the algorithm. Fig. 1 gives the\r\n                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 E (h;f)= Xkx \u2212f(h(x ))k2:                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  (1)                                                                                                                                                                                                                                                    MACalgorithmforBAs.\r\n                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              BA                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       n                                                                                                                                                                                                                                                                                                                                                               n\r\n                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                n=1\r\n                                                                                                                                                                                                                                                                                                                                                                                                                                                                               Optimising this nonconvex, nonsmooth function is NP-                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                The BA was proposed as a way to learn good binary\r\n                                                                                                                                                                                                                                                                                                                                                                                                                                                                               hard(Carreira-Perpin\u02dca\u00b4n& Raziperchikolaei,2015). Where                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             hash functions for fast, approximate information retrieval\r\n                                                                                                                                                                                                                                                                                                                                                                                                                                                                               the gradients do exist wrt A they are zero, so optimisa-                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            (Carreira-Perpin\u02dca\u00b4n & Raziperchikolaei, 2015).                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             Binary\r\n                                                                                                                                                                                                                                                                                                                                                                                                                                                                               tion of h using chain-rule gradients does not apply. We                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             hashing (Grauman & Fergus, 2013) has emerged in recent\r\n                                                                                                                                                                                                                                                                                                                                                                                                                                                                               introduceas auxiliary coordinatesthe outputs of h, i.e., the                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        years as an effective way to do fast, approximate nearest-\r\n                                                                                                                                                                                                                                                                                                                                                                                                                                                                               codes for each of the N input patterns, and obtain the fol-                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         neighbour searches in image databases. The real-valued,\r\n                                                         ParMAC:DistributedOptimisationof Nested Functions\r\n                 high-dimensional image vectors are mapped onto a binary                   In the rest of the paper, some readers may \ufb01nd this analogy\r\n                 space with L bits and the search is performed there us-                   useful and think of EM for Gaussian mixtures instead of\r\n                 ing Hammingdistancesata vastly faster speed and smaller                   MAC, replacing \u201csubmodels\u201d and \u201cauxiliary coordinates\u201d\r\n                 memory (e.g. N = 109 points with D = 500 take 2 TB,                       in MAC with \u201cGaussians\u201d and \u201cposterior probabilities\u201d in\r\n                 but only 8 GB using L = 64 bits, which easily \ufb01ts in                      EM,respectively.\r\n                 RAM).AsshownbyCarreira-Perpin\u02dca\u00b4n&Raziperchikolaei\r\n                 (2015),training BAs with MAC beats approximateoptimi-                     4    PARMAC: APARALLEL, DISTRIBUTED\r\n                 sation approaches such as relaxing the codes or the step                       COMPUTATION MODEL FOR MAC\r\n                 function in the encoder, and yields state-of-the-art binary\r\n                 hashfunctionshinunsupervisedproblems,improvingover                        A speci\ufb01c MAC algorithm depends on the model and\r\n                 established approachessuch as iterative quantisation (ITQ)                objective function and on how the auxiliary coordinates\r\n                 (Gong et al., 2013). We focus mostly on linear hash func-                 are introduced.      We can achieve steps that are closed-\r\n                 tions because these are, by far, the most used type of hash               form, convex, nonconvex, binary, or others. However, we\r\n                 functions in the literature of binary hashing, due to the fact            will assume the following always hold: (1) Separability\r\n                 that computing the binary codes for a test image must be                  over data points. In the Z step, the N subproblems for\r\n                 fast at run time.                                                         z ;:::;z      are independent, one per data point.             Each\r\n                                                                                            1         N\r\n                                                                                           z step depends on the current model. (2) Separability\r\n                 3.2   MACinGeneral                                                         n\r\n                                                                                           over submodels. In the W step, there are M indepen-\r\n                 With a nested function with K layers, we can introduce                    dent submodels, where M depends on the problem. For\r\n                 auxiliary coordinates at each layer. For example, with a                  example, M is the number of hidden units in a deep net,\r\n                 neural net, this decouples the weight vector of every hid-                or the number of hash functions and linear decoders in a\r\n                 den unit in the W step, which can be solved as a logistic                 BA. Each submodel depends on all the data and coordi-\r\n                 regression (see Carreira-Perpin\u02dca\u00b4n & Alizadeh, 2016). For                nates. We now show how to turn this into a distributed,\r\n                 a large net with a large dataset, this affords an enormous                low-communicationParMAC algorithm.\r\n                 potential for parallel computation.                                       ThebasicideainParMACisasfollows. Withlargedatasets\r\n                 3.3   MACandEM                                                            in distributed systems, it is imperative to minimise data\r\n                                                                                           movement over the network because the communication\r\n                 MACisverysimilartoexpectation-maximisation(EM)ata                         time generally far exceeds the computation time in modern\r\n                 conceptuallevel. We brie\ufb02yexplainthishere;seeCarreira-                    architectures. In MAC we have 3 types of data: the origi-\r\n                 Perpin\u02dca\u00b4n (2019b) for more details. EM (McLachlan & Kr-                  nal training data of inputs X and outputs Y, the auxiliary\r\n                 ishnan, 2008)applies generally to many probabilistic mod-                 coordinates Z, and the model parameters (the submodels).\r\n                 els. The resulting algorithm can be very different (e.g. EM               Usually, the latter type is far smaller. In ParMAC, we never\r\n                 for Gaussian mixtures vs EM for hidden Markov models),                    communicate training or coordinate data; each machine\r\n                 but it always alternates two steps that conceptually do the               keeps a disjoint portion of (X;Y;Z) corresponding to a\r\n                 following. TheEstepupdatesinparalleltheposteriorprob-                     subset of the points. Only model parameters are communi-\r\n                 abilities.  This separates over data points and is like the               cated, during the W step, following a circular topology1,\r\n                 Z step in MAC, where the posterior probabilities are the                  whichimplicitly implements a stochastic optimisation. The\r\n                 auxiliary coordinates, and where the step may be in closed                modelparametersarethehashfunctionshandthedecoder\r\n                 formorrequireoptimisation,dependingon themodel. The                       f for BAs, and the weight vector wh of each hidden unit h\r\n                 Mstepupdates in parallel the \u201csubmodels\u201d. For a mixture                   for deep nets. Let us see this in detail (refer to \ufb01g. 2).\r\n                 with M components, these are the M Gaussians (means,                      Assume we have P identical processing machines, each\r\n                 covariances, proportions). This separates over submodels\r\n                 andisliketheWstepinMAC.ForBAs,thesubmodelsare                             with its own memory and CPU, connected through a net-\r\n                 the L encoders (linear SVMs) and the D decoders (linear                   work in a circular unidirectional topology. Each machine\r\n                 regressors); for a neural net, each weight vector of a hid-               stores a subset of the data points and corresponding coor-\r\n                                                                                           dinates (x ;y ;z ) such that the subsets are disjoint and\r\n                 den unit is a submodel (a logistic regressor). For Gaussian                           n    n   n\r\n                 mixtures, the M step can be done exactly in one \u201cepoch\u201d                   their union is the entire data. Before the Z step starts, each\r\n                 because it is a simple average. For MAC, it usually re-                   machine contains all the (just updated) submodels. This\r\n                 quires optimisation, and so multiple epochs. In fact, Par-                means that in the Z step each machine processes its auxil-\r\n                                                                                           iary coordinates {z } independently of all other machines,\r\n                 MACappliestoEMbyusinge = 1epoch: intheW step,                                                   n\r\n                 the Gaussians visit each machine circularly and (their aver-              i.e., no communication occurs. The W step is more sub-\r\n                 ages) are updated on its data; in the Z step, each machine                tle.  At the beginning of the W step, each machine will\r\n                 updates its posterior probabilities.                                          1Wediscuss other topologies in section 4.4.\r\n                                                  ParMAC:DistributedOptimisationof Nested Functions\r\n                    input X       =(x ;:::;x ),L \u2208 N\r\n                            D\u00d7N         1       N                                                      X trainingpoints\r\n                    Initialise Z     =(z ;:::;z ) \u2208 {0;1}LN\r\n                               L\u00d7N        1       N\r\n                    for \u00b5 = 0 < \u00b51 < \u00b7\u00b7\u00b7 < \u00b5\u221e                                                          Z auxiliarycoordinates\r\n                      parforl = 1;:::;L                                      Wstep: h\r\n                        h (\u00b7) \u2190 \ufb01t SVM to (X;Z )                                                       h: RD \u2192{0;1}L,\r\n                         l                        l\u00b7\r\n                      parford = 1;:::;D                                       Wstep: f                 h=(h1;:::;hL)\r\n                        f (\u00b7) \u2190 least-squares \ufb01t to (Z;X )                                                   encoders(hash function)\r\n                         d                                d\u00b7\r\n                      parforn = 1;:::;N                                           Zstep                     L      D\r\n                                                               2                    2                  f: R \u2192R ,\r\n                        z \u2190argmin              L kx \u2212f(z )k +\u00b5kz \u2212h(x )k\r\n                         n            z \u2208{0;1}      n       n           n       n\r\n                                       n                                                               f = (f ;:::;f )\r\n                      if no change in Z and Z = h(X) then stop                                                1       D\r\n                    return h, Z = h(X)                                                                       decoders\r\n               Figure 1. MAC algorithm for binary autoencoders. \u201cparfor\u201d indicates a for loop whose iterations are carried out in parallel. The steps\r\n               over h and f can be run in parallel as well.\r\n               contain all the submodels and its portion of the data and       by having a submodel do e consecutive passes within each\r\n               (just updated) coordinates. Each submodel must have ac-         machine\u2019s data. This reduces the amount of shuf\ufb02ing, but\r\n               cess to the entire data and coordinates in order to update      shouldnotbeaproblemifthedataarerandomlydistributed\r\n               itself and, since the data cannot leave its home machine,       over machines.\r\n               the submodel must go to the data. We achieve this in the\r\n               circular topology with an asynchronous processing, as fol-      4.1   Extensions of ParMAC\r\n               lows. Each machinekeepsaqueueofsubmodelstobepro-\r\n               cessed, and repeatedly performs the following operations:       Data shuf\ufb02ing, which improves the SGD convergence\r\n               extract a submodel from the queue, process it on its data       speed, can be achieved without data movement by access-\r\n               and send it to the machine\u2019s successor (which will insert it    ing the local data in random order at each epoch (within-\r\n               in its queue). If the queue is empty, the machine waits until   machine),andbyrandomisingthecirculartopologyateach\r\n               it is nonempty. The queue of each machine is initialised        epoch(across-machine). Loadbalancingissimplebecause\r\n               with a portionM=P ofsubmodelsassociatedwiththatma-              the workinbothWandZstepsisproportionaltothenum-\r\n               chine (e.g. in \ufb01g. 2, machine 1\u2019s queue contains submodels      ber of data points N. Hence, if the processing power of\r\n                                                                               machine p is proportional to \u03b1      > 0, we allocate to it\r\n               1\u20133, machine 2 submodels 4\u20136, etc.). Each submodel car-                                           p\r\n                                                                               N\u03b1 =(\u03b1 +\u00b7\u00b7\u00b7 + \u03b1 ) data points. Streaming, i.e., dis-\r\n               ries a counter that is initially 1 and increases every time it      p    1            P\r\n               visits a machine. When it reaches P, the submodelhas vis-       carding old data and adding new data during training, can\r\n               ited all machines in sequence and has completed an epoch.       be done by adding/removing data within-machine, or by\r\n               We repeat this for e epochs and, to ensure all machines         adding/removingmachinesandupdatingthecirculartopol-\r\n               have all \ufb01nal submodels before starting the Z step, we run      ogy. Fault tolerance is possible because we can still learn\r\n               a communication-onlyepoche+1(withoutcomputation),               a good model even if we lose the data from a machine that\r\n               wheresubmodelssimplymovefrommachinetomachine.                   fails, and because in the W step we can revert to older\r\n                                                                               copies of the lost submodels residing in other machines.\r\n               Since each submodel is updated as soon as it visits a ma-       SeefurtherdetailsinCarreira-Perpin\u02dca\u00b4n&Alizadeh(2016).\r\n               chine, rather than computing the exact gradient once it has\r\n               visited all machines and then take a step, the W step is re-    4.2   ATheoreticalModeloftheParallelSpeedup\r\n               ally carrying out stochastic steps for each submodel. For\r\n               example, if the update is done by a gradient step, we are       We can estimate the runtime of the W and Z steps as-\r\n               actually implementing stochastic gradient descent (SGD)         suming there are M independent submodels of the same\r\n               where the minibatches are of size N=P (or smaller, if we        size in the W step, using e epochs, on a dataset with\r\n               subdivideamachine\u2019sdataportionintominibatches,which             N training points, distributed over P identical machines\r\n               shouldbetypicallythecase in practice). From this point of       (each with N=P points).       Let tW be the computation\r\n                                                                                                                   r\r\n               view,wecanregardtheWstepasdoingSGDoneachsub-                    time per submodel and data point in the W step, tZ the\r\n                                                                                                                                      r\r\n               model in parallel by having each submodel visit the mini-       computation time per data point in the Z step, and tW\r\n                                                                                                                                         c\r\n               batches in each machine.                                        the communication time per submodel in the W step.\r\n                                                                               Then the runtime of the W and Z steps is TW(P) =\r\n               As described, and as implemented in our experiments, the        \u2308M=P\u2309(tWN +tW)Pe + \u2308M=P\u2309tWP and TZ(P) =\r\n                                                                                         r  P     c                   c\r\n               entire modelparametersarecommunicatede+1timesina                MNtZ,respectively, and the total runtime per iteration is\r\n               MACiteration if running e epochs in the W step. We can              P r\r\n                                                                               T(P) = TW(P)+TZ(P). Hencetheparallelspeedupis\r\n               also run e epochs with only 2 rounds of communication\r\n                                                   ParMAC:DistributedOptimisationof Nested Functions\r\n                                w                               w                               w                                w\r\n                                 h                                h                               h                               h\r\n                          1                              13                              25                              37\r\n                          2                              14                              26                              38\r\n                          3                              15                              27                              39\r\n                          4                              16                              28                               40\r\n                   l      5                              17                              29                              41\r\n                   e\r\n                   d      6                              18                              30                               42\r\n                   o      7                              19                              31                              43\r\n                   M      8                              20                              32                               44\r\n                          9                              21                              33                              45\r\n                         10                              22                              34                               46\r\n                         11                              23                              35                              47\r\n                         12                              24                              36                              48\r\n                            x   y    z                      x    y   z                      x    y   z                      x    y    z\r\n                             n    n   n                      n    n   n                      n    n   n                       n    n   n\r\n                       1                               11                              21                              31\r\n                        2                              12                              22                              32\r\n                       3                               13                              23                              33\r\n                    a   4                              14                              24                              34\r\n                    t\r\n                    a\r\n                    D\r\n                       10                              20                              30                              40\r\n                           Machine1                         Machine2                        Machine3                        Machine4\r\n               Figure 2. ParMAC model with P = 4 machines, M = 12 submodels \u201cwh\u201d and N = 40 data points. Submodels h, h + M, h + 2M\r\n               and h + 3M are copies of submodel h, but only one of them is the most currently updated. At the end of the W step all copies are\r\n               identical.\r\n               (see details in Carreira-Perpin\u02dca\u00b4n, 2019a):                       Eq.(3)alsoshowsthatwecanleavethespeedupunchanged\r\n                                                                                  bytradingoffdatasetsizeandcomputation/communication\r\n                              T(1)              \u03c1   1    MP                       times, as long as one of these holds: NtW and NtZ re-\r\n                    S(P) =           =            \u2308M=P\u2309                   (3)                                                  r           r\r\n                                         1  2                  1                  main constant; or N=tW remains constant; or tW=tW and\r\n                              T(P)      NP +\u03c12P +\u03c11\u2308M=P\u2309M                                                 c                          r    c\r\n                                                                                  tZ=tW remain constant.\r\n                       \u03c1 =tZ=(e+1)tW; \u03c1 =etW=(e+1)tW                      (4)      r   c\r\n                        1     r           c     2      r           c\r\n                                             W Z                 W                In the BA, we have submodels of different size: encoders\r\n                         \u03c1 = \u03c11 +\u03c12 = (etr +tr)=(e+1)tc                   (5)     of size D and decoders of size L < D. We can model\r\n               where \u03c1, \u03c1 and \u03c1 are ratios of computation vs communi-             this by \u201cgrouping\u201d the D decoders into L groups of D=L\r\n                           1      2                                               decoders each, resulting in M = 2L equal-size submod-\r\n               cation, dependent on the optimisation algorithm in the W           els (assuming the ratio of computation and communication\r\n               and Z steps, and on the performace of the distributed sys-         times of decoder vs encoder is L=D < 1).\r\n               temandlibraries (MPI in our implementation).\r\n               Hence, if P \u2264 M and M is divisible by P we have                    4.3   ConvergenceofParMAC\r\n               S(P) = P=(1 + P ) and if P > M we have S(P) =\r\n                                   \u03c1N                                             The only approximation that ParMAC makes to the orig-\r\n               \u03c1M=(\u03c1 + \u03c1 M + P). In practice, typically we have\r\n                       2      1 P     N                                           inal MAC algorithm is using SGD in the W step. Since\r\n               \u03c1 \u226a 1 (because communication dominates computation                 we can guarantee convergence of SGD under certain con-\r\n               in current architectures) and \u03c12N \u226b 1 (large dataset). If          ditions(e.g.Robbins-Monroschedules),wecanrecoverthe\r\n               we take P \u226a \u03c12N, then S(P) \u2248 P if P \u2264 M and                        originalconvergenceguaranteesforMACtoalocalstation-\r\n               S(P)\u2248\u03c1M=(\u03c1 +\u03c1 M)ifP >M.Hence,thespeedupis\r\n                                 2   1 P                                          ary point with differentiable layers (see details in Carreira-\r\n               nearlyperfectif using fewer machinesthansubmodels,and\r\n                                        \u2217                 p                       Perpin\u02dca\u00b4n & Alizadeh, 2016). This convergence guarantee\r\n               otherwise it peaks at S1 = \u03c1M=(\u03c12 + 2         \u03c11M=N) > M\r\n                           \u2217     \u221a                                                is independent of the number of models and processors.\r\n               for P = P =         \u03c11MN > M anddecreases thereafter.\r\n                           1                                                      With nondifferentiable layers, the convergence properties\r\n               Thisaffordsverylargespeedupsforlargedatasetsandlarge\r\n               models. This theoretical speedup matches well our mea-             of MAC(and ParMAC) are not well known. In particular,\r\n               sured ones (see the experiments section), and can be used          for the binary autoencoder the encoding layer is discrete\r\n               to determine optimal values for the number of machines P           and the problem is NP-hard. While convergence guaran-\r\n               to use in practice (subjectto additionalconstraints, e.g. cost     tees are important theoretically, in practical applications\r\n               of the machines).                                                  withlargedatasetsinadistributedsettingonetypicallyruns\r\n                                                   ParMAC:DistributedOptimisationof Nested Functions\r\n               SGDforjust a few epochs, even one or less than one (i.e.,          MPI Bsend blocks until the buffer is copied to the MPI\r\n               we stop SGD before passing through all the data). This             internal memory; after that, the MPI library takes care of\r\n               typically reduces the objective function to a good enough          sending the data.\r\n               value as fast as possible, since each pass over the data is        The code snippet in \ufb01gure 3 shows the main steps of the\r\n               very costly. In our experiments, 1\u20132 epochs in the W step          ParMAC algorithm for the BA. All the functions starting\r\n               makeParMACverysimilartoMACusinganexactstep.                        with MPI are API calls from the MPI library. As with\r\n                                                                                  all MPI programs, we start the code by initialising the\r\n               4.4   Circular vs Parameter-ServerTopologies                       MPIenvironmentand end by \ufb01nalising it. To receive data\r\n                                                                                                             3\r\n               WealsoconsideredimplementingParMAC(intheWstep)                     we use the synchronous , blocking MPI receive function\r\n               using a parameter-server (PS) topology rather than a circu-        MPI Recv. The process calling this blocks until the data\r\n               lar one, but the latter is better. To see this, focus on how a     arrives. To send data we use the buffered blocking version\r\n               single submodel m \u2208 {1;:::;M} is processed (since dif-             of the MPI send functions, MPI Bsend. This requiresthat\r\n               ferent submodels are processed independently and in par-           we allocate enough memory and attach it to the system in\r\n               allel in either topology). With a PS we do parallel SGD on         advance. The process calling MPI Bsend blocks until the\r\n               m,i.e., each worker runs SGD on its own replica of m for           buffer is copied to the MPI internal memory; after that, the\r\n               a while, sends it to the PS, and this broadcasts an \u201caver-         MPI library takes care of sending the data appropriately.\r\n               age\u201d m back to the workers, asynchronously. The circular           The bene\ufb01t of using this version of send is that the pro-\r\n               topology does true SGD directly on m, with no replicas.            grammercansendmessageswithoutworryingaboutwhere\r\n               We can show (Carreira-Perpin\u02dca\u00b4n, 2019a) the runtime per           they are buffered, so the code is simpler.\r\n               iteration using a PS is equal to that of the circular topology\r\n               only if the server can communicate with P workers simul-           Distributed-memorycluster          WeusedGeneralComput-\r\n               taneously (rather than sequentially), otherwise it is slower.      ing NodesfromtheUCSDTritonSharedComputingClus-\r\n               The reason is the PS has more communication. Also im-              ter (TSCC), available to the public for a fee. Each node\r\n               portantly, parallel SGD converges more slowly than true            contains 2 8-core Intel Xeon E5-2670 processors (16 cores\r\n               SGDandis dif\ufb01cult to apply if the W step is nonconvex.             in total), 64GB RAM (4GB/processor) and a 500GB hard\r\n               Finally, the PS needs extra machine(s) to act as parameter         drive. The nodes are connected through a 10GbE network.\r\n               server(s). Considering now all M submodels, the funda-             We used up to P = 128 processors. Carreira-Perpin\u02dca\u00b4n &\r\n               mental difference between both topologies is in how they           Alizadeh (2016)give detailed specs as well as experiments\r\n               employtheavailable parallelism: the circular topology up-          in a shared-memorymachine.\r\n               dates submodels directly and communicates them, while\r\n               the PS updates replicas (of each submodel), communicates           Datasets     We have used 3 well-known colour image re-\r\n               themandaveragesthem.                                               trieval benchmarks. (1) CIFAR (Krizhevsky, 2009) con-\r\n               It may be possible to use other topologies that do true SGD        tains 60000 images (N = 50000training and 10000test),\r\n               onthesubmodelsbutwedidnotexplorethem.                              representedbyD =320GISTfeatures. (2)SIFT-1M(Je\u00b4gou\r\n                                                                                      3Note that the word \u201csynchronous\u201d here does not refer to\r\n               5    EXPERIMENTS2                                                  how we process the different submodels, which as we stated ear-\r\n                                                                                  lier are not synchronised to start or end at speci\ufb01c clock ticks,\r\n               MPI implementation of ParMAC for BAs. We have                      hence are processed asynchronously with respect to each other.\r\n               used C/C++, the GSL and BLAS libraries for mathemat-               The word \u201csynchronous\u201d here refers to MPI\u2019s handling of an\r\n               ical operations, and the Message Passing Interface (MPI)           individual receive function.  This can be done either by call-\r\n               (Gropp et al., 1999) for interprocess communication. MPI           ing MPI Recv, which will block until the data is received (syn-\r\n               is a widely used framework for high-performance parallel           chronous blocking function), as in the pseudocode in \ufb01g. 3; or\r\n                                                                                  bycallingMPI Irecv(asynchronous nonblockingfunction)fol-\r\n               computing, available in multiple platforms. It is particu-         lowedbyaMPI Wait,whichwillblockuntilthedataisreceived,\r\n               larly suitable for ParMAC because of its support of the            like this:\r\n               SPMD (single program, multiple data) model. In MPI,                    MPI Irecv(receivebuffer, commbuffsize,\r\n               processesindifferentmachinescommunicatethroughmes-                       MPI CHAR, MPI ANY SOURCE, MODEL MSG TAG,\r\n               sages. Toreceivedata,weusethesynchronousblockingre-                      MPI COMM WORLD, &recvRequest);\r\n               ceive function MPI Recv; the process calling this blocks               MPI Wait(&recvRequest, &recvStatus);\r\n               until the data arrives. To send data we use the buffered           Both options are equivalent for our purpose, which is to en-\r\n               blocking send function MPI Bsend. We allocate enough               sure we receive the submodel before starting to train it.  The\r\n               memory and attach it to the system. The process calling            MPI Irecv/MPI Wait option is slightly more \ufb02exible in that\r\n                                                                                  it would allow us to do some additional processing between\r\n                   2Our implementation is available in the authors\u2019 webpage.      MPI IRecv and MPI Wait and possibly achieve some perfor-\r\n                                                                                  mance gain.\r\n                                                                     ParMAC:DistributedOptimisationof Nested Functions\r\n                    MPI Init(&argc, &argv);                      // initialise MPI execution environment       (forvaryingvaluesofR);incaseoftieddistances,weplace\r\n                    MPI Comm rank(MPI COMM WORLD, &mpirank);\r\n                    MPI Comm size(MPI COMM WORLD, &mpisize);                                                   the query as top rank. All these measures are computed of-\r\n                    loadsettings();                                       // \u00b5, epochs, dataset path, etc.     \ufb02ine once the BA is trained. Carreira-Perpin\u02dca\u00b4n & Alizadeh\r\n                    loaddatasets();                           // datasets and initial auxiliary coordinates\r\n                    initializelayers();                                     // initialise f, h and Z steps     (2016)giveadditional measures and experiments.\r\n                    // allocate big enough buffer for MPI Bsend\r\n                    MPI Pack size(commbuffsize, MPI CHAR, MPI COMM WORLD,\r\n                      &mpi attach buff size);                                                                  Models and their parameters                     We use BAs with linear\r\n                    mpi attach buff = malloc(totalsubmodelcount*                                               encoders(linearSVM)exceptwithSIFT-1B,wherewealso\r\n                      (mpi attach buff size+MPI BSEND OVERHEAD));\r\n                    MPI Buffer attach(mpi attach buff, mpi attach buff size);                                  use kernel SVMs. The decoder is always linear. We set\r\n                    for (iter=1 to length(\u00b5)) {                                                                L=16bits(hashfunctions)for CIFAR and SIFT-1M and\r\n                      // begin W-step                                                                          L = 64 bits for SIFT-1B. We initialise the binary codes\r\n                      visitedsubmodels = 0;                                                                    from truncated PCA ran on a subset of the training set\r\n                      // each process visits all the submodels, epochs + 1 times\r\n                      while (visitedsubmodels <= totalsubmodelcount*epochs) {                                  (small enough that it \ufb01ts in one processor). To train the\r\n                        // stepcounterindicates howfar trained each submodel is                                encoder (L SVMs) and decoder (D linear mappings) with\r\n                        if (stepcounter > 0) {                       // not 1st submodel? wait to receive\r\n                          // MPI Recvblocks until requested data is available                                  stochastic optimisation, we used the SGD code from (Bot-\r\n                          MPI Recv(receivebuffer, commbuffsize,                                                tou&Bousquet,2008),usingitsdefaultparametersettings.\r\n                            MPI CHAR, MPI ANY SOURCE, MODEL MSG TAG,\r\n                                              MPI COMM WORLD, &recvStatus);                                    The SGD step size is tuned automatically in each iteration\r\n                          savesubmodel(receivebuffer);                                                         by examining the \ufb01rst 1000 datapoints. We use a multi-\r\n                        }\r\n                        if (stepcounter < epochs*mpisize) {                           // not in last round     plicative \u00b5 schedule \u00b5i = \u00b50ai where the initial value \u00b50\r\n                          switch(submodeltype)                       // train submodel according to type       and the factor a > 1 are tuned of\ufb02ine in a trial run using\r\n                          case \u2019SVM\u2019: HtrainSGD();\r\n                          case \u2019linlayer\u2019: FtrainSGD();                                                        a small subset of the data. For CIFAR we use \u00b50 = 0.005\r\n                        }                                                                                      and a = 1.2 over 26 iterations (i = 0;:::;25). For SIFT-\r\n                        if (stepcounter < (ringepochs+1)*mpisize) {\r\n                          // pick the successor process from the lookup table                                  1Mand SIFT-1B we use \u00b50 = 10\u22124 and a = 2 over 10\r\n                          successor = next in lookuptable();                                                   iterations.\r\n                          loadsubmodel(sendbuffer);\r\n                          MPI Bsend(sendbuffer, taskbufsize*sizeof(double),\r\n                            MPI CHAR, successor, MODEL MSG TAG, MPI COMM WORLD);                               5.1     Effect of Stochastic Steps in the W Step\r\n                        }\r\n                        visitedsubmodels++;\r\n                      }                                                                                        Fig. 4 shows the effect on the precision on CIFAR of vary-\r\n                      // end W-step                                                                            ing the number of epochs within the W step and shuf-\r\n                      // begin Z-step                                                                          \ufb02ing the data as a function of the number of processors P.\r\n                      updateZ();                                        // optimise auxiliary coordinates      As the number of epochs increases, the W step is solved\r\n                      // end Z-step\r\n                    }                                                                                          more exactly (8 epochs is practically exact in this data).\r\n                    MPI Buffer detach(&mpi attach buff, &mpi attach buff size);                                Fewer epochs, even just one, cause only a small degrada-\r\n                    free(mpi attach buff);                                  // free the allocated memory       tion. The reason is that, although these are relatively small\r\n                    MPI Finalize();                             // terminate MPI execution environment         datasets, theycontainsuf\ufb01cientredundancethatfewepochs\r\n                                                                                                               aresuf\ufb01cienttodecreasetheerrorconsiderably. Thisisalso\r\n                    Figure 3. Binary autoencoder ParMAC algorithm (fragment),                                  helped by the accumulated effect of epochs over MAC it-\r\n                    showing important MPI calls.                                                               erations. Running more epochs increases the runtime and\r\n                                                                                                               lowers the parallel speedup in this particular model, be-\r\n                    et al., 2011a) contains N = 106 training and 104 test im-                                  cause we use few bits (L = 16) and therefore few submod-\r\n                    ages, each represented by D = 128 SIFT features. (3)                                       els (M = 2L = 32) compared to the number of machines\r\n                    SIFT-1B (Je\u00b4gou et al., 2011a) has three subsets: 109 base                                 (up to P = 128), so the W step has less parallelism. The\r\n                    vectors where the search is performed, N = 108 learning                                    positiveeffectofdatashuf\ufb02ingintheWstepisclear: shuf-\r\n                    vectors used to train the model and 104 query vectors.                                     \ufb02ing generally increases the precision with no increase in\r\n                                                                                                               runtime.\r\n                    Performancemeasures RegardingthequalityoftheBA                                             5.2     Speedup\r\n                    and hash functions learnt, we report the retrieval precision\r\n                    (%) in the test set using as true neighbours the K near-                                   ThefundamentaladvantageofParMACanddistributedop-\r\n                    est images in Euclidean distance in the original space, and                                timisation in general is the ability to train on datasets that\r\n                    as retrieved neighbours in the binary space we use the k                                   donot\ufb01t in a single machine, and the reduction in runtime\r\n                    nearest images in Hamming distance. We set (K;k) =                                         because of parallel processing. Fig. 5 shows the \u201cstrong\r\n                    (1000,100) for CIFAR and (10000,10000) for SIFT-1M.                                        scaling\u201d speedups achieved, as a function of the number of\r\n                    For SIFT-1B, as suggested by the dataset creators, we re-                                  machines P for \ufb01xed problem size (dataset and model), in\r\n                    porttherecall@R:theaveragenumberofqueriesforwhich                                          CIFAR and SIFT-1M (N = 50K and 1M training points,\r\n                    the nearest neighbour is ranked within the top R positions                                 respectively). Even though these datasets and especially\r\n                                                           ParMAC:DistributedOptimisationof Nested Functions\r\n                                                       P =1,differente                                      e = 8, different P\r\n                                                                                                                                              \r\n                                            34\r\n                                         n  32\r\n                                         o\r\n                                         i                           1 epoch                                    1 machine\r\n                                         s\r\n                                         i\r\n                                         c                           2 epochs                                   32 machines\r\n                                         e  30\r\n                                         r\r\n                                         p                           8 epochs                                   64 machines\r\n                                            28                       1 epoch shuffled                           1 machine shuffled\r\n                                                                     2 epochs shuffled                          32 machines shuffled\r\n                                                                     8 epochs shuffled                          64 machines shuffled\r\n                                            26                                                   \r\n                                              0          500        1000       1500             0               10               20\r\n                                                                 runtime                                    ParMACiteration\r\n                                                                       Figure 4. Precision in CIFAR dataset.\r\n                 the number of independent submodels (M = 2L = 32 ef-                         used L = 64 hash functions (M = 128 submodels): linear\r\n                 fectivesubmodelsofthesamesize,asdiscussedearlier)are                         SVMsasbefore,andkernelSVMs. Thesehave\ufb01xedGaus-\r\n                 relatively small, the speedupswe achieveare nearly perfect                   sian radial basis functions (2000 centres picked at random\r\n                 for P \u2264 M and hold very well for P > M up to the                             from the training set and bandwidth \u03c3 = 160), so the only\r\n                 maximum number of machines we used (P = 128 in the                           trainable parameters are the weights, and the MAC algo-\r\n                 distributed system). The speedups \ufb02atten as the number of                    rithm does not change except that it operates on a 2000-\r\n                 W-step epochs (and consequently the amount of commu-                         dimensional input vector of kernel values, instead of the\r\n                 nication) increases, because for this experiment the bottle-                 128 SIFT features. We use e = 2 epochs with shuf\ufb02ing.\r\n                 neck is the W step, whose parallelisation ability (i.e., the                 All these decisions were based on trials on a subset of the\r\n                 number of concurrent processes) is limited by M = 2L                         training dataset. We initialised the binary codes from trun-\r\n                 (the Z step has N independent processes and is never a                       cated PCA trained on a subset of size 1M (recall@R=100:\r\n                 bottleneck, since N is very large). However, as noted ear-                   55.2%), which gave results comparable to the baseline in\r\n                 lier, using 1 to 2 epochs gives a good enough result, very                   (Je\u00b4gou et al., 2011b).\r\n                 close to doing an exact W step. The runtime for SIFT-1M                      WeranParMAConthewholetrainingsetinthedistributed\r\n                 onP =128machineswith1epochwas12minutesandits                                 system with 128 processors for 6 iterations and achieved\r\n                 speedup 100\u00d7. This is particularly remarkable given that                     a recall@R=100 of 61.5% in 29 hours (linear SVM) and\r\n                 the original, nested model did not have model parallelism.                   66.1% in 83 hours (kernel SVM). Using a scaled-down\r\n                 Fig. 5 also shows the speedups predicted by our theoreti-                    model and training set, we estimated that training in one\r\n                 cal model. We set the parameters e and N to their known                      machine (with enough RAM to hold the data and param-\r\n                 values, and M = 2L = 32 for CIFAR and SIFT-1M and                            eters) would take months. The theoretical speedup (\ufb01g. 5\r\n                 M=2L=128forSIFT-1B.Forthetimeparameters,we                                   right plot, using the same parameters as in SIFT-1M), is\r\n                 set tW = 1 to \ufb01x the time units, and we set tW and tZ                        nearly perfect (note the plot goes up to P = 1024 ma-\r\n                       r                                                    c          r\r\n                 by trial and error to achieve a reasonably good \ufb01t to the                    chines, even though our experiments are limited to P =\r\n                 experimental speedups: tW = 104 for both datasets, and                       128). This is because M is quite larger and N is much\r\n                                                 c\r\n                 tZ = 200 for CIFAR and 40 for SIFT-1M. Although these                        larger than in the previous datasets.\r\n                  r\r\n                 are fudgefactors, they are in rough agreement with the fact\r\n                 that communicatinga weightvectoroverthenetworkis or-                         6     DISCUSSION\r\n                 ders of magnitude slower than updating it with a gradient\r\n                 step, and that for BAs the Z step is quite slower than the                   Developingparallel,distributedoptimisationalgorithmsfor\r\n                 Wstepbecauseofthebinaryoptimisationit involves.                              nonconvexproblemsinmachinelearningischallenging,as\r\n                                                                                              shownbyrecenteffortsby large teams of researchers. One\r\n                 5.3    Large-ScaleExperiment                                                 important advantage of ParMAC is its simplicity. Data and\r\n                                                                                              modelparallelismarise naturallythanksto the introduction\r\n                 SIFT-1Bisoneofthelargestdatasets,ifnotthelargestone,                         of auxiliary coordinates. The corresponding optimisation\r\n                 that are publiclyavailable for comparingnearest-neighbour                    subproblems can often be solved reusing existing code as\r\n                 search algorithms with known ground-truth (i.e., precom-                     a black box (as with the SGD training of SVMs and lin-\r\n                 putedexactEuclideandistancesforeachquerytoitsknear-                          ear mappings in the BA). A circular topology is suf\ufb01cient\r\n                 est vectors in the base set). The training set contains N =                  to achieve a low communication between machines. There\r\n                 100M vectors, each consisting of 128 SIFT features. We\r\n                                                                   ParMAC:DistributedOptimisationof Nested Functions\r\n                                               CIFAR                                                      SIFT-1M                                             SIFT-1B\r\n                      ) 80                                                               100                                               \r\n                      .          1 epoch                                                          1 epoch\r\n                      t\r\n                      p          2 epochs                                                 80      8 epochs\r\n                      x 60\r\n                      e          3 epochs\r\n                      (\r\n                      )          4 epochs\r\n                      P                                                                   60\r\n                      ( 40       8 epochs\r\n                      S                                                                                                                                   too long to run\r\n                      p                                                                   40\r\n                      u\r\n                      d 20\r\n                      e                                                                   20\r\n                      e\r\n                      p\r\n                      s\r\n                          0                                                                 0 \r\n                            1           32           64            96          128           1         32         64         96        128\r\n                        80                                                               100                                                    1024                                      \r\n                      )          1 epoch                                                          1 epoch                                                  1 epoch\r\n                      .                                                                           2 epochs                                                 2 epochs\r\n                      h          2 epochs                                                 80\r\n                      t\r\n                      ( 60       3 epochs                                                         3 epochs                                        768      3 epochs\r\n                      )\r\n                      P          4 epochs                                                 60      4 epochs                                                 4 epochs\r\n                      (                                                                           8 epochs                                                 8 epochs\r\n                      S 40       8 epochs                                                                                                         512\r\n                      p                                                                   40\r\n                      u\r\n                      d\r\n                      e 20                                                                                                                        256\r\n                      e                                                                   20\r\n                      p\r\n                      s\r\n                          0                                                                 0                                                        1 \r\n                            1           32           64            96          128           1         32         64         96        128            1      256      512     768 1024\r\n                                       numberofmachinesP                                            numberofmachinesP                                   numberofmachinesP\r\n                    Figure 5. Speedup S(P) as a function of the number of machines P (top: experiment, bottom: theory). The dataset size and number of\r\n                                                                                  6                                  8\r\n                    submodels (N;M)is(50000,32) for CIFAR, (10 ;32) for SIFT-1M and (10 ;128) for SIFT-1B.\r\n                    is no close coupling between the model structure and the                                coders. MAC creates parallelism by introducing auxiliary\r\n                    distributed system architecture. This makes ParMAC suit-                                coordinates for each data point to decouple nested terms in\r\n                    able for architectures as different as supercomputers, data                             the objective function. ParMAC is able to translate the par-\r\n                    centres or even IoT devices.                                                            allelism inherent in MAC into a distributed system by 1)\r\n                    Further improvements can be made in speci\ufb01c problems.                                   using data parallelism, so that each machine keeps a por-\r\n                    For example, we may have more parallelisation or less de-                               tion of the original data and its corresponding auxiliary co-\r\n                    pendencies (e.g. the weights of hidden units in layer k of a                            ordinates; and 2) using model parallelism, so that indepen-\r\n                    neural net depend only on auxiliary coordinates in layers k                             dent submodels visit every machine in a circular topology,\r\n                    and k +1). This may reduce the communication in the W                                   effectively executing epochs of a stochastic optimisation,\r\n                    step, by sending to a given machine only the model portion                              without the need for a parameter server and therefore no\r\n                    it needs, or by allocating cores within a multicore machine                             communicationbottlenecks. Theconvergencepropertiesof\r\n                    accordingly. The W and Z step optimisations can make                                    MACremainessentially unaltered in ParMAC. The paral-\r\n                    use of further parallelisation by GPUs or by distributed                                lel speedup can be theoretically predicted to be nearly per-\r\n                    convex optimisation algorithms. Many more re\ufb01nements                                    fect whenthenumberofsubmodelsiscomparableorlarger\r\n                    can be done, such as storing or communicating reduced-                                  than the number of machines, and to eventually saturate as\r\n                    precision values with little effect of the accuracy. In this                            one continues to increase the number of machines, and in-\r\n                    paper, we havekeptourimplementationassimpleaspossi-                                     deed this was con\ufb01rmed in our experiments. ParMAC also\r\n                    ble, because our goal was to understand the parallelisation                             makesit easy to account for data shuf\ufb02ing, load balancing,\r\n                    speedups of ParMAC in a setting as general as possible,                                 streaming and fault tolerance. Hence, we expect that Par-\r\n                    rather than trying to achieve the very best performance for                             MACcouldbeabasicbuildingblock,in combinationwith\r\n                    a particular dataset, model or distributed system.                                      other techniques, for the distributed optimisation of nested\r\n                                                                                                            modelsin big data settings.\r\n                    7     CONCLUSION                                                                        ACKNOWLEDGMENTS\r\n                    We have proposed ParMAC, a distributed model for the                                    Work supported by a Google Faculty Research Award and\r\n                    method of auxiliary coordinates for training nested, non-                               byNSFawardIIS\u20131423515. We thankDongLi (UCMer-\r\n                    convexmodelsingeneral,analyseditsparallelspeedupand                                     ced)fordiscussionsaboutMPIandperformanceevaluation\r\n                    convergence, and demonstrated it with an MPI-based im-                                  on parallel systems and Quoc Le (Google) for discussions\r\n                    plementation for a particular case, to train binary autoen-                             about Google\u2019s DistBelief system.\r\n                                               ParMAC:DistributedOptimisationof Nested Functions\r\n                                                                                                   \u00b4\r\n              REFERENCES                                                    Carreira-Perpin\u02dca\u00b4n, M. A. Theoretical speedup of the Par-\r\n              Abadi, M., Agarwal, A., Barham, P., Brevdo, E., Chen, Z.,       MACmodelfordistributedoptimisation of nested func-\r\n                 Citro, C., Corrado, G. S., Davis, A., Dean, J., Devin, M.,   tions. arXiv, 2019a.\r\n                 Ghemawat, S., Goodfellow, I., Harp, A., Irving, G., Is-                            \u00b4\r\n                                                                            Carreira-Perpin\u02dca\u00b4n, M. A.   The EM algorithm and the\r\n                 ard, M., Jia, Y., Jozefowicz, R., Kaiser, \u0141., Kudlur, M.,    Method of Auxiliary Coordinates: Similarities and dif-\r\n                 Levenberg, J., Mane\u00b4, D., Monga, R., Moore, S., Mur-         ferences. arXiv, 2019b.\r\n                 ray, D., Olah, C., Schuster, M., Shlens, J., Steiner, B.,                         \u00b4\r\n                 Sutskever, I., Talwar, K., Tucker, P., Vanhoucke, V., Va-  Carreira-Perpin\u02dca\u00b4n, M. A. and Alizadeh, M. ParMAC: Dis-\r\n                 sudevan, V., Vie\u00b4gas, F., Vinyals, O., Warden, P., Watten-   tributed optimisation of nested functions, with applica-\r\n                 berg, M., Wicke, M., Yu, Y., and Zheng, X. TensorFlow:       tion to learning binary autoencoders. arXiv:1605.09114,\r\n                 Large-scalemachinelearningonheterogeneoussystems,            May302016.\r\n                 2015. TensorFlowWhitePaper.                                                       \u00b4\r\n                                                                            Carreira-Perpin\u02dca\u00b4n, M. A. and Raziperchikolaei, R. Hash-\r\n                                                                              ing with binary autoencoders.     In Proc. of the 2015\r\n              Anderson, D. P., Cobb, J., Korpela, E., Lebofsky, M.,           IEEEComputerSocietyConf.ComputerVisionandPat-\r\n                 and Werthimer, D.    SETI@home: An experiment in             tern Recognition(CVPR\u201915),pp.557\u2013566,Boston,MA,\r\n                 public-resourcecomputing. Comm.ACM,45(11):56\u201361,             June 7\u201312 2015.\r\n                 November2002.\r\n                                                                                                    \u00b4\r\n                                                                            Carreira-Perpin\u02dca\u00b4n, M. A. and Vladymyrov, M.     A fast,\r\n              Askari, A., Negiar, G., Sambharya, R., and Ghaoui, L. E.        universal algorithm to learn parametric nonlinear em-\r\n                 Lifted neural networks.    arXiv:1805.01532, June 21         beddings. In Cortes, C., Lawrence, N. D., Lee, D. D.,\r\n                 2018.                                                        Sugiyama, M., and Garnett, R. (eds.), Advances in Neu-\r\n              Bengio, Y., Paiement, J.-F., Vincent, P., Delalleau, O.,        ral Information Processing Systems (NIPS), volume 28,\r\n                 Le Roux, N., and Ouimet, M. Out-of-sample exten-             pp. 253\u2013261.MITPress, Cambridge,MA, 2015.\r\n                                                                                                   \u00b4\r\n                 sions for LLE, Isomap, MDS, Eigenmaps, and spectral        Carreira-Perpin\u02dca\u00b4n, M. A. and Wang, W. Distributed op-\r\n                 clustering. In Thrun, S., Saul, L. K., and Scho\u00a8lkopf, B.    timization of deeply nested systems. arXiv:1212.5921,\r\n                 (eds.), Advances in Neural Information Processing Sys-       December242012.\r\n                 tems (NIPS), volume 16. MIT Press, Cambridge, MA,                                 \u00b4\r\n                 2004.                                                      Carreira-Perpin\u02dca\u00b4n, M. A. and Wang, W. Distributed op-\r\n                                                                              timization of deeply nested systems. In Kaski, S. and\r\n              Bertsekas, D. P.  Incremental gradient, subgradient, and        Corander, J. (eds.), Proc. of the 17th Int. Conf. Arti\ufb01cial\r\n                 proximalmethodsforconvexoptimization: A survey. In           Intelligence and Statistics (AISTATS 2014), pp. 10\u201319,\r\n                 Sra, S., Nowozin, S., and Wright, S. J. (eds.), Optimiza-    Reykjavik, Iceland, April 22\u201325 2014.\r\n                 tion for Machine Learning. MIT Press, 2011.                Cevher, V., Becker, S., and Schmidt, M. Convex optimiza-\r\n              Bottou, L. Large-scale machine learning with stochastic         tion for big data: Scalable, randomized, and parallel al-\r\n                 gradient descent.  In Proc. 19th Int. Conf. Computa-         gorithms for big data analytics. IEEE Signal Processing\r\n                 tional Statistics (COMPSTAT 2010),pp. 177\u2013186,Paris,         Magazine,31(5):32\u201343,September2014.\r\n                 France, August 22\u201327 2010.                                 Coates, A., Huval, B., Wang, T., Wu, D., Catanzaro, B.,\r\n                                                                              and Ng, A. Deep learning with COTS HPC systems. In\r\n              Bottou, L. and Bousquet, O. The tradeoffs of large scale        Dasgupta,S.andMcAllester,D.(eds.),Proc. ofthe30th\r\n                 learning.  In Platt, J. C., Koller, D., Singer, Y., and      Int. Conf. Machine Learning (ICML 2013), pp. 1337\u2013\r\n                 Roweis, S. (eds.), Advances in Neural Information Pro-       1345, Atlanta, GA, June 16\u201321 2013.\r\n                 cessing Systems (NIPS), volume 20, pp. 161\u2013168. MIT\r\n                 Press, Cambridge, MA, 2008.                                Dean,J., Corrado,G., Monga,R., Chen,K., Devin,M., Le,\r\n                                                                              Q., Mao, M., Ranzato, M., Senior, A., Tucker, P., Yang,\r\n              Boyd, S., Parikh, N., Chu, E., Peleato, B., and Eckstein, J.    K., and Ng, A. Largescale distributed deep networks. In\r\n                 Distributed optimization and statistical learning via the    Pereira, F., Burges, C. J. C., Bottou, L., and Weinberger,\r\n                 alternating direction method of multipliers. Foundations     K.Q.(eds.),AdvancesinNeuralInformationProcessing\r\n                 andTrends in Machine Learning, 3(1):1\u2013122,2011.              Systems (NIPS), volume 25, pp. 1232\u20131240. MIT Press,\r\n              Bradley,J., Kyrola,A.,Bickson,D.,andGuestrin,C. Paral-          Cambridge,MA,2012.\r\n                 lel coordinate descent for l1-regularized loss minimiza-   Drineas, P. and Mahoney, M. W. On the Nystro\u00a8m method\r\n                 tion. In Getoor, L. and Scheffer, T. (eds.), Proc. of        for approximating a Gram matrix for improved kernel-\r\n                 the 28th Int. Conf. Machine Learning (ICML 2011), pp.        based learning. J. Machine Learning Research, 6:2153\u2013\r\n                 321\u2013328,Bellevue, WA, June 28 \u2013 July 2 2011.                 2175, December2005.\r\n                                              ParMAC:DistributedOptimisationof Nested Functions\r\n              Gong, Y., Lazebnik, S., Gordo, A., and Perronnin, F. It-    Liu, J. and Wright, S. J. Asynchronous stochastic coor-\r\n                erative quantization: A Procrustean approach to learn-       dinate descent: Parallelism and convergence properties.\r\n                ing binary codes for large-scale image retrieval. IEEE       SIAMJ.Optimization,25(1):351\u2013376,2015.\r\n                Trans. Pattern Analysis and Machine Intelligence, 35\r\n                (12):2916\u20132929,December2013.                              Low, Y., Bickson, D., Gonzalez, J., Guestrin, C., Kyrola,\r\n                                                                             A., and Hellerstein, J. M.  Distributed GraphLab: A\r\n              Grauman, K. and Fergus, R. Learning binary hash codes          framework for machine learning and data mining in the\r\n                for large-scale image search. In Cipolla, R., Battiato, S.,  cloud.  Proc. VLDB Endowment, 5(8):716\u2013727, April\r\n                andFarinella,G.(eds.),MachineLearningforComputer             2012.\r\n                Vision, pp. 49\u201387. Springer-Verlag, 2013.\r\n                                                                          McLachlan,G. J. and Krishnan, T. The EM Algorithm and\r\n              Gropp, W., Lusk, E., and Skjellum, A.       Using MPI:         Extensions. Wiley Series in Probability and Mathemati-\r\n                Portable Parallel Programming with the Message-              cal Statistics. John Wiley & Sons, second edition, 2008.\r\n                Passing Interface. MIT Press, second edition, 1999.\r\n                                                                          Niu, F., Recht, B., Re\u00b4, C., and Wright, S. J. HOGWILD!:\r\n              Hinton, G., Deng, L., Yu, D., Dahl, G., rahman Mohamed,        A lock-free approach to parallelizing stochastic gradi-\r\n                A., Jaitly, N., Senior, A., Vanhoucke, V., Nguyen, P.,       ent descent. In Shawe-Taylor, J., Zemel, R. S., Bartlett,\r\n                Sainath, T. N., and Kingsbury, B. Deep neural networks       P., Pereira, F., and Weinberger, K. Q. (eds.), Advances\r\n                for acoustic modeling in speech recognition: The shared      in Neural Information Processing Systems (NIPS), vol-\r\n                views of four research groups. IEEE Signal Processing        ume 24, pp. 693\u2013701. MIT Press, Cambridge, MA,\r\n                Magazine,29(6):82\u201397,November2012.                           2011.\r\n              Jaderberg,M.,Czarnecki,W.M.,Osindero,S.,Vinyals,O.,         Nocedal, J. and Wright, S. J.   Numerical Optimization.\r\n                Graves, A., Silver, D., and Kavukcuoglu, K. Decoupled        Springer Series in Operations Research and Financial\r\n                neural interfaces using synthetic gradients. In Precup,      Engineering. Springer-Verlag, New York, second edi-\r\n                D.andTeh,Y.W.(eds.),Proc.ofthe34thInt.Conf.Ma-               tion, 2006.\r\n                chine Learning (ICML 2017), pp. 1627\u20131635, Sydney,\r\n                Australia, August 6\u201311 2017.                              Ororbia, A. G., Mali, A., Kifer, D., and Giles, C. L. Con-\r\n                                                                             ducting credit assignment by aligning local representa-\r\n              Je\u00b4gou, H., Douze,M., andSchmid,C. Productquantization         tions. arXiv:1803.01834,July 12 2018.\r\n                for nearest neighbor search. IEEE Trans. Pattern Anal-\r\n                ysis and Machine Intelligence, 33(1):117\u2013128, January     Ouyang,H.,He,N.,Tran,L.,andGray,A. Stochasticalter-\r\n                2011a.                                                       nating direction method of multipliers. In Dasgupta, S.\r\n                                                                             and McAllester, D. (eds.), Proc. of the 30th Int. Conf.\r\n              Je\u00b4gou, H., Tavenard, R., Douze, M., and Amsaleg, L.           Machine Learning (ICML 2013), pp. 80\u201388, Atlanta,\r\n                Searching in one billion vectors: Re-rank with source        GA,June16\u2013212013.\r\n                coding. InProc.oftheIEEEInt.Conf.Acoustics,Speech\r\n                                                                                                                          \u00b4\r\n                and Sig. Proc. (ICASSP\u201911), pp. 861\u2013864, Prague,          Raziperchikolaei, R. and Carreira-Perpin\u02dca\u00b4n, M. A. Opti-\r\n                CzechRepublic, May22\u2013272011b.                                mizing af\ufb01nity-based binary hashing using auxiliary co-\r\n                                                                             ordinates. In Lee, D. D., Sugiyama, M., von Luxburg,\r\n              Krizhevsky, A. Learning multiple layers of features from       U., Guyon, I., and Garnett, R. (eds.), Advances in Neu-\r\n                tiny images. Master\u2019sthesis, Dept. ofComputerScience,        ral Information Processing Systems (NIPS), volume 29,\r\n                University of Toronto, April 8 2009.                         pp. 640\u2013648.MITPress, Cambridge,MA, 2016.\r\n              Le, Q., Ranzato, M., Monga, R., Devin, M., Corrado, G.,     Richta\u00b4rik, P. and Taka\u00b4c\u02c7, M. Distributed coordinate descent\r\n                Chen, K., Dean, J., and Ng, A. Building high-level fea-      methodforlearningwithbigdata. arXiv:1310.2059,Oc-\r\n                tures using large scale unsupervised learning. In Lang-      tober 8 2013.\r\n                ford, J. and Pineau, J. (eds.), Proc. of the 29th Int. Conf.\r\n                Machine Learning (ICML 2012), Edinburgh, Scotland,        Seide, F., Fu, H., Droppo, J., Li, G., and Yu, D.   1-bit\r\n                June 26 \u2013 July 1 2012.                                       stochastic gradient descent and its application to data-\r\n                                                                             parallel distributed training of speech DNNs. In Proc. of\r\n              Lee, D.-H., Zhang, S., Fischer, A., and Bengio, Y. Dif-        Interspeech\u201914,Singapore, September 14\u201318 2014.\r\n                ference target propagation. In Appice, A., Rodrigues,\r\n                P. P., Costa, V. S., Soares, C., Gama, J., and Jorge, A.  Talwalkar, A., Kumar, S., and Rowley, H.     Large-scale\r\n                (eds.), Proc. of the 26th European Conf. Machine Learn-      manifold learning. In Proc. of the 2008 IEEE Computer\r\n                ing (ECML\u201315), pp. 498\u2013515, Porto, Portugal, Septem-         Society Conf. Computer Vision and Pattern Recognition\r\n                ber 7\u201311 2015.                                               (CVPR\u201908),Anchorage,AK,June23\u2013282008.\r\n                                                ParMAC:DistributedOptimisationof Nested Functions\r\n              Taylor, G., Burmeister, R., Xu, Z., Singh, B., Patel, A.,\r\n                 and Goldstein, T. Training neural networks without gra-\r\n                 dients: A scalable ADMM approach. In Balcan, M.-F.\r\n                 andWeinberger,K.Q.(eds.),Proc.ofthe33rdInt.Conf.\r\n                 Machine Learning (ICML 2016), pp. 2722\u20132731, New\r\n                 York, NY, June 19\u201324 2016.\r\n                                                             \u00b4\r\n              Vladymyrov, M. and Carreira-Perpin\u02dca\u00b4n, M. A.      Locally\r\n                 Linear Landmarks for large-scale manifold learning. In\r\n                 Blockeel, H., Kersting, K., Nijssen, S., and Zelezny\u00b4, F.\r\n                 (eds.), Proc. of the 24th European Conf. Machine Learn-\r\n                 ing (ECML\u201313), pp. 256\u2013271, Prague, Czech Republic,\r\n                 September23\u2013272013.\r\n                                                           \u00b4\r\n              Vladymyrov, M. and Carreira-Perpin\u02dca\u00b4n, M. A. The Varia-\r\n                 tionalNystro\u00a8mmethodforlarge-scalespectralproblems.\r\n                 In Balcan, M.-F. and Weinberger, K. Q. (eds.), Proc. of\r\n                 the 33rd Int. Conf. Machine Learning (ICML 2016), pp.\r\n                 211\u2013220,NewYork,NY,June19\u2013242016.\r\n                                                    \u00b4\r\n              Wang,W.andCarreira-Perpin\u02dca\u00b4n,M.A. Theroleofdimen-\r\n                 sionality reduction in classi\ufb01cation. In Brodley, C. E.\r\n                 and Stone, P. (eds.), Proc. of the 28th National Confer-\r\n                 ence on Arti\ufb01cial Intelligence (AAAI 2014), pp. 2128\u2013\r\n                 2134, QuebecCity, Canada, July 27\u201331 2014.\r\n              Williams, C. K. I. and Seeger, M.      Using the Nystro\u00a8m\r\n                 method to speed up kernel machines. In Leen, T. K.,\r\n                 Dietterich, T. G., and Tresp, V. (eds.), Advances in Neu-\r\n                 ral Information Processing Systems (NIPS), volume 13,\r\n                 pp. 682\u2013688.MITPress, Cambridge,MA, 2001.\r\n              Wright, S. J. Coordinate descent algorithms. Math. Prog.,\r\n                 151(1):3\u201334,June 2016.\r\n              Xing, E. P., Ho, Q., Dai, W., Kim, J. K., Wei, J., Lee, S.,\r\n                 Zheng, X., Xie, P., Kumar, A., and Yu, Y. Petuum: A\r\n                 new platform for distributed machine learning on big\r\n                 data.  IEEE Trans. Big Data, 1(2):49\u201367, April\u2013June\r\n                 2015.\r\n              Zaharia, M., Chowdhury, M., Franklin, M. J., Shenker, S.,\r\n                 and Stoica, I. Spark: Cluster computing with working\r\n                 sets. In Proc. 2nd USENIX Conf. Hot Topics in Cloud\r\n                 Computing(HotCloud2010),2010.\r\n              Zhang,R.andKwok,J. AsynchronousdistributedADMM\r\n                 algorithm for global variable consensus optimization. In\r\n                 Xing, E. P. and Jebara, T. (eds.), Proc. of the 31st Int.\r\n                 Conf. Machine Learning (ICML 2014), pp. 1701\u20131709,\r\n                 Beijing, China, June 21\u201326 2014.\r\n              Zinkevich, M., Weimer, M., Smola, A., and Li, L. Par-\r\n                 allelized stochastic gradient descent.   In Lafferty, J.,\r\n                 Williams, C. K. I., Shawe-Taylor, J., Zemel, R., and Cu-\r\n                 lotta, A. (eds.), Advancesin NeuralInformationProcess-\r\n                 ing Systems (NIPS), volume 23, pp. 2595\u20132603. MIT\r\n                 Press, Cambridge, MA, 2010.", "award": [], "sourceid": 152, "authors": [{"given_name": "Miguel A", "family_name": "Carreira-Perpinan", "institution": "UC Merced"}, {"given_name": "Mehdi", "family_name": "Alizadeh", "institution": "UC Merced"}]}