{"title": "Kernel Machines That Adapt To Gpus For Effective Large Batch Training", "book": "Proceedings of Machine Learning and Systems", "page_first": 360, "page_last": 373, "abstract": "Modern machine learning models are typically trained using Stochastic Gradient Descent (SGD) on massively parallel computing resources such as GPUs. Increasing mini-batch size is a simple and direct way to utilize the parallel computing capacity. \r\nFor small batch an increase in batch size results in the proportional reduction in the training time, a phenomenon known as {\\it linear scaling}. \r\nHowever, increasing batch size beyond a certain value leads to no further improvement in training time. In this paper we develop the first analytical framework that extends linear scaling to match the parallel computing capacity of a resource.\r\nThe framework is designed for a class of classical kernel machines. It automatically modifies a standard kernel machine to output a mathematically equivalent prediction function, yet allowing for extended linear scaling, i.e., higher effective parallelization and faster training time on given hardware. \r\n\r\nThe resulting algorithms are accurate, principled and very fast. For example, using a single Titan Xp GPU, training on ImageNet with $1.3\\times 10^6$ data points and $1000$ labels takes under an hour, while smaller datasets, such as MNIST, take seconds. As the parameters are chosen analytically, based on the theoretical bounds, little tuning beyond selecting the kernel and the kernel parameter is needed, further facilitating the practical use of these methods.", "full_text": "KERNELMACHINESTHATADAPTTGPUO SFOR\r\n EFFECTIVE LARGE BATCH TRAINING\r\n Siyuan Ma1 MikhailBelkin1\r\n ABSTRACT\r\n Modern machine learning models are typically trained using Stochastic Gradient Descent (SGD) on massively\r\n parallel computing resources such as GPUs. Increasing mini-batch size is a simple and direct way to utilize the\r\n parallel computing capacity. For small batch an increase in batch size results in the proportional reduction in the\r\n training time, a phenomenon known as linear scaling. However, increasing batch size beyond a certain value\r\n leads to no further improvement in training time. In this paper we develop the \ufb01rst analytical framework that\r\n extends linear scaling to match the parallel computing capacity of a resource. The framework is designed for a\r\n class of classical kernel machines. It automatically modi\ufb01es a standard kernel machine to output a mathemati-\r\n cally equivalent prediction function, yet allowing for extended linear scaling, i.e., higher effective parallelization\r\n and faster training time on given hardware.\r\n The resulting algorithms are accurate, principled and very fast. For example, using a single Titan Xp GPU,\r\n 6\r\n training on ImageNet with 1.3 \u21e5 10 data points and 1000 labels takes under an hour, while smaller datasets,\r\n such as MNIST, take seconds. As the parameters are chosen analytically, based on the theoretical bounds, little\r\n tuning beyond selecting the kernel and the kernel parameter is needed, further facilitating the practical use of\r\n these methods.\r\n \u21e4\r\n 1INTRODUCTION Nepoch(m) / mform>m. OntheotherhandTepoch(m)\r\n is at best1 decreases proportionally to 1/m. Thus the train-\r\n Modern machine learning models are trained using ing time is at least\r\n Stochastic Gradient Descent (SGD) on parallel computing \u21e2\r\n \u21e4\r\n resources such as GPUs. During training we aim to min- Ttrain(m)= 1/m, for m \uf8ff m\r\n \u21e4\r\n imize the (wall clock) training time Ttrain given a compu- const, for m>m\r\n tational resource, e.g., a bank of GPUs. Although using\r\n larger batch size m improves resource utilization, it does In other words, we obtain linear speedup (\u201clinear scaling\u201d)\r\n not necessarily lead to a reduction in training time. In- \u21e4\r\n for batch sizes up to m , beyond which the training time\r\n deed, we can decompose the training time Ttrain(m) into cannot be improved by further increasing m. Furthermore,\r\n two parts, an important property of m\u21e4 is its near independence from\r\n T (m)=N (m)\u21e5T (m) thenumberoftrainingsamplesasitisprimarilydetermined\r\n train epoch epoch bythe model and the data distribution.\r\n whereNepoch(m)isthenumberoftrainingepochsrequired Similar relationship between the batch size and the training\r\n for convergence and Tepoch(m) is the wall clock time to time has been observed empirically in training deep neural\r\n train for one epoch. It is easy to see that increasing m networks(Krizhevsky,2014). Aheuristiccalledthe\u201clinear\r\n always leads to higher resource utilization, thus decreas- scaling rule\u201d has been widely used in deep learning prac-\r\n ing Tepoch(m). However, Nepoch(m) may increase with m. tice (Goyal et al., 2017; You et al., 2017; Jia et al., 2018).\r\n In fact, for a general class of convex problems it can be Moreover, in parallel to the convex case analyzed in (Ma\r\n shown (Ma et al., 2017) Nepoch(m) is approximately con- et al., 2017), recent work (Golmant et al., 2018; McCan-\r\n \u21e4\r\n stant for m no more than a certain critical size m and \u21e4\r\n dlish et al., 2018) empirically demonstrates that m is in-\r\n 1Department of Computer Science and Engineering, Ohio dependent of the data size for deep neural networks.\r\n StateUniversity,Columbus,Ohio,UnitedStates. Correspondence Manybestpractices of modern large-scale learning (Goyal\r\n to: Siyuan Ma , MikhailBelkin et al., 2017; You et al., 2017; Jia et al., 2018) start with\r\n .\r\n \u21e4\r\n estimating m byeitherheuristicrulesorexperiments. The\r\n Proceedings of the 2nd SysML Conference, Palo Alto, CA, USA,\r\n 2019. Copyright 2019 by the author(s). 1Assuming perfect parallel computation.\r\n Kernel machines that adapt to GPUs for effective large batch training\r\n optimal training time for the model is then capped by the seen that inference methods, notably neural networks, that\r\n \u21e4\r\n estimated batch size m which is \ufb01xed given the model interpolate or nearly interpolate the training data gen-\r\n architecture and weight, as well as the learning task. eralize very well to test data (Zhang et al., 2016). It\r\n In this work we propose a principled framework (EigenPro has been observed in (Belkin et al., 2018) that mini-\r\n \u21e4 mumnormkernel interpolants, i.e., functions of the forms\r\n 2.0) that increases m for a class of models corresponding x P x x x\r\n to classical kernel machines. Our framework modi\ufb01es a f(xx)= i\u21b5ik(xx,xxi), such that f(xxi)=yi, achieve op-\r\n kernel machine to output a mathematically equivalent pre- timal or near optimal generalization performance. While\r\n diction function, yet allowing for extended linear scaling the mathematical foundations of why interpolation pro-\r\n adaptive to (potentially) arbitrary parallel computational duces good test results are not yet fully understood, the\r\n resource. Furthermore, the optimization parameter selec- simplicity of the framework can be used to accelerate\r\n tion is analytic, making it easy and ef\ufb01cient to use in prac- and scale the training\r\n tice and appropriate for \u201cinteractive\u201d exploratory machine of classical kernel meth-\r\n learning and automatic model selection. The resulting al- ods, while improving\r\n gorithms show signi\ufb01cant speedup for training on GPUs their test accuracy. In-\r\n over the state-of-the-art methods and excellent generaliza- deed, constructing these\r\n tion performance. interpolating functions is\r\n conceptually and mathe-\r\n Kernelmachines. Kernelmachinesareapowerfulclassof matically simple, requir-\r\n methods for classi\ufb01cation and regression. Given the train- ing approximately solv-\r\n x d\r\n ing data {(xx ,y),i=1,...,}n2R \u21e5 R, and a posi- ing a single system of\r\n i i\r\n tive de\ufb01nite kernel k : Rd \u21e5 Rd ! R, kernel machines linear equations with a\r\n x P x x\r\n construct functions of the form f(xx)= i\u21b5ik(xx,xxi). unique solution, same\r\n These methods are theoretically attractive, show excellent for both regression and\r\n performance on small datasets, and are known to be uni- classi\ufb01cation. Signi\ufb01-\r\n versal learners, i.e., capable of approximating any func- cant computational sav-\r\n tion from data. However, making kernel machines fast and ings and, when neces-\r\n scalable to large data has been a challenging problem. Re- sary, regularization (Yao\r\n cent large scale efforts typically involved signi\ufb01cant paral- et al., 2007) are provided\r\n lel computational resources, such as multiple (sometimes by early stopping, i.e.,\r\n thousands) AWS vCPU\u2019s (Tu et al., 2016; Avron et al., stopping iterations well\r\n 2016) or super-computer nodes (Huang et al., 2014). Very before numerical conver-\r\n recently, FALKON(Rudietal.,2017)andEigenPro(Ma& gence, once successive\r\n Belkin, 2017) showed strong classi\ufb01cation results on large iterations fail to improve\r\n datasets with much lower computational requirements, a validation error.\r\n few hours on a single GPU. Adaptivity to data and computational resource: choos-\r\n The main problem and our contribution. The main ing optimal batch size and step size for SGD. We will\r\n problem addressed in this paper is to minimize the training train kernel methods using Stochastic Gradient Descent\r\n time for a kernel machine, given access to a parallel (SGD), a method which is well-suited to modern GPU\u2019s\r\n computational resource G. Our main contribution is that and has shown impressive success in training neural net-\r\n given a standard kernel, we are able to learn a new data and works. Importantly, in the interpolation framework, depen-\r\n computational resource dependent kernel to minimize the dence of convergence on the batch size and the step size\r\n resource time required for training without changing the can be derived analytically, allowing for full analysis and\r\n mathematical solution for the original kernel. Our model automatic parameter selection.\r\n for a computational resource G is based on a modern We \ufb01rst note that in the parallel model each iteration of\r\n graphics processing unit (GPU), a device that allows for SGD (essentially a matrix multiplication) takes the same\r\n very ef\ufb01cient, highly parallel2 matrix multiplication.\r\n max\r\n time for any mini-batch size up to m , de\ufb01ned as the\r\n G\r\n mini-batch size where the parallel capacity of the resource\r\n Theoutline of our approach is shown in the diagram on the G is fully utilized. It is shown in (Ma et al., 2017) that in\r\n right. We now outline the key ingredients. the interpolation framework convergence per iteration (us-\r\n The interpolation framework. In recent years we have ing optimal step size) improves nearly linearly as a func-\r\n tion of the mini-batch size m up to a certain critical size\r\n 2 \u21e4 \u21e4\r\n For example, there are 3840 CUDA cores in Nvidia GTX m (k)andrapidlysaturates after that. The quantity m (k)\r\n Titan Xp (Pascal). is related to the spectrum of the kernel. For kernels used\r\n Kernel machines that adapt to GPUs for effective large batch training\r\n in practice it is typically quite small, less than 10, due to Comparison to related work. In recent years there has\r\n their rapid eigenvalue decay. Yet, depending on the num- beensigni\ufb01cantprogressonscalingandacceleratingkernel\r\n \u00b4\r\n ber of data points, features and labels, a modern GPU can methods including (Takac et al., 2013; Huang et al., 2014;\r\n handle mini-batches of size 1000 or larger. This disparity Lu et al., 2014; Tu et al., 2016; Avron et al., 2016; May\r\n presents an opportunity for major improvements in the ef- et al., 2017). Most of these methods are able to scale to\r\n \ufb01ciency of kernel methods. In this paper we show how to large data sets by utilizing major computational resources\r\n construct data and resource adaptive kernel kG, by modi- such as supercomputers or multiple (sometimes hundreds\r\n 3\r\n fying the spectrum of the kernel by using EigenPro algo- or thousands) AWS vCPU\u2019s . Two recent methods which\r\n rithm (Ma & Belkin, 2017). The resulting iterative method allow for high ef\ufb01ciency kernel training with a single CPU\r\n withthenewkernelhassimilarorbetterconvergenceperit- or GPU is EigenPro (Ma & Belkin, 2017) (used a as basis\r\n eration than the original kernel k for small mini-batch size. for the adaptive kernels in this paper) and FALKON (Rudi\r\n However its convergence improves linearly to much larger et al., 2017). The method developed in this paper is signi\ufb01-\r\n mini-batch sizes, matching mmax, the maximum that can cantly faster than either of them, while achieving similar or\r\n G\r\n be utilized by the resource G. Importantly, SGD for either better test set accuracy. Additionally, it is easier to use as\r\n kernel converge to the same interpolated solution. muchoftheparameterselection is done automatically.\r\n Mini-batchSGD(usedinouralgorithm)hasbeenthedomi-\r\n nanttechniqueintrainingdeepmodels. Therehasbeensig-\r\n ni\ufb01cant empirical evidence (Krizhevsky, 2014; You et al.,\r\n 2017; Smith et al., 2017) showing that linearly scaling the\r\n step size with the mini-batch size up to a certain value leads\r\n to improved convergence. This phenomenon has been uti-\r\n lized to scale deep learning in distributed systems by adopt-\r\n ing large mini-batch sizes (Goyal et al., 2017).\r\n The advantage of our setting is that the optimal batch\r\n and step sizes can be analyzed and expressed analytically.\r\n Moreover, these formulas contain variables which can\r\n be explicitly computed and directly used for parameter\r\n selection in our algorithms. Going beyond batch size and\r\n step size selection, the theoretical interpolation framework\r\n allows us to construct new adaptive kernels, such that the\r\n mini-batch size required for optimal convergence matches\r\n Figure 1: Adaptive and original kernel the capacity of the computational resource.\r\n Thus, we aim to modify the kernel by constructing a kernel Thepaperisstructuredasfollows: In Section 3, we present\r\n \u21e4 max our main algorithm to learn a kernel to fully utilize a given\r\n kG, such that m (kG)=m without changing the op-\r\n G computational resource. In Section 4, we present an im-\r\n timal (interpolating) solution. This is shown schematically proved version of EigenPro iteration used by the main al-\r\n in Figure 1. We see that for small mini-batch size conver- gorithm. We then provide comparisons to state-of-the-art\r\n gence of these two kernels k and kG is similar. However, kernel methods on several large datasets in Section 5.We\r\n \u21e4\r\n values of m>m(k) do not help the convergence of the further discuss exploratory machine learning in the context\r\n original kernel k, while convergence of kG keep improving of our method.\r\n max\r\n up to m = m , where the resource utilization is satu-\r\n G\r\n rated. For empirical results on real datasets, parallel to the\r\n schematic shown above, see Figure 2 in Section 5. 2SETUP\r\n We construct and implement these kernels (see Westart by brie\ufb02y discussing the basic setting and kernel\r\n github.com/EigenPro/EigenPro2 for the code), and methods used in this paper.\r\n show how to analytically choose parameters, including Kernel interpolation. We are given n labeled training\r\n the batch size and the step size. As a secondary con- d\r\n x x\r\n points (xx ,y),...,(xx ,y) 2 R \u21e5 R. We consider a\r\n tribution of this work we develop an improved version 1 1 n n\r\n of EigenPro (Ma & Belkin, 2017) signi\ufb01cantly reducing ReproducingKernelHilbertSpace(RKHS)H (Aronszajn,\r\n the memory requirements and making the computational 1950) corresponding to a positive de\ufb01nite kernel function\r\n overhead over the standard SGD negligible. 3See http://aws.amazon.com/ec2fordetails.\r\n Kernel machines that adapt to GPUs for effective large batch training\r\n k : Rd \u21e5 Rd ! R. There is a unique (minimum norm) EigenPro iteration (Ma & Belkin, 2017). To achieve\r\n interpolated solution in Hof the form faster convergence, EigenPro iteration performs spec-\r\n n tral modi\ufb01cation on the kernel operator K(f) ,\r\n X 2 Pn x x\r\n \u21e4 \u21e4 hk(xx ,\u00b7),fi k(xx ,\u00b7) using operator,\r\n x n i=1 i H i\r\n f (\u00b7)= \u21b5ik(xxi,\u00b7), q\r\n i=1 X q\r\n \u21e4 \u21e4 T 1 T P(f),f (1 )he ,fi e (4)\r\n where (\u21b5 ,...,\u21b5) =K (y ,...,y) i H i\r\n 1 n 1 n i\r\n i=1\r\n x x where 1 \u00b7\u00b7\u00b7n areorderedeigenvalues of K and ei\r\n Here K denotes an n \u21e5 n kernel matrix, K =k(xx ,xx ).\r\n ij i j is its eigenfunction corresponding to . The iteration uses\r\n It is easy to check that 8if\u21e4(xi)=yi. i\r\n Remark 2.1 (Square loss). While the interpolated solu- Ptorescale a (stochastic) gradient in H,\r\n \u21e4 ( m )\r\n tion f in H does not depend on any loss function, it is 2 X x x\r\n the unique minimizer in H for the empirical square loss f f \u2318\u00b7P m (f(xxti) yti)k(xxti,\u00b7) (5)\r\n 1 Pn x 2 i=1\r\n L(f) , n i=1(f(xxi) yi) . Remark 2.2 (Data adaptive kernel for fast optimization).\r\n Gradient descent. It can be shown that gradient descent EigenPro iteration for target function y and kernel k is\r\n iteration for the empirical squared loss in RKHSH isgiven equivalent to Richardson iteration / randomized (block)\r\n \u21b5 y\r\n by coordinate descent for linear system KP\u21b5\u21b5 = yyP ,\r\n n \u21e4 x \u21e4 x T\r\n 2 X x x (Pf (xx1),...,Pf (xxn)) . Here KP is the kernel ma-\r\n f f\u2318\u00b7n (f(xxi) yi)k(xxi,\u00b7) (1) trix corresponding to a data-dependent kernel kP. When\r\n i=1 n !1,ithasthefollowing expansion according to Mer-\r\n Mini-batch SGD. Instead of calculating the gradient with cer\u2019s theorem,\r\n q 1\r\n n training points, each SGD iteration updates the solution x z X x z X x z\r\n x x kP(xx,zz)= qei(xx)ei(zz)+ iei(xx)ei(zz) (6)\r\n f using m subsamples (xx ,y ),...,(xx ,y ),\r\n t1 t1 tm tm\r\n ( ) i=1 i=q+1\r\n m\r\n 2 X x x For n<1,itisamodi\ufb01cationoftheoriginal kernel k,\r\n f f \u2318\u00b7 m (f(xxti) yti)k(xxti,\u00b7) (2) x z x z\r\n i=1 kP(xx,zz)=P{k(xx,\u00b7)}(zz)\r\n q\r\n It is equivalent to randomized coordinate descent (Leven- x z X x z\r\n \u21b5 y \u21b5 \u21e1k(xx,zz) (i q)ei(xx)ei(zz)\r\n thal & Lewis, 2010) for K\u21b5\u21b5 = yy on m coordinates of \u21b5\u21b5, i=1\r\n Remark 2.3 (Preconditioned linear system / gradient de-\r\n 2 \u21b5 y\r\n x scent). KP\u21b5\u21b5 = yyP is equivalent to the preconditioned\r\n \u21b5ti \u21b5ti \u2318\u00b7 m {f(xxti)yti} for i =1,...,m(3) \u21b5 y\r\n linear system PK\u21b5\u21b5 = Pyy where P is a left matrix pre-\r\n Critical mini-batch size as effective parallelism. Theo- conditioner related to P. Accordingly, P is the operator\r\n preconditioner for preconditioned (stochastic) gradient de-\r\n rem 4 in (Ma et al., 2017) shows that for mini-batch itera- scent (5).\r\n tion (2) with kernel k there is a data-dependent batch size\r\n \u21e4 Abstraction for parallel computational resources. To\r\n m (k)suchthat\r\n construct a resource adaptive kernel, we consider the fol-\r\n \u2022 Convergence per iteration improves linearly with in- lowing abstraction for given computational resource G,\r\n creasing batch sizem for m \uf8ff m\u21e4(k) (using optimal \u2022 C : Parallel capacity of G, i.e., the number of parallel\r\n constant step size). G\r\n operations that is required to fully utilize the comput-\r\n \u21e4\r\n \u2022 Training with any batch size m>m(k) leads to the ing capacity of G.\r\n \u21e4\r\n sameconvergenceperiterationastrainingwithm (k) \u2022 S : Internal resource memory of G.\r\n uptoasmallconstant factor. G\r\n \u21e4 To fully utilize G, one SGD / EigenPro iteration must exe-\r\n Wecan calculate m (k) explicitly using kernel matrix K cute at least CG operations using less than SG memory. In\r\n (depending on the data), this paper, we primarily adapt kernel to GPU devices. For a\r\n (K) GPUG,SG equalsthesizeofitsdedicatedmemoryandCG\r\n \u21e4 x x\r\n m (k)= where (K) , max k(xx ,xx )\r\n (K) i=1,...,n i i is proportional to the number of the computing cores (e.g.,\r\n 1 3840 CUDA cores in Titan Xp). Note for computational\r\n For any shift invariant kernel k, after normalization, we resources like cluster and supercomputer, we need to take\r\n n x x\r\n have (K) = max k(xx ,xx ) \u2318 1. into accountadditionalfactorssuchasnetworkbandwidth.\r\n i=1 i i\r\n Kernel machines that adapt to GPUs for effective large batch training\r\n 3MAINALGORITHM in Iteration (2). These computations reduce to matrix\r\n Our main algorithm aims to reduce the training time by multiplication and can be done in parallel.\r\n constructing a data/resource adaptive kernel for any given \u2022 Spaceusage. Ittakesd\u00b7nmemorytostorethetraining\r\n kernel function k to fully utilize a computational resource data (as kernel centers) and l \u00b7 n memory to maintain\r\n G. Its detailed work\ufb02ow is presented on the right. Speci\ufb01- the model weight. Additionally we need to store a\r\n cally, we use the following steps: m\u00b7nkernel matrix for the prediction on the mini-\r\n Step 1. Calculate the resource-dependent mini-batch size batch. In total, we need (d + l + m) \u00b7 n memory.\r\n max max\r\n m to fully utilize resource G. Wecannowcalculatem for the parallel computational\r\n G G\r\n resource G with parameters C ,S and introduced in Sec-\r\n Step 2. Identify the parameters and construct a new kernel G G\r\n \u21e4 max tion 2.\r\n kG such that m (kG)=m .\r\n G max\r\n Step 1: Determining batch size m for 100% re-\r\n Step 3. Select optimal step size and train using improved G\r\n EigenPro (see Section 4). source utilization. We \ufb01rst de\ufb01ne two mini-batch nota-\r\n tions:\r\n Note that due to prop- \u2022 m : batch size for fully utilizing parallelism in G\r\n erties of EigenPro CG\r\n such that (d + l) \u00b7 m \u00b7 n \u21e1 C .\r\n iteration, training with CG G\r\n \u2022 m : batch size for maximum memory usage of G\r\n this adaptive kernel SG\r\n converges to the same such that (d + l + mSG) \u00b7 n \u21e1 SG.\r\n solution as the original To best utilize G without exceeding its memory, we set\r\n kernel. max\r\n m =min{m ,m }.Notethatinpractice,itismore\r\n G CG SG\r\n max max\r\n To calculate m importanttofullyutilizethememorysothatm .m .\r\n G G SG\r\n for 100% resource\r\n utilization, we \ufb01rst max\r\n Step 2: Learning the kernel kG given m . Next, we\r\n estimate the opera- G\r\n show how to construct kG = kP using EigenPro iteration\r\n tion parallelism and q\r\n \u21e4 max\r\n suchthatm (kG)=m . Thecorrespondingq isde\ufb01ned\r\n memory usage of one G\r\n EigenPro iteration. The as\r\n \u21e4 max\r\n q , max {i 2 N, s.t. m (kP ) \uf8ff mG } (7)\r\n improved version of i\r\n (KP )\r\n EigenPro iteration (in- \u21e4 q , where K\r\n To compute q recall that m (kP )= P\r\n q 1(KP ) q\r\n troduced in Section 4) q\r\n makes computation is the kernel matrix corresponding to the kernel function\r\n k . Using the de\ufb01nition of P and in Section 2, we\r\n P q\r\n and memory overhead q\r\n over the standard SGD have\r\n 1(KP )=q(K)\r\n negligible (see Ta- q\r\n x x\r\n ble 1). Thus we assume (K )\u21e1 max k (xx ,xx )\r\n P P i i\r\n q i=1,...,n q\r\n that EigenPro has the q\r\n same complexity as x x X x 2\r\n = max {k(xx ,xx ) ( )e (xx ) }\r\n the standard SGD per i=1,...,n i i j q i i\r\n iteration. j=1\r\n In practice, (KP ) can be accurately estimated using the\r\n Cost of one EigenPro q\r\n x x\r\n maximumofkP (xx,xx) on a small number of subsamples.\r\n iteration with batch q\r\n Similarly, we can estimate q(K) on a subsample kernel\r\n size m. We consider matrix. Knowing the approximate top eigenvalues of K,\r\n x y \u21e4\r\n training data (xxi,yy ) 2\r\n i allows us to ef\ufb01ciently compute m (kP ) for each p, thus\r\n Rd \u21e5Rl,i=1,...,.n p\r\n allowing to choose q from (7).\r\n Here each feature vec- Step3: Trainingwithadaptivekernelk = k . Weuse\r\n G P\r\n x q\r\n tor xx is d dimensional, the learned kernel k with improved EigenPro (Section 4).\r\n y G\r\n and each label yy is l di- EigenPro 2.0 Its optimization parameters (batch and step size) are calcu-\r\n mensional. lated as follows:\r\n max\r\n \u2022 Computational cost. It takes (d + l) \u00b7 m \u00b7 n oper- m\r\n m=mmax,\u2318= G\r\n ations to perform one SGD iteration on m points as G (KG)\r\n Kernel machines that adapt to GPUs for effective large batch training\r\n Claim (Acceleration). Using the adaptive kernel kG de- Algorithm 1 Improved EigenPro iteration\r\n creasestheresourcetimerequiredfortraining(assumingan (double coordinate block descent)\r\n idealized model of the GPU and workload) over the origi- x z\r\n nal kernel k by a factor of Input: Kernel function k(xx,zz), EigenPro parameter q,\r\n mini-batch size m, step size \u2318, size of \ufb01xed coordinate\r\n max block s\r\n (K) m\r\n acceleration of k over k = \u00b7 G\r\n G \u21e4 T\r\n (KG) m (k) \u21b5\r\n initialize model parameter \u21b5\u21b5 =(\u21b5 ,...,\u21b5) 0\r\n 1 n\r\n subsamplescoordinateindicesr ,...,r2{1,...,}n\r\n See the Appendix C for the derivation and a discussion. 1 s\r\n max for constructing P , which form \ufb01xed coordinate block\r\n Wenote that empirically, (K ) \u21e1 (K), while mG is q\r\n G m\u21e4(k) \u21b5 T\r\n \u21b5\u21b5 ,(\u21b5 ,...,\u21b5)\r\n r r r\r\n between 50 and 500, which is in line with the acceleration 1 s\r\n compute top-q eigenvalues \u2303 , diag( ,...,) and\r\n observed in practice. 1 q\r\n e e\r\n corresponding eigenvectors V , (ee ,...,ee) of sub-\r\n 1 q\r\n Remark3.1(Choiceofq). Note that it is not important to x x s\r\n sample kernel matrix K =[k(xx ,xx )]\r\n s ri rj i,j=1\r\n select q exactly, according to Eq. 7. In fact, choosing kP\r\n p for t =1,...do\r\n for any p>qallows for the same acceleration as kP as\r\n q\r\n max x x\r\n 1. sample a mini-batch (xx ,y ),...,(xx ,y )\r\n long as the mini-batch size is chosen to be m and the t1 t1 tm tm\r\n G 2. calculate predictions on the mini-batch\r\n step size is chosen accordingly. Thus, we can choose any\r\n n\r\n value p>qfor our adaptive kernel kP . However, choos- X\r\n p x x x\r\n f(xx )= \u21b5 k(xx ,xx ) for j =1,...,m\r\n ing p larger than q incurs an additional computation cost tj i i tj\r\n as p eigenvalues and eigenvectors of K need to be approxi- i=1\r\n matedaccurately. Inparticular, largersubsamplesizes(see 3. update sampled coordinate block corresponding\r\n \u21b5\r\n to the mini-batch \u21b5\u21b5 , (\u21b5 ,...,\u21b5),\r\n Section 4 may be needed for approximating eigenvectors. t t1 tm\r\n \u21b5 \u21b5 2 x x T\r\n 4IMPROVEDEIGENPROITERATION \u21b5\u21b5t \u21b5\u21b5t \u2318 \u00b7 m(f(xxt1)yt1,...,(fxxtm)ytm)\r\n \u00a8 4. evaluatethefollowingfeaturemap(\u00b7)onthemini-\r\n USINGNYSTROMEXTENSION\r\n x x\r\n batch features xx ,...,xx :\r\n t1 tm\r\n In this section, we present an improvement for the Eigen- x x x x x T\r\n Pro iteration originally proposed in (Ma & Belkin, 2017). (xx) , (k(xxr1,xx),...,(kxxrs,xx))\r\n We signi\ufb01cantly reduce the memory overhead of Eigen- \u21b5\r\n 5. update \ufb01xed coordinate block\u21b5\u21b5 to apply P ,\r\n Pro over standard SGD and nearly eliminate computa- r q\r\n m\r\n tional overhead per iteration. The improvement is based on \u21b5 \u21b5 2 X x T x\r\n an ef\ufb01cient representation of the preconditioner P using \u21b5\u21b5r \u21b5\u21b5r +\u2318 \u00b7 m (f(xxti) yti) \u00b7 VDV (xxti)\r\n q i=1\r\n \u00a8\r\n Nystromextension. 1 1\r\n Westart by recalling the EigenPro iteration in RKHS and where D , (1q \u00b7\u2303 )\u2303\r\n endfor\r\n its preconditioner constructed by the top-q eigensystem\r\n ,eofthekernel operator K:\r\n i i\r\n ( ) Next,weshowhowtoapproximate ,e. We\ufb01rstconsider\r\n m i i\r\n 2 X x x d\r\n a related linear system for subsamples xx ,...,xx 2 R :\r\n x x r1 rs\r\n f f \u2318\u00b7P (f(xx ) y )k(xx ,\u00b7)\r\n q ti ti ti e e x x s\r\n m Kee =ee whereK ,[k(xx ,xx )] is a subsample\r\n i=1 s i i i s ri rj i,j=1\r\n e\r\n kernel matrix and ,ee is its eigenvalue/eigenvector. This\r\n q i i\r\n where P (f)=f X(1 q)he ,fi e rank-s linear system is in fact a discretization of Kei =\r\n q i H i e in the RKHS.\r\n i=1 i i i\r\n e\r\n The two eigensystems, ,ee and ,e are connected\r\n Thekeytoconstructtheaboveiterationistoobtainanaccu- i i i i\r\n \u00a8\r\n through Nystrom extension. Speci\ufb01cally, the Nystrom ex-\r\n rate and computationally ef\ufb01cient approximation of ,e \u00a8\r\n i i x x\r\n tension of e on subsamples xx ,...,xx approximates e\r\n i r r i\r\n such that Kei \u21e1 iei. The original EigenPro iteration 1 s\r\n Pn x as follows:\r\n learns an approximate ei of the form j=1wjk(xxj,\u00b7). In\r\n contrast, our improved version of Eigenpro uses only a s\r\n 1 X x x\r\n x x e (\u00b7) \u21e1 e (xx )k(xx , \u00b7)\r\n small number of subsamples xx ,...,xx to learn an e i i r r\r\n P r1 rs i j j\r\n s x i j=1\r\n of the form j=1wjk(xxrj,\u00b7). This compact representa-\r\n tion (s versus n) nearly eliminates per-iteration overhead x x\r\n Evaluating both side on xx ,...,xx, we have\r\n of EigenPro over SGD. Importantly, there is no associated r1 rs\r\n accuracy reduction as this is the same subset used in the i 1 eT\r\n \u21e1 ,e(\u00b7) \u21e1 p ee (\u00b7)\r\n original EigenPro to approximate P . i i i\r\n q s i\r\n Kernel machines that adapt to GPUs for effective large batch training\r\n x x T\r\n where(\u00b7) , (k(xxr1,\u00b7),...,(kxxrs,\u00b7)) is a kernel feature Datasets. We reduce multiclass labels to multiple bi-\r\n map. Thusweapproximatethetop-q eigensystemofK us- nary labels. For image datasets including MNIST (Le-\r\n ing the top-q eigensystem of Ks. These (low-rank) approx- Cunetal.,1998), CIFAR-10(Krizhevsky&Hinton,2009),\r\n imationsfurtherallowustoapplyP foref\ufb01cientEigenPro and SVHN (Netzer et al., 2011), color images are \ufb01rst\r\n q\r\n x x\r\n iteration on mini-batch (xx ,y )...,(xx ,y ), transformed to grayscale images. We then rescale the\r\n t1 t1 tm tm\r\n m range of each feature to [0,1]. For ImageNet (Deng et al.,\r\n 2 X x x 2009), we use the top 500 PCA components of some\r\n f f \u2318\u00b7 m (f(xxti) yti)k(xxti,\u00b7) convolutional features extracted from Inception-ResNet-\r\n i=1 v2 (Szegedy et al., 2017). For TIMIT (Garofolo et al.,\r\n m\r\n 2 X x x T T (8) 1993), we normalize each feature by z-score.\r\n +\u2318\u00b7 m (f(xxti) yti) \u00b7 (xxti) VDV (\u00b7)\r\n i=1 Choosing the size of the \ufb01xed coordinate block s.We\r\n 1 1\r\n where D , \u2303 (1q\u00b7\u2303 ) choosesaccordingtothesizeofthetrainingdata,n. When\r\n 5 3 5\r\n n\uf8ff10 ,wechooses=2\u00b710 ;whenn>10 ,wechoose\r\n e e 4\r\n where \u2303 , diag( ,\u00b7\u00b7\u00b7, ) and V , (ee ,\u00b7\u00b7\u00b7,ee ) are\r\n 1 q 1 q s =1.2\u00b710 .\r\n top-q eigensystem of Ks.\r\n P 5.1 Comparisontostate-of-the-art kernel methods\r\n n x\r\n Recalling that f = i=1 \u21b5ik(xxi,\u00b7), the above iteration In Table 2, we compare our method to the state-of-the-art\r\n can be executed by updating two coordinate blocks of the kernel methods on several large datasets. For all datasets,\r\n \u21b5\r\n parameter vector\u21b5\u21b5 as in Algorithm 1. our method is signi\ufb01cantly faster than other methods while\r\n Computation/memoryperiteration. In Algorithm 1, the still achieving better or similar results. Moreover, our\r\n cost of each iteration relates to updating two coordinate method uses only a single GPU while many state-of-the-\r\n blocks. Notably, Steps 2-3 is exactly the standard SGD. art kernel methods use much less accessible computing re-\r\n Thus the overhead of our method comes from Steps 4-5. sources.\r\n WecompareourimprovedEigenPro to the original Eigen- Among all the compared methods, FALKON (Rudi et al.,\r\n Pro and to standard SGD in Table 1. We see that the over- 2017) and EigenPro (Ma & Belkin, 2017) stand out for\r\n headoforiginalEigenPro(inbold)scaleswiththedatasize their competitive performance and fast training on a single\r\n n. In contrast, improved EigenPro depends only on the GPU. Notably, our method still achieves 5X-6X accelera-\r\n \ufb01xed coordinate block size s which is independent of n. tion over FALKON and 5X-14X acceleration over Eigen-\r\n Hence, when n becomes large, the overhead of our itera- Pro with mostly better accuracy. Importantly, our method\r\n tion becomesnegligible(bothincomputationandmemory) has the advantage of automatically inferring parameters for\r\n compared to the cost of SGD. optimization. In contrast, parameters related to optimiza-\r\n Computation Memory tion for FALKON and EigenPro need to be selected by\r\n s\u00b7mq s\u00b7q cross-validation.\r\n Improved EigenPro ss \u00b7\u00b7 mqmq + n \u00b7 m(d + l) ss \u00b7\u00b7 qq + n \u00b7 (m + d + l)\r\n n\u00b7mq n\u00b7q\r\n Original EigenPro nn \u00b7\u00b7 mqmq + n \u00b7 m(d + l) nn \u00b7\u00b7 qq + n \u00b7 (m + d + l)\r\n SGD n\u00b7m(d+l) n\u00b7(m+d+l) 5.2 Convergence comparison to SGD and EigenPro\r\n Table 1: Overhead over SGD is bolded. n: training data size, In Figure 2, we train three kernel machines with Eigen-\r\n m: batch size, d: feature dim., s: \ufb01xed coordinate block size, q: Pro 2.0, standard SGD and EigenPro (Ma & Belkin, 2017)\r\n EigenPro parameter, l: number of labels. for various batch sizes. The step sizes for SGD and Eigen-\r\n Proaretunedforbestperformance. ThestepsizeforEigen-\r\n To give a realistic example, for many of our experiments Pro 2.0 is computed automatically according to Section 3.\r\n 6 4\r\n n = 10 , while s is chosen to be 10 . We typically have Consistent with the schematic Figure 1 in the introduction,\r\n 3\r\n d,m of the same order of magnitude 10 , while q and l \u21e4\r\n 2 the original kernel k has a critical batch size m (k) of size\r\n around 10 . This results in overhead of EigenPro of less 4 and 6 respectively, which is too small to fully utilize the\r\n than 1% over SGD for both computation and memory. parallel computingcapacityoftheGPUdevice. Incontrast,\r\n our adaptive kernel kG has a much larger critical batch size\r\n \u21e4\r\n 5EXPERIMENTALEVALUATION m (kG) \u21e1 6500, which leads to maximum GPU utiliza-\r\n tion. We see that EigenPro 2.0 signi\ufb01cantly outperforms\r\n Computing resource. We run all experiments on a single original EigenPro due to better resource utilization and pa-\r\n workstation equipped with 128GB main memory, two Intel rameter selection, as well as lower overhead (see Table 1).\r\n Xeon(R) E5-2620 processors, and one Nvidia GTX Titan\r\n Xp(Pascal) GPU.\r\n Kernel machines that adapt to GPUs for effective large batch training\r\n Table 2: Comparison of EigenPro 2.0 and state-of-the-art kernel methods\r\n EigenPro 2.0 Results of Other Methods\r\n Dataset Size (use 1 GTX Titan Xp)\r\n error GPUtime resource time error reference\r\n 4.8 h on 0.70% EigenPro (Ma & Belkin, 2017)\r\n 6 1GTXTitanX\r\n MNIST 6.7 \u21e510 0.72% 19m 1.1 h on 0.72% PCG(Avronetal., 2016)\r\n 1344 AWSvCPUs\r\n less than 37.5 hours 0.85% (Lu et al., 2014)\r\n on1TeslaK20m\r\n \u2020 6 - 19.9% Inception-ResNet-v2 (Szegedy et al., 2017)\r\n ImageNet 1.3 \u21e510 20.6% 40m 4hon 20.7% FALKON(Rudietal.,2017)\r\n 1Tesla K40c\r\n 3.2 h on 31.7% EigenPro (Ma & Belkin, 2017)\r\n 1GTXTitanX\r\n 1.1 \u00b7 106 31.7% 24m 1.5 h on 32.3% FALKON(Rudietal.,2017)\r\n TIMIT\u2021 / 2 \u00b7 106 (3 epochs) 1Tesla K40c\r\n 512 IBM 33.5% Ensemble (Huang et al., 2014)\r\n Blue Gene/Q cores\r\n 32.1% 8m 7.5 h on 33.5% BCD(Tuetal.,2016)\r\n (1 epoch) 1024 AWSvCPUs\r\n multiple AWS 32.4% DNN(Mayetal.,2017)\r\n g2.2xlarge instances\r\n multiple AWS 30.9% SparseKernel (May et al., 2017)\r\n g2.2xlarge instances (use learned features)\r\n 6mon 19.8% EigenPro (Ma & Belkin, 2017)\r\n 6 1GTXTitanX\r\n SUSY 4\u00b710 19.7% 58s 4mon 19.6% FALKON(Rudietal.,2017)\r\n 1Tesla K40c\r\n 36mon \u21e120% Hierarchical (Chen et al., 2016)\r\n IBMPOWER8\r\n \u2020OurmethodusestheconvolutionalfeaturesfromInception-ResNet-v2andFalkonusestheconvolutionalfeaturesfromInception-v4.\r\n Both neural network models are presented in (Szegedy et al., 2017) and show nearly identical performance.\r\n \u2021 There are two sampling rates for TIMIT, which result in two training sets of different sizes.\r\n 5 4 5 4\r\n (a) MNIST (10 subsamples), stop when train mse < 1 \u00b7 10 (b) TIMIT (10 subsamples), stop when train mse < 2 \u00b7 10\r\n Figure 2: Time to converge with different batch sizes and optimal step sizes\r\n 5.3 Batch size and GPU utilization quired per iteration for a pure sequential machine would\r\n Thenumberofoperationrequiredforoneiteration of SGD scale linearly with batch size. On the other hand an ideal\r\n is linear in the batch size. Thus we expect that time re- parallel device with no overhead requires the same amount\r\n of time to process any mini-batch. In Figure 3a, we show\r\n Kernel machines that adapt to GPUs for effective large batch training\r\n (a) Time per training iteration of different batch sizes on actual (b) Time per training epoch on GPU with different sizes of train\r\n 5\r\n and ideal devices (TIMIT, n =10, d =440) set (n, which is also the model size) and batch that \ufb01t into the\r\n GPUmemory\r\n Figure 3: Time per iteration / epoch for training with different batch sizes\r\n howthetraining time per iteration for actual GPU depends close to real time. One of the advantages of our approach is\r\n on the batch size. We see that for small batch sizes time the combination of its speed on small and medium datasets\r\n per iteration is nearly constant, like that of an ideal parallel using standard hardware together with the automatic opti-\r\n device, and start to increase for larger batches. mization parameter selection.\r\n 4 5\r\n Note that in addition to time per iteration we need to con- Wedemonstratethisonseveralsmallerdatasets(10 \u21e0 10\r\n sider the overhead associated to each iteration. Larger points) using a Titan Xp GPU (see Table 3). We see that in\r\n batch sizes incur less overhead per epoch. This phe- every case training takes no more than 15 seconds, mak-\r\n nomenon is known in the systems literature as Amdahl\u2019s ing multiple runs for parameter and feature selection easily\r\n law (Rodgers, 1985). In Figure 3b we show GPU time per feasible.\r\n epoch for different model (training set) size (n). We see For comparison, we also provide timings for LibSVM,\r\n consistent speed-ups by increasing mini-batch size across a popular and widely used kernel library (Chang & Lin,\r\n model sizes up to maximum GPU utilization. 2011) and ThunderSVM (Wen et al., 2018), a fast GPU\r\n 5.4 \u201cInteractive\u201d training for exploratory machine implementation for LibSVM. We show the results for Lib-\r\n SVM4 and ThunderSVM using the same kernel with the\r\n learning same parameter. We stopped iteration of our method when\r\n the accuracy on test exceeded that of LibSVM, which our\r\n Dataset Size Feature EigenPro ThunderSVM LibSVM method was able to achieve on every dataset. While not\r\n (GPU) (GPU) (CPU) intended as a comprehensive evaluation, the bene\ufb01ts of our\r\n 5\r\n TIMIT 1\u00b710 440 15s 480s 1.6 h 5\r\n 4 method for typical data analysis tasks are evident. Fast\r\n SVHN 7\u00b710 1024 13s 142s 3.8 h\r\n 4\r\n MNIST 6\u00b710 784 6s 31s 9m training along with the \u201cworry-free\u201d optimization create\r\n 4\r\n CIFAR-10 5\u00b710 1024 8s 121s 3.4 h an \u201cinteractive/responsive\u201d environment for using kernel\r\n Table 3: Comparing training time of kernel machines methods in machine learning. Furthermore, the choice of\r\n kernel (e.g., Laplacian or Gaussian) and its single band-\r\n widthparameterisusuallyfarsimplerthanthemultiplepa-\r\n Most practical tasks of machine learning require multiple rameters involved in the selection of architecture in neural\r\n trainingrunsforparameterandfeatureselection,evaluating networks.\r\n appropriateness of data or features to a given task, and var- 4Weusethesvmpackageinscikit-learn 0.19.0.\r\n ious other exploratory purposes. While using hours, days 5Our algorithm is still much faster than LibSVM when run-\r\n or even months of machine time may be necessary to im- ning on CPU. For example, training on datasets shown in Table 3\r\n prove on the state of the art in large-scale certain problems, takes between one and three minutes.\r\n it is too time-consuming and expensive for most data anal-\r\n ysis work. Thus, it is very desirable to train classi\ufb01ers in\r\n Kernel machines that adapt to GPUs for effective large batch training\r\n 5.5 Practical Techniques for Accelerating Inference While our paper deals with kernel machines, similar ideas\r\n Wewould like to point out two simple and practical tech- are applicable to a much broader class of learning architec-\r\n niques to accelerate and simplify kernel training. The use tures including deep neural networks.\r\n of the Laplacian kernel is not common in the literature and The algorithms developed in this paper allow for very fast\r\n in our opinion deserves more attention. While PCA is fre- kernel learning on smaller datasets and easy scaling to sev-\r\n quently used to speed up training (and sometimes to im- eral million data points using a modern GPU. It is likely\r\n prove the test results), it is useful to state the technique ex- that moreeffective memorymanagementtogetherwithbet-\r\n 7\r\n plicitly. ter hardware wouldallowscalingupto10 datapointswith\r\n reasonable training time. Going beyond that to 108 or more\r\n Choice of kernel function. In many cases Laplace (expo- data points using multi-GPU setups is the next natural step\r\n x z\r\n kxxzzk\r\n x z \r\n nential) kernel k(xx,zz)=e producesresults compa- for kernel methods.\r\n rable or better than those for the more standard Gaussian\r\n kernel. Moreover the Laplacian kernel has several practi- ACKNOWLEDGEMENTS\r\n cal advantages over the Gaussian (consistent with the \ufb01nd-\r\n ings reported in (Belkin et al., 2018)). (1) Laplacian gen- We thank Raef Bassily for discussions and helpful com-\r\n erally requires fewer epochs for training to obtain the same ments and Alex Lee for running ThunderSVM compar-\r\n \u21e4\r\n quality result. (2) The batch value m is typically larger isons. We thank Lorenzo Rosasco and Luigi Carratino for\r\n for the Laplacian kernel allowing for more effective paral- sharing preprocessed ImageNet features. We used a Titan\r\n lelization. (3) Test performance for the Laplacian kernel is Xp GPU provided by Nvidia. We acknowledge \ufb01nancial\r\n empirically more robust to the bandwidth parameter , sig- support from NSF.\r\n ni\ufb01cantly reducing the need for careful parameter tuning to\r\n achieve optimal performance. REFERENCES\r\n Dimensionality reduction by PCA. Recall that the pri- Aronszajn,N. Theoryofreproducingkernels. Transactions\r\n mary cost of one EigenPro iteration is n \u00b7 md for the num- of the American mathematical society, 68(3):337\u2013404,\r\n ber of operations and n \u00b7 (m + d) for memory where d is 1950.\r\n the number of features. Thus reducing the dimension of\r\n the features results in signi\ufb01cant computational savings. It Avron, H., Clarkson, K., and Woodruff, D. Faster ker-\r\n is often possible to signi\ufb01cantly reduce dimensionality of nel ridge regression using sketching and precondition-\r\n the data without perceptibly changing classi\ufb01cation (or re- ing. arXiv preprint arXiv:1611.03220, 2016.\r\n gression) accuracy by applying the Principal Components Belkin, M., Ma, S., and Mandal, S. To understand deep\r\n Analysis (PCA). For example, using PCA to reduce the learning we need to understand kernel learning. arXiv\r\n feature dimensionality from 1536 to 500 for ImageNet de- preprint arXiv:1802.01396, 2018.\r\n creases the accuracy by less than 0.2%. Chang, C.-C. and Lin, C.-J. Libsvm: a library for support\r\n 6CONCLUSIONANDFUTURE vector machines. ACM transactions on intelligent sys-\r\n tems and technology (TIST), 2(3):27, 2011.\r\n DIRECTIONS Chen,J.,Avron,H.,andSindhwani,V. Hierarchicallycom-\r\n Best practices for training modern large-scale models are positional kernels for scalable nonparametric learning.\r\n concerned with linear scaling. Most of the work is based arXiv preprint arXiv:1608.00860, 2016.\r\n on an implicit but widely held assumption that the limit Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., and\r\n of linear scaling, m\u21e4, cannot be controlled or changed in Fei-Fei, L. Imagenet: A large-scale hierarchical image\r\n practice. In contrast, this paper shows that the limit of lin- database. In Computer Vision and Pattern Recognition,\r\n ear scaling can be analytically and automatically adapted 2009. CVPR 2009. IEEE Conference on, pp. 248\u2013255.\r\n to a given computing resource. This \ufb01nding adds a new di- IEEE, 2009.\r\n mension for potential improvements in training large-scale Garofolo, J. S., Lamel, L. F., Fisher, W. M., Fiscus, J. G.,\r\n models. and Pallett, D. S. Darpa timit acoustic-phonetic conti-\r\n The main technical contribution of this paper is a new nous speech corpus cd-rom. NIST speech disc, 1-1.1,\r\n learning framework(EigenPro2.0)thatextendslinearscal- 1993.\r\n ing to match the parallel capacity of a computational re- Golmant, N., Vemuri, N., Yao, Z., Feinberg, V., Gho-\r\n source. The framework is based on extracting limited lami, A., Rothauge, K., Mahoney, M. W., and Gonza-\r\n second order information to modify the optimization pro- lez, J. On the computational inef\ufb01ciency of large batch\r\n cedure without changing the learned predictor function. sizes for stochastic gradient descent. arXiv preprint\r\n arXiv:1811.12941, 2018.\r\n Kernel machines that adapt to GPUs for effective large batch training\r\n \u00b4\r\n Goyal, P., Dollar, P., Girshick, R., Noordhuis, P., timal large scale kernel method. In Advances in Neural\r\n Wesolowski, L., Kyrola, A., Tulloch, A., Jia, Y., and He, Information Processing Systems, pp. 3891\u20133901, 2017.\r\n K. Accurate, large minibatch sgd: Training imagenet in Smith, S. L., Kindermans, P.-J., and Le, Q. V. Don\u2019t decay\r\n 1 hour. arXiv preprint arXiv:1706.02677, 2017. the learning rate, increase the batch size. arXiv preprint\r\n Huang, P.-S., Avron, H., Sainath, T. N., Sindhwani, V., and arXiv:1711.00489, 2017.\r\n Ramabhadran, B. Kernel methods match deep neural Szegedy, C., Ioffe, S., Vanhoucke, V., and Alemi, A. A.\r\n networksontimit. InICASSP,pp.205\u2013209.IEEE,2014. Inception-v4, inception-resnet and the impact of residual\r\n Jia, X., Song, S., He, W., Wang, Y., Rong, H., Zhou, F., connections on learning. In AAAI, volume 4, pp. 12,\r\n Xie, L., Guo, Z., Yang, Y., Yu, L., et al. Highly scal- 2017.\r\n abledeeplearningtrainingsystemwithmixed-precision: \u00b4 \u00b4\r\n Takac, M., Bijral, A. S., Richtarik, P., and Srebro, N. Mini-\r\n Training imagenet in four minutes. arXiv preprint batch primal and dual methods for SVMs. In ICML (3),\r\n arXiv:1807.11205, 2018. pp. 1022\u20131030, 2013.\r\n Krizhevsky, A. One weird trick for parallelizing convolu- Tu, S., Roelofs, R., Venkataraman, S., and Recht, B. Large\r\n tional neural networks. arXiv preprint arXiv:1404.5997, scale kernel learning using block coordinate descent.\r\n 2014. arXiv preprint arXiv:1602.05310, 2016.\r\n Krizhevsky, A. and Hinton, G. Learning multiple layers of Wen, Z., Shi, J., Li, Q., He, B., and Chen, J. Thunder-\r\n features from tiny images. Master\u2019s thesis, University of svm: a fast svm library on gpus and cpus. The Journal\r\n Toronto, 2009. of Machine Learning Research (JMLR), 19(1):797\u2013801,\r\n LeCun,Y.,Bottou,L.,Bengio,Y.,andHaffner,P. Gradient- 2018.\r\n based learning applied to document recognition. In Pro- Yao, Y., Rosasco, L., and Caponnetto, A. On early stop-\r\n ceedings of the IEEE, pp. 2278\u20132324, 1998. ping in gradient descent learning. Constructive Approx-\r\n Leventhal, D. and Lewis, A. S. Randomized methods for imation, 26(2):289\u2013315, 2007.\r\n linear constraints: convergence rates and conditioning. You, Y., Gitman, I., and Ginsburg, B. Large batch\r\n Mathematics of Operations Research, 35(3):641\u2013654, training of convolutional networks. arXiv preprint\r\n 2010. arXiv:1708.03888, 2017.\r\n Lu, Z., May, A., Liu, K., Garakani, A. B., Guo, D., Bellet, Zhang, C., Bengio, S., Hardt, M., Recht, B., and Vinyals,\r\n A., Fan, L., Collins, M., Kingsbury, B., Picheny, M., and O. Understandingdeeplearningrequiresrethinkinggen-\r\n Sha, F. How to scale up kernel methods to be as good as eralization. arXiv preprint arXiv:1611.03530, 2016.\r\n deepneuralnets. arXiv preprint arXiv:1411.4000, 2014.\r\n Ma,S.andBelkin, M. Diving into the shallows: a compu-\r\n tational perspective on large-scale shallow learning. In\r\n AdvancesinNeuralInformationProcessingSystems,pp.\r\n 3781\u20133790, 2017.\r\n Ma, S., Bassily, R., and Belkin, M. The power of\r\n interpolation: Understanding the effectiveness of sgd\r\n in modern over-parametrized learning. arXiv preprint\r\n arXiv:1712.06559, 2017.\r\n May, A., Garakani, A. B., Lu, Z., Guo, D., Liu, K., Bellet,\r\n A., Fan, L., Collins, M., Hsu, D., Kingsbury, B., et al.\r\n Kernel approximation methods for speech recognition.\r\n arXiv preprint arXiv:1701.03577, 2017.\r\n McCandlish, S., Kaplan, J., Amodei, D., and Team, O. D.\r\n An empirical model of large-batch training. arXiv\r\n preprint arXiv:1812.06162, 2018.\r\n Netzer, Y., Wang, T., Coates, A., Bissacco, A., Wu, B., and\r\n Ng, A. Reading digits in natural images with unsuper-\r\n vised feature learning. In NIPS workshop, volume 2011,\r\n pp. 4, 2011.\r\n Rodgers, D. P. Improvements in multiprocessor system de-\r\n sign. In SIGARCH, 1985.\r\n Rudi, A., Carratino, L., and Rosasco, L. Falkon: An op-\r\n Kernel machines that adapt to GPUs for effective large batch training\r\n Appendices When step size \u2318 is chosen optimally, we can apply The-\r\n orem 4 in (Ma et al., 2017) to bound its convergence per\r\n iteration toward the optimal (interpolating) solution f\u21e4 as\r\n ADATASETS follows:\r\n h \u21e4 2 i \u21e4 h \u21e4 2 i\r\n Wereduce multiclass labels to multiple binary labels. For E kft f k \uf8ffgK(m)\u00b7E kft1f k\r\n K K\r\n image datasets including MNIST (LeCun et al., 1998), Hereg\u21e4 (m)isakernel-dependentupperboundonthecon-\r\n CIFAR-10(Krizhevsky&Hinton,2009),andSVHN(Net- K\r\n zer et al., 2011), color images are \ufb01rst transformed to vergence rate.\r\n grayscale images. We then rescale the range of each fea- The fastest (up to a small constant factor) convergence\r\n ture to [0,1]. For ImageNet (Deng et al., 2009), we use rate per iteration is obtained when using mini-batch size\r\n the top 800 PCA components of some convolutional fea- \u21e4 (K)\r\n m (K)= (or larger). Kernels used in prac-\r\n tures extracted from Inception-ResNet-v2 (Szegedy et al., 1(K)n(K)\r\n 2017). For TIMIT (Garofolo et al., 1993), we normalize tice, such as Gaussian kernels, have rapid eigendecay (Ma\r\n each feature by z-score. &Belkin, 2017), i.e., 1(K) n(K). Hence we have\r\n \u21e4 (K)\r\n m (k) \u21e1 1(K).\r\n BSELECTIONOFKERNELANDITS Thus we can write an accurate approximation of conver-\r\n gence rate g\u21e4 (m\u21e4(K)) as follows:\r\n BANDWIDTH K\r\n kxyk2 \u21e4 \u21e4 \u21e4 m\u21e4(K)\u00b7n(K)\r\n We use Gaussian kernel k(x,y)=exp( ) and \u270f ,g (m (K))=1\r\n 22 K K (K)+(m1) (K)\r\n kxyk n\r\n Laplace kernel k(x,y)=exp( ) in our experi- n(K)\r\n ments. Notethatthekernelbandwidthisselectedthrough \u21e11 1(K)\r\n cross-validation on a small subsampled dataset. In Table 4, 1+(m1)n(K)\r\n we report the kernel and its bandwidth selected for each (K)\r\n dataset to achieve the best performance. We also report x x\r\n Wenowobservethat = max k(xx ,xx ) tr(K).\r\n i=1,...,n i i\r\n the parameters that are calculated automatically using our Hence for the mini-batch size m much smaller than n we\r\n method. Note that in practice we choose a value q (in the have\r\n parenthesis) that is larger than the q corresponding to m .\r\n G n(K) n(K) m1\r\n Increasing q appears to lead to faster convergence. We use (m1) (K) \uf8ff(m1)tr(K) \uf8ff n \u23271\r\n a simple heuristic to automatically obtain such q based on\r\n the eigenvalue and the size of the \ufb01xed coordinate block6.\r\n That allows us to write\r\n CANALYSISOFACCELERATION \u270f\u21e4 \u21e11 n(K)\r\n K (K)\r\n 1\r\n Claim (Acceleration). Using the adaptive kernel kG\r\n decreases the resource time required for training over the Wewill now apply this formula to the adaptive kernel kG.\r\n (K) mmax\r\n original kernel k by a factor of a \u21e1 \u00b7 G . RecallthatitscorrespondingkernelmatrixKG modi\ufb01esthe\r\n (KG) m\u21e4(k) top-q eigenspectrum of K such that\r\n Wewillnowgiveaderivation of this acceleration factor a, (q(K) if i \uf8ff q\r\n based on the analysis of SGD in the interpolating setting i(KG)= (K) if i>q\r\n in (Ma et al., 2017). i\r\n x x Thusthe convergence rate for kG is\r\n As before, let (xx ,y),...,(xx ,y) be the data, and\r\n 1 1 n n\r\n let K be the corresponding (normalized) kernel matrix \u21e4 n(KG) n(K)\r\n x x \u270fKG \u21e1 1 =1\r\n K = k(xx ,xx )/n. We start by recalling the SGD it-\r\n ij i j 1(KG) q(K)\r\n eration in the kernel setting for a mini-batch of size m,\r\n x x\r\n (xx ,y ),...,(xx ,y ),\r\n t1 t1 tm tm Next, we compare the number of iterations needed to con-\r\n verge to error \u270f using the original kernel k and the adaptive\r\n (m ) kernel kG.\r\n 2 X x x log\u270f\r\n f f \u2318\u00b7 m (f(xxti) yti)k(xxti,\u00b7) First, we see that for kernel k it takes t = log\u270f\u21e4 iterations\r\n i=1 to go below error \u270f such that K\r\n 6For SUSY we directly specify a large q for optimal perfor- h i h i\r\n \u21e4 2 \u21e4 2\r\n mance. E kft f k \uf8ff\u270f\u00b7E kf0f k\r\n K K\r\n Kernel machines that adapt to GPUs for effective large batch training\r\n Table 4: Selected kernel bandwidth and corresponding optimization parameters\r\n Dataset Size of (Subsampled) Kernel Bandwidth Train epochs Calculated Parameters\r\n Train Set q (adjusted q) m=mG \u2318\r\n MNIST 1\u00b7106 Gaussian 5 4 93(330) 735 379\r\n TIMIT 1.1 \u00b7 106 Laplacian 15 3 52(128) 682 343\r\n ImageNet 1.3 \u00b7 106 Gaussian 16 1 2(321) 294 149\r\n SUSY 6\u00b7105 Gaussian 4 1 106 (850) 1687 849\r\n Notice that (K) \uf8ff tr(K) = 1 for normalized kernel\r\n n n n\r\n matrix K. Thus for large n, we have\r\n log\u270f = log\u270f \u21e1log\u270f\u00b7 1(K)\r\n log\u270f\u21e4 log(1 n(K)) n(K)\r\n K 1(K)\r\n Inotherwords,thenumberofiterationsneededtoconverge\r\n with kernel k is proportional to 1(K).\r\n n(K)\r\n Bythesametoken,toachieveaccuracy\u270f, the adaptive ker-\r\n nel k needs log\u270f \u21e1 log\u270f\u00b7 q(K) iteration.\r\n G log\u270f\u21e4 n(K)\r\n KG\r\n Therefore, to achieve accuracy \u270f, training with the adaptive\r\n kernel k needs q(K) as many iterations as training with\r\n G 1(K)\r\n the original kernel k.\r\n Tounpackthemeaningoftheratio q(K), we rewrite it as\r\n 1(K)\r\n (K ) (K ) \u21e4 (K ) \u21e4\r\n q(K) = 1 G = G \u00b7 m (K) = G \u00b7m (K)\r\n \u21e4 max\r\n 1(K) 1(K) (K) m (KG) (K) m\r\n G\r\n Recall that by the assumptions made in the paper (1) any\r\n max\r\n iteration for kernel K with mini-batch size m \uf8ff m re-\r\n G\r\n quires the same amount of resource time to complete on G,\r\n (2) iteration of kernels K and KG require the same resource\r\n time for any m (negligible overhead).\r\n \u21e4 \u21e4 max\r\n Since m (K) \uf8ff m (KG) \u21e1 m , we see that one iter-\r\n G\r\n \u21e4\r\n ation of batch size m (K) and one iteration of batch size\r\n \u21e4\r\n m (KG)takethesameamountoftimeforeitherkernel.\r\n Wethus conclude that the adaptive kernel accelerates over\r\n the original kernel by a factor of approximately\r\n max\r\n (K) m\r\n \u00b7 G\r\n (KG) m\u21e4(K)\r\n Remark. Noticethatouranalysisisbasedonusingupper\r\n boundsforconvergence. Whiletheseboundsaretight((Ma\r\n et al., 2017), Theorem 3), there is no guarantees of tight-\r\n ness for speci\ufb01c data and choice of kernel used in practice.\r\n Remarkably, the values of parameters obtained by using\r\n these bounds work very well in practice. Moreover, accel-\r\n eration predicted theoretically closely matches acceleration\r\n observed in practice.\r\n D. Artifact Appendix \u2022 Run the test code. Note that the value of mem gb needs\r\n D.1 Abstract to be the size of the available GPU memory.\r\n This artifact contains the tensor\ufb02ow implementation of $ python run_mnist.py --kernel=Gaussian \\\r\n EigenPro 2.0 (from github.com/EigenPro/EigenPro2) and > --s=5 --mem_gb=12 --epochs 1 2 3 4 5\r\n a python script for running examples using public datasets. D.6 Evaluation and expected result\r\n It can validate the functionality of our method and support\r\n the result in Table 2 of our SysML\u20192019 paper: Learning Theexpectedresultsincludeautomaticallycalculatedhyper-\r\n kernels that adapt to GPUs. parameters for optimization and runtime, as well as classi-\r\n D.2 Artifact check-list (meta-information) \ufb01cation error (in %) and mean squared error (l2) for both\r\n training set and test set (val in the result).\r\n \u2022 Algorithm: EigenPro iteration.\r\n \u2022 Program: Python code.\r\n \u2022 Data set: Public available image datasets.\r\n \u2022 Run-time environment: Ubuntu 16.04 with CUDA ( 8.0)\r\n and GPUComputingSDKinstalled.\r\n \u2022 Hardware: Any GPU with compute capacity 3.0 (tested\r\n GPU:NvidiaTitanXp(Pascal)).\r\n \u2022 Publicly available?: Yes.\r\n \u2022 Codelicenses (if publicly available)?: MIT License.\r\n \u2022 Archived (provide DOI)?: doi.org/10.5281/zenodo.2574996\r\n D.3 Description\r\n D.3.1 Howdelivered\r\n EigenPro 2.0 is an open source library under MIT license\r\n and is hosted with code, API speci\ufb01cations, usage instruc-\r\n tions, and design documentations on Github.\r\n D.3.2 Hardwaredependencies\r\n EigenPro 2.0 requires NVIDIA GPU with compute capacity\r\n 3.0.\r\n D.3.3 Software dependencies\r\n EigenPro 2.0 requires CUDA ( 8.0), Tensor\ufb02ow ( 1.2.1),\r\n and Keras (tested version: 2.0.8). EigenPro 2.0 has been\r\n tested on Ubuntu 16.04 and Windows 10.\r\n D.3.4 Datasets\r\n All datasets are publicly available. The used dataset MNIST\r\n in this artifact will be automatically downloaded and prepro-\r\n cessed by the included python script. Users can also down-\r\n load the dataset directly from yann.lecun.com/exdb/mnist.\r\n D.4 Installation\r\n The python based EigenPro 2.0 can be used directly out of\r\n the package.\r\n D.5 Experimentwork\ufb02ow\r\n Belowarethesteps to download and run the experiments.\r\n \u2022 Download the code from Github.\r\n $ git clone \\\r\n > https://github.com/EigenPro/EigenPro2.git\r\n $ cd EigenPro2", "award": [], "sourceid": 171, "authors": [{"given_name": "Siyuan", "family_name": "Ma", "institution": "The Ohio State University"}, {"given_name": "Mikhail", "family_name": "Belkin", "institution": "Ohio State University"}]}