{"title": "MotherNets: Rapid Deep Ensemble Learning", "book": "Proceedings of Machine Learning and Systems", "page_first": 199, "page_last": 215, "abstract": "Ensembles of deep neural networks significantly improve generalization accuracy. However, training neural network ensembles requires a large amount of computational resources and time. State-of-the-art approaches either train all networks from scratch leading to prohibitive training cost or generate ensembles by training a monolithic architecture resulting in lower diversity and accuracy. We propose MotherNets to address these shortcomings: A MotherNet captures the structural similarity across different members of a deep neural network ensemble. To train an ensemble, we first train a single or a small set of MotherNets and subsequently, their function is transferred to all members of the ensemble. Then, we continue to train the ensemble networks, which converge significantly faster compared to training from scratch. MotherNets can handle ensembles with diverse architectures by clustering ensemble networks of similar architecture and training a separate MotherNet for every cluster. MotherNets also use clustering to balance the accuracy vs. training cost tradeoff. We show that compared to state-of-the-art approaches such as Snapshot ensembles, knowledge distillation, and TreeNets, MotherNets can achieve better accuracy given the same time budget or alternatively that MotherNets can achieve the same accuracy as state-of-the-art approaches at a fraction of the training time.  Overall, we demonstrate that MotherNets bring not only performance and accuracy improvements but a new powerful way to balance the training cost vs. accuracy tradeoff and we verify these benefits over numerous state-of-the-art neural network architectures. ", "full_text": "                                          MOTHERNETS: RAPID DEEP ENSEMBLE LEARNING\r\n                                         AbdulWasay1 BrianHentschel1 YuzeLiao1 SanyuanChen1 StratosIdreos1\r\n                                                                                      ABSTRACT\r\n                        Ensembles of deep neural networks signi\ufb01cantly improve generalization accuracy. However, training neural\r\n                        network ensembles requires a large amount of computational resources and time. State-of-the-art approaches\r\n                        either train all networks from scratch leading to prohibitive training cost that allows only very small ensemble sizes\r\n                        in practice, or generate ensembles by training a monolithic architecture, which results in lower model diversity\r\n                        and decreased prediction accuracy. We propose MotherNets to enable higher accuracy and practical training cost\r\n                        for large and diverse neural network ensembles: A MotherNet captures the structural similarity across some or\r\n                        all members of a deep neural network ensemble which allows us to share data movement and computation costs\r\n                        across these networks. We \ufb01rst train a single or a small set of MotherNets and, subsequently, we generate the\r\n                        target ensemble networks by transferring the function from the trained MotherNet(s). Then, we continue to train\r\n                        these ensemble networks, which now converge drastically faster compared to training from scratch. MotherNets\r\n                        handle ensembles with diverse architectures by clustering ensemble networks of similar architecture and training\r\n                        a separate MotherNet for every cluster. MotherNets also use clustering to control the accuracy vs. training\r\n                        cost tradeoff. We show that compared to state-of-the-art approaches such as Snapshot Ensembles, Knowledge\r\n                        Distillation, and TreeNets, MotherNets provide a new Pareto frontier for the accuracy-training cost tradeoff.\r\n                        Crucially, training cost and accuracy improvements continue to scale as we increase the ensemble size (2 to 3\r\n                        percent reduced absolute test error rate and up to 35 percent faster training compared to Snapshot Ensembles). We\r\n                        verify these bene\ufb01ts over numerous neural network architectures and large data sets.\r\n                  1     INTRODUCTION                                                                 e                   Snapshot               Bagging\r\n                  Neural network ensembles. Various applications increas-                            t                  Ensembles             Knowledge Distillation\r\n                  ingly use ensembles of multiple neural networks to scale the                                                                    TreeNets\r\n                  representational power of their deep learning pipelines. For                       rror ra         g=1\r\n                                                                                                      e\r\n                  example, deep neural network ensembles predict relation-                           t    Pareto frontier\r\n                                                                                                     s                           MotherNets\r\n                                                                                                     e\r\n                  ships between chemical structure and reactivity (Agra\ufb01otis                         t                                                  g=ensemble size\r\n                  et al., 2002), segment complex images with multiple ob-                                           Number of clusters g \r\n                  jects (Ju et al., 2017), and are used in zero-shot as well as                                     navigate this tradeoff      Independent (full data)\r\n                  multiple choice learning (Guzman-Rivera et al., 2014; Ye &                                                    training time\r\n                  Guo, 2017). Further, several winners and top performers on\r\n                  the ImageNet challenge are ensembles of neural networks                         Figure 1. MotherNets establish a new Pareto frontier for the\r\n                  (Lee et al., 2015a; Russakovsky et al., 2015). Ensembles                        accuracy-training time tradeoff as well as navigate this tradeoff.\r\n                  function as collections of experts and have been shown, both\r\n                  theoretically and empirically, to improve generalization ac-                    e.g., from 6 percent to 4.5 percent for ensembles of ResNets\r\n                  curacy(Dietterich,2000;Druckeretal.,1993;Granittoetal.,                         onCIFAR-10(Huangetal.,2017a;Juetal., 2017).\r\n                  2005; Huggins et al., 2016; Ju et al., 2017; Lee et al., 2015a;                 The growing training cost. Training ensembles of mul-\r\n                  Russakovsky et al., 2015; Xu et al., 2014). For instance,                       tiple deep neural networks takes a prohibitively large\r\n                  bycombiningseveral image classi\ufb01cation networks on the                          amount of time and computational resources. Even on high-\r\n                  CIFAR-10, CIFAR-100, and SVHN data sets, ensembles                              performance hardware, a single deep neural network may\r\n                  can reduce the misclassi\ufb01cation rate by up to 20 percent,                       take several days to train and this training cost grows lin-\r\n                      1Harvard School of Engineering and Applied Sciences. Corre-                 early with the size of the ensemble as every neural network\r\n                  spondence to: Abdul Wasay <awasay@seas.harvard.edu>.                            in the ensemble needs to be trained (Szegedy et al., 2015; He\r\n                                                                                                  et al., 2016; Huang et al., 2017b;a). This problem persists\r\n                  Proceedings of the 3rd MLSys Conference, Austin, TX, USA,                       even in the presence of multiple machines. This is because\r\n                  2020. Copyright 2020 by the author(s).                                          the holistic cost of training, in terms of buying or renting\r\n                                                         MotherNets: Rapid Deep Ensemble Learning\r\n               Table 1. Existing approaches to train ensembles of deep neural     beenrecentlyintroducedthatgenerateak networkensemble\r\n               networks are limited in speed, accuracy, diversity, and size.      from a single network: Snapshot Ensembles and TreeNets.\r\n                                        Fast    High    Diverse     Large         Snapshot Ensembles train a single network and use its pa-\r\n                                       train.    acc.     arch.      size         rameters at k different points of the training process to\r\n                  Full data              \u00d7                            \u00d7           instantiate k networks that will form the target ensemble\r\n                  Bagging                \u223c        \u00d7                   \u00d7           (Huangetal.,2017a). SnapshotEnsemblesvarythelearning\r\n                  KnowledgeDist.         \u223c        \u00d7                   \u00d7           rate in a cyclical fashion, which enables the single network\r\n                  TreeNets               \u223c        \u223c         \u00d7         \u00d7           to converge to k local minima along its optimization path.\r\n                  Snapshot Ens.                             \u00d7         \u00d7           TreeNets also train a single network but this network is\r\n                  MotherNets                                                      designed to branch out into k sub-networks after the \ufb01rst\r\n                                                                                  few layers. Effectively every sub-network functions as a\r\n                                                                                  separate member of the target ensemble (Lee et al., 2015b).\r\n               out these machines through a cloud service provider, still in-     While these approaches do improve training time, they also\r\n               creases linearly with the ensemble size. The rising training       come with two critical problems. First, the resulting en-\r\n               cost is a bottleneck for numerous applications, especially         sembles are less accurate because they are less diverse com-\r\n               when it is critical to quickly incorporate new data and to         pared to using k different and individually trained networks.\r\n               achieve a target accuracy. For instance, in one use case,          Second, these approaches cannot be applied to state-of-\r\n               where deep learning models are applied to detect Diabetic          the-art diverse ensembles. Such ensembles may contain\r\n               Retinopathy (a leading cause of blindness), newly labelled         arbitrary neural network architectures with structural differ-\r\n               images become available every day. Thus, incorporating             ences to achieve increased accuracy (for instance, such as\r\n               newdataintheneural network models as quickly as possi-             those used in the ImageNet competitions (Lee et al., 2015a;\r\n               ble is crucial in order to enable more accurate diagnosis for      Russakovsky et al., 2015)).\r\n               the immediately next patient (Gulshan et al., 2016).               KnowledgeDistillation provides a middle ground between\r\n               Problem 1: Restrictive ensemble size. Due to this pro-             separate training and ensemble generation approaches (Hin-\r\n               hibitive training cost, researchers and practitioners can only     ton et al., 2015). With Knowledge Distillation, an ensemble\r\n               feasibly train and employ small ensembles (Szegedy et al.,         is trained by \ufb01rst training a large generalist network and\r\n               2015; He et al., 2016; Huang et al., 2017b;a). In partic-          then distilling its knowledge to an ensemble of small spe-\r\n               ular, neural network ensembles contain drastically fewer           cialist networks that may have different architectures (by\r\n               individual models when compared with ensembles of other            training them to mimic the probabilities produced by the\r\n               machine learning methods. For instance, random decision            larger network) (Hinton et al., 2015; Li & Hoiem, 2017).\r\n               forests, a popular ensemble algorithm, often has several hun-      However, this approach results in limited improvement in\r\n               dredsofindividualmodels(decisiontrees),whereasstate-of-            training cost as distilling knowledge still takes around 70\r\n               the-art ensembles of deep neural networks consist of around        percent of the time needed to train from scratch. Even then,\r\n               \ufb01venetworks(Heetal.,2016;Huangetal.,2017b;a;Oshiro                 the ensemble networks are still closely tied to the same\r\n               et al., 2012; Szegedy et al., 2015). This is restrictive since     large network that they are distilled from. The result is\r\n               the generalization accuracy of an ensemble increases with          signi\ufb01cantly lower accuracy and diversity when compared\r\n               the number of well-trained models it contains (Oshiro et al.,      to ensembles where every network is trained individually\r\n               2012; Bonab & Can, 2016; Huggins et al., 2016). Theoreti-          (Hinton et al., 2015; Li & Hoiem, 2017).\r\n               cally, for best accuracy, the size of the ensemble should be       MotherNets. Wepropose MotherNets, which enable rapid\r\n               at least equal to the number of class labels, of which there       training of large feed-forward neural network ensembles.\r\n               could be thousands in modern applications (Bonab & Can,            The core bene\ufb01ts of MotherNets are depicted in Table 1.\r\n               2016).                                                             MotherNets provide: (i) lower training time and better gen-\r\n               Additional problems: Speed, accuracy, and diversity.               eralization accuracy than existing fast ensemble training\r\n               Typically, every deep neural network in an ensemble is             approaches and (ii) the capacity to train large ensembles\r\n               initialized randomly and then trained individually using all       with diverse network architectures.\r\n               training data (full data), or by using a random subset of the      Figure 2 depicts the core intuition behind MotherNets: A\r\n               training data (i.e., bootstrap aggregation or bagging) (Ju         MotherNet is a network that captures the maximum struc-\r\n               et al., 2017; Lee et al., 2015a; Moghimi & Vasconcelos,            tural similarity between a cluster of networks (Figure 2 Step\r\n               2016). This requires a signi\ufb01cant amount of processing             (1)). An ensemble may consist of one or more clusters; one\r\n               time and computing resources that grow linearly with the           MotherNet is constructed per cluster. Every MotherNet is\r\n               ensemble size.                                                     trained to convergence using the full data set (Figure 2 Step\r\n               To alleviate this linear training cost, two techniques have        (2)). Then, every target network in the ensemble is hatched\r\n                                                                     MotherNets: Rapid Deep Ensemble Learning\r\n                              \r\n                             r\r\n                             e\r\n                             t         Specifications of ensemble networks                       MotherNet                               Hatched networks\r\n                             usg-1\r\n                             l\r\n                             c\r\n                      uron\r\n                      ne          s                                         out\r\n                                  r                                         put\r\n                             g    e\r\n                            r     ay\r\n                            e                                                l                                                                                            ne t\r\n                            t      l                                        ay                                                                                               ra\r\n                        s   us                                                                                                                                            uroni\r\n                        t   l                                               e                                                                                                ne\r\n                        e   c     nput                                      r                                                                                                d \r\n                     d for rN     i                                         s\r\n                     e\r\n                     t  he\r\n                     c\r\n                     e\r\n                     l  ot\r\n                     e\r\n                     s  M\r\n                              \r\n                             r\r\n                             e 1\r\n                             t\r\n                             usg+\r\n                             l\r\n                             c      Step 1: Construct the MotherNet per           Step 2: Train the MotherNet using the        Step 3: Hatch ensemble networks \r\n                                   cluster to capture the largest structural  entire data set. This allows us to \u201cshare              by function-preserving \r\n                                       commonality (shown in bold).               epochs\u201d amongst ensemble networks.           transformations and further train.\r\n                  Figure 2. MotherNets train an ensemble of neural networks by \ufb01rst training a set of MotherNets and transferring the function to the\r\n                  ensemble networks. The ensemble networks are then further trained converging signi\ufb01cantly faster than training individually.\r\n                  from its MotherNet using function-preserving transforma-                          andtrain a separate MotherNet for each cluster. The number\r\n                  tions (Figure 2 Step (3)) ensuring that knowledge from the                        of clusters used (and thus the number of MotherNets) is\r\n                  MotherNet is transferred to every network. The ensemble                           a knob that helps navigate the training time vs. accuracy\r\n                  networksarethentrained. Theyconvergesigni\ufb01cantlyfaster                            tradeoff. Figure 1 depicts visually the new tradeoff achieved\r\n                  compared to training from scratch (within tens of epochs).                        byMotherNets.\r\n                  Thecoretechnical intuition behind the MotherNets design                           Contributions. We describe how to construct MotherNets\r\n                  is that it enables us to \u201cshare epochs\u201d between the ensem-                        in detail and how to trade accuracy for speed. Then through\r\n                  ble networks. At a lower level what this means is that the                        a detailed experimental evaluation with diverse data sets and\r\n                  networks implicitly share part of the data movement and                           architectures we demonstrate that MotherNets bring three\r\n                  computation costs that manifest during training over the                          bene\ufb01ts: (i) MotherNets establish a new Pareto frontier of\r\n                  samedata. This design draws intuition from systems tech-                          the accuracy-training time tradeoff providing up to 2 per-\r\n                  niques such as \u201cshared scans\u201d in data systems where many                          cent better absolute test error rate compared to fast ensemble\r\n                  (analytical) queries share data movement and computation                          training approaches at comparable or less training cost. (ii)\r\n                  for part of a scan over the same data (Harizopoulos et al.,                       MotherNets allow robust navigation of this new Pareto fron-\r\n                  2005; Zukowski et al., 2007; Qiao et al., 2008; Arumugam                          tier of the tradeoff between accuracy and training time. (iii)\r\n                  et al., 2010; Candea et al., 2011; Giannikis et al., 2012;                        MotherNets enable scaling of neural network ensembles to\r\n                  Psaroudakis et al., 2013; Giannikis et al., 2014; Kester et al.,                  large sizes (100s of models) with practical training cost and\r\n                  2017).                                                                            increasing accuracy bene\ufb01ts.\r\n                  Accuracy-trainingtimetradeoff. MotherNetsdonottrain                               Weprovideaweb-basedinteractive demo as an additional\r\n                  each network individually but \u201csource\u201d all networks from                          resource to help in understanding the training process\r\n                  the same set of \u201cseed\u201d networks instead. This introduces                          in MotherNets:           http://daslab.seas.harvard.\r\n                  some reduction in diversity and accuracy compared to an                           edu/mothernets/.\r\n                  approach that trains all networks independently. There is no\r\n                  way around this. In practice, there is an intrinsic tradeoff                      2     RAPID ENSEMBLE TRAINING\r\n                  between ensemble accuracy and training time. All existing\r\n                  approaches are affected by this and their design decisions                        De\ufb01nition: MotherNet. Given a cluster of k neural net-\r\n                  effectively place them at a particular balance within this                        works C = {N ,N ,...N }, where N denotes the i-th\r\n                                                                                                                         1    2         k                i\r\n                  tradeoff (Guzman-Rivera et al., 2014; Lee et al., 2015a;                          neural network in C, the MotherNet M is de\ufb01ned as the\r\n                                                                                                                                                        c\r\n                  Huangetal., 2017a).                                                               largest network from which all networks in C can be ob-\r\n                  WeshowthatMotherNets,strikeasuperiorbalancebetween                                tained through function-preserving transformations. Moth-\r\n                  accuracy and training time than all existing approaches. In                       erNets divide an ensemble into one or more such network\r\n                  fact, we show that MotherNets establish a new Pareto fron-                        clusters and construct a separate MotherNet for each.\r\n                  tier for this tradeoff and that we can navigate this tradeoff.                    Constructing a MotherNet for fully-connected net-\r\n                  Toachieve this, MotherNets cluster ensemble networks (tak-                        works. Assume a cluster C of fully-connected neural net-\r\n                  ingintoaccountboththetopologyandthearchitectureclass)                             works. The input and the output layers of Mc have the same\r\n                                                                                                                                                                                                                    MotherNets: Rapid Deep Ensemble Learning\r\n                                                        structure as all networks in C, since they are all trained                                                                                                                                                                                                                                                                                                                                                3 : 64\r\n                                                        for the same task. M is initialized with as many hidden\r\n                                                                                                                                                c                                                                                                                                                                                     k\r\n                                                        layers as the shallowest network in C. Then, we construct                                                                                                                                                                                                                     oc                  3 : 72                                              5 : 64                                              3 : 64                                                     3 : 64\r\n                                                        the hidden layers of M one-by-one going from the input\r\n                                                                                                                                                      c                                                                                                                                                                               a bl\r\n                                                        to the output layer. The structure of the i-th hidden layer                                                                                                                                                                                                                                       3 : 64                                              5 : 64                                              1 : 64                                                     1 : 64\r\n                                                        of M is the same as the i-th hidden layer of the network\r\n                                                                             c\r\n                                                        in C with the least number of parameters at the i-th layer.                                                                                                                                                                                                             ng         r\r\n                                                        Figure 2 shows an example of how this process works for                                                                                                                                                                                                                 i          e              3 : 64                                              3 : 64                                              3 : 64                                                     3 : 64\r\n                                                                                                                                                                                                                                                                                                                                           ay\r\n                                                        a toy ensemble of two three-layered and one four-layered                                                                                                                                                                                                                a pool     l\r\n                                                        neural networks. Here, the MotherNet is constructed with                                                                                                                                                                                                                .                         3 : 64                                                                                                  3 : 64\r\n                                                        three layers. Every layer has the same structure as the layer                                                                                                                                                                                                                      r\r\n                                                        with the least number of parameters at that position (shown                                                                                                                                                                                                             onv        e              3 : 64                                              3 : 32                                              3 : 64                                                     3 : 32\r\n                                                                                                                                                                                                                                                                                                                                           ay\r\n                                                        in bold in Figure 2 Step (1)). In Appendix A we also include                                                                                                                                                                                                            a c        l\r\n                                                                                                                                                                                                                                                                                                                                                         Net 1                                               Net 2                                               Net 3                                                              M\r\n                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           c\r\n                                                        a pseudo-code description of this algorithm.                                                                                                                                                                                                                                                                                                                                                                                                                                <latexit sha1_base64=\"Udm5xoxUxkeSJ78/C7WWjqvzZBU=\">AAAB6nicbZBNS8NAEIYn9avWr6pHL4tF8FQSEeqx6MWLUNF+QBvKZjtpl242YXcjlNCf4MWDIl79Rd78N27bHLT1hYWHd2bYmTdIBNfGdb+dwtr6xuZWcbu0s7u3f1A+PGrpOFUMmywWseoEVKPgEpuGG4GdRCGNAoHtYHwzq7efUGkey0czSdCP6FDykDNqrPVw12f9csWtunORVfByqECuRr/81RvELI1QGiao1l3PTYyfUWU4Ezgt9VKNCWVjOsSuRUkj1H42X3VKzqwzIGGs7JOGzN3fExmNtJ5Ege2MqBnp5drM/K/WTU145WdcJqlByRYfhakgJiazu8mAK2RGTCxQprjdlbARVZQZm07JhuAtn7wKrYuqZ/n+slK/zuMowgmcwjl4UIM63EIDmsBgCM/wCm+OcF6cd+dj0Vpw8plj+CPn8wcV+I2n</latexit><latexit sha1_base64=\"Udm5xoxUxkeSJ78/C7WWjqvzZBU=\">AAAB6nicbZBNS8NAEIYn9avWr6pHL4tF8FQSEeqx6MWLUNF+QBvKZjtpl242YXcjlNCf4MWDIl79Rd78N27bHLT1hYWHd2bYmTdIBNfGdb+dwtr6xuZWcbu0s7u3f1A+PGrpOFUMmywWseoEVKPgEpuGG4GdRCGNAoHtYHwzq7efUGkey0czSdCP6FDykDNqrPVw12f9csWtunORVfByqECuRr/81RvELI1QGiao1l3PTYyfUWU4Ezgt9VKNCWVjOsSuRUkj1H42X3VKzqwzIGGs7JOGzN3fExmNtJ5Ege2MqBnp5drM/K/WTU145WdcJqlByRYfhakgJiazu8mAK2RGTCxQprjdlbARVZQZm07JhuAtn7wKrYuqZ/n+slK/zuMowgmcwjl4UIM63EIDmsBgCM/wCm+OcF6cd+dj0Vpw8plj+CPn8wcV+I2n</latexit><latexit sha1_base64=\"Udm5xoxUxkeSJ78/C7WWjqvzZBU=\">AAAB6nicbZBNS8NAEIYn9avWr6pHL4tF8FQSEeqx6MWLUNF+QBvKZjtpl242YXcjlNCf4MWDIl79Rd78N27bHLT1hYWHd2bYmTdIBNfGdb+dwtr6xuZWcbu0s7u3f1A+PGrpOFUMmywWseoEVKPgEpuGG4GdRCGNAoHtYHwzq7efUGkey0czSdCP6FDykDNqrPVw12f9csWtunORVfByqECuRr/81RvELI1QGiao1l3PTYyfUWU4Ezgt9VKNCWVjOsSuRUkj1H42X3VKzqwzIGGs7JOGzN3fExmNtJ5Ege2MqBnp5drM/K/WTU145WdcJqlByRYfhakgJiazu8mAK2RGTCxQprjdlbARVZQZm07JhuAtn7wKrYuqZ/n+slK/zuMowgmcwjl4UIM63EIDmsBgCM/wCm+OcF6cd+dj0Vpw8plj+CPn8wcV+I2n</latexit><latexit sha1_base64=\"Udm5xoxUxkeSJ78/C7WWjqvzZBU=\">AAAB6nicbZBNS8NAEIYn9avWr6pHL4tF8FQSEeqx6MWLUNF+QBvKZjtpl242YXcjlNCf4MWDIl79Rd78N27bHLT1hYWHd2bYmTdIBNfGdb+dwtr6xuZWcbu0s7u3f1A+PGrpOFUMmywWseoEVKPgEpuGG4GdRCGNAoHtYHwzq7efUGkey0czSdCP6FDykDNqrPVw12f9csWtunORVfByqECuRr/81RvELI1QGiao1l3PTYyfUWU4Ezgt9VKNCWVjOsSuRUkj1H42X3VKzqwzIGGs7JOGzN3fExmNtJ5Ege2MqBnp5drM/K/WTU145WdcJqlByRYfhakgJiazu8mAK2RGTCxQprjdlbARVZQZm07JhuAtn7wKrYuqZ/n+slK/zuMowgmcwjl4UIM63EIDmsBgCM/wCm+OcF6cd+dj0Vpw8plj+CPn8wcV+I2n</latexit>\r\n                                                        Constructing a MotherNet for convolutional networks.                                                                                                                                                                                                                                                     Cluster of ensemble networks C                                                                                                                     MotherNet \r\n                                                        Convolutionalneuralnetworkarchitecturesconsistofblocks                                                                                                                                                                                                     Figure 3.                                  Constructing MotherNet for convolutional neu-\r\n                                                        of one or more convolutional layers separated by pooling                                                                                                                                                                                                    ral networks block-by-block.                                                                                                    For each layer, we select\r\n                                                        layers (He et al., 2016; Shazeer et al., 2017; Simonyan                                                                                                                                                                                                     the layer with the least number of parameters from the\r\n                                                        &Zisserman, 2015; Szegedy et al., 2015). These blocks                                                                                                                                                                                                       ensemble networks (shown in bold rectangles) (Notation:\r\n                                                        are then followed by another block of one or more fully-                                                                                                                                                                                                   <filter_width>:<filter_number>).\r\n                                                        connected layers. For instance, VGGNets are composed\r\n                                                        of \ufb01ve blocks of convolutional layers separated by max-                                                                                                                                                                                                     ensemble network and the MotherNet. This may lead to a\r\n                                                        poolinglayers, whereas, DenseNetsconsistoffourblocksof                                                                                                                                                                                                      scenario where the MotherNet only captures an insigni\ufb01-\r\n                                                        densely connected convolutional layers. For convolutional                                                                                                                                                                                                   cant amount of commonality. This would negatively affect\r\n                                                        networks, we construct the MotherNet Mc block-by-block                                                                                                                                                                                                      performance as we would not be able to share signi\ufb01cant\r\n                                                        instead of layer-by-layer. The intuition is that deeper or                                                                                                                                                                                                  computation and data movement costs across the ensemble\r\n                                                        wider variants of such networks are created by adding or                                                                                                                                                                                                    networks. This property is directly correlated with the size\r\n                                                        expanding layers within individual blocks instead of adding                                                                                                                                                                                                 of the MotherNet.\r\n                                                        them all at the end of the network. For instance, VGG-C                                                                                                                                                                                                     In order to maintain the ability to share costs in diverse\r\n                                                        (with 16 convolutional layers) is obtained by adding one                                                                                                                                                                                                    ensembles, we partition such an ensemble into g clusters,\r\n                                                        layer to each of the last three blocks of VGG-B (with 13                                                                                                                                                                                                    and for every cluster, we construct and train a separate\r\n                                                        convolutional layers) (Simonyan & Zisserman, 2015). To                                                                                                                                                                                                      MotherNet. To perform this clustering, the m networks in\r\n                                                        construct the MotherNet for every block, we select as many                                                                                                                                                                                                  the ensemble E = {N ,N ,...N } are represented as\r\n                                                                                                                                                                                                                                                                                                                                                                                                                    1                2                             m\r\n                                                        convolutional layers to include in the MotherNet as the                                                                                                                                                                                                     vectors E = {V ,V ,...V } such that Vj stores the size\r\n                                                        network in C with the least number of layers in that block.                                                                                                                                                                                                                                       v                            1              2                           m                                                          i\r\n                                                                                                                                                                                                                                                                                                                    of the j-th layer in N . These vectors are zero-padded to\r\n                                                        Every layer within a block is constructed such that it has                                                                                                                                                                                                                                                                                          i\r\n                                                                                                                                                                                                                                                                                                                    a length of max({|N |,|N |,...|N |}) (where |N | is the\r\n                                                        the least number of \ufb01lters and the smallest \ufb01lter size of any                                                                                                                                                                                                                                                                                  1                      2                                  m                                                              i\r\n                                                        layer at the same position within that block. An example                                                                                                                                                                                                    numberoflayers in Ni). For convolutional neural networks,\r\n                                                        of this process is shown in Figure 3. Here, we construct a                                                                                                                                                                                                  these vectors are created by \ufb01rst creating similarly zero-\r\n                                                        MotherNet for three convolutional neural networks block-                                                                                                                                                                                                    padded sub-vectors per block and then concatenating the\r\n                                                                                                                                                                                                                                                                                                                    sub-vectors to get the \ufb01nal vector. In this case, to fully\r\n                                                        by-block. For instance, in the \ufb01rst block, we include one                                                                                                                                                                                                   represent convolutional layers, V j stores a 2-tuple of \ufb01lter\r\n                                                        convolutional layer in the MotherNet having the smallest                                                                                                                                                                                                    sizes and number of \ufb01lters.                                                                                                        i\r\n                                                        \ufb01lter width and the least number of \ufb01lters (i.e., 3 and 32\r\n                                                        respectively). In Appendix A we also include a pseudo-code                                                                                                                                                                                                  Given a set of vectors Ev, we create g clusters using the bal-\r\n                                                        description of this algorithm.                                                                                                                                                                                                                              ancedK-meansalgorithmwhileminimizingtheLevenshtein\r\n                                                        Constructing MotherNets for ensembles of neural net-                                                                                                                                                                                                        distance between the vector representation of networks in a\r\n                                                        works with different sizes and topologies. By construc-                                                                                                                                                                                                     cluster and its MotherNet (Levenshtein, 1966; MacQueen,\r\n                                                        tion, the overall size and topology (sequence of layer sizes)                                                                                                                                                                                              1967). The Levenshtein or the edit distance between two\r\n                                                        of a MotherNet is limited by the smallest network in its clus-                                                                                                                                                                                              vectors is the minimum number of edits \u2013 insertions, dele-\r\n                                                        ter. If we were to assign a single cluster to all networks in                                                                                                                                                                                               tions, or substitutions \u2013 needed to transform one vector to\r\n                                                        an ensemble that has a large difference in size and topology                                                                                                                                                                                                another. By minimizing this distance, we ensure that, for\r\n                                                        between the smallest and the largest networks, there will                                                                                                                                                                                                   every cluster, the ensemble networks can be obtained from\r\n                                                        be a correspondingly large difference between at least one                                                                                                                                                                                                  their cluster\u2019s MotherNet with the minimal amount of edits\r\n                                                                                                                                                                                                                                                                                                                    constrained on g. During every iteration of the K-means al-\r\n                                                           MotherNets: Rapid Deep Ensemble Learning\r\n                gorithm, instead of computing centers of candidate clusters,         Notonlyis it conceptually simpler but in our experiments\r\n                we construct MotherNets corresponding to every cluster.              weobservethat it serves as a better starting point for further\r\n                Then, we use the edit distance between these MotherNets              training of the expanded network as compared to Network\r\n                and all networks to perform cluster reassignments.                   Morphism. Overall, function-preserving transformations\r\n                Constructing MotherNets for ensembles of diverse ar-                 are readily applicable to a wide range of feed-forward neu-\r\n                chitecture classes. An individual MotherNet is built for             ral networks including VGGNets, ResNets, FractalNets,\r\n                a cluster of networks that belong to a single architecture           DenseNets, and Wide ResNets (Chen et al., 2016; Wei et al.,\r\n                class. Each architecture class has the property of function-         2016; 2017; Huang et al., 2017b). As such MotherNets is\r\n                preserving navigation. This is to say that given any member          applicable to all of these different network architectures. In\r\n                of this class, we can build another member of this class with        addition, designing function-preserving transformations is\r\n                more parameters but having the same function. Multiple               an active area of research and better transformation tech-\r\n                types of neural networks fall under the same architecture            niques may be incorporated in MotherNets as they become\r\n                class (Cai et al., 2018). For instance, we can build a sin-          available.\r\n                gle MotherNet for ensembles of AlexNets, VGGNets, and                Hatchingisacomputationallyinexpensiveprocessthattakes\r\n                Inception Nets as well as one for DenseNets and ResNets.             negligible time compared to an epoch of training (Wei et al.,\r\n                Tohandle scenarios when an ensemble contains members                 2016). This is because generating every network in a clus-\r\n                from diverse architecture classes i.e., we cannot navigate           ter through function preserving transformations requires at\r\n                the entire set of ensemble networks in a function-preserving         most a single pass on layers in its MotherNet.\r\n                manner, we build a separate MotherNet for each class (or a           Training Step 3: Training hatched networks. To explic-\r\n                set of MotherNets if each class also consists of networks of         itly add diversity to the hatched networks, we randomly\r\n                diverse sizes).                                                      perturb their parameters with gaussian noise before further\r\n                Overall, the techniques described in the previous paragraphs         training. This breaks symmetry after hatching and it is a\r\n                allow us to create g MotherNets for an ensemble, being able          standard technique to create diversity when training ensem-\r\n                to capture the structural similarity across diverse networks         ble networks (Hinton et al., 2015; Lee et al., 2015b; Wei\r\n                both in terms of architecture and topology. We now describe          et al., 2016; 2017). Further, adding noise forces the hatched\r\n                how to train an ensemble using one or more MotherNets                networks to be in a different part of the hypothesis space\r\n                to help share the data movement and computation costs                from their MotherNets.\r\n                amongst the target ensemble networks.                                Thehatched ensemble networks are further trained converg-\r\n                Training Step 1: Training the MotherNets. First, the                 ing signi\ufb01cantly faster compared to training from scratch.\r\n                MotherNet for every cluster is trained from scratch using            This fast convergence is due to the fact that by initializing\r\n                the entire data set until convergence. This allows the Moth-         every ensemble network through its MotherNet, we placed\r\n                erNet to learn a good core representation of the data. The           it in a good position in the parameter space and we need\r\n                MotherNet has fewer parameters than any of the networks              to explore only for a relatively small region instead of the\r\n                in its cluster (by construction) and thus it takes less time per     whole parameter space. We show that hatched networks\r\n                epoch to train than any of the cluster networks.                     typically converge in a very small number epochs.\r\n                Training Step 2: Hatching ensemble networks. Once                    Weexperimented with both full data and bagging to train\r\n                the MotherNet corresponding to a cluster is trained, the             hatched networks. We use full data because given the small\r\n                next step is to generate every cluster network through a se-         numberofepochsneededforthehatchednetworks,bagging\r\n                quence of function-preserving transformations that allow             does not offer any signi\ufb01cant advantage in speed while it\r\n                us to expand the size of any feed-forward neural network,            hurts accuracy.\r\n                while ensuring that the function (or mapping) it learned is          Accuracy-training time tradeoff.         MotherNets can nav-\r\n                preserved (Chen et al., 2016). We call this process hatch-           igate the tradeoff between accuracy and training time by\r\n                ing and there are two distinct approaches to achieve this:           controlling the number of clusters g, which in turn controls\r\n                Net2Net increases the capacity of the given network by               howmanyMotherNetswehavetotrainindependently from\r\n                adding identity layers or by replicating existing weights            scratch. For instance, on one extreme if g is set to m, then\r\n                (Chen et al., 2016). Network Morphism, on the other hand,            every network in E will be trained independently, yielding\r\n                derives suf\ufb01cient and necessary conditions that when satis-          high accuracy at the cost of higher training time. On the\r\n                \ufb01ed will extend the network while preserving its function            other extreme, if g is set to one then, all ensemble networks\r\n                and provides algorithms to solve for those conditions (Wei           have a shared ancestor and this process may yield networks\r\n                et al., 2016; 2017).                                                 that are not as diverse or accurate, however, the training\r\n                In MotherNets, we adopt the \ufb01rst approach i.e., Net2Net.             time will be low.\r\n                                                      MotherNets: Rapid Deep Ensemble Learning\r\n                                   Table 2. We experiment with ensembles of various sizes and neural network architectures.\r\n                Ensemble      Membernetworks                                                      Param. SEalternative           Param.\r\n                V5            VGG13, 16, 16A, 16B, and 19 from the VGGNet paper (Si-              682M      VGG-16\u00d75             690M\r\n                              monyan&Zisserman,2015)\r\n                D5            TwovariantsofDenseNet-40(with12and24convolutional\ufb01lters             17M       DenseNet-60 \u00d7 5      17.3M\r\n                              per layer) and three variants of DenseNet-100 (with 12, 16, and\r\n                              24\ufb01lters per layer) (Huang et al., 2017b)\r\n                R10           Twovariants each of ResNet 20, 32, 44, 56, and 110 from the         327M      R-56 \u00d710             350M\r\n                              ResNet paper (He et al., 2016)\r\n                V25           25 variants of VGG-16 with distinct architectures created by        3410M     VGG-16\u00d725            3450M\r\n                              progressively varying one layer from VGG16 in one of three\r\n                              ways: (i) increasing the number of \ufb01lters, (ii) increasing the \ufb01lter\r\n                              size, or (iii) applying both (i) and (ii)\r\n                V100          100variants of VGG-16 created as described above                    13640M VGG-16\u00d7100              13800M\r\n               MotherNets expose g as a tuning knob. As we show in our        3    EXPERIMENTAL ANALYSIS\r\n               experimental analysis, MotherNets achieve a new Pareto         Wedemonstrate that MotherNets enable a better training\r\n               frontier for the accuracy-training cost tradeoff which is a    time-accuracy tradeoff than existing fast ensemble training\r\n              well-de\ufb01ned convex space. That is, with every step in in-       approaches across multiple data sets and architectures. We\r\n               creasing g (and consequently the number of independently       also show that MotherNets make it more realistic to utilize\r\n               trained MotherNets) accuracy does get better at the cost of    large neural network ensembles.\r\n               someadditional training time and vice versa. Conceptually\r\n               this is shown in Figure 1. This convex space allows robust     Baselines. Wecompareagainst\ufb01vestate-of-the-artmethods\r\n               and predictable navigation of the tradeoff. For example, un-   spanning both techniques that train all ensemble networks\r\n               less one needs best accuracy or best training time (in which   individually, i.e., Full Data (FD) and Bagging (BA), as\r\n               case the choice is simply the extreme values of g), they can   well as approaches that generate ensembles by training a\r\n               start with a single MotherNet and keep adding MotherNets       single network, i.e., Knowledge Distillation (KD), Snapshot\r\n               in small steps until the desired accuracy is achieved or the   Ensembles (SE), and TreeNets (TN).\r\n               training time budget is exhausted. This process can further    Evaluation metrics. We capture both the training cost and\r\n               be \ufb01ne-tuned using known approaches for hyperparameter         the resulting accuracy of an ensemble. For the training cost,\r\n               tuning methods such as bayesian optimization, training on      wereportthewallclocktimeaswellasthemonetarycostfor\r\n               sampled data, or learning trajectory sampling (Goodfellow      training on the cloud. For ensemble test accuracy, we report\r\n               et al., 2016).                                                 the test error rate under the widely used ensemble-averaging\r\n               Parallel training. MotherNets create a new schedule for        method (Van der Laan et al., 2007; Guzman-Rivera et al.,\r\n              \u201csharing epochs\u201d amongst networks of an ensemble but the        2012;2014;Leeetal.,2015b). Experimentswithalternative\r\n               actual process of training in every epoch remains unchanged.   inference methods (e.g., super learner and voting (Ju et al.,\r\n              As such, state-of-the-art approaches for distributed train-     2017)) showed that the method we use does not affect the\r\n               ing such as parameter-server (Dean et al., 2012) and asyn-     overall results in terms of comparing the training algorithms.\r\n               chronous gradient descent (Gupta et al., 2016; Iandola et al., Ensemblenetworks. Weexperimentwithensemblesofvar-\r\n               2016) can be applied to fully utilize as many machines as      ious convolutional architectures such as VGGNets, ResNets,\r\n               available during any stage of MotherNets\u2019 training.            Wide ResNets1, and DenseNets. Ensembles of these ar-\r\n               Fast inference. MotherNets can also be used to improve         chitectures have been extensively used to evaluate fast en-\r\n               inference time by keeping the MotherNet parameters shared      semble training approaches (Lee et al., 2015a; Huang et al.,\r\n               across the hatched networks. We describe this idea in Ap-      2017a). Each of these ensembles are composed of networks\r\n               pendix C.                                                      having diverse architectures as described in Table 2.\r\n                                                                              Toprovide a fair comparison with SE (where the snapshots\r\n                                                                              have to be from the same network architecture), we cre-\r\n                                                                                  1For experiments with Wide ResNets, see Appendix E.\r\n                                                                                                                                                                     MotherNets: Rapid Deep Ensemble Learning\r\n                                                                   single model                                                  5.3                                                                                                                  single model                  31                                                                         5.2                                        single model\r\n                                                       8.5                                                                                 single model                                                    5.5\r\n                                                                                                 KD                              5.2                                      KD                                                                       KD                               30            SE               single model                                    5\r\n                                                           8                                                                                                                                                                                                                                                                                KD                                             KD\r\n                                                                           SE                                                                                                                                  5              SE                                                    29                                                                         4.8            MN (g=1)\r\n                                                       7.5                                                                       5.1            MN (g=1)                                                                                                                            28              MN (g=1)\r\n                                                                   MN (g=1)                                                                                                                                4.5                                                                                                                                                 4.6\r\n                                                           7                     Pareto frontier   TN                                5                  SE                                                                    MN (g=1)                                              27\r\n                                                                 MN (g=2)                                                                                                                                      4                     MN (g=2)                                       26                MN (g=2)                                                 4.4              SE\r\n                                                       6.5                                                                       4.9                             MN (g=2)                                                                                                                                                                                      4.2\r\n                                                                      MN (g=3)                                                                                                                                                                          MN (g=5)                    25              MN (g=5)                                                             MN (g=2)\r\n                                                           6                                                                     4.8                  MN (g=3)                     MN (g=5)                3.5                                                                                                                                                     4                 MN (g=5)\r\n                                                 test error rate (%)          MN (g=4)                                     test error rate (%)                                          /FD          test error rate (%)                                                       test error rate (%) 24                          MN (g=25)/                test error rate (%)                   MN (g=10)MN (g=25)\r\n                                                       5.5                                           MN (g=5)/FD                                                     MN (g=4)                                                         MN (g=10)/FD                                                      MN (g=10)                    FD                        3.8\r\n                                                                                                                                 4.7                                                                           3                                                                    23\r\n                                                               0             4              8           12             16                4           8         12  16  20  24                                     20  30  40  50  60  70                                                        20           40           60           80          100                 4          8         12  16  20  24\r\n                                                                  training time (hrs)                                                       training time (hrs)                                                       training time (hrs)                                                     training time (hrs)                                                        training time (hrs)\r\n                                                                  (a) V5 (C-10)                                                                 (b) D5 (C-10)                                                            (c) R10 (C-10)                                                          (d) V25 (C-100)                                                          (e) V25 (SVHN)\r\n                                           Figure 4. MotherNets provide consistently better accuracy-training time tradeoff when compared with existing fast ensemble training\r\n                                            approaches across various data sets, architectures, and ensemble sizes.\r\n                                            ate snapshots having comparable number of parameters to                                                                                                                                             3.1              Better training time-accuracy tradeoff\r\n                                            each of the ensembles described above. This comparable                                                                                                                                             We\ufb01rst show how MotherNets strike an overall superior\r\n                                            alternatives we used for SE are also summarized in Table 2.                                                                                                                                         accuracy-training time tradeoff when compared to existing\r\n                                            For TN, we varied the number of shared layers and found                                                                                                                                             fast ensemble training approaches.\r\n                                            that sharing the 3 initial layers provides the best accuracy.                                                                                                                                       Figure 4 shows results across all our test data sets and en-\r\n                                           This is similar to the optimal proportion of shared layers                                                                                                                                           semble networks. All graphs in Figure 4 depict the tradeoff\r\n                                            in the TreeNets paper (Lee et al., 2015a). TN is not appli-                                                                                                                                         betweentrainingtimeneededversusaccuracyachieved. The\r\n                                            cable to DenseNets or ResNets as it is designed only for                                                                                                                                            core observation from Figure 4 is that across all datasets and\r\n                                            networks without skip-connections (Lee et al., 2015a). We                                                                                                                                           networks, MotherNets help establish a new Pareto frontier\r\n                                            omit comparison with TN for such ensembles.                                                                                                                                                         of this tradeoff. The different versions of MotherNets shown\r\n                                           Training setup. For all training approaches we use stochas-                                                                                                                                          in Figure 4 represent different numbers of clusters used (g).\r\n                                            tic gradient descent with a mini-batch size of 256 and batch-                                                                                                                                      Wheng=1,weuseasingleMotherNet,optimizing for train-\r\n                                            normalization. All weights are initialized by sampling from                                                                                                                                         ing time, while when g becomes equal to the ensemble size,\r\n                                            a standard normal distribution. Training data is randomly                                                                                                                                          weoptimize for accuracy (effectively this is equal to FD as\r\n                                            shuf\ufb02ed before every training epoch. The learning rate is set                                                                                                                                       every network is trained independently in its own cluster).\r\n                                            to 0.1 with the exception of DenseNets. For DenseNets, we                                                                                                                                          Thehorizontal line at the top of each graph indicates the ac-\r\n                                            use a learning rate of 0.1 to train MotherNets and 0.01 to                                                                                                                                          curacy of the best-performing single model in the ensemble\r\n                                            train hatched networks. This is inline with the learning rate                                                                                                                                       trained from scratch. This serves as a benchmark and, in\r\n                                            decay used in the DenseNets paper (Huang et al., 2017b).                                                                                                                                            the vast majority of cases, all approaches do improve over a\r\n                                            For FD, KD, TN, and MotherNets, we stop training if the                                                                                                                                             single model even when they have to sacri\ufb01ce on accuracy\r\n                                            training accuracy does not improve for 15 epochs. For SE                                                                                                                                            to improve training time. MotherNets is consistently and\r\n                                           weusetheoptimized training setup proposed in the original                                                                                                                                            signi\ufb01cantly better than that benchmark.\r\n                                            paper (Huang et al., 2017a), starting with an initial learning\r\n                                            rate of 0.2 and then training every snapshot for 60 epochs.                                                                                                                                         Next we discuss each individual training approach and how\r\n                                            Data sets. We experiment with a diverse array of data                                                                                                                                               it compares to MotherNets.\r\n                                            sets: SVHN,CIFAR-10,andCIFAR-100(Krizhevsky,2009;                                                                                                                                                   MotherNetsvs. KD,TN,andBA.MotherNets(withg=1)\r\n                                            Netzer et al.). The SVHN data set is composed of images of                                                                                                                                          is 2\u00d7 to 4.2\u00d7 faster than KD and results in up to 2 percent\r\n                                            house numbers and has ten class labels. There are a total of                                                                                                                                        better test accuracy. KD suffers in terms of accuracy because\r\n                                            99Kimages. Weuse73Kfortrainingand26Kfortesting.                                                                                                                                                     its ensemble networks are more closely tied to the base\r\n                                           TheCIFAR-10andCIFAR-100datasetshave10and100                                                                                                                                                          network as they are trained from the output of the same\r\n                                            class labels respectively corresponding to various images of                                                                                                                                        network. KD\u2019s higher training cost is because distilling is\r\n                                            everyday objects. There are a total of 60K images \u2013 50K                                                                                                                                             expensive. Every network starts from scratch and is trained\r\n                                            training and 10K test images.                                                                                                                                                                       on the data set using a combination of empirical loss and\r\n                                            Hardwareplatform. All experiments are run on the same                                                                                                                                               the loss from the output of the teacher network. We observe\r\n                                            server with Nvidia Tesla V100 GPU.                                                                                                                                                                  that distilling a network still takes around 60 to 70 percent\r\n                                                                                                                                                                                                                                                of the time required to train it using just the empirical loss.\r\n                                                                                                                                                                                                                                               Toachieve comparable accuracy to MotherNets (with g =\r\n                                                                                                             MotherNets: Rapid Deep Ensemble Learning\r\n                                    25                    MN          DN3                         8.5                               (k=100)\r\n                                                         DN1          DN4                                (k=10)         (k=50)                                Table 3. MotherNets (with g=1) give better oracle test accuracy\r\n                                                                                                     8         (k=25)\r\n                                    20                   DN2         DN5\r                                                                                      compared to Snapshot ensembles.\r\n                                                                                                  7.5 (k=5)(k=5)                   SE\r\n                                    15                                                               7        (k=10)               MN (g=1)                                            V5                D5                R10              V25               V25\r\n                                    10                                                            6.5             (k=25)           MN (g=8)\r\n                                                                                                     6     (k=5)        (k=50)                                                         C10               C10               C10              C100              SVHN\r\n                                      5                                                                                             (k=100)\r\n                                                                                                            (k=10)\r\n                                 training time (hrs.)                                         test error rate (%) 5.5(k=25)(k=50)      (k=100)                           MN 96.71 97.43 98.61 87.5                                                            97.17\r\n                                      0                                                              5                                                                   SE            96.03             96.91             97.11            86.9              97.3\r\n                                              FD       KD BA MN                                         0  5  10 15 20 25 30 35 40 45\r\n                                                                               1\r\n                                              training method                                             training time (hrs)\r\n                             Figure 5. MotherNets                         train           Figure 6. As the size of the                                         3.2        MotherNetsvs. SEandscalingtolargeensembles\r\n                             ensemble networks signif-                                    ensemble grows, MotherNets                                           AcrossallexperimentsinFigure4,SEistheclosestbaseline\r\n                             icantly faster after having                                  scale better than SE both in                                         to MotherNets. In effect, SE is part of the very same Pareto\r\n                             trained             the          MotherNet                   terms of training time and ac-                                       frontier de\ufb01ned by MotherNets in the accuracy-training cost\r\n                             (shown in black).                                            curacy achieved.                                                     tradeoff. That is, it represents one more valid point that can\r\n                             1), TN requires up to 3.8\u00d7 more training time on V 5. In                                                                          be useful depending on the desired balance. For example,\r\n                             the same time budget, MotherNets can train with g = 4                                                                             in Figure 4a (for V5 CIFAR-10), SE sacri\ufb01ces nearly one\r\n                             providing over one percent reduction in test error rate. The                                                                      percent in test error rate compared to MotherNets (with g=1)\r\n                             higher training time of TN is due to the fact that it combines                                                                    for a small improvementintrainingcost. Weobservesimilar\r\n                             several networks together to create a monolithic architecture                                                                     trends in Figure 4c and 4d). In Figure 4b, SE achieves a\r\n                             with various branches. We observe that training this takes a                                                                      balance that is in between MotherNets with one and two\r\n                             signi\ufb01cant time per epoch as well as requires more epochs                                                                         clusters. However, when training V25 on SVHN (Figure 4e)\r\n                             to converge. Moreover, TN does not generalize to neural                                                                           SEisinfact outside the Pareto frontier as it is both slower\r\n                             networks with skip-connections.                                                                                                   and achieves worst accuracy.\r\n                             Figure 4 does not show results for BA because it is an                                                                            Overall, MotherNets enables drastic improvements in either\r\n                             outlier. BA takes on average 73 percent of the time FD                                                                            accuracy or training time compared to SE by being able to\r\n                             needs to train but results in signi\ufb01cantly higher test error                                                                      control and navigate the tradeoff between the two.\r\n                             rate than any of the baseline approaches including the single                                                                     Oracle accuracy. Also, Table 3 shows that MotherNets\r\n                             model. Compared to BA, MotherNets is on average 3.6 \u00d7                                                                             enable better oracle test accuracy when compared with SE\r\n                             faster and results in signi\ufb01cantly better accuracy \u2013 up to 5.5                                                                    across all our experiments. This is the accuracy if an oracle\r\n                             percentlowerabsolutetesterrorrate. Theseobservationsare                                                                           were to pick the prediction of the most accurate network\r\n                             consistent with past studies that show how BA is ineffective                                                                      in the ensemble per test element (Guzman-Rivera et al.,\r\n                             whentrainingdeepneuralnetworksasitreducesthenumber                                                                                2012; 2014; Lee et al., 2015b). Oracle accuracy is an upper\r\n                             of unique data items seen by individual networks (Lee et al.,                                                                     bound for the accuracy that any ensemble inference tech-\r\n                             2015a).                                                                                                                           nique could achieve. This metric is also used to evaluate the\r\n                             Overall, the low test error rate of MotherNets when com-                                                                          utility of ensembles when they are applied to solve Multiple\r\n                             pared to KD, TN, and BA stems from the fact that trans-                                                                           Choice Learning (MCL) problems (Guzman-Rivera et al.,\r\n                             ferring the learned function from MotherNets to target en-                                                                        2014; Lee et al., 2016; Brodie et al., 2018).\r\n                             semble networks provides a good starting point as well as                                                                         Scaling to very large ensembles. As we discussed before,\r\n                             introduces regularization for further training. This also al-                                                                     large ensembles help improve accuracy and thus ideally\r\n                             lows hatched ensemble networks to converge signi\ufb01cantly                                                                           wewouldlike to scale neural network ensembles to large\r\n                             faster, resulting in overall lower training time.                                                                                 number of models as it happens for other ensembles such\r\n                             Training time breakdown. To better understand where the                                                                           as random forests (Oshiro et al., 2012; Bonab & Can, 2016;\r\n                             time goes during the training process, Figure 5 provides                                                                          2017). Our previous results were for small to medium en-\r\n                             the time breakdown per ensemble network. We show this                                                                             sembles of 5, 10 or 25 networks. We now show that when\r\n                             for the D5 ensemble and compare MotherNets (with g=1)                                                                             it comes to larger ensembles, MotherNets dominate SE in\r\n                             with individual training approaches FD, BA, and KD. While                                                                         both how accuracy and training time scale.\r\n                             other approaches spend signi\ufb01cant time training each net-                                                                         Figure 6 shows results as we increase the number of net-\r\n                             work, MotherNets, can train these networks very quickly                                                                           works up to a hundred variants of VGGNets trained on\r\n                             after having trained the core MotherNet (black part in the                                                                        CIFAR-10. For every point in Figure 6, k indicates the\r\n                             MotherNets stacked bar in Figure 5). We observe similar                                                                           numberofnetworks. For MotherNets we plot results for the\r\n                             time breakdown across all ensembles in our experiments.                                                                           time-optimized version with g=1, as well as with g=8.\r\n                                                                                               MotherNets: Rapid Deep Ensemble Learning\r\n                         Figure 6 shows that as the size of the ensemble grows, Moth-                                                                 Full Data                         MotherNets                    Snapshot Ensemble\r\n                         erNets scale much better in terms of training time. Toward                                                        0.025    0     0      0     0      0.027  0.01  0.009 0.009 0.009     0.034 0.008 0.008 0.007 0.008\r\n                         the end (for 100 networks), MotherNets train more than                                                              0    0.025   0      0     0       0.01  0.027 0.009 0.009 0.008     0.008 0.028   0.01  0.009 0.009\r\n                         10 hours faster (out of 40 total hours needed for SE). The                                                          0      0   0.027    0     0      0.009 0.009 0.028 0.009 0.008      0.008  0.01  0.027  0.01   0.01\r\n                         training time of MotherNets grows at a much smaller rate                                                            0      0     0    0.026 \u22120.001   0.009 0.009 0.009 0.028 0.008      0.007 0.009   0.01  0.028  0.01\r\n                         because once the MotherNet has been trained, it takes 40                                                            0      0     0   \u22120.001 0.026    0.009 0.008 0.008 0.008 0.028      0.008 0.009   0.01  0.01  0.027\r\n                         percent less time to train a hatched network than the time it\r\n                         takes to train one snapshot.                                                                                    Figure 8. MotherNets(withg=1)trainensembleswithlowermodel\r\n                         In addition, Figure 6 shows that MotherNets does not only                                                       covariances compared to Snapshot Ensembles.\r\n                         scale better in terms of training time, but also it scales better\r\n                         in terms of accuracy. As we add more networks to the                                                            to other approaches.\r\n                         ensemble, MotherNets keeps improving its error rate by\r\n                         nearly 2 percent while SE actually becomes worse by more                                                        3.4       Diversity of model predictions\r\n                         than 0.5 percent. The declining accuracy of SE as the size\r\n                         of the ensemble increases has also been observed in the                                                         Next, we analyze how diversity of ensembles produced by\r\n                         past, where by increasing the number of snapshots above six                                                     MotherNets compares with SE and FD.\r\n                         results in degradation in performance (Huang et al., 2017a).                                                    Ensembles and predictive diversity. Theoretical results\r\n                         Finaly, Figure 6 shows that different cluster settings for                                                      suggest that ensembles of models perform better when the\r\n                         MotherNets allow us to achieve different performance bal-                                                       models\u2019 predictions on a single example are less correlated.\r\n                         ances while still providing robust and predictable navigation                                                   This is true under two assumptions: (i) models have equal\r\n                         of the tradeoff. In this case, with g=8 accuracy improves                                                       correct classi\ufb01cation probability and (ii) the ensemble uses\r\n                         consistently across all points (compared to g=1) at the cost                                                    majority vote for classi\ufb01cation (Krogh & Vedelsby, 1994;\r\n                         of extra training time.                                                                                         Rosen, 1996; Kuncheva & Whitaker, 2003). Under ensem-\r\n                                                                                                                                         ble averaging, no analytical proof that negative correlation\r\n                         3.3       Improvingcloudtraining cost                                                                           reduces error rate exists, but lower correlation between mod-\r\n                                                                                                                                         els can be used to create a smaller upper bound on incorrect\r\n                         One approach to speed                                  1000       V25 SVHN           V25 C100                   classi\ufb01cation probability. More precise statements and their\r\n                         up training of large en-                                 750        R10 C10                                     proofs are given in Appendix B.\r\n                         sembles is to utilize                                    500           M1\r\n                         morethanonemachines.                                     250                                                    Rapid ensemble training methods. For MotherNets, as\r\n                         For example, we could                                      0                                                    well as for all other compared techniques for ensemble\r\n                         train k individual net-                                  300                                                    training, the training procedure binds the models together to\r\n                         works in parallel using                                  150           M2                                       decrease training time. This can have two negative effects\r\n                         k machines.                    While                 training cost (USD)                                        compared to independent training of models:\r\n                                                                                    0\r\n                         this does save time, the                                        MN SE KD BA FD\r\n                                                                                             1                                              1. by changing the model\u2019s architecture or training pat-\r\n                         holistic cost in terms                                              training method                                      tern, the technique affects each model\u2019s prediction qual-\r\n                         of energy and resources                                                                                                  ity (the model\u2019s marginal prediction accuracy suffers)\r\n                         spent is still linear to the                         Figure 7. Training cost (USD)\r\n                         ensemble size.                                                                                                     2. bysharinglayers(TN),attemptedsoftmaxvalues(KD),\r\n                         One proxy for capturing the holistic cost is to look at the                                                              or training epochs (SE, MN), the training technique\r\n                         amount of money one has to pay on the cloud for training                                                                 creates positive correlations between model errors.\r\n                         a given ensemble. In our next experiment, we compare all                                                        Wecompare here the magnitude of these two effects for-\r\n                         approaches using this proxy. Figure 7 shows the cost (in                                                        MotherNets and Snapshot Ensembles when compared to\r\n                         USD)oftraining on four cloud instances across two cloud                                                         independent training of each model on CIFAR-10 using V5.\r\n                         service providers: (i) M1 that maps to AWS P2.xlarge and\r\n                         Azure NC6, and (ii) M2 that maps to AWS P3.2xlarge and                                                          Individual model quality.                              For both SE and MN, the\r\n                         Azure NCv3. M1ispriced at USD 0.9 per hour and M2 is                                                            individual model accuracy drops, but the effect is more\r\n                         pricedatUSD3.06perhourforbothcloudserviceproviders                                                              pronounced in SE than MN. The mean misclassi\ufb01cation\r\n                         (Amazon, 2019; Microsoft, 2019).                                                                                percentage of the individual models for V5 using FD, MN\r\n                         Training time-optimized MotherNets provide signi\ufb01cant                                                           and SE are 8.1%, 8.4% and 9.8% respectively. The poor\r\n                         reduction in training cost (up to 3\u00d7) as it can train a very                                                    performance of SE in this area is due to its dif\ufb01culty in\r\n                         large ensemble in a fraction of the training time compared                                                      consistently hitting performant local minima, either because\r\n                                                                                                                                         it over\ufb01ts to the training data when trained for a long time\r\n                                                         MotherNets: Rapid Deep Ensemble Learning\r\n               or because its early snapshots need to be far away from the         Parameter sharing. Various related techniques share pa-\r\n               \ufb01nal optimum to encourage diversity.                                rameters between different networks during ensemble train-\r\n               Model variance. Our goal in assessing variance is to see            ing and, in doing so, improve training cost. One interpre-\r\n               how the training procedure affects how models in the en-            tation of techniques such as Dropout and Swapout is that,\r\n               semble correlate with each other on each example. To do             during training, they create several networks with shared\r\n               this, we train each of the \ufb01ve models in V 5 \ufb01ve times under        weights within a single network. Then, they implicitly en-\r\n               MN,SE,andFD.LettingY bethesoftmaxofthecorrect                       semble them during inference (Wan et al., 2013; Srivastava\r\n                                             ij                                    et al., 2014; Huang et al., 2016; Singh et al., 2016; Huang\r\n               model on test example j using model i, we then estimate             et al., 2017a). Our approach, on the other hand, captures the\r\n                                                          \u2032                \u2032\r\n               Var(Y ) for each i,j and Cov(Y ,Y            ) for each i,i ,j\r\n                       ij                            ij  i j                       structural similarity in an ensemble, where members have\r\n               with i 6= i\u2032 using the sample variance and covariance. To           different and explicitly de\ufb01ned neural network architectures\r\n               get a single number for a model, instead of one for each            and trains it. Overall, this enables us to effectively com-\r\n               test example, we then average across all test examples, i.e.\r\n                                   P                                               bine well-known architectures together within an ensemble.\r\n                                 1    n\r\n               Cov(Y ,Y \u2032) =              Cov(Y ,Y \u2032 ). For total variance\r\n                       i  i      n    j=1        ij   i j                          Furthermore, implicit ensemble techniques (e.g., dropout\r\n               numbers for the ensemble, we perform the same procedure\r\n               onY = 1P5 Y .                                                       and swapout) can be used as training optimizations to im-\r\n                    j    5    i=1 ij                                               prove upon the accuracy of individual networks trained in\r\n               Figure 8 shows the results. As expected, independent train-         MotherNets (Srivastava et al., 2014; Singh et al., 2016).\r\n               ing between the models in FD makes their corresponding              Ef\ufb01cientdeepnetworktraining. Variousalgorithmictech-\r\n               covariance 0 and provides the greatest overall variance re-         niquestargetfundamentalbottlenecksinthetrainingprocess\r\n               duction for the ensemble, with ensemble variance at 0.0051.         (Niu et al., 2011; Brown et al., 2016; Bottou et al., 2018).\r\n               For both SE and MN, the covariance of separate models is            Others apply system-oriented techniques to reduce mem-\r\n               non-zero at around 0.009 per pair of models; however, it is         ory overhead and data movement (De Sa & Feldman, 2017;\r\n               also signi\ufb01cantly less than the variance of a single model.         Jain et al., 2018). Recently, specialized hardware is being\r\n               Asaresult, both MN and SE provide signi\ufb01cant variance               developed to improve performance, parallelism, and energy\r\n               reduction compared to a single model. Whereas a single              consumption of neural network training (Prabhakar et al.,\r\n               modelhasvariance around 0.026, MN and SE provide en-                2016; De Sa & Feldman, 2017; Jouppi et al., 2017). All\r\n               semble variance of 0.0125 and 0.0130 respectively.                  techniques to improve upon training ef\ufb01ciency of individual\r\n               Takeaways. Since both SE and MN train nearly as fast as a           neural networks are orthogonal to MotherNets and in fact\r\n               single model, they provide variance reduction in prediction         directly compatible. This is because MotherNets does not\r\n               at very little training cost. Additionally, for MN, at the cost     make any changes to the core computational components\r\n               of higher training time, one can create more clusters and           of the training process. In our experiments, we do utilize\r\n               thus make the training of certain models independent of             someofthewidelyapplied training optimizations such as\r\n               each other, zeroing out many of the covariance terms and            batch-normalization and early-stopping. The advantage that\r\n               reducing the overall ensemble variance. When compared               MotherNets bring on top of these approaches is that we can\r\n               to each other, MN with g = 1 and SE have similar vari-              further reduce the total number of epochs that are required\r\n               ance numbers, with MN slightly lower, but MotherNets has            to train an ensemble. This is because a set of MotherNets\r\n               a substantial increase in individual model accuracy when            will train for the structural similarity present in the ensemble\r\n               compared to Snapshot Ensembles. As a result, its overall            once.\r\n               ensemble performs better.\r\n               Additional results. We demonstrate in Appendix C how                5    CONCLUSION\r\n               MotherNets can improve inference time by 2\u00d7. In Ap-                 Wepresent MotherNets which enable training of large and\r\n               pendix D, we show how the relative behavior of MotherNets           diverse neural network ensembles while being able to nav-\r\n               remains the same when training using multiple GPUs. Fi-             igate a new Pareto frontier with respect to accuracy and\r\n               nally, in Appendix E we provide experiments with Wide               training cost. The core intuition behind MotherNets is to\r\n               ResNets and demonstrate how MotherNets provide better               reduce the number of epochs needed to train an ensemble by\r\n               accuracy-training time tradeoff when compared with Fast             capturing the structural similarity present in the ensemble\r\n               Geometric Ensembles.                                                and training for it once.\r\n               4    RELATEDWORK                                                    Acknowledgments. Wethankthereviewers for their valu-\r\n                                                                                   able feedback. We also thank Chang Xu for building the\r\n               In this section, we brie\ufb02y survey additional (but orthogonal)       web-based demo and all DASlab members for their help.\r\n               ensemble training techniques beyond Snapshot Ensembles,             This work was partially funded by Tableau, Cisco, and the\r\n               TreeNets, and Knowledge Distillation.                               Harvard Data Science Institute.\r\n                                                        MotherNets: Rapid Deep Ensemble Learning\r\n               REFERENCES                                                        DeSa,C.andFeldman,M. Understandingandoptimizing\r\n               Agra\ufb01otis, D. K., Cedeno, W., and Lobanov, V. S. On the              asynchronous low-precision stochastic gradient descent.\r\n                  useofneuralnetworkensemblesinqsarandqspr. Journal                 In Annual International Symposium on Computer Archi-\r\n                  of chemical information and computer sciences, 42(4),             tecture (ISCA), 2017.\r\n                  2002.                                                          Dean, J., Corrado, G., Monga, R., Chen, K., Devin, M.,\r\n               Amazon. Aws pricing. https://aws.amazon.com/                         Mao,M.,Senior, A., Tucker, P., Yang, K., Le, Q. V., et al.\r\n                  pricing/,2019. (Accessedon05/16/2019).                            Large scale distributed deep networks. In Advances in\r\n                                                                                    Neural Information Processing Systems, 2012.\r\n               Arumugam, S., Dobra, A., Jermaine, C. M., Pansare,                Dietterich, T. G. Ensemble methods in machine learning. In\r\n                  N., and Perez, L.      The DataPath System: A Data-               International Workshop on Multiple Classi\ufb01er Systems,\r\n                  centric Analytic Processing Engine for Large Data Ware-           2000.\r\n                  houses.   In Proceedings of the ACM SIGMOD Inter-\r\n                  national Conference on Management of Data, pp. 519\u2013            Drucker, H., Schapire, R., and Simard, P. Improving per-\r\n                  530, 2010. URL http://dl.acm.org/citation.                        formance in neural networks using a boosting algorithm.\r\n                  cfm?id=1807167.1807224.                                           In Advances in Neural Information Processing Systems,\r\n               Bonab, H. R. and Can, F. A theoretical framework on the              1993.\r\n                  ideal number of classi\ufb01ers for online ensembles in data        Garipov, T., Izmailov, P., Podoprikhin, D., Vetrov, D. P., and\r\n                  streams. In Proceedings of the 25th ACMInternationalon            Wilson, A. G. Loss surfaces, mode connectivity, and fast\r\n                  Conference on Information and Knowledge Management,               ensembling of dnns. In Advances in Neural Information\r\n                  2016.                                                             Processing Systems, 2018.\r\n               Bonab, H. R. and Can, F. Less is more: A comprehensive            Giannikis, G., Alonso, G., and Kossmann, D. SharedDB:\r\n                  framework for the number of components of ensemble                Killing One Thousand Queries with One Stone. Pro-\r\n                  classi\ufb01ers. IEEE Transactions on Neural Networks and              ceedings of the VLDB Endowment, 5(6):526\u2013537,\r\n                  Learning Systems, 2017.                                           2012.     URL http://dl.acm.org/citation.\r\n               Bottou, L., Curtis, F. E., and Nocedal, J. Optimization              cfm?id=2168651.2168654.\r\n                  methods for large-scale machine learning. SIAM Review,         Giannikis, G., Makreshanski, D., Alonso, G., and Koss-\r\n                  60(2), 2018.                                                      mann, D.       Shared Workload Optimization.           Pro-\r\n               Brodie, M., Tensmeyer, C., Ackerman, W., and Martinez,               ceedings of the VLDB Endowment, 7(6):429\u2013440,\r\n                  T. Alpha model domination in multiple choice learning.            2014.      URL http://www.vldb.org/pvldb/\r\n                  In IEEE International Conference on Machine Learning              vol7/p429-giannikis.pdf.\r\n                  andApplications (ICMLA), 2018.                                 Goodfellow,I.,Bengio,Y.,andCourville,A. DeepLearning.\r\n               Brown, K. J., Lee, H., Rompf, T., Sujeeth, A. K., Sa, C. D.,         MITPress, 2016.\r\n                  Aberger,C.R.,andOlukotun,K. Haveabstractionandeat              Granitto, P. M., Verdes, P. F., and Ceccatto, H. A. Neural net-\r\n                  performance, too: Optimized heterogeneous computing               work ensembles: Evaluation of aggregation algorithms.\r\n                  with parallel patterns. In Proceedings of the International       Arti\ufb01cial Intelligence, 163(2), 2005.\r\n                  SymposiumonCodeGenerationandOptimization, 2016.                Gulshan, V., Peng, L., Coram, M., Stumpe, M. C., Wu,\r\n               Cai, H., Chen, T., Zhang, W., Yu, Y., and Wang, J. Ef\ufb01cient          D., Narayanaswamy, A., Venugopalan, S., Widner, K.,\r\n                  architecture search by network transformation. In AAAI            Madams,T., Cuadros, J., et al. Development and valida-\r\n                  Conference on Arti\ufb01cial Intelligence, 2018.                       tion of a deep learning algorithm for detection of diabetic\r\n               Candea, G., Polyzotis, N., and Vingralek, R.              Pre-       retinopathy in retinal fundus photographs. Jama, 316(22),\r\n                  dictable Performance and High Query Concurrency                   2016.\r\n                  for Data Analytics.     The VLDB Journal, 20(2):227\u2013           Gupta, S., Zhang, W., and Wang, F. Model accuracy and\r\n                  248, 2011. URL http://dl.acm.org/citation.                        runtime tradeoff in distributed deep learning: A system-\r\n                  cfm?id=1969331.1969355.                                           atic study. In IEEE International Conference on Data\r\n               Chen, T., Goodfellow, I. J., and Shlens, J. Net2net: Accel-          Mining (ICDM), 2016.\r\n                  erating learning via knowledge transfer. In International      Guzman-Rivera,A.,Batra,D.,andKohli,P. Multiplechoice\r\n                  Conference on Learning Representations (ICLR), San                learning: Learning to produce multiple structured outputs.\r\n                  Juan, Puerto Rico, May 2-4, 2016, Conference Track                In Advances in Neural Information Processing Systems,\r\n                  Proceedings, 2016.                                                2012.\r\n                                                       MotherNets: Rapid Deep Ensemble Learning\r\n               Guzman-Rivera, A., Kohli, P., Batra, D., and Rutenbar, R.          Should I Scan or Should I Probe? In Proceedings of the\r\n                 Ef\ufb01ciently enforcing diversity in multi-output structured        ACMSIGMODInternationalConferenceonManagement\r\n                 prediction. In Arti\ufb01cial Intelligence and Statistics, 2014.      of Data, pp. 715\u2013730, 2017. ISBN 9781450341974. doi:\r\n               Harizopoulos, S., Shkapenyuk, V., and Ailamaki, A.                 10.1145/3035918.3064049. URL http://dl.acm.\r\n                 QPipe: A Simultaneously Pipelined Relational Query               org/citation.cfm?doid=3035918.3064049.\r\n                 Engine. In Proceedings of the ACM SIGMOD Inter-               Keuper, J. and Preundt, F.-J. Distributed training of deep\r\n                 national Conference on Management of Data, pp. 383\u2013              neural networks: Theoretical and practical limits of paral-\r\n                 394, 2005. URL http://dl.acm.org/citation.                       lel scalability. In 2016 2nd Workshop on Machine Learn-\r\n                 cfm?id=1066157.1066201.                                          ing in HPC Environments (MLHPC), pp. 19\u201326. IEEE,\r\n                                                                                  2016.\r\n               He, K., Zhang, X., Ren, S., and Sun, J. Deep residual\r\n                 learning for image recognition. In IEEE Conference on         Krizhevsky, A. Learning multiple layers of features from\r\n                 Computer Vision and Pattern Recognition (CVPR), 2016.            tiny images. 2009.\r\n               Hinton, G., Vinyals, O., and Dean, J. Distilling the knowl-     Krogh, A. and Vedelsby, J. Neural network ensembles,\r\n                 edge in a neural network. In NIPS Deep Learning and              cross validation and active learning. In International\r\n                 Representation Learning Workshop, 2015.                          Conference on Neural Information Processing Systems,\r\n                                                                                  1994.\r\n               Huang, G., Sun, Y., Liu, Z., Sedra, D., and Weinberger,         Kuncheva, L. I. and Whitaker, C. J. Measures of diversity\r\n                 K. Q. Deep networks with stochastic depth. In European           in classi\ufb01er ensembles and their relationship with the\r\n                 Conference on Computer Vision, 2016.                             ensemble accuracy. Machine Learning, 51(2), 2003.\r\n               Huang, G., Li, Y., Pleiss, G., Liu, Z., Hopcroft, J. E., and    Lee, S., Purushwalkam, S., Cogswell, M., Crandall, D. J.,\r\n                 Weinberger, K. Q. Snapshot ensembles: Train 1, get               and Batra, D.     Why M heads are better than one:\r\n                 mfor free. 5th International Conference on Learning              Training a diverse ensemble of deep networks. CoRR,\r\n                 Representations (ICLR), 2017a.                                   abs/1511.06314, 2015a.\r\n               Huang, G., Liu, Z., van der Maaten, L., and Weinberger,         Lee, S., Purushwalkam, S., Cogswell, M., Crandall, D. J.,\r\n                 K. Q. Densely connected convolutional networks. In               and Batra, D.     Why M heads are better than one:\r\n                 IEEEConferenceonComputerVisionandPatternRecog-                   Training a diverse ensemble of deep networks. CoRR,\r\n                 nition, (CVPR), 2017b.                                           abs/1511.06314, 2015b.\r\n               Huggins, J., Campbell, T., and Broderick, T. Coresets for       Lee, S., Prakash, S. P. S., Cogswell, M., Ranjan, V., Cran-\r\n                 scalable bayesian logistic regression. In Advances in            dall, D., and Batra, D. Stochastic multiple choice learning\r\n                 Neural Information Processing Systems, 2016.                     for training diverse deep ensembles. In Advances in Neu-\r\n               Iandola, F. N., Moskewicz, M. W., Ashraf, K., and Keutzer,         ral Information Processing Systems, 2016.\r\n                 K. Firecaffe: Near-linear acceleration of deep neural net-    Levenshtein, V. I. Binary codes capable of correcting dele-\r\n                 worktraining on compute clusters. In IEEE Conference             tions, insertions, and reversals. 1966.\r\n                 onComputerVision and Pattern Recognition, 2016.\r\n               Jain, A., Phanishayee, A., Mars, J., Tang, L., and Pekhi-       Li, Z. and Hoiem, D. Learning without forgetting. IEEE\r\n                                                                                  Transactions on Pattern Analysis and Machine Intelli-\r\n                 menko, G. Gist: Ef\ufb01cient data encoding for deep neural           gence, 2017.\r\n                 network training. In IEEE Annual International Sympo-\r\n                 sium on Computer Architecture, 2018.                          MacQueen,J. Somemethodsforclassi\ufb01cation and analysis\r\n                                                                                  of multivariate observations. In Berkeley Symposium on\r\n               Jouppi, N. P. et al. In-datacenter performance analysis of a       Mathematical Statistics and Probability, 1967.\r\n                 tensor processing unit. In Annual International Sympo-\r\n                 sium on Computer Architecture (ISCA), 2017.                   Microsoft.          Pricing    -  windows      virtual   ma-\r\n                                                                                  chines   |  microsoft   azure.       https://azure.\r\n               Ju, C., Bibaut, A., and van der Laan, M. J. The relative           microsoft.com/en-us/pricing/details/\r\n                 performance of ensemble methods with deep convolu-               virtual-machines/windows/,2019. (Accessed\r\n                 tional neural networks for image classi\ufb01cation. CoRR,            on05/16/2019).\r\n                 abs/1704.01664, 2017.                                         Moghimi,M.andVasconcelos,N. Boostedconvolutional\r\n               Kester, M. S., Athanassoulis, M., and Idreos, S. Access            neural networks. In Proceedings of the British Machine\r\n                 PathSelectioninMain-MemoryOptimizedDataSystems:                  Vision Conference, 2016.\r\n                                                     MotherNets: Rapid Deep Ensemble Learning\r\n              Netzer, Y., Wang, T., Coates, A., Bissacco, A., Wu, B.,        Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I.,\r\n                 and Ng, A. Y. Reading digits in natural images with            and Salakhutdinov, R. Dropout: A simple way to prevent\r\n                 unsupervised feature learning.                                 neural networks from over\ufb01tting. The Journal of Machine\r\n              Niu, F., Recht, B., R\u00e9, C., and Wright, S. J. Hogwild!:           Learning Research, 15(1), 2014.\r\n                 Alock-free approach to parallelizing stochastic gradient    Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S.,\r\n                 descent. In Advances in Neural Information Processing          Anguelov, D., Erhan, D., Vanhoucke, V., and Rabinovich,\r\n                 Systems, 2011.                                                 A. Going deeper with convolutions. In Computer Vision\r\n              Oshiro, T. M., Perez, P. S., and Baranauskas, J. A. How           andPattern Recognition (CVPR), 2015.\r\n                 many trees in a random forest? In International Work-       VanderLaan,M.J.,Polley,E.C.,andHubbard,A.E. Super\r\n                 shop on Machine Learning and Data Mining in Pattern            learner. Statistical applications in genetics and molecular\r\n                 Recognition. Springer, 2012.                                   biology, 6(1), 2007.\r\n              Prabhakar, R., Koeplinger, D., Brown, K. J., Lee, H.,          Wan,L., Zeiler, M., Zhang, S., Le Cun, Y., and Fergus, R.\r\n                 Sa, C. D., Kozyrakis, C., and Olukotun, K. Generat-            Regularization of neural networks using dropconnect. In\r\n                 ing con\ufb01gurable hardware from parallel patterns.      In       International Conference on Machine Learning, 2013.\r\n                 Proceedings of the Twenty-First International Confer-\r\n                 ence on Architectural Support for Programming Lan-          Wei, T., Wang, C., Rui, Y., and Chen, C. W. Network mor-\r\n                 guages and Operating Systems, ASPLOS \u201916, Atlanta,             phism. InInternationalConferenceonMachineLearning,\r\n                 GA, USA, April 2-6, 2016, pp. 651\u2013665, 2016. doi:              2016.\r\n                 10.1145/2872362.2872415. URL http://doi.acm.                Wei, T., Wang, C., and Chen, C. W. Modularized morphing\r\n                 org/10.1145/2872362.2872415.                                   of neural networks. CoRR, abs/1701.03281, 2017.\r\n              Psaroudakis, I., Athanassoulis, M., and Ailamaki, A.\r\n                 Sharing Data and Work Across Concurrent Analytical          Xu, L., Ren, J. S., Liu, C., and Jia, J. Deep convolutional\r\n                 Queries. Proceedings of the VLDB Endowment, 6(9):637\u2013          neural network for image deconvolution. In Advances in\r\n                 648, 2013. URL http://dl.acm.org/citation.                     Neural Information Processing Systems, 2014.\r\n                 cfm?id=2536360.2536364.                                     Ye, M. and Guo, Y. Self-training ensemble networks for\r\n              Qiao, L., Raman, V., Reiss, F., Haas, P. J., and Lohman,          zero-shot image recognition. Knowl.-Based Syst., 123:\r\n                 G. M.    Main-memory Scan Sharing for Multi-core               41\u201360, 2017.\r\n                 CPUs. Proceedings of the VLDB Endowment, 1(1):610\u2013          Zagoruyko, S. and Komodakis, N. Wide residual networks.\r\n                 621, 2008. URL http://dl.acm.org/citation.                     Proceedings of the British Machine Vision Conference,\r\n                 cfm?id=1453856.1453924.                                        2016.\r\n              Rosen, B. E. Ensemble learning using decorrelated neural\r\n                 networks. Connection Science, 1996.                         Zukowski, M., H\u00e9man, S., Nes, N. J., and Boncz, P. A.\r\n                                                                                Cooperative Scans: Dynamic Bandwidth Sharing in\r\n              Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S.,      a DBMS. In Proceedings of the International Con-\r\n                 Ma,S., Huang, Z., Karpathy, A., Khosla, A., Bernstein,         ference on Very Large Data Bases (VLDB), pp. 723\u2013\r\n                 M., et al. Imagenet large scale visual recognition chal-       734, 2007. URL http://dl.acm.org/citation.\r\n                 lenge. International Journal of Computer Vision, 115(3),       cfm?id=1325851.1325934.\r\n                 2015.\r\n              Shazeer, N., Mirhoseini, A., Maziarz, K., Davis, A., Le,\r\n                 Q., Hinton, G., and Dean, J. Outrageously large neural\r\n                 networks: The sparsely-gated mixture-of-experts layer.\r\n                 International Conference on Learning Representations\r\n                 (ICLR), 2017.\r\n              Simonyan, K. and Zisserman, A. Very deep convolutional\r\n                 networks for large-scale image recognition. In Interna-\r\n                 tional Conference on Learning Representations (ICLR),\r\n                 2015.\r\n              Singh, S., Hoiem, D., and Forsyth, D. Swapout: Learning\r\n                 anensembleofdeeparchitectures. In Advances in Neural\r\n                 Information Processing Systems, 2016.\r\n                                                         MotherNets: Rapid Deep Ensemble Learning\r\n               APPENDIX                                                           B MODELCOVARIANCEANDENSEMBLE\r\n                                                                                        PREDICTIVE ACCURACY\r\n                                                                                  We can analyze how model covariance effects ensemble\r\n               Algorithm A Constructing the MotherNet for fully-                  performance by using Chebyshev\u2019s Inequality to bound the\r\n               connected neural networks                                          chance that a model predicts an example incorrectly. By\r\n               Input: E: ensemble networks in one cluster;                        showing that lower covariance between models makes this\r\n               Initialize: M: empty MotherNet;                                    bound on the probability smaller, we give an intuitive rea-\r\n                                                                                  son why ensembles with lower covariance between models\r\n               // set input/output layer sizes                                    perform better. The proof shows as well that the average\r\n               M.input.num_param \u2190 E[0].input.num_param;                          model\u2019spredictive accuracy is important; \ufb01nally, no assump-\r\n               M.output.num_param \u2190 E[0].output.num_param;                        tions need to be made for the proof to hold. The individual\r\n               M.num_layers \u2190getShallowestNetwork(E).num_layers;                  modelscanbeofdifferentqualityandhavedifferentchances\r\n               // set hidden layer sizes                                          of getting each example correct.\r\n               for i \u2190 0 ...M.num_layers-1 do                                     Given a \ufb01xed training dataset, let Yi be the softmax value\r\n                   M.layers[i].num_param \u2190 getMin(E,i);                           of model i in the ensemble for the correct class, and let\r\n                                                                                           P\r\n               return M;                                                           \u02c6     1    m\r\n                                                                                  Y = m       i=1 Yi be the ensemble\u2019s average softmax value\r\n                                                                                  on the correct class. Both are random variables with the\r\n               // Get the min. size layer at posn                                                  \u02c6\r\n               Function getMin(E,posn)                                            randomness of Y and Yi coming through the randomness\r\n                   min\u2190E[0].layers[posn].num_param;                               of neural network training. Under the mild assumption that\r\n                                                                                      \u02c6     1\r\n                   for j \u2190 0 ...len(E)-1 do                                       E[Y] > 2, so that the a one vs. all softmax classi\ufb01er would\r\n                       if E[j].layers[posn].num_param < min then                  say on average that the correct class is more likely, than\r\n                            min\u2190E[j].layers[posn].num_param                       Chebyshev\u2019s Inequality bounds the probability of incorrect\r\n                                                                                  prediction. Namely, the correct prediction is made with cer-\r\n                   return min;                                                             \u02c6     1\r\n                                                                                  tainty if Y \u2265 2 and so the probability of incorrect prediction\r\n                                                                                  is less than\r\n                                                                                                                                     \u02c6\r\n                                                                                             \u02c6       \u02c6         \u02c6     1        Var(Y)\r\n                                                                                         P(|Y \u2212E[Y]| \u2265 E[Y]\u2212 2) \u2264               \u02c6     1 2\r\n                                                                                                                            (E[Y]\u2212 2)\r\n               A ALGORITHMSFORCONSTRUCTING\r\n                     MOTHERNETS                                                   From the form of the equation, we immediately see that\r\n               Weoutline algorithms for constructing the MotherNet given          keeping the average model accuracy E[Yi] high is impor-\r\n               a cluster of neural networks. We describe the algorithms for       tant, and that degradation in model quality can offset reduc-\r\n                                                                                                                             \u02c6\r\n               both fully-connected and convolutional neural networks.            tions in variance. Since the variance of Y decomposes into\r\n                                                                                       P                  P\r\n                                                                                    1     m\r\n                                                                                     2 (      Var(Y )+          \u2032 Cov(Y ,Y \u2032)), we see that low\r\n               Fully-Connected Neural Networks.            Algorithm A de-         m      i=1        i       i6=i       i   i\r\n               scribes how to construct the MotherNet for a cluster of            model covariance keeps the variance of the ensemble low,\r\n               fully-connected neural networks. We proceed layer-by-layer         and that models which have which have high covariance\r\n               selecting the layer with the least number of parameters at         with other models provides little bene\ufb01t to the ensemble.\r\n               every position.                                                    WeexplainhowMotherNetsimprovetheef\ufb01ciencyofen-\r\n               Convolutional Neural Networks. Algorithm B provides a              semble inference.\r\n               detailed strategy to construct the MotherNet for a cluster of      Ensemble inference. Inference in an ensemble of neural\r\n               convolutional neural networks. We proceed block-by-block,          networks proceeds as follows: First, the data item (e.g., an\r\n               where each block is composed of multiple convolutional             image or a feature vector) is passed through every network\r\n               layers. The MotherNet has as many blocks as the network            in the ensemble. These forward passes produce multiple\r\n               with the least number of blocks. Then, for every block, we         predictions \u2013 one prediction for every network in the ensem-\r\n               proceed layer-by-layer and construct the MotherNet layer           ble. The prediction of the ensemble is then computed by\r\n               at every position as follows: First, we compute the least          combining the individual predictions using some averaging\r\n               numberofconvolutional \ufb01lters and convolutional \ufb01lter sizes         or voting function. As the size of the ensemble grows, the\r\n               at that position across all ensemble networks. Let these           inference cost in terms of memory and time required for\r\n               be F      and S      respectively. Then, in MotherNet, we          inference increases linearly. This is because for every addi-\r\n                    min         min\r\n               include a convolutional layer with Fmin \ufb01lters of Smin size        tional ensemble network, we need to maintain its parameters\r\n               at that position.                                                  as well as do an additional forward pass on them.\r\n                                                         MotherNets: Rapid Deep Ensemble Learning\r\n               Algorithm B Constructing the MotherNet for convolutional neural networks block-by-block.\r\n               Input: E: ensemble of convolutional networks in one cluster;\r\n               Initialize: M: empty MotherNet;\r\n               // set input/output layer sizes and number of blocks\r\n               M.input.num_param \u2190 E[0].input.num_param;\r\n               M.output.num_param \u2190 E[0].output.num_param;\r\n               M.num_blocks\u2190getShallowestNetwork(E).num_blocks;\r\n               // set hidden layers block-by-block\r\n               for k \u2190 0 ...M.num_blocks-1 do\r\n                   M.block[k].num_hidden \u2190 getShallowestBlockAt(E,k).num_hidden; // select the shallowest block\r\n                   for i \u2190 0 ...M.block[k].num_hidden-1 do\r\n                       M.block[k].hidden[i]..num_\ufb01lters, M.block[k].hidden[i]..\ufb01lter_size \u2190 getMin(E,k,i)\r\n               return M;\r\n               // Get minimum number of filters and filter size at posn\r\n               Function getMin(E,blk,posn)\r\n                   min_num_\ufb01lters \u2190 E[0].block[blk].hidden[posn].num_\ufb01lters;\r\n                   min_\ufb01lter_size \u2190 E[0].block[blk].hidden[posn].\ufb01lter_size;\r\n                   for j \u2190 0 ...len(E) do\r\n                       if E[j].block[blk].hidden[posn].num_\ufb01lters < min_num_\ufb01lters then\r\n                            min_num_\ufb01lters \u2190 E[j].block[blk].hidden[posn].num_\ufb01lters;\r\n                       if E[j].block[blk].hidden[posn].\ufb01lter_size < min_\ufb01lter_size then\r\n                            min_\ufb01lter_size \u2190 E[j].block[blk].hidden[posn].\ufb01lter_size;\r\n                   return min_num_\ufb01lters, min_\ufb01lter_size;\r\n                                    Ensemble param.    Shared param.              MotherNet S as follows: First, S is initialized with K input\r\n                                                                                  and output layers, one for every hatched network. This al-\r\n                                                                                  lows S to produce as many as K predictions. Then, every\r\n                                                                                  hidden layer of S is constructed one-by-one going from\r\n                                                                                  the input to the output layer and consolidating all neurons\r\n                                                                                  across all of E that originate from the MotherNet. To con-\r\n                                                                                  solidate a MotherNet neuron at layer li, we \ufb01rst reduce the\r\n                    Hatched ensemble networks             Shared MotherNets       k copies of that neuron (across all K networks in H) to a\r\n                                                                                  single copy. All inputs to the neuron that may originate\r\n               Figure A. Toconstructashared-MotherNet,parametersoriginating       from various other neurons in the layer l      across different\r\n               from the MotherNet are combined together in the ensemble.                                                    i\u22121\r\n                                                                                  hatched networks are added together. The output of this\r\n                                                                                  consolidated neuron is then forwarded to all neurons in the\r\n                                                                                  next layer l     (across all hatched networks) which were\r\n                                                                                               i+1\r\n               C SHARED-MOTHERNETS                                                connected to the consolidated neuron.\r\n               Shared-MotherNets. We introduce shared-MotherNets to               Figure A shows an example of how this process works for\r\n               reduce inference time and memory requirement of ensem-             a simple ensemble of three hatched networks. The \ufb01lled\r\n               bles trained through MotherNets. In shared-MotherNets, af-         circles represent neurons originating from the MotherNet\r\n               ter the process of hatching (step 2 from \u00a72), the parameters       and the colored circles represent neurons from ensemble\r\n               originating from the MotherNet are incrementally trained           networks. To construct the shared-MotherNet (shown on\r\n               in a shared manner. This yields a neural network ensemble          the right), we go layer-by-layer consolidating MotherNet\r\n               with a single copy of MotherNet parameters reducing both           neurons.\r\n               inference time and memory requirement.                             Theshared-MotherNet is then trained incrementally. This\r\n               Constructing a shared-MotherNet. Given an ensemble                 proceeds similarly to step 3 from \u00a72, however, now through\r\n               E of K hatched networks (i.e., those networks that are             the shared-MotherNet, the neurons originating from the\r\n               obtained from a trained MotherNet), we construct a shared-         MotherNet are trained jointly. This results in an ensemble\r\n                                                            MotherNets: Rapid Deep Ensemble Learning\r\n                      8                                 10000                             10000\r\n                                      MN                                     V5                 D5        FD                 20  FGE\r\n                      7          Shared-MN                                                                SE                 18   MN\r\n                      6                                                                                  MN                  16\r\n                   err. rate (%)                        1000                               1000                              14\r\n                                                                                                                             12\r\n                    20                                                                                                       10\r\n                                                         100                                100                               8\r\n                    10                                                                                                        6\r\n                      0                              training time (min)               training time (min)               test error rate (%) 4\r\n                   inf. time (ms)                         10                                 10                               2\r\n                       1     2    3     4     5                1    2    4    8                  1    2    4    8             0\r\n                          number of clusters                 number of GPUs                    number of GPUs                       C10      C100\r\n                Figure B. Shared MotherNets       Figure C. MotherNets       con-    Figure D. MotherNets is able      Figure E. MotherNets outper-\r\n                improve inference time by 2\u00d7       tinue to improve training cost    to utilize multiple GPUs effec-   formFGEonWideResNeten-\r\n                for the V5 ensemble.               in parallel settings (V5).        tively scaling better than SE.    sembles.\r\n                that has K outputs, but some parameters between the net-               idle GPUs, then we assign one network to multiple GPUs\r\n                works are shared instead of being completely independent.              dividing idle GPUs equally between networks. In such\r\n                This reduces the overall number of parameters, improving               cases, we adopt data parallelism to train a network across\r\n                both the speed and the memory requirement of inference.                multiple machines (Dean et al., 2012).\r\n                Memory reduction.            Assume an ensemble E               =      Wetrain on a cluster of 8 Nvidia K80 GPUs and vary the\r\n                {N ,N ,...N           } of K neural networks (where N de-              numberofavailable GPUs from 1 to 8. The training hyper-\r\n                    0   1        K\u22121                                         i\r\n                notes a neural network architecture in the ensemble with               parameters are the same as described in Section 3. Figure\r\n                |N | number of parameters) and its MotherNet M. The                    CandFigure D show the time to train the V5 and D5 en-\r\n                   i\r\n                numberofparametersintheensembleisreducedbyafactor                      sembles respectively across FD, SE, and MotherNets. We\r\n                of \u03c7 given by:                                                         observe that compared to Snapshot Ensembles, MotherNets\r\n                                                   k|M|                                (g=1) scale better as we increase the number of GPUs. The\r\n                                     \u03c7=1\u2212PK\u22121|N|                                       reason for this is that after the MotherNet has been trained,\r\n                                                  i=0      i                           the rest of the ensemble networks are all ready to be trained.\r\n                Results. Figure B shows how shared-MotherNets improves                 Theycanthenbetrained in a way that minimizes communi-\r\n                inference time for an ensemble of 5 variants of VGGNet                 cation overhead by assigning them to as distinct set of GPUs\r\n                as described in Table 1. This ensemble is trained on the               as possible. Snapshot Ensembles, on the other hand, are\r\n                CIFAR-10 data set. We report both overall ensemble test                generated one after the other. In a parallel setting this boils\r\n                error rate and the inference time per image. We see an                 down to training a single network across multiple GPUs,\r\n                improvement of 2\u00d7 with negligible loss in accuracy. This               which incurs communication overhead that increases as the\r\n                improvement is because shared-MotherNets has a reduced                 numberofGPUsincreases(Keuper&Preundt,2016).\r\n                number of parameters requiring less computation during                 E IMPROVINGOVERFASTGEOMETRIC\r\n                inference time. This improvement scales with the ensemble\r\n                size.                                                                        ENSEMBLES\r\n                D PARALLELTRAINING                                                     NowwecompareagainstFastGeometricEnsembles(FGE),\r\n                                                                                       a technique closely related to Snapshot Ensembles (SE)\r\n                Deeplearning pipelines rely on clusters of multiple GPUs               (Huang et al., 2017a; Garipov et al., 2018). FGE also trains\r\n                to train computationally-intensive neural networks. Mother-            a single neural network architecture and saves the network\u2019s\r\n                Nets continue to improve training time in such cases when              parameters at various points of its training trajectory. In par-\r\n                an ensemble is trained on more than one GPUs. We show                  ticular, FGE uses a cyclical geometric learning rate schedule\r\n                this experimentally.                                                   to explore various regions in a neural network\u2019s loss surface\r\n                                                                                       that have a low test error (Garipov et al., 2018). As the\r\n                To train an ensemble of multiple networks, we queue all                learning rate reaches its lowest value in a cycle, FGE saves\r\n                networks that are ready to be trained and assign them to              \u2018snapshots\u2019 of the network parameters. These snapshots are\r\n                available GPUs in the following fashion: If the number of              then used in an ensemble.\r\n                ready networks is greater than free GPUs, then we assign a             WecompareMotherNetstoFGEusinganensembleofWide\r\n                separate network to every GPU. If the number of ensemble               Residual Networks trained on CIFAR-10 and CIFAR-100\r\n                networks available to be trained are less than the number of\r\n                                                        MotherNets: Rapid Deep Ensemble Learning\r\n               (Zagoruyko & Komodakis, 2016). Our experiment consists            for a full training budget of 200 epochs. MotherNets is\r\n               of an ensemble with six WRN-28-10. For MotherNets, we             also allocated the same training budget. The MotherNet\r\n               use six variants of this architecture having different number     is trained for a 140 epochs and every ensemble network\r\n               of \ufb01lters and \ufb01lter widths. For FGE all networks in the           is trained for 10 epochs after hatching. The experimental\r\n               ensemble are the same as is required by the approach. For a       hardware is the same as outlined in Section 3. Figure E\r\n               fair comparison, the number of parameters are kept identical      shows that for identical training budget MotherNets is more\r\n               between the two approaches. We use the same training              accurate than FGE across both data sets.\r\n               hyperparameters as discussed in the FGE paper and train\r\n", "award": [], "sourceid": 85, "authors": [{"given_name": "Abdul", "family_name": "Wasay", "institution": "Harvard University"}, {"given_name": "Brian", "family_name": "Hentschel", "institution": "Harvard University"}, {"given_name": "Yuze", "family_name": "Liao", "institution": "Harvard University"}, {"given_name": "Sanyuan", "family_name": "Chen", "institution": "Harvard"}, {"given_name": "Stratos", "family_name": "Idreos", "institution": "Harvard"}]}