{"title": "MLPerf Training Benchmark", "book": "Proceedings of Machine Learning and Systems", "page_first": 336, "page_last": 349, "abstract": "Machine learning is experiencing an explosion of software and hardware solutions, and needs industry-standard performance benchmarks to drive design and enable competitive evaluation. However, machine learning training presents a number of unique challenges to benchmarking that do not exist in other domains: (1) some optimizations that improve training throughput actually increase time to solution, (2) training is stochastic and time to solution has high variance, and (3) the software and hardware systems are so diverse that they cannot be fairly benchmarked with the same binary, code, or even hyperparameters. We present MLPerf, a machine learning benchmark that overcomes these challenges. We quantitatively evaluate the efficacy of MLPerf in driving community progress on performance and scalability across two rounds of results from multiple vendors.", "full_text": "                                                MLPERFTRAININGBENCHMARK\r\n                Peter Mattson1 Christine Cheng2 Cody Coleman3 GregDiamos4 PauliusMicikevicius5 DavidPatterson16\r\n                Hanlin Tang2 Gu-YeonWei7 PeterBailis3 VictorBittorf1 David Brooks7 DehaoChen1 DebojyotiDutta8\r\n                  Udit Gupta7 KimHazelwood9 AndrewHock10 XinyuanHuang8 AtsushiIke11 BillJia9 DanielKang3\r\n                    DavidKanter12 NaveenKumar1 JefferyLiao13 GuokaiMa2 DeepakNarayanan3 TayoOguntebi1\r\n                     GennadyPekhimenko1415 LillianPentecost7 Vijay Janapa Reddi7 Taylor Robie1 Tom St. John16\r\n                TsuguchikaTabaru11 Carole-JeanWu9 LingjieXu17 MasafumiYamazaki11 CliffYoung1 MateiZaharia3\r\n                                                                    ABSTRACT\r\n                   Machine learning (ML) needs industry-standard performance benchmarks to support design and competitive\r\n                   evaluation of the many emerging software and hardware solutions for ML. But ML training presents three unique\r\n                   benchmarking challenges absent from other domains: optimizations that improve training throughput can increase\r\n                   the time to solution, training is stochastic and time to solution exhibits high variance, and software and hardware\r\n                   systems are so diverse that fair benchmarking with the same binary, code, and even hyperparameters is dif\ufb01cult.\r\n                   Wetherefore present MLPerf, an ML benchmark that overcomes these challenges. Our analysis quantitatively\r\n                   evaluates MLPerf\u2019s ef\ufb01cacy at driving performance and scalability improvements across two rounds of results\r\n                   from multiple vendors.\r\n               1   INTRODUCTION                                               Corporation (SPEC) for Unix servers (Dixit, 1991) and\r\n              Machine learning (ML) has revolutionized numerous do-           the Transaction Processing Performance Council (TPC) for\r\n              mains, including computer vision (Krizhevsky et al., 2012),     transaction processing anddatabases(Council,2005). These\r\n              language processing (Devlin et al., 2018; Radford et al.,       organizations helped develop and maintain benchmarks that\r\n              2019), speech recognition (Hinton et al., 2012), and gam-       their respective communities then embraced. Their success\r\n              ing (Silver et al., 2018; Mnih et al., 2013; Chan, 2018).       inspired the formation of MLPerf, a consortium of commer-\r\n              Muchofthisprogressowestodeeplearning(DL),whichin-               cial and academic organizations, to design a comprehensive\r\n              volves training of large deep-neural-network (DNN) models       benchmark suite for DL.\r\n              on massive data sets. To keep up with this growing com-         Unlike other computational workloads, DL allows a range\r\n              putational demand, hardware and software systems have           of statistical, hardware, and software optimizations that can\r\n              garnered sizable investments (Amodei & Hernandez, 2018).        change the mathematical semantics of the underlying opera-\r\n              As the number of hardware and software systems for DL           tors. Although these optimizations can boost performance\r\n              training increases (Paszke et al., 2017; Abadi et al., 2016;    (i.e., training speed), some change the learning dynamics\r\n              Chenetal., 2015; Jia et al., 2014; Jouppi et al., 2017; Chen    and affect the \ufb01nal model\u2019s quality (i.e., accuracy). Even\r\n              et al., 2018; Markidis et al., 2018; Intel, 2019), so does      accommodating different system scales (e.g., varying the\r\n              the need for a comprehensive benchmark. History shows           number of chips) requires changing hyperparameters, po-\r\n              that benchmarks accelerate progress (Hennessy & Patterson,      tentially affecting the amount of computation necessary to\r\n              2011); for example, breakthroughs in microprocessor and         reach a particular quality target. By contrast, other com-\r\n              relational-database systems in the 1980s inspired industry      pute benchmarks can evaluate systems through targeted\r\n              consortiums to create Standard Performance Evaluation           microbenchmarks.\r\n                  1       2     3                   4           5             DLisalsointrinsically approximate and stochastic, allow-\r\n                  Google Intel Stanford University Landing AI NVIDIA          ing multiple equally correct solutions\u2014unlike conventional\r\n              6University of California, Berkeley 7Harvard University 8Cisco\r\n              9Facebook 10Cerebras 11Fujitsu 12Real World Technologies        computing, which tends to allow just one correct solution.\r\n              13Synopsys 14University of Toronto 15Vector Institute 16Tesla   As a result, implementations and training times can vary\r\n              17Alibaba.   Correspondence to:   Peter Mattson <petermatt-     while the \ufb01nal quality remains the same. Since it is ap-\r\n              son@google.com>.                                                proximate, DL requires careful de\ufb01nition of equally valid\r\n              Proceedings of the 3rd MLSys Conference, Austin, TX, USA,       solution classes and the appropriate degrees of freedom.\r\n              2020. Copyright 2020 by the author(s).                          Prior work has varied in granularity but has either left the\r\n                                                               MLPerfTrainingBenchmark\r\n               above challenges unaddressed or lacked critical workloads           \u2022 Establish rules that ensure submissions are equivalent\r\n               representative of modern ML. Microbenchmarks such as                   to these reference implementations and use equivalent\r\n               DeepBench(Baidu,2017)areaffordabletorunandenablea                      hyperparameters.\r\n               fair comparison of competing systems by isolating hardware\r\n               and software from statistical optimizations, but they fail to       \u2022 Establish timing rules to minimize the effects of\r\n               re\ufb02ect the complexity of real workloads and have limited               stochasticity when comparing results.\r\n               utility. Although throughput benchmarks like Fathom and             \u2022 Make submission code open source so that the ML\r\n               TBD(Adolf et al., 2016; Zhu et al., 2018; Google, 2017)                and systems communities can study and replicate the\r\n               evaluate full model architectures across a broad range of              results.\r\n               tasks to better re\ufb02ect the diversity and complexity of real\r\n               workloads,theylimitmodelarchitectureandtraininginnova-              \u2022 Formworkinggroupstokeepthebenchmarksuiteup\r\n               tions that advance the state-of-the-art. DAWNBench (Cole-              to date.\r\n               manetal., 2017) measures end-to-end training time, subject\r\n               to a quality threshold (i.e., time to train), and it accommo-    Therest of the paper is organized as follows. In \u00a7 2, we dis-\r\n               dates innovative solutions (i.e., new model architectures        cuss the main challenges to benchmarks for DL training, as\r\n               and training techniques, such as progressive resizing and        well as related prior work. In \u00a7 3, we review the benchmarks\r\n               cyclic learning rates). It additionally collects source code     in our suite, the time-to-train metric, and quality thresholds.\r\n               to promote reproducibility. DAWNBench\u2019s \ufb02exibility, how-         In \u00a7 4, we describe the submission, review, and reporting of\r\n               ever, also made it dif\ufb01cult to draw fair comparisons between     results for the various categories. Finally, in \u00a7 5 and \u00a7 6, we\r\n               hardware and software platforms. MLPerf builds on the            reviewprogressbetweenthe\ufb01rsttwoMLPerfbenchmarking\r\n               strengths of prior work; it combines a broad set of bench-       rounds, along with future work directions.\r\n               marks like Fathom or TBD, an end-to-end training metric\r\n               like DAWNBench,andthebackingofabroadconsortium                   2    BACKGROUND\r\n               like SPEC.\r\n               MLPerfaimstocreate a representative benchmark suite for          Webegin by describing in \u00a7 2.1 the unique challenges of\r\n               MLthat fairly evaluates system performance to meet \ufb01ve           benchmarking ML relative to other compute tasks (Don-\r\n               high-level goals:                                                garra, 1988; Council, 2005) and then review prior ML-\r\n                 \u2022 Enable fair comparison of competing systems while            benchmarking efforts in \u00a7 2.2.\r\n                    still encouraging ML innovation.                            2.1   UniqueChallengesofBenchmarkTraining\r\n                 \u2022 Accelerate ML progress through fair and useful mea-          MLbenchmarkingfacesuniquechallenges relative to other\r\n                    surement.                                                   compute benchmarks, such as LINPACK (Dongarra, 1988)\r\n                                                                                and SPEC(Dixit, 1991), that necessitate an end-to-end ap-\r\n                 \u2022 Enforce reproducibility to ensure reliable results.          proach. After an ML practitioner selects a data set, opti-\r\n                 \u2022 Serve both the commercial and research communities.          mizer, and DNN model, the system trains the model to its\r\n                                                                                state-of-the-art quality (e.g., Top-1 accuracy for image clas-\r\n                 \u2022 Keepbenchmarking effort affordable so all can partici-       si\ufb01cation). Provided the system meets this requirement, the\r\n                    pate.                                                       practitioner can make different operation, implementation,\r\n                                                                                and numerical-representation choices to maximize system\r\n               This paper focuses on the design and rationale for the           performance\u2014that is, how fast the training executes. Thus,\r\n               MLPerfTraining benchmark (a related MLPerf Inference             an MLperformance benchmark must ensure that systems\r\n               benchmarkisbeyondthepresentscope). AlthoughpriorML               under test achieve state-of-the-art quality while providing\r\n               benchmarking efforts (Coleman et al., 2017; Adolf et al.,        suf\ufb01cient \ufb02exibility to accommodate different implemen-\r\n               2016; Google, 2017; Baidu, 2017; Zhu et al., 2018) each          tations. This tradeoff between quality and performance is\r\n               contributed to meeting one or more of the above goals, we        challenging because multiple factors affect both the \ufb01nal\r\n               created MLPerf to address all of them holistically, build-       quality and the time to achieve it.\r\n               ing on the lessons learned from these efforts. To this end,\r\n               MLPerfTraining does the following:                               2.1.1   Effect of Optimizations on Quality\r\n                 \u2022 Establish a comprehensive benchmark suite that covers        Although many optimizations immediately improve tradi-\r\n                    diverse applications, DNN models, and optimizers.           tional performance metrics such as throughput, some can\r\n                                                                                decreasethe\ufb01nalmodelquality,aneffectthatisonlyobserv-\r\n                 \u2022 Create reference implementations of each benchmark           able by running an entire training session. For example, the\r\n                    to precisely de\ufb01ne models and training procedures.          accuracy difference between single-precision training and\r\n                                                                 MLPerfTrainingBenchmark\r\n               lower-precision training only emerges in later epochs (Zhu                20\r\n               et al., 2016). Across several representation and training\r\n               choices, the validation-error curves may only separate after\r\n               tens of epochs, and some numerical representations never                  10\r\n               match the \ufb01nal validation error of full-precision training                Epochs to quality\r\n               (lower validation error directly corresponds to higher ac-                 0\r\n               curacy: accuracy = 1 \u2212 errorvalidation). Thus, even though                      1 2  3  4  5  6  7  8  9 10 11 12 13 14 15\r\n               microbenchmarks (Baidu, 2017; Chetlur et al., 2014) can                                       Experiment ID\r\n               assess an optimization\u2019s performance impact, a complete                                       (a) NCF.\r\n               training session is necessary to determine the quality impact                           Seed 1     Seed 3     Seed 5\r\n               and whether the model achieves the desired accuracy. Ow-                                Seed 2     Seed 4\r\n               ing to the introduction of systems with varying numerics\r\n               (Abadi et al., 2016; Banner et al., 2018; Kster et al., 2017;             40\r\n               Micikevicius et al., 2018) and performance optimizations,\r\n               MLbenchmarksmustincludeaccuracymetrics.                                   20\r\n               2.1.2   Effect of Scale on Time to Train                                  Epochs to quality\r\n                                                                                          0     1     2     3      4     5      6     7\r\n               MLtraining on large distributed systems with many pro-                                        Experiment ID\r\n               cessors typically involves data parallelism and large mini-                                 (b) MiniGo.\r\n               batches to maximize system utilization and minimize train-         Figure 1. TrainingepochstoreachthetargetqualityfortheMLPerf\r\n               ing time.    In turn, these large minibatches require ad-          v0.5 NCF (a) and MiniGo (b) benchmarks. Each experiment uses\r\n               justments to optimizer parameters, such as the learning            identical hyperparameters except for the random seed. For MiniGo,\r\n               rate (Krizhevsky, 2014; Goyal et al., 2017). Together, these       weobserved considerable variability across runs even when \ufb01xing\r\n               changes affect the learning dynamics and can alter the num-        the random seed (same color).\r\n               ber of iterations required to achieve the target accuracy. For\r\n               example, MLPerf v0.5 ResNet-50 takes about 64 epochs\r\n               to reach the target Top-1 accuracy of 74.9% at a minibatch         selection and the non-commutative nature of \ufb02oating-point\r\n               size of 4K, 1 whereas a minibatch size of 16K can require          addition). Large distributed-training tasks can involve asyn-\r\n               morethan 80 epochs to reach the same accuracy, increasing          chronous updates, altering the gradient-accumulation order.\r\n               computation by 30%. Larger minibatches, however, per-              These variations make it hard to reliably compare system\r\n               mit ef\ufb01cient scaling to larger distributed systems, reducing       performance.\r\n               the time to train the model. The tradeoffs between system\r\n               size, minibatch size, and learning dynamics present another        2.1.4   Diverse Software\r\n               challenge for a DL-focused performance benchmark.                  Multiple ML software frameworks have emerged, each of\r\n               2.1.3   Run-to-Run Variation                                       which executes similar but distinct computations owing to\r\n                                                                                  various implementations and constraints (Abadi et al., 2016;\r\n               DNNtraining involves many stochastic in\ufb02uences that man-           Paszke et al., 2017; Chen et al., 2015; Jia et al., 2014).\r\n               ifest in substantial run-to-run variation (Choromanska et al.,     Software frameworks and the underlying math libraries em-\r\n               2015; Gori & Tesi, 1992; Auer et al., 1996; Coleman et al.,        ploy different algorithms to implement the same operation.\r\n               2019). Different training sessions for the same model us-          For example, convolutional and fully connected layers\u2014\r\n               ing the same hyperparameters can yield slightly different          two compute-intensive operators prevalent in modern DNN\r\n               accuracies after a \ufb01xed number of epochs. Alternatively,           models\u2014typically use cache blocking to exploit processor\r\n               different training sessions can take a different number of         memoryhierarchies. Different block sizes and processing\r\n               epochs to reach a given target accuracy. For example, Fig-         orders (which optimize for different hardware), although\r\n               ure 1 shows the number of epochs needed to reach target            algebraically equivalent, yield slightly divergent results. In\r\n               accuracy for two MLPerf v0.5 benchmarks using reference            addition, operators can execute using various algorithms.\r\n               implementations and default batch sizes. Several factors           For example, convolution layers can be executed using a va-\r\n               contribute to this variation, such as application behavior         riety of algorithms, including GEMM-based and transform-\r\n               (e.g., random weight initialization and random data traver-        based (e.g., FFT or Winograd) variants. In fact, the cuDNN\r\n               sal) and systemcharacteristics(e.g., pro\ufb01le-driven algorithm       v7.6 library provides roughly 10 algorithms for the forward\r\n                                                                                  pass of a convolutional layer, 2 some of which vary in tiling\r\n                   1Source: MLPerf v0.5 results (https://mlperf.org/\r\n               training-results-0-5).                                                2Source:      cuDNN (https://docs.nvidia.com/\r\n                                                                                  deeplearning/sdk/cudnn-developer-guide).\r\n                                                              MLPerfTrainingBenchmark\r\n               or blocking choices depending on the hardware. Although         Several benchmarks are de\ufb01ned at the granularity of entire\r\n               mathematically equivalent, different implementations will       DNNmodels. FathomandGoogleTFBenchmarks(Adolf\r\n               produce different numerical results, as \ufb02oating-point repre-    et al., 2016; Google, 2017) provide a reference suite of\r\n               sentations have \ufb01nite precision.                                DNNmodelsthatspanawideapplication space, but they\r\n               Additionally, frameworks occasionally implement the same        speci\ufb01cally measure model throughput and fail to account\r\n               function in mathematically different ways. For example,         for accuracy. Similarly, TBD (Training Benchmarks for\r\n               moderntraining frameworks implement stochastic gradient         DNNs)(Zhuetal.,2018)pro\ufb01les training on GPUs (but not\r\n               descent with momentum in two ways:                              other architectures) across diverse workloads, measuring\r\n                                                                               characteristics such as memory and hardware utilization.\r\n                                                             \u2202L                Our benchmark builds on the diversity of applications in\r\n                        momentum=\u03b1\u00b7momentum+\u03b7\u00b7 \u2202w                      (1)     these projects while also capturing the quality and perfor-\r\n                                  w=w\u2212momentum                                 mancetradeoffs.\r\n                                                                               DAWNBench (Coleman et al., 2017) was the \ufb01rst multi-\r\n                          momentum=\u03b1\u00b7momentum+ \u2202L                              entrant benchmark competition to use \u201ctime to train\u201d (orig-\r\n                                                           \u2202w          (2)     inally called time to accuracy) to measure the end-to-end\r\n                                    w=w\u2212\u03b7\u00b7momentum                             performance of deep-learning systems; it allowed optimiza-\r\n                                                                               tions across model architectures, optimization procedures,\r\n               TheCaffeframework(Jia et al., 2014) implements the \ufb01rst         software frameworks, and hardware platforms. Our bench-\r\n               approach, whereas PyTorch(Paszkeetal.,2017)andTensor-           markfollows a similar approach but handles more-diverse\r\n               Flow(Abadietal., 2016) implement the second. These ap-          tasks (\u00a7 3.1), and it uses important rules and mechanisms in\r\n               proachesdiffermathematicallyifthelearningrate\u03b7 changes          the Closed division (\u00a7 4.2.1) to enable fair comparisons of\r\n               during training\u2014a common technique. Although this differ-       hardware and software systems.\r\n               enceistinyinmanycases,itcanhindertrainingconvergence            Severalotherbenchmarksareunderdevelopment. AIMatrix\r\n               for larger minibatches.                                         measures workloads at different granularities (microbench-\r\n               Variations also arise owing to the frameworks\u2019 programming      marks, layer-wise benchmarks, end-to-end model bench-\r\n               interface. For example, PyTorch and TensorFlow interpret        marks, and synthetic benchmarks) (aim). Deep500, al-\r\n               asymmetric padding differently, complicating the task of        though not a benchmark, provides a software framework\r\n               porting model weights between them. Data-augmentation           for measuring DL-training performance (Ben-Nun et al.,\r\n               pipelines across frameworks can also apply image augmen-        2019).\r\n               tations (e.g., crop, zoom, and rotation) in different orders.\r\n               Although ONNX (Bai et al., 2019), TVM (Chen et al.,             3    MLPERFTRAININGBENCHMARK\r\n               2018), and similar emerging tools enable interoperability       WenowpresenttheMLPerfTrainingbenchmark,detailing\r\n               of model architectures across frameworks, their support re-     the workloads (\u00a7 3.1), timing rules (\u00a7 3.2), quality-threshold\r\n               mains limited. Moreover, ML systems involve a range of          choices (\u00a7 3.3), and reference implementations and hyper-\r\n               optimizations that extend beyond the model architecture,        parameters (\u00a7 3.4).\r\n               such as preprocessing, precision, and communication meth-\r\n               ods. Benchmarks must accommodate the wide diversity             3.1   BenchmarkSuite\r\n               of deployed systems despite this lack of a standard way to\r\n               specify every training aspect.                                  To create a fair and useful benchmark suite for modern\r\n                                                                               ML workloads, we curated a representative set of tasks\r\n               2.2  Prior Work                                                 from several major ML areas, including vision, language,\r\n               Prior ML benchmarks vary in granularity and scope. Mi-          recommendation, and reinforcement learning. Our selec-\r\n               crobenchmarks such as DeepBench (Baidu, 2017) measure           tion of benchmarks was primarily based on commercial\r\n               kernel-level operations that appear in commonly deployed        and research relevance, representing diverse compute mo-\r\n               models. Benchmarking such low-level operations fails to         tifs. To establish relevance, we relied on feedback from the\r\n               address the challenges associated with numerical precision,     tens of commercial and academic organizations that support\r\n               hyperparameter choices, and system scale, which we de-          MLPerf. To keep the suite affordable, we selected a com-\r\n               scribed in the previous section. Furthermore, it neither cap-   pact but representative set of seven benchmarks, which we\r\n               tures the end-to-end application, nor accounts for memory-      describe below and summarize in Table 1. Although these\r\n               and cache-hierarchy effects across layers and operations,       benchmarks already cover a wide range of research and\r\n               nor measures the data preprocessing that deep learning com-     industrial tasks, we are continuously exploring additional\r\n               monlyemploys.                                                   ones to keep the suite relevant to the ML community (\u00a7 6).\r\n                                                                 MLPerfTrainingBenchmark\r\n                                                           Table 1. MLPerf Training v0.5 benchmarks.\r\n                                Benchmark                    Dataset                 Model                    Quality Threshold\r\n                             Image classi\ufb01cation             ImageNet            ResNet-50 v1.5             74.9%Top-1accuracy\r\n                                                        (Deng et al., 2009)     (MLPerf, 2019b)\r\n                              Object detection             COCO2017             SSD-ResNet-34                     21.2 mAP\r\n                                (lightweight)            (Lin et al., 2014)     (Liu et al., 2016)\r\n                          Instance segmentation and        COCO2017              MaskR-CNN                     37.7 Box min AP,\r\n                        object detection (heavyweight)   (Lin et al., 2014)     (He et al., 2017a)            33.9 Mask min AP\r\n                                 Translation             WMT16EN-DE                  GNMT                      21.8 Sacre BLEU\r\n                                 (recurrent)               (WMT,2016)           (Wuetal., 2016)\r\n                                 Translation             WMT17EN-DE               Transformer                     25.0 BLEU\r\n                               (nonrecurrent)              (WMT,2017)         (Vaswani et al., 2017)\r\n                              Recommendation             MovieLens-20M                NCF                       0.635 HR@10\r\n                                                        (GroupLens, 2016)       (He et al., 2017b)\r\n                           Reinforcement learning               Go                  MiniGo            40.0%Professional move prediction\r\n                                                            (9x9 Board)         (MLPerf, 2019a)\r\n               3.1.1   Image Classi\ufb01cation                                        3.1.2   Object Detection and Segmentation\r\n               Image classi\ufb01cation is the most common task for evaluat-           Object detection and segmentation are crucial components\r\n               ing ML-system performance (Coleman et al., 2017; Adolf             of manyindustrialsystemsforrobotics, autonomousdriving,\r\n               et al., 2016; Zhu et al., 2018; Goyal et al., 2017; Jia et al.,    video analytics, and social networks. Object detection is a\r\n               2018; Mikami et al., 2018; Ying et al., 2018; Google, 2017;        regression task as opposed to a classi\ufb01cation task: it returns\r\n               Narayanan et al., 2019). A classi\ufb01er selects a class that          bounding-box coordinates for objects in a given image. Seg-\r\n               best describes the contents of a given image. Classi\ufb01cation        mentation assigns an object class to each input-image pixel.\r\n               modelarchitectures also serve as feature extractors for many       Although pretrained image-classi\ufb01cation models commonly\r\n               other computer-vision workloads, including object detec-           serve as the backbone (feature extractor) for DNN object de-\r\n               tion, captioning, and style transfer. We use the ILSVRC            tectors and segmenters, these DNN tasks differ from image\r\n               2012 ImageNet classi\ufb01cation data set, consisting of 1.28           classi\ufb01cation in their compute characteristics. Examples\r\n               million training images and 50,000 validation images (Deng         include additional layer types (upscaling, ROIalign, NMS,\r\n               et al., 2009). Our model-quality metric is the Top-1 accuracy      and sorting); moreover, the inputs have greater resolution.\r\n               onthe validation set.                                              MLPerf uses the 2017 COCO data set (Lin et al., 2014)\r\n               ResNet-50 is a residual network (He et al., 2016a;b); such         consisting of 118,000 training images and 5,000 validation\r\n               networks and their derivatives remain the state of the art         images. Model-quality measurement uses mAP for both\r\n               in image classi\ufb01cation, and system studies commonly use            detection and segmentation.\r\n               them (Goyal et al., 2017; Jia et al., 2018; Mikami et al.,         Mask R-CNN (He et al., 2017a) is a popular object-\r\n               2018; Ying et al., 2018; Sun et al., 2019). Several slightly       detection and instance-segmentation model for images. It\r\n               different ResNet-50 implementations appear in training-            has two stages: the \ufb01rst proposes regions of interest, and\r\n               framework repositories, preventing comparison of earlier           the second processes them to compute bounding boxes and\r\n               system-performance claims because of model differences.            segmentation masks. Mask R-CNN provides high-accuracy\r\n               Toensure meaningful system comparison, MLPerf uses the             results for these tasks, but at the cost of higher latency as\r\n               ResNet-50 v1.5 model, which performs addition after batch          well as greater compute and memory requirements. The\r\n               normalization, omits 1\u00d71convolutionfromtheskipconnec-              benchmarktraining uses images resized to 800 pixels on the\r\n               tion of the \ufb01rst residual block, and applies downsampling          shorter side and employs ResNet-50 as the backbone.\r\n               by the 3 \u00d7 3 convolutions. MLPerf also speci\ufb01es the appro-         Single Shot Detection (SSD) (Liu et al., 2016) serves in\r\n               priate parameter initialization, optimizer schedule, and data      real-time applications that require low-latency solutions.\r\n               augmentation.                                                      These applications include autonomous driving, robotics,\r\n                                                                                  and video analytics. Compared with Mask R-CNN (Huang\r\n                                                                                  et al., 2016) and other two-stage solutions, SSD trades speed\r\n                                                                                  for accuracy. Instead of full images, training uses 300\u00d7300\r\n                                                            MLPerfTrainingBenchmark\r\n              crops. We chose a ResNet-34 backbone to represent current     spending most of their time in computations unrelated to\r\n              real-time applications. ResNet-34 has a different residual-   ML. To measure quality, we calculate the percentage of\r\n              block structure than ResNet-50, increasing the diversity of   predicted moves that match human reference games.\r\n              computational motifs that MLPerf covers.\r\n                                                                            3.1.5  Recommendation\r\n              3.1.3  Translation                                            Recommendationsystemsareamajorcommercialworkload\r\n              Neural machine translation converts a sequence of words       for Internet companies (Naumov et al., 2019; Zhou et al.,\r\n              from the source language to a target language; many indus-    2018; Cheng et al., 2016). These workloads are character-\r\n              trial applications employ this technology. As is common in    ized by large embedding tables followed by linear layers.\r\n              translation research, we use the WMT English-to-German        Neural collaborative \ufb01ltering (NCF) (He et al., 2017b)\r\n              (EN-DE)dataset (WMT,2017),whichcontains about 4.5             was our choice for the benchmark. It is trained to predict\r\n              million sentence pairs. Our model-quality metric is the       user-item interactions. More so than for other tasks, this\r\n              Bilingual Evaluation Understudy Score (Bleu) score on the     recommender\u2019s compute characteristics depend on the data\r\n              Newstest2014 test set. We include two translation bench-      set. For example, the data set de\ufb01nes the embedding-table\r\n              marks to account for the two model architectures that trans-  size as well as the memory-access patterns. Thus, a repre-\r\n              lation and other sequence-data tasks often employ.            sentative data set is crucial to a representative benchmark.\r\n              Transformer (Vaswani et al., 2017) is an attention-based      Unfortunately, however, public data sets tend to be orders\r\n              model that achieves state-of-the-art language-translation     of magnitude smaller than industrial data sets. Although\r\n              quality. It consists of an encoder and decoder, each being    MLPerfv0.5adoptedtheMovieLens-20Mdataset(Grou-\r\n              a stack of six blocks. Every block comprises a multihead      pLens, 2016) for its NCF benchmark, v0.7 will employ a\r\n              attention layer and point-wise fully connected layers.        synthetically generated data set and benchmark while re-\r\n              GNMT (Wu et al., 2016) is a recurrent neural network          taining the characteristics of the original data (Belletti et al.,\r\n              (RNN) for language translation. Even though it achieves       2019)\r\n              lower accuracy than Transformer on the WMT English-to-        3.2  Time-to-Train Performance Metric\r\n              German data set, it appears in the suite to represent RNN\r\n              applications. These applications span numerous tasks, but     Toaddress the ML-benchmarking challenges of system op-\r\n              language-translation data sets and publications are more      timization and scale that we outlined in \u00a7 2.1.1 and \u00a7 2.1.2,\r\n              common, enabling clearer system comparison. GNMT is           MLPerf\u2019sperformancemetricisthetimetotraintoade\ufb01ned\r\n              the suite\u2019s only RNN. It consists of an eight-layer encoder   quality target. It incorporates both system speed and accu-\r\n              and an eight-layer decoder, each using 1,024 LSTM cells       racy and is most relevant to ML practitioners. As an end-to-\r\n              with skip connections.                                        end metric, it also captures the auxiliary operations neces-\r\n                                                                            sary for training such models, including data-pipeline and\r\n              3.1.4  Reinforcement Learning                                 accuracy calculations. The metric\u2019s generality enables ap-\r\n              Reinforcement learning (RL) is responsible for the recent     plication to reinforcement learning, unsupervised learning,\r\n              dramatic increase in compute demand (Amodei & Hernan-         generative adversarial networks, and other training schemes.\r\n              dez, 2018), and it serves in control systems. RL algorithms   Timetotrainovercomesthechallengesin\u00a72.1.1and\u00a72.1.2\r\n              can train agents (which includes neural networks) that rival  bypreventing submissions from using quality-reducing op-\r\n              humansatvideogames,go,andchess\u2014majormilestones                timizations while still allowing for extensive system-scale\r\n              in machine learning (Silver et al., 2018; Mnih et al., 2013;  and software-environment \ufb02exibility.\r\n              Chan, 2018). RL has a different computational pro\ufb01le than     3.2.1  Timing Rules\r\n              the other ML benchmarks: it generates training data through\r\n              exploration instead of relying on a predetermined data set.   We chose the timing requirements to ensure fair system\r\n              MiniGo (MLPerf, 2019a), inspired by AlphaGo (Silver           comparisons and to represent various training use cases.\r\n                                                                            Timing begins when the system touches any training or\r\n              et al., 2016; 2017; 2018), trains a single model that rep-    validation data, and it stops when the system achieves the\r\n              resents both value and policy functions for a 9 \u00d7 9 game      de\ufb01ned quality target on the validation data set.\r\n              board. Training uses self-play (simulated games) between      Weexcludefromtimingseveral components that can carry\r\n              agents to generate data; rather than using a simulator, it    substantial overhead and that are unrepresentative of real-\r\n              performs many forward passes through the model to gener-      world differences.\r\n              ate actions. We chose MiniGo to keep MLPerf more ML\r\n              oriented, since many other RL problems employ simulators      System initialization. Initialization, especially at large\r\n              (physics, video-game environments, etc.) to generate data,\r\n                                                               MLPerfTrainingBenchmark\r\n               scales, varies on the basis of cluster-administrator choices            100        Seed 1     Seed 3    Seed 5\r\n               and system-queue load. For example, it may involve run-                            Seed 2     Seed 4\r\n               ning diagnostics on each node before starting the training               75\r\n               job. Such overheads are unindicative of a system\u2019s training              50\r\n               capability, so we exclude them from timing.                            Accuracy (%)25\r\n               Modelcreation and initialization. Some frameworks can                    0 0       20        40       60        80\r\n               compile the model graph to optimize subsequent execution.                                     Epochs\r\n               This compilation time is insigni\ufb01cant for the longer train-\r\n               ing sessions when using industry-scale data sets. MLPerf,       Figure 2. Top-1 accuracy of MLPerf v0.5 ResNet-50 benchmark\r\n               however, uses public data sets that are usually much smaller    over 100 epochs for \ufb01ve runs (denoted by color) with identical\r\n               than industry ones. Therefore, large distributed systems can    hyperparameters but different random seeds. The dashed line\r\n               train some MLPerf benchmarks in minutes, making com-            indicates the quality target of 74.9% Top-1 accuracy. The early\r\n               pilation times a substantial portion of the total time. To      training phase exhibits much more variability than later phases.\r\n               makebenchmarksrepresentative of training on the largest\r\n               industrial data sets, we allow exclusion of up to 20 minutes\r\n               of model-creation time. This limit ensures that MLPerf cap-     consistently achieve the quality metric. Although selecting a\r\n               tures smaller training jobs, and it discourages submissions     lowerthresholdthatisachievableearlierinatrainingsession\r\n               with compilation approaches that are too computationally        reduces submission resources, we chose higher thresholds\r\n               and operationally expensive to use in practice.                 that require longer training sessions for two reasons: First,\r\n               Data reformatting. The raw input data commonly under-           wemustpreventoptimizations from adversely affecting the\r\n               goes reformatting once and then serves in many subsequent       \ufb01nal results (challenges described in \u00a7 2.1.1 and \u00a7 2.1.2).\r\n               training sessions. Reformatting examples include changing       Second,wemustminimizerun-to-runvariation,whichtends\r\n               image-\ufb01le formats and creating a database (e.g., LMDB,          to be much higher early in training. For example, Figure 2\r\n               TFRecords, or RecordIO) for more-ef\ufb01cient access. Be-           shows accuracy for \ufb01ve training sessions of MLPerf v0.5\u2019s\r\n               cause these operations execute once for many training ses-      ResNet-50 v1.5 reference implementation, where the \ufb01rst\r\n               sions, MLPerftimingexcludesreformatting. Butitprohibits         30epochsexhibit considerably more noise.\r\n               any data processing or augmentation that occurs in training\r\n               from moving to the reformatting stage (e.g., it prevents dif-   3.4   References and Hyperparameters\r\n               ferent crops of each image from being created and saved         MLPerfprovidesareferenceimplementationforeachbench-\r\n               before the timed training stage).                               mark, using either the PyTorch or TensorFlow framework.\r\n               3.2.2  NumberofTimingRuns                                       References also include scripts or directions to download\r\n                                                                               and preprocess public data sets. References are not opti-\r\n               To address the stochastic nature and resulting run-to-run       mized for performance (meaning they should not be used\r\n               variance of modern deep-learning methods described in           for performance assessment or comparison), as their main\r\n               \u00a7 2.1.3, MLPerf requires that submissions provide several       purpose is to de\ufb01ne a concrete implementation of a bench-\r\n               runs of each benchmark to stabilize timing. We determined       mark model and training procedure. All submitters must\r\n               the number of runs, which varies among benchmarks, by           follow these references\u2014they may reimplement a bench-\r\n               studying the behavior of reference implementations. Vision      mark in their framework of choice as long as the DNN\r\n               tasks require 5 runs to ensure 90% of entries from the same     model and training operations are mathematically equiva-\r\n               system are within 5%; all other tasks require 10 runs to        lent to the reference. Furthermore, MLPerf uses reference\r\n               ensure 90%ofentries from the same system are within 10%.        implementations to establish the required quality thresholds.\r\n               MLPerf drops the fastest and slowest times, reporting the       MLPerfrules specify the modi\ufb01able hyperparameters (Ta-\r\n               arithmetic mean of the remaining runs as the result.            ble 2) as well as restrictions on their modi\ufb01cation. These\r\n               3.3  Choice of Quality Thresholds                               restrictions are intended to balance the need to tune for dif-\r\n                                                                               ferent systems with limiting the size of the hyperparamter\r\n               For each benchmark, we chose quality metrics near the state     search space to be fair to submitters with smaller compute\r\n               of the art for the corresponding model and data set (Table 1),  resources. For example, to accommodate a wide range of\r\n               basing our choice on experiments with the reference imple-      training-system scales, submissions must be able to adjust\r\n               mentations. Someofthesethresholdsareslightlylowerthan           the minibatch size used by SGD in order to showcase maxi-\r\n               results in the literature, enabling us to benchmark across      mumsystemef\ufb01ciency(this approach is similar in concept\r\n               software frameworks and to ensure that training sessions        to the Top500 LINPACK benchmark, which allows systems\r\n                                                                               to choose the problem size). To ensure that training still\r\n                                                              MLPerfTrainingBenchmark\r\n                       Table 2. MLPerf modi\ufb01able hyperparameters.              rounds of the MLPerf benchmark: v0.5 and v0.6. The time\r\n                                                                               between rounds is about a few months, allowing us to up-\r\n                        Model            Modi\ufb01ableHyperparmeters               date the suite after each one. Every round has a submission\r\n                                                                               and review period followed by publication of results.\r\n                   All that use SGD    Batch size, Learning-rate schedule\r\n                                                  parameters                   4.1   Submission and Review\r\n                    ResNet-50 v1.5\r\n                                         Maximumsamplespertraining             An MLPerf submission consists of a system description,\r\n                   SSD-ResNet-34                     patch                     training-session log \ufb01les, and all code and libraries required\r\n                    MaskR-CNN             Numberofimagecandidates              to reproduce the training sessions. All of this information is\r\n                                                                               publicly available on the MLPerf GitHubsite, along with the\r\n                                          Learning-rate decay function,        MLPerfresults, allowing for reproducibility and enabling\r\n                       GNMT             Learning rate, Decay start, Decay      the community to improve the results in subsequent rounds.\r\n                                       interval, Warmup function, Warmup       Asystemdescription includes both the hardware (number\r\n                                                     steps                     of nodes, processor and accelerator counts and types, stor-\r\n                                        Optimizer: Adam (Kingma & Ba,          age per node, and network interconnect) and the software\r\n                     Transformer       2015) or Lazy Adam, Learning rate,      (operating system as well as libraries and their versions).\r\n                                                Warmupsteps\r\n                                        Optimizer: Adam or Lazy Adam,          Atraining-session log \ufb01le contains a variety of structured\r\n                        NCF                  Learning rate, \u03b21, \u03b22             information including time stamps for important workload\r\n                    Go(9x9board)                                               stages, quality-metric evaluations at prescribed intervals,\r\n                                                                               and hyperparameter choices. These logs are the foundation\r\n                                                                               for analyzing results.\r\n               convergestotherequiredthreshold,otherhyperparameters\u2014           Before publishing results, submissions are peer-reviewed\r\n               such as the learning rate schedule\u2014may need adjustment to       for compliance with MLPerf rules. Submitters receive noti-\r\n               match. For example, a common ResNet training practice is        \ufb01cation of noncompliance, where applicable, and they may\r\n               to to increase the learning rate linearly with the minibatch    resubmit after addressing any such problems. Additionally,\r\n               size (Goyal et al., 2017). Although these hyperparameter        we permit some hyperparameter borrowing as described\r\n               searches are a common ML task, MLPerf\u2019s focus is on sys-        earlier during this period.\r\n               tem optimization rather than hyperparameter exploration         4.2   Reporting Results\r\n               andwedonotwanttopenalizesubmitterswhoareunableto\r\n               do extensive searches. Therefore we restrict hyperparamter      Each MLPerf submission has several labels: division (open\r\n               tuning to subset of all possible parameters and values.         or closed), category (available, preview, or research), and\r\n               Further, we allow \u201chyperparameter borrowing\u201d during the         system type (on-Premises or cloud).\r\n               post-submission review process in which one submitter           4.2.1  Submission Divisions\r\n               may adopt another submitter\u2019s hyperparamters for a spe-\r\n               ci\ufb01c benchmark and resubmit their result (with no other         MLPerf has two submission divisions: closed and open.\r\n               hardware or software changes allowed). In the \ufb01rst two          Both require that submissions employ the same data set and\r\n               rounds, hyperparameter borrowing was used successfully to       quality metric as the corresponding reference implementa-\r\n               improve several submissions indicating hyperparamters are       tion.\r\n               somewhatportable. Typically borrowing occured across sys-       Theclosed division is intended for direct system compari-\r\n               tems of similiar scale, but did result in convergence across    son, soit strives to ensure workload equivalencebyrequiring\r\n               different numerics (FP16, b\ufb02oat16, and FP32), architec-         that submissions be equivalent to reference implementations.\r\n               tures (CPU, GPU, and TPU), and software implementations         Equivalence includes mathematically identical model imple-\r\n               (TF, cuDNN, and MKL-DNN). MLPerf working groups re-             mentations, parameter initialization, optimizer and training\r\n               view the hyperparameter choices and requirements for each       schedules, and data processing and traversal. To ensure\r\n               benchmark round to account for advances in training ML          fairness, this division also restricts hyperparameter modi\ufb01-\r\n               models at scale.                                                cation.\r\n               4   BENCHMARKINGPROCESS                                         The open division is intended to encourage innovative so-\r\n                                                                               lutions of important practical problems and to encourage\r\n               Next, we outline the process for submission and review          hardware/software co-design. It allows submissions to em-\r\n               (\u00a7 4.1) and for reporting results (\u00a7 4.2) to account for inno-  ploy model architectures, optimization procedures, and data\r\n               vative solutions, availability, and scale. We have run two      augmentations that differ from the reference implementa-\r\n                                                             MLPerfTrainingBenchmark\r\n              tions.                                                                   2\r\n              4.2.2   System Categories                                                1\r\n              Toallow for a broad range of research and industry systems,           Speedup fromv0.5 to v0.6\r\n              wede\ufb01nedthreesubmission categories: available, preview,\r\n              and research. These categories encourage novel techniques                0 ResNet-50  SSD     Mask    GNMT  Transformer\r\n              and systems (e.g., from academic researchers), but they also                                 R-CNN\r\n              distinguish between shipping products and proof-of-concept                                    Model\r\n              or early engineering samples.                                                           (a) Speedup.\r\n              Theavailable category imposes requirements on both hard-          Model                       Metric         v0.5    v0.6\r\n              ware and software availability. Hardware must be either           ResNet-50           Top-1 accuracy         74.9    75.9\r\n              available for third-party rental on a cloud service or, in the    SSD                          mAP           21.2     23\r\n              case of on-premises equipment, available for purchase. Sup-       MaskR-CNN       Box/MaskminAP        37.7 / 39.9 Same\r\n              ply and lead times for renting or purchasing should be\ufb01t          GNMT                  Sacre BLEU           21.8     24\r\n                                                                                Transformer                 BLEU            25   Same\r\n              the system scale and company size. To ensure that bench-                             (b) Quality targets.\r\n              marksubmissions are widely consumable and to discourage\r\n              benchmark-speci\ufb01c engineering, we also require that soft-       Figure 3. Speedupinthefastest16-chipentryfromMLPerfversion\r\n              wareinthis category be versioned and supported for general      v0.5 to v0.6 for various benchmarks common to both (Figure 3a),\r\n              use.                                                            along with quality-target increases (Figure 3b).\r\n              Previewsystemscontaincomponentsthatmeettheavailable-\r\n              category criteria within 60 days of the submission date or by\r\n              the next submission cycle, whichever is later. Any preview      4.2.4  Reporting Scores\r\n              system must also be submitted to the available category by\r\n              that time.                                                      AnMLPerfresults report provides the time to train for each\r\n              Research submissions contain components unintended for          benchmark. Although a single summary score that spans\r\n              production. An example is an academic-research prototype        the entire suite may be desirable for system comparisons,\r\n              designed as a proof of concept rather than a robust product.    it is unsuited to MLPerf for two main reasons. First, a\r\n              This category also includes systems that are built from pro-    summaryscoreimpliessomeweightingofindividualbench-\r\n              duction hardware and software but are larger in scale than      mark scores. Given the diversity of system users and the\r\n              available-category con\ufb01gurations.                               widerangeofapplicationsthatMLPerfcovers,noweighting\r\n                                                                              scheme is universally representative. Second, a summary\r\n              4.2.3   Reporting Scale                                         score becomes less meaningful if a submitter declines to\r\n                                                                              report results on all benchmarks. Submitters can have mul-\r\n              Modern ML training spans multiple orders of magnitude           tiple reasons for omitting some benchmarks\u2014not all are\r\n              in system power draw and cost. Therefore, comparisons           practical at every system scale (for example, some models\r\n              are more useful if the reported performance includes the        are untrainable at the minibatch sizes that the largest sys-\r\n              scale. A common scale metric, such as cost or power, is         tems require for data-parallel training). Additionally, some\r\n              not de\ufb01nable across a wide range of systems (cloud, on-         processors may target only certain applications.\r\n              premises, and preproduction), so it requires differentiation\r\n              bysystemtype.                                                   5    RESULTS\r\n              In the \ufb01rst two MLPerf rounds, we included the system con-      MLPerf, like all benchmarks, aims to to encourage innova-\r\n              \ufb01guration (number of processors and/or accelerators) along-     tion through constructive competition; we measure progress\r\n              side the performance scores. For on-premises examples,          by comparing results across submission rounds. We have\r\n              future versions will include a power-measurement speci\ufb01ca-      conducted two MLPerf Training rounds thus far: v0.5 and\r\n              tion. For cloud systems, we derived a \u201ccloud-scale\u201d metric      v0.6. They were six months apart, and the underlying hard-\r\n              fromthenumberofhostprocessors,amountofhostmemory,               ware systems were unchanged. The results that were ei-\r\n              and number and type of accelerators. We empirically veri-       ther unmodi\ufb01ed or underwent minor modi\ufb01cations between\r\n              \ufb01edthat cloud scale correlates closely with cost across three   rounds show that MLPerf is driving rapid performance and\r\n              major cloud providers. Reporting of these scale metrics was     scaling improvement in both the implementations and soft-\r\n              optional in MLPerf v0.5 and v0.6.                               ware stacks. Figure 3 shows that between the two sub-\r\n                                                                              mission rounds, the best performance results for a 16-chip\r\n                                                                              system increased by an average of 1.3\u00d7 despite the higher\r\n                                                                     MLPerfTrainingBenchmark\r\n                                                v0.5     v0.6                                sets that are smaller than industrial scale, training time\r\n                        1000                                                                 should exclude the startup time, which would be pro-\r\n                                                                                             portionally less in actual use.\r\n                         500                                                              \u2022 Small hyperparameter changes can produce consider-\r\n                       Number of chips                                                       able performance changes. But, based on our experi-\r\n                                                                                             ence with hyperparameter borrowing, hyperparameters\r\n                           0  ResNet-50   SSD      Mask    GNMT   Transformer                are relatively portable at similiar system scales, even\r\n                                                  R-CNN                                      across architectures, numerics, or software stacks.\r\n                                                  Model\r\n                Figure 4. Number of chips necessary to produce the fastest time to        \u2022 Frameworksexhibit subtle optimizer-algorithm varia-\r\n                                                                                             tions that affect convergence.\r\n                solution for MLPerf versions v0.5 to v0.6. This number increased       MLisanevolving\ufb01eld,however, and we have much more\r\n                byasmuchas5.5\u00d7.                                                        to learn. To keep pace, MLPerf establishes a process to\r\n                                                                                       maintain and update the suite. For example, MLPerf v0.6\r\n                quality targets. Figure 4 reveals that the number of chips             includes several updates: the ResNet-50 benchmark added\r\n                necessary to produce the best overall performance result               LARS(Youetal.,2017),GNMT\u2019smodelarchitectureim-\r\n                increased by an average of 5.5\u00d7. Some of this improvement              proved to increase translation quality, and the MiniGo ref-\r\n                owes to better benchmark implementations and some to                   erence switched from Python to C++ to increase perfor-\r\n                rule changes, such as allowing the LARS (You et al., 2017)             mance. The MLPerf organization welcomes input and con-\r\n                optimizer for large ResNet batches. But we believe sub-                tributions: https://mlperf.org/get-involved\r\n                mitters incorporated much of the performance and scaling\r\n                improvements into the underlying software infrastructure               ACKNOWLEDGEMENTS\r\n                and passed them on to users. We expect MLPerf to drive\r\n                similar improvements through focused hardware innovation.              In this section, we acknowledge all those who helped pro-\r\n                                                                                       duce the \ufb01rst set of results or supported the overall bench-\r\n                6    CONCLUSIONS                                                       markdevelopment.\r\n                MLPerfTrainingisasuiteofMLbenchmarksthatrepresent                      Intel: Cong Xu, Deng Xu, Feng Tian, Haihao Shen, Mingx-\r\n                both industrial and academic use cases. In addition to being           iao Huang, Rachita Prem Seelin, Teng Lu, Xin Qiu, and\r\n                the only widely used ML-training benchmark suite boasting              ZhongyuanWu.\r\n                such coverage, it has made the following contributions:                Facebook: MaximNaumov,DheevatsaMudigere, Mustafa\r\n                   \u2022 Precise de\ufb01nition of model architectures and training             Ozdal, Misha Smelyanskiy, Joe Spisak, Sy Choudhury, and\r\n                      procedures for each benchmark. This feature enables              Brian Gamidos.\r\n                      systemcomparisonsforequivalentworkloads, whereas                 Stanford: Work at Stanford received support in part\r\n                      previous results often involved substantially different          from af\ufb01liate members and other Stanford DAWN project\r\n                      variants of a given model (for example, ResNet-50 has            participants\u2014Ant Financial, Facebook, Google, Infosys,\r\n                      at least \ufb01ve variants).                                          NEC,andVMware\u2014aswellasToyotaResearchInstitute,\r\n                   \u2022 Reference implementations and rule de\ufb01nitions to ad-              NorthropGrumman,Cisco,SAP,NSFCAREERgrantCNS-\r\n                      dress the challenges unique to benchmarking ML train-            1651570, and NSF Graduate Research Fellowship grant\r\n                      ing. These challenges include the stochastic nature of           DGE-1656518. Anyopinions, \ufb01ndings, conclusions, or rec-\r\n                      training processes, the necessity of training to comple-         ommendations expressed in this material are those of the\r\n                      tion to determine the quality impact of performance              authors and do not necessarily re\ufb02ect the views of the NSF.\r\n                      optimizations, and the need for workload variation at            Harvard: Work at Harvard received partial support from\r\n                      different system scales (\u00a7 2.1).                                 the Applications Driving Architectures (ADA) Research\r\n                Although MLPerf focuses on relative system performance,                Center, a Jump Center cosponsored by the SRC and Darpa,\r\n                as the online results demonstrate, it also offers general              NSFCCF#1704834,andIntelCorporation. We would also\r\n                lessons about ML and benchmarking:                                     like to thank Brandon Reagen.\r\n                   \u2022 Realistic data-set size is critical to ensuring realis-           University of Toronto: Work at the University of Toronto\r\n                      tic memory-system behavior\u2014for example, the initial              received partial support from an NSERC Discovery grant,\r\n                      NCFdatasetwastoosmallandcouldresideentirely                      the Canada Foundation for Innovation JELF grant, the Con-\r\n                      in memory. Furthermore, when benchmarking data                   naught Fund, and Huawei grants.\r\n                                                             MLPerfTrainingBenchmark\r\n               REFERENCES                                                       Learning. In 13th {USENIX} Symposium on Operating\r\n              AIMatrix. URLhttps://aimatrix.ai.                                 Systems Design and Implementation ({OSDI} 18), pp.\r\n                                                                                578\u2013594, 2018.\r\n              Abadi, M., Barham, P., Chen, J., Chen, Z., Davis, A., Dean,     Cheng, H.-T., Koc, L., Harmsen, J., Shaked, T., Chandra,\r\n                 J., Devin, M., Ghemawat, S., Irving, G., Isard, M., et al.     T., Aradhye, H., Anderson, G., Corrado, G., Chai, W.,\r\n                 TensorFlow: A System for Large-Scale Machine Learn-            Ispir, M., et al. Wide & Deep Learning for Recommender\r\n                 ing. In OSDI, volume 16, pp. 265\u2013283, 2016.                    Systems. In Proceedings of the 1st workshop on deep\r\n              Adolf, R., Rama, S., Reagen, B., Wei, G.-Y., and Brooks, D.       learning for recommender systems, pp. 7\u201310. ACM, 2016.\r\n                 Fathom: Reference Workloads for Modern Deep Learn-           Chetlur, S., Woolley, C., Vandermersch, P., Cohen, J.,\r\n                 ing Methods. In Workload Characterization (IISWC),             Tran, J., Catanzaro, B., and Shelhamer, E. CuDNN:\r\n                 2016 IEEE International Symposium on, pp. 1\u201310. IEEE,          Ef\ufb01cient Primitives for Deep Learning. arXiv preprint\r\n                 2016.                                                          arXiv:1410.0759, 2014.\r\n              Amodei, D. and Hernandez, D.                AI and Com-         Choromanska, A., Henaff, M., Mathieu, M., Arous, G. B.,\r\n                 pute, 2018.   URL https://blog.openai.com/                     and LeCun, Y. The Loss Surfaces of Multilayer Net-\r\n                 ai-and-compute/.                                               works. In Arti\ufb01cial Intelligence and Statistics, pp. 192\u2013\r\n              Auer, P., Herbster, M., and Warmuth, M. K. Exponentially          204, 2015.\r\n                 Many Local Minima for Single Neurons. In Advances            Coleman, C., Narayanan, D., Kang, D., Zhao, T., Zhang, J.,\r\n                 in neural information processing systems, pp. 316\u2013322,                                               \u00b4\r\n                 1996.                                                          Nardi, L., Bailis, P., Olukotun, K., Re, C., and Zaharia,\r\n                                                                                M. DAWNBench: AnEnd-to-EndDeepLearningBench-\r\n              Bai, J., Lu, F., Zhang, K., et al. ONNX: Open Neural              mark and Competition. NIPS ML Systems Workshop,\r\n                 Network Exchange. https://github.com/onnx/                     2017.\r\n                 onnx,2019.                                                   Coleman, C., Kang, D., Narayanan, D., Nardi, L., Zhao, T.,\r\n                                                                                                                      \u00b4\r\n              Baidu. DeepBench: Benchmarking Deep Learning Op-                  Zhang, J., Bailis, P., Olukotun, K., Re, C., and Zaharia,\r\n                 erations on Different Hardware. https://github.                M. Analysis of DAWNBench, a Time-to-Accuracy Ma-\r\n                 com/baidu-research/DeepBench,2017.                             chine Learning Performance Benchmark. ACM SIGOPS\r\n              Banner, R., Hubara, I., Hoffer, E., and Soudry, D. Scal-          Operating Systems Review, 53(1):14\u201325, 2019.\r\n                 able Methods for 8-bit Training of Neural Networks. In       Council, T. P. P. Transaction Processing Performance Coun-\r\n                 Advances in Neural Information Processing Systems, pp.         cil. Web Site, http://www. tpc. org, 2005.\r\n                 5145\u20135153, 2018.                                             Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., and Fei-\r\n              Belletti, F., Lakshmanan, K., Krichene, W., Chen, Y.-F.,          Fei, L. ImageNet: A Large-scale Hierarchical Image\r\n                 and Anderson, J. Scalable Realistic Recommendation             Database. In 2009 IEEE conference on computer vision\r\n                 Datasets through Fractal Expansions. arXiv preprint            andpattern recognition, pp. 248\u2013255. Ieee, 2009.\r\n                 arXiv:1901.08910, 2019.                                      Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K. BERT:\r\n              Ben-Nun, T., Besta, M., Huber, S., Ziogas, A. N., Peter, D.,      Pre-training of Deep Bidirectional Transformers for Lan-\r\n                 and Hoe\ufb02er, T. A Modular Benchmarking Infrastructure           guage Understanding. arXiv preprint arXiv:1810.04805,\r\n                 for High-Performance and Reproducible Deep Learning.           2018.\r\n                 arXiv preprint arXiv:1901.10183, 2019.                       Dixit, K. M. The SPEC Benchmarks. Parallel computing,\r\n              Chan, B.     OpenAI Five, Jun 2018.      URL https://             17(10-11):1195\u20131209, 1991.\r\n                 openai.com/blog/openai-five/.                                Dongarra, J.    The LINPACK Benchmark: An Expla-\r\n              Chen, T., Li, M., Li, Y., Lin, M., Wang, N., Wang, M.,            nation.   In Proceedings of the 1st International Con-\r\n                 Xiao, T., Xu, B., Zhang, C., and Zhang, Z. MXNet:              ference on Supercomputing, pp. 456\u2013474, London,\r\n                 AFlexible and Ef\ufb01cient Machine Learning Library for            UK, UK, 1988. Springer-Verlag. ISBN 3-540-18991-\r\n                 Heterogeneous Distributed Systems.       arXiv preprint        2.  URL http://dl.acm.org/citation.cfm?\r\n                 arXiv:1512.01274, 2015.                                        id=647970.742568.\r\n              Chen,T., Moreau, T., Jiang, Z., Zheng, L., Yan, E., Shen, H.,   Google.    TensorFlow Benchmarks.         https://www.\r\n                 Cowan,M.,Wang,L.,Hu,Y.,Ceze,L.,etal. {TVM}: An                 tensorflow.org/performance/benchmarks,\r\n                 Automated End-to-End Optimizing Compiler for Deep              2017.\r\n                                                                 MLPerfTrainingBenchmark\r\n               Gori, M. and Tesi, A. On the Problem of Local Minima in               Convolutional Architecture for Fast Feature Embedding.\r\n                  Backpropagation. IEEE Transactions on Pattern Analysis             In ACM International Conference on Multimedia, pp.\r\n                  &MachineIntelligence, (1):76\u201386, 1992.                             675\u2013678. ACM, 2014.\r\n                                  \u00b4\r\n               Goyal, P., Dollar, P., Girshick, R., Noordhuis, P.,                Jouppi, N. P., Young, C., Patil, N., Patterson, D., Agrawal,\r\n                  Wesolowski, L., Kyrola, A., Tulloch, A., Jia, Y., and He,          G., Bajwa, R., Bates, S., Bhatia, S., Boden, N., Borchers,\r\n                  K. Accurate, Large Minibatch SGD: Training ImageNet                A., et al. In-Datacenter Performance Analysis of a Tensor\r\n                  in 1 Hour. arXiv preprint arXiv:1706.02677, 2017.                  Processing Unit. In 2017 ACM/IEEE 44th Annual Inter-\r\n               GroupLens.        MovieLens 20M Dataset, Oct 2016.                    national Symposium on Computer Architecture (ISCA),\r\n                  URL        https://grouplens.org/datasets/                         pp. 1\u201312. IEEE, 2017.\r\n                  movielens/20m/.                                                 Kingma,D.P.andBa,J. Adam: AMethodforStochastic\r\n               He,K.,Zhang,X.,Ren,S.,andSun,J. DeepResidualLearn-                    Optimization. ICLR, 2015.\r\n                  ing for Image Recognition. In Proceedings of the IEEE           Krizhevsky, A. One Weird Trick for Parallelizing Convolu-\r\n                  conference on computer vision and pattern recognition,             tional Neural Networks, 2014.\r\n                  pp. 770\u2013778, 2016a.\r\n               He, K., Zhang, X., Ren, S., and Sun, J. Identity Mappings          Krizhevsky, A., Sutskever, I., and Hinton, G. E. ImageNet\r\n                  in Deep Residual Networks. In European conference on               Classi\ufb01cation with Deep Convolutional Neural Networks.\r\n                  computer vision, pp. 630\u2013645. Springer, 2016b.                     In Advances in neural information processing systems,\r\n                                                                                     pp. 1097\u20131105, 2012.\r\n                                             \u00b4\r\n               He, K., Gkioxari, G., Dollar, P., and Girshick, R. Mask            Kster, U., Webb, T. J., Wang, X., Nassar, M., Bansal, A. K.,\r\n                  R-CNN. InProceedings of the IEEE international con-                Constable, W.H., Elibol, O. H., Gray, S., Hall, S., Hornof,\r\n                  ference on computer vision, pp. 2961\u20132969, 2017a.                  L., Khosrowshahi, A., Kloss, C., Pai, R. J., and Rao, N.\r\n               He, X., Liao, L., Zhang, H., Nie, L., Hu, X., and Chua,               Flexpoint: An Adaptive Numerical Format for Ef\ufb01cient\r\n                  T.-S. Neural Collaborative Filtering. In Proceedings of            Training of Deep Neural Networks. NIPS, 2017.\r\n                  the 26th international conference on world wide web,            Lin, T.-Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ra-\r\n                  pp. 173\u2013182. International World Wide Web Conferences                               \u00b4\r\n                  Steering Committee, 2017b.                                         manan,D.,Dollar,P.,andZitnick,C.L. MicrosoftCOCO:\r\n                                                                                     CommonObjectsin Context. In European Conference\r\n               Hennessy, J. L. and Patterson, D. A. Computer Architecture:           onComputerVision, pp. 740\u2013755. Springer, 2014.\r\n                  AQuantitative Approach. Elsevier, 2011.                         Liu, W., Anguelov, D., Erhan, D., Szegedy, C., Reed, S.,\r\n               Hinton, G., Deng, L., Yu, D., Dahl, G., Mohamed, A.-r.,               Fu, C.-Y., and Berg, A. C. SSD: Single Shot Multibox\r\n                  Jaitly, N., Senior, A., Vanhoucke, V., Nguyen, P., Kings-          Detector. In European conference on computer vision,\r\n                  bury, B., et al. Deep Neural Networks for Acoustic Mod-            pp. 21\u201337. Springer, 2016.\r\n                  eling in Speech Recognition. IEEE Signal processing             Markidis, S., Der Chien, S. W., Laure, E., Peng, I. B., and\r\n                  magazine, 29, 2012.                                                Vetter, J. S. NVIDIA Tensor Core Programmability, Per-\r\n               Huang, J., Rathod, V., Sun, C., Zhu, M., Korattikara, A.,             formance & Precision. arXiv preprint arXiv:1803.04014,\r\n                  Fathi, A., Fischer, I., Wojna, Z., Song, Y., Guadarrama, S.,       2018.\r\n                  and Murphy, K. Speed/Accuracy Trade-offs for Modern             Micikevicius, P., Narang, S., Alben, J., Diamos, G., Elsen,\r\n                  Convolutional Object Detectors, 2016.                              E., Garcia, D., Ginsburg, B., Houston, M., Kuchaiev, O.,\r\n               Intel.   BigDL: Distributed Deep Learning Library for                 Venkatesh, G., and Wu, H. Mixed Precision Training. In\r\n                  ApacheSpark, 2019. URL https://github.com/                         Proceedings of the International Conference on Learning\r\n                  intel-analytics/BigDL.                                             Representations, 2018.\r\n               Jia, X., Song, S., He, W., Wang, Y., Rong, H., Zhou, F.,           Mikami, H., Suganuma, H., U-chupala, P., Tanaka,\r\n                  Xie, L., Guo, Z., Yang, Y., Yu, L., et al. Highly Scalable         Y., and Kageyama, Y.        Massively Distributed SGD:\r\n                  Deep Learning Training System with Mixed-Precision:                ImageNet/ResNet-50 Training in a Flash. arXiv preprint\r\n                  Training ImageNet in Four Minutes.          arXiv preprint         arXiv:1811.05233, 2018.\r\n                  arXiv:1807.11205, 2018.                                         MLPerf.       MLPerf Reference:         MiniGo.       https:\r\n               Jia, Y., Shelhamer, E., Donahue, J., Karayev, S., Long, J.,           //github.com/mlperf/training/tree/\r\n                  Girshick, R., Guadarrama, S., and Darrell, T. Caffe:               master/reinforcement,2019a.\r\n                                                             MLPerfTrainingBenchmark\r\n              MLPerf.     MLPerf Reference: ResNet in TensorFlow.            WMT. First Conference on Machine Translation, 2016.\r\n                 https://github.com/mlperf/training/                            URLhttp://www.statmt.org/wmt16/.\r\n                 tree/master/image_classification/                           WMT. SecondConferenceonMachineTranslation,2017.\r\n                 tensorflow/official,2019b.                                     URLhttp://www.statmt.org/wmt17/.\r\n              Mnih, V., Kavukcuoglu, K., Silver, D., Graves, A.,             Wu, Y., Schuster, M., Chen, Z., Le, Q. V., Norouzi, M.,\r\n                 Antonoglou, I., Wierstra, D., and Riedmiller, M. Playing       Macherey, W., Krikun, M., Cao, Y., Gao, Q., Macherey,\r\n                 Atari with Deep Reinforcement Learning. arXiv preprint         K., et al. Google\u2019s Neural Machine Translation System:\r\n                 arXiv:1312.5602, 2013.                                         Bridging the Gap between Human and Machine Transla-\r\n              Narayanan, D., Harlap, A., Phanishayee, A., Seshadri, V.,         tion. arXiv preprint arXiv:1609.08144, 2016.\r\n                 Devanur, N. R., Ganger, G. R., Gibbons, P. B., and Za-      Ying, C., Kumar, S., Chen, D., Wang, T., and Cheng, Y. Im-\r\n                 haria, M. PipeDream: Generalized Pipeline Parallelism          ageClassi\ufb01cation at Supercomputer Scale. arXiv preprint\r\n                 for DNN Training. In Proceedings of the 27th ACM               arXiv:1811.06992, 2018.\r\n                 SymposiumonOperatingSystems Principles, pp. 1\u201315,\r\n                 2019.                                                       You, Y., Gitman, I., and Ginsburg, B.         Large Batch\r\n              Naumov,M.,Mudigere,D.,Shi,H.-J. M., Huang, J., Sun-               Training of Convolutional Networks.      arXiv preprint\r\n                 daraman, N., Park, J., Wang, X., Gupta, U., Wu, C.-J.,         arXiv:1708.03888, 2017.\r\n                 Azzolini, A. G., et al. Deep Learning Recommendation        Zhou, G., Zhu, X., Song, C., Fan, Y., Zhu, H., Ma, X., Yan,\r\n                 ModelforPersonalizationandRecommendationSystems.               Y., Jin, J., Li, H., and Gai, K. Deep Interest Network for\r\n                 arXiv preprint arXiv:1906.00091, 2019.                         Click-through Rate Prediction. In Proceedings of the 24th\r\n              Paszke, A., Gross, S., Chintala, S., Chanan, G., Yang, E.,        ACMSIGKDDInternationalConferenceonKnowledge\r\n                 DeVito, Z., Lin, Z., Desmaison, A., Antiga, L., and Lerer,     Discovery & Data Mining, pp. 1059\u20131068. ACM, 2018.\r\n                 A. Automatic Differentiation in PyTorch. 2017.              Zhu, C., Han, S., Mao, H., and Dally, W. J. Trained Ternary\r\n              Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., and         Quantization. arXiv preprint arXiv:1612.01064, 2016.\r\n                 Sutskever, I. Language Models are Unsupervised Multi-       Zhu, H., Akrout, M., Zheng, B., Pelegris, A., Jayarajan,\r\n                 task Learners. OpenAI Blog, 1(8), 2019.                        A., Phanishayee, A., Schroeder, B., and Pekhimenko,\r\n                                                                                G. Benchmarking and Analyzing Deep Neural Network\r\n              Silver, D., Huang, A., Maddison, C. J., Guez, A., Sifre, L.,      Training. In 2018 IEEE International Symposium on\r\n                 VanDenDriessche, G., Schrittwieser, J., Antonoglou, I.,        Workload Characterization (IISWC), pp. 88\u2013100. IEEE,\r\n                 Panneershelvam, V., Lanctot, M., et al. Mastering the          2018.\r\n                 GameofGowithDeepNeuralNetworksandTreeSearch.\r\n                 nature, 529(7587):484, 2016.\r\n              Silver, D., Schrittwieser, J., Simonyan, K., Antonoglou,\r\n                 I., Huang, A., Guez, A., Hubert, T., Baker, L., Lai, M.,\r\n                 Bolton, A., et al. Mastering the Game of Go without\r\n                 HumanKnowledge. Nature, 550(7676):354, 2017.\r\n              Silver, D., Hubert, T., Schrittwieser, J., Antonoglou, I., Lai,\r\n                 M., Guez, A., Lanctot, M., Sifre, L., Kumaran, D., Grae-\r\n                 pel, T., et al. A General Reinforcement Learning Algo-\r\n                 rithm that masters Chess, Shogi, and Go through Self-\r\n                 Play. Science, 362(6419):1140\u20131144, 2018.\r\n              Sun, P., Feng, W., Han, R., Yan, S., and Wen, Y. Optimizing\r\n                 NetworkPerformance for Distributed DNN Training on\r\n                 GPUClusters: ImageNet/AlexNet Training in 1.5 Min-\r\n                 utes. arXiv preprint arXiv:1902.06855, 2019.\r\n              Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones,\r\n                 L., Gomez,A.N.,Kaiser,\u0141.,andPolosukhin,I. Attention\r\n                 is All You Need. In Advances in neural information\r\n                 processing systems, pp. 5998\u20136008, 2017.\r\n                                                               MLPerfTrainingBenchmark\r\n               A ARTIFACTAPPENDIX                                               A.3    Description\r\n               A.1   Abstract                                                   A.3.1   Howtoaccess\r\n               This artifact description contains information about the com-    MLPerfv0.5training results on Github:\r\n               plete work\ufb02ow to reproduce Nvidia\u2019s v0.5 image classi\ufb01-          https://github.com/mlperf/training_results_\r\n               cation submissions to MLPerf. We describe how to run             v0.5.\r\n               this submission on a single-node DGX-1 system. More de-          A.4    Installation\r\n               tails for DGX-2 and multi-node systems are provided in the\r\n               of\ufb01cial MLPerf results repositories:                             See the README.md for Nvidia\u2019s v0.5 ResNet-50 submission:\r\n                                                                                https://github.com/mlperf/training_results_\r\n                 \u2022 Nvidia\u2019s v0.5 ResNet-50 submissions                          v0.5/tree/master/v0.5.0/nvidia/submission/\r\n                                                                                code/image_classification/mxnet/README.md.\r\n               Results from other tasks and submitters are also available:      A.5    Evaluation and expected result\r\n                 \u2022 MLPerfv0.5training results                                   Time-to-Train: 134.6 minutes.\r\n                 \u2022 MLPerfv0.6training results\r\n               However, these results have not been independently veri\ufb01ed\r\n               for reproducibility. Please see the MLPerf website (https:\r\n               //mlperf.org/) for the most up-to-date information\r\n               and feel free to report issues on Github.\r\n               A.2   Artifact check-list (meta-information)\r\n                  \u2022 Algorithm: Image classi\ufb01cation ResNet-50 CNN\r\n                  \u2022 Program: MLPerf(https://mlperf.org/)\r\n                  \u2022 Compilation: nvidia-docker\r\n                  \u2022 Model: ResNet-50v1.53\r\n                  \u2022 Dataset: ImageNet (http://image-net.org/)\r\n                  \u2022 Hardware: NVIDIADGX-1orDGX-2\r\n                  \u2022 Metrics: Time-to-Train: minutes to reach accuracy thresh-\r\n                    old (74.9% Top-1 for v0.5)\r\n                  \u2022 Output: MLPerf compliant log \ufb01le with timestamps and\r\n                    evaluation accuracy.  Execution ends once the accuracy\r\n                    threshold is reached.\r\n                  \u2022 Experiments: shellscriptincludedwiththecode(./run.sub)\r\n                  \u2022 How much disk space required (approximately)?: 300\r\n                    GB\r\n                  \u2022 Howmuchtimeisneededtopreparework\ufb02ow(approxi-\r\n                    mately)?: 2 hours\r\n                  \u2022 How much time is needed to complete experiments (ap-\r\n                    proximately)?: 8 hours\r\n                  \u2022 Publicly available: Yes\r\n                  \u2022 Codelicenses: Apache License 2.0\r\n                  \u2022 Work\ufb02owframeworkused?: MXNet\r\n                  \u2022 Archived (provide DOI)?:\r\n                    http://doi.org/10.5281/zenodo.3610717\r\n                  3https://github.com/mlperf/training/tree/\r\n               master/image_classification/tensorflow/\r\n               official\r\n", "award": [], "sourceid": 134, "authors": [{"given_name": "Peter", "family_name": "Mattson", "institution": "Google"}, {"given_name": "Christine", "family_name": "Cheng", "institution": "Intel"}, {"given_name": "Gregory", "family_name": "Diamos", "institution": "Baidu"}, {"given_name": "Cody", "family_name": "Coleman", "institution": "Stanford"}, {"given_name": "Paulius", "family_name": "Micikevicius", "institution": "NVIDIA"}, {"given_name": "David", "family_name": "Patterson", "institution": "Google"}, {"given_name": "Hanlin", "family_name": "Tang", "institution": "Intel Corporation"}, {"given_name": "Gu-Yeon", "family_name": "Wei", "institution": ""}, {"given_name": "Peter", "family_name": "Bailis", "institution": "Stanford University"}, {"given_name": "Victor", "family_name": "Bittorf", "institution": "Google"}, {"given_name": "David", "family_name": "Brooks", "institution": "Harvard University"}, {"given_name": "Dehao", "family_name": "Chen", "institution": "Google"}, {"given_name": "Debo", "family_name": "Dutta", "institution": "Cisco Systems, Inc."}, {"given_name": "Udit", "family_name": "Gupta", "institution": "Harvard University"}, {"given_name": "Kim", "family_name": "Hazelwood", "institution": "Facebook AI"}, {"given_name": "Andy", "family_name": "Hock", "institution": "Cerebras Systems"}, {"given_name": "Xinyuan", "family_name": "Huang", "institution": "Cisco Systems, Inc."}, {"given_name": "Daniel", "family_name": "Kang", "institution": "Stanford University"}, {"given_name": "David", "family_name": "Kanter", "institution": "RWI"}, {"given_name": "Naveen", "family_name": "Kumar", "institution": "Google"}, {"given_name": "Jeffery", "family_name": "Liao", "institution": "Synopsys"}, {"given_name": "Deepak", "family_name": "Narayanan", "institution": "Stanford"}, {"given_name": "Tayo", "family_name": "Oguntebi", "institution": "Google LLC"}, {"given_name": "Gennady", "family_name": "Pekhimenko", "institution": "University of Toronto"}, {"given_name": "Lillian", "family_name": "Pentecost", "institution": "Harvard University"}, {"given_name": "Vijay", "family_name": "Janapa Reddi", "institution": "Harvard University"}, {"given_name": "Taylor", "family_name": "Robie", "institution": "Google"}, {"given_name": "Tom", "family_name": "St John", "institution": "Tesla"}, {"given_name": "Carole-Jean", "family_name": "Wu", "institution": "Facebook AI"}, {"given_name": "Lingjie", "family_name": "Xu", "institution": "Alibaba"}, {"given_name": "Cliff", "family_name": "Young", "institution": "google.com"}, {"given_name": "Matei", "family_name": "Zaharia", "institution": "Stanford and Databricks"}]}