{"title": "Improving the Accuracy, Scalability, and Performance of Graph Neural Networks with Roc", "book": "Proceedings of Machine Learning and Systems", "page_first": 187, "page_last": 198, "abstract": "Graph neural networks (GNNs) have been demonstrated to be an effective model for learning tasks related to graph structured data.\nDifferent from classical deep neural networks which handle relatively small individual samples, GNNs process very large graphs, which must be partitioned and processed in a distributed manner.\nWe present Roc, a distributed multi-GPU framework for fast GNN training and inference on graphs.\nRoc is up to 4.6x faster than existing GNN frameworks on a single machine, and can scale to multiple GPUs on multiple machines.\nThis performance gain is mainly enabled by Roc's graph partitioning and memory management optimizations.\nBesides performance acceleration, the better scalability of Roc also enables the exploration of more sophisticated GNN architectures on large, real-world graphs.\nWe demonstrate that a class of GNN architectures significantly deeper and larger than the typical two-layer models can achieve new state-of-the-art classification accuracy on the widely used Reddit dataset.", "full_text": "                      IMPROVING THE ACCURACY, SCALABILITY, AND PERFORMANCE OF\r\n                                             GRAPHNEURALNETWORKSWITHROC\r\n                                        ZhihaoJia1 SinaLin2 MingyuGao3 MateiZaharia1 AlexAiken1\r\n                                                                       ABSTRACT\r\n                    Graphneuralnetworks(GNNs)havebeendemonstratedtobeaneffectivemodelforlearningtasksrelatedtograph\r\n                    structured data. Different from classical deep neural networks that handle relatively small individual samples,\r\n                    GNNsprocessverylargegraphs, which must be partitioned and processed in a distributed manner. We present\r\n                    ROC,adistributed multi-GPU framework for fast GNN training and inference on graphs. ROC is up to 4\u00d7 faster\r\n                    than existing GNN frameworks on a single machine, and can scale to multiple GPUs on multiple machines. This\r\n                    performance gain is mainly enabled by ROC\u2019s graph partitioning and memory management optimizations. Besides\r\n                    performance acceleration, the better scalability of ROC also enables the exploration of more sophisticated GNN\r\n                    architectures on large, real-world graphs. We demonstrate that a class of GNN architectures signi\ufb01cantly deeper\r\n                    and larger than the typical two-layer models can achieve new state-of-the-art classi\ufb01cation accuracy on the widely\r\n                    used Reddit dataset.\r\n               1    INTRODUCTION\r\n               Graphs provide a natural way to represent real-world data\r\n               withrelationalstructures, suchassocialnetworks, molecular\r\n               networks, and webpage graphs. Recent work has extended                                    Aggr\r\n               deep neural networks (DNNs) to extract high-level features\r\n               from data sets structured as graphs, and the resulting archi-\r\n               tectures, known as graph neural networks (GNNs), have\r\n               recently achieved state-of-the-art prediction performance\r\n               across a number of graph-related tasks, including vertex\r\n               classi\ufb01cation, graph classi\ufb01cation, and link prediction (Kipf          Neighbor Aggregation            DNN Operations\r\n               &Welling, 2016; Hamilton et al., 2017; Xu et al., 2019).\r\n               GNNscombineDNNoperations(e.g.,convolution and ma-                 Figure 1. Computation of one vertex (in red) in a GNN layer by\r\n               trix multiplication) with iterative graph propagation: In each    \ufb01rst aggregating its neighbors\u2019 activations (in blue), and then ap-\r\n               GNNlayer, the activations of each vertex are computed             plying DNN operations.\r\n               with a set of DNN operations, using the activations of its\r\n               neighbors from the previous GNN layer as inputs. Figure 1         collection is relatively small (e.g., a single image). These\r\n               illustrates the computation of one vertex (in red) in a GNN       systems typically leverage data and/or model parallelism by\r\n               layer, which aggregates the activations from its neighbors        partitioning the batch of input samples or the DNN models\r\n               (in blue), and then applies DNN operations to compute new         across multiple devices, such as GPUs, while each input\r\n               activations of the vertex.                                        sample is still stored on a single GPU and not partitioned.\r\n               Existing deep learning frameworks do not easily support           However, GNNs typically use small DNN models (a cou-\r\n               GNNtraining and inference at scale. TensorFlow (Abadi             ple of layers) on very large and irregular input samples \u2014\r\n               et al., 2016), PyTorch (PyTorch), and Caffe2 (Caffe2) were        graphs. These large graphs do not \ufb01t in a single device\r\n               originally designed to handle situations where the model          and so must be partitioned and processed in a distributed\r\n               and data collection can be large, but each sample of the          manner. Recent GNN frameworks such as DGL (DGL,\r\n                  1Stanford University 2Microsoft 3Tsinghua University. Corre-   2018) and PyG (Fey & Lenssen, 2019) are implemented\r\n               spondence to: Zhihao Jia <zhihao@cs.stanford.edu>.                on top of PyTorch (PyTorch), and have the same scalability\r\n                                                                                 limitation. NeuGraph (Ma et al., 2019) stores intermediate\r\n               Proceedings of the 3rd MLSys Conference, Austin, TX, USA,         GNNdatainthehostCPUDRAMtosupportmulti-GPU\r\n               2020. Copyright 2020 by the author(s).                            training, but it is still limited to the compute resources of\r\n                               ImprovingtheAccuracy,Scalability, and Performance of Graph Neural Networks with ROC\r\n               a single machine. AliGraph (Yang, 2019) is a distributed        each training iteration of a GNN architecture, ROC com-\r\n               GNNframeworkonCPUplatforms,whichdoesnotexploit                  putes a graph partitioning using the run time predictions\r\n               GPUsforperformanceacceleration.                                 from the cost model, and uses the graph partitioning to\r\n               Thecurrent lack of system support has limited the potential     parallelize training. At the end of each training iteration,\r\n               application of GNN algorithms on large-scale graphs, and        the actual run time of the subgraphs is sent back to the\r\n               hasalsopreventedtheexplorationoflargerandmoresophis-            ROC graph partitioner, which updates the cost model by\r\n               ticated GNN architectures. To alleviate these limitations,      minimizing the difference between the actual and predicted\r\n               various sampling techniques (Hamilton et al., 2017; Ying        run times. We show that this linear regression-based graph\r\n               et al., 2018) were introduced to \ufb01rst down-sample the origi-    partitioner outperforms existing static and dynamic graph\r\n               nal graphs before applying the GNN models, so that the data     partitioning strategies by up to 1.4\u00d7.\r\n               \ufb01t in a single device. Sampling allows existing frameworks      Memorymanagement. In GNNs, computing even a sin-\r\n               to train larger graphs at the cost of potential model accuracy  gle vertex requires accessing a potentially large number of\r\n               loss (Hamilton et al., 2017).                                   neighbor vertices that may span multiple GPUs and com-\r\n               In this paper, we propose ROC, a distributed multi-GPU          pute nodes. These data transfers have a high impact on\r\n               framework for fast GNN training and inference on large-         overall performance. The framework thus must carefully\r\n               scale graphs. ROC leverages the compute resources of mul-       decide in which device memory (CPUorGPU)tostoreeach\r\n               tiple GPUs on multiple compute nodes to train large GNN         intermediate tensor, in order to minimize data transfer costs.\r\n               modelsonthefullreal-worldgraphs,achievingupto4\u00d7per-             Thememorymanagementishardtooptimizemanuallyas\r\n               formance over existing GNN frameworks. Despite its use of       the optimal strategy depends on the input graph size and\r\n               full graphs, ROC also achieves better time-to-accuracy per-     topology as well as the device constraints such as memory\r\n               formance compared to existing sampling techniques. More-        capacity and communication bandwidth. We formulate the\r\n               over, the better scalability allows ROC to easily support       task of optimizing data transfers as a cost minimization\r\n               larger and more sophisticated GNNs than those possible in       problem, and introduce a dynamic programming algorithm\r\n               existing frameworks. To demonstrate ROC\u2019s scalability and       to quickly \ufb01nd a globally optimal strategy that minimizes\r\n               improved accuracy, we design a class of deep GNN archi-         data transfers between CPU and GPU memories. We com-\r\n               tectures by stacking multiple GCN layers (Kipf & Welling,       pare the ROC memory management algorithm with existing\r\n               2016). By using signi\ufb01cantly larger and deeper GNN ar-          heuristic approaches (Ma et al., 2019), and show that ROC\r\n               chitectures, we improve the classi\ufb01cation accuracy over         reduces data transfer costs between CPU and GPU by 2\u00d7.\r\n               state-of-the-art sampling techniques by 1.5% on the widely      Overall, compared to NeuGraph, ROC improves the runtime\r\n               used Reddit dataset (Hamilton et al., 2017).                    performancebyupto4\u00d7formulti-GPUtrainingonasingle\r\n               Toachieve these results, ROC tackles two signi\ufb01cant system      compute node. Beyond improved partitioning and memory\r\n               challenges for distributed GNN computation.                     management, ROC sees other smaller performance improve-\r\n                                                                               ments from a more ef\ufb01cient distributed runtime (Jia et al.,\r\n               Graph partitioning. Real-world graphs could have arbi-          2019) and the highly optimized kernels adopted from Lux\r\n               trary sizes and variable per-vertex computation loads, which    for fast graph propagation on GPUs (Jia et al., 2017).\r\n               are challenging to partition in a balanced way (Gonzalez        Besides performance acceleration, ROC also enables exact\r\n               et al., 2014; Zhu et al., 2016). GNNs mix compute-intensive     GNN computation on full original graphs without using\r\n               DNN operations with data-intensive graph propagation,           sampling techniques, as well as the exploration of more\r\n               making it hard to statically compute a good load-balancing      sophisticated GNNarchitecturesbeyondthecommonlyused\r\n               partitioning. Furthermore, GNN inference requires parti-        two-layer models. For large real-world graphs, we show that\r\n               tioning new input graphs that only run for a few iterations,    performing exact GNN computation on the original graphs\r\n               such as predicting the properties of newly discovered pro-      and using larger and deeper GNN architectures can increase\r\n               teins (Hamilton et al., 2017), in which case existing dynamic   the model accuracy by up to 1.5% on the widely used Reddit\r\n               repartitioning approaches do not work well (Venkataraman        dataset compared to existing sampling techniques.\r\n               et al., 2013). ROC uses an online linear regression model to\r\n               optimize graph partitioning. During the training phase of a     Tosummarize, our contributions are:\r\n               GNNarchitecture, ROC learns a cost model for predicting\r\n               the execution time of performing a GNN operation on an            \u2022 On the systems side, we present ROC, a distributed\r\n               input (sub)graph. To capture the runtime performance of a            multi-GPU framework for fast GNN training and in-\r\n               GNNoperation, the cost model includes both graph-related             ference on large-scale graphs. ROC uses a novel on-\r\n               features such as the number of vertices and edges in the             line linear regression model to achieve ef\ufb01cient graph\r\n               graph, and hardware-related features such as the number of           partitioning, and introduces a dynamic programming\r\n               GPUmemoryaccesses to perform the operation. During                   algorithm to minimize data transfer cost.\r\n                                   ImprovingtheAccuracy,Scalability, and Performance of Graph Neural Networks with ROC\r\n                Table 1. The graph partitioning strategies used by different frame-     the sample (i.e., data parallelism) and operator dimensions\r\n                works. Balanced training/inference indicates whether an approach        (i.e., model parallelism) to parallelize training, but some\r\n                can achieve balanced partitioning for GNN training/inference.           recent works exploit multiple dimensions (Jia et al., 2019).\r\n                  Frameworks               Partitioning     Balanced    Balanced        Oneofthekeydifferences with GNNs is that partitioning\r\n                                           Strategies       Training    Inference       in the attribute dimension (i.e., partitioning large individual\r\n                  TensorFlow, NeuGraph     Equal                                        samples) is necessary for supporting GNN training on large\r\n                  GraphX, Gemini           Static                                       graphs. Thelackofsystemsupportforparallelizingintheat-\r\n                  Presto, Lux              Dynamic             X                        tribute dimension prevents most existing DNN frameworks\r\n                  ROC(ours)                Online learning     X           X            from training GNNs on large graphs.\r\n                   \u2022 On the machine learning side, ROC removes the ne-                  GNNframeworks. MostoftheexistingGNNframeworks,\r\n                      cessity of using sampling techniques for GNN training             such as DGL (DGL, 2018) and PyG (Fey & Lenssen, 2019)\r\n                      on large graphs, and also enables the exploration of              that extend PyTorch (PyTorch), do not support graphs where\r\n                      more sophisticated GNN architectures. We demon-                   the data cannot \ufb01t in a single device. NeuGraph (Ma et al.,\r\n                      strate this potential by achieving new state-of-the-art           2019) supports GNN computation on multiple GPUs in a\r\n                      classi\ufb01cation accuracy on the Reddit dataset.                     single machine. AliGraph (Yang, 2019) is a distributed\r\n                                                                                        GNNframeworkbutonlyusesCPUsratherthanGPUs.\r\n                2     BACKGROUNDANDRELATEDWORK                                          Sampling in GNNs. As discussed in Section 2.1, due to\r\n                                                                                        the highly connected nature of real-world graphs, comput-\r\n                2.1    GraphNeuralNetworks                                                    (k)\r\n                                                                                        ing h     may require accessing more data than the GPU\r\n                                                                                              v\r\n                AGNNtakes graph-structured data as input, and learns                    memorycapacity. A number of sampling techniques have\r\n                a representation vector for each vertex in the graph. The               been proposed to support GNN training on large graphs,\r\n                learned representation can be used for down-stream tasks                by down-sampling the neighbors of each vertex (Hamil-\r\n                such as vertex classi\ufb01cation, graph classi\ufb01cation, and link             ton et al., 2017; Ying et al., 2018; Chen et al., 2018). The\r\n                prediction (Kipf & Welling, 2016; Hamilton et al., 2017; Xu             sampling techniques can be formalized as follows.\r\n                et al., 2019).                                                                   (k)                   (k)   (k\u22121)        b      \u0001\r\n                                                                                               a     =AGGREGATE            {h       |u \u2208 N(v)}         (3)\r\n                                                                                                 v                            u\r\n                AsshowninFigure1, each GNN layer gathers the activa-                            b\r\n                tions of the neighbor vertices from the previous GNN layer,             whereN(v)isthesampledsubsetofN(v)withasizelimit.\r\n                and then updates the activations of the vertex, using DNN               For example, GraphSAGE (Hamilton et al., 2017) samples\r\n                                                                                                                                            b\r\n                operations such as convolution or matrix multiplication. For-           at most 25 neighbors for each vertex (i.e., |N(v)| \u2264 25),\r\n                mally, the computation in a GNN layer is:                               while a vertex may actually contain thousands of neighbors.\r\n                       (k)                       (k)    (k\u22121)              \u0001           Our evaluation shows that existing sampling techniques\r\n                     a       = AGGREGATE              {h       |u \u2208 N(v)}      (1)\r\n                       v                                 u                              come with potential model accuracy loss for large real-\r\n                       (k)                  (k)   (k)   (k\u22121)\r\n                     h       = UPDATE (a ,h                   )                (2)      world graphs. This observation is consistent with previous\r\n                       v                          v     v\r\n                                                                                        work(Hamiltonet al., 2017). ROC provides an orthogonal\r\n                          (k)\r\n                where h      is the learned activation of vertex v at the k-th          approach to support GNN training on large graphs. Any\r\n                          v\r\n                         (0)\r\n                layer, h    is the input features of v. N(v) denotes v\u2019s neigh-         existing sampling technique can be additionally applied in\r\n                         v\r\n                bors in the graph. For each vertex, AGGREGATE gathers the               ROCtofurther accelerate large-scale GNN training.\r\n                activations of its neighbors using an accumulation function             Graphframeworksandgraphpartitioning. Anumberof\r\n                such as average or summation. For each vertex v, UPDATE                 distributed graph processing frameworks (Malewicz et al.,\r\n                                                  (k)\r\n                computesits new activations h         bycombiningitsprevious\r\n                                                  v                                     2010; Gonzalez et al., 2014; Jia et al., 2017) have been\r\n                               (k\u22121)                                           (k)\r\n                activations h         and the neighborhood aggregation a           .\r\n                               v                                               v        proposed to accelerate data-intensive graph applications.\r\n                                                      (K)\r\n                Theactivations of the last layer h         capture the structural\r\n                                                      v                                 These systems generally adopt the Gather-Apply-Scatter\r\n                information for all neighbors within K hops of v, and can               (GAS)(Gonzalez et al., 2012) vertex-centric programming\r\n                be used as the input for down-stream prediction tasks.                  model. GAScannaturally express the data propagation in\r\n                                                                                        GNNs,butcannotsupport many neural network operations.\r\n                2.2    Related Work                                                                                                                \u02c7     \u00b4\r\n                                                                                        For example, computing the attention scores (Velickovic\r\n                Distributed DNN training. In the terminology of Jia et al.              et al., 2018) between vertices not directly connected cannot\r\n                (2019), DNN computations can be partitioned in the sample,              be easily expressed in the GAS model.\r\n                operator, attribute and parameter dimensions for parallel               Table 1 summarizes the graph partitioning strategies used\r\n                and distributed execution. The vast majority of existing                in existing deep learning and graph processing frameworks.\r\n                deep learning frameworks (Abadi et al., 2016; PyTorch) use              Deep learning frameworks (Abadi et al., 2016; Ma et al.,\r\n                                        ImprovingtheAccuracy,Scalability, and Performance of Graph Neural Networks with ROC\r\n                                  GNN Architecture          Input Graph                             ROC uses an online-linear-regression-based graph parti-\r\n                                                                                                    tioner to address the unique load imbalance challenge of\r\n                                      Learning-based Graph Partitioner                              distributed GNN inference, where a trained GNN model\r\n                                                                                                    is used to provide inference service on previously unseen\r\n                                            Partitioned Subgraphs                                   graphs (Section 4). This problem exists today in real-world\r\n                                                                                                    GNNinference services (Hamilton et al., 2017), and our\r\n                                CPU DRAM                         CPU DRAM                           partitioning technique improves the inference performance\r\n                                                                                                    byupto1.4\u00d7comparedtoexistinggraphpartitioningstrate-\r\n                           DPMM          DPMM       \u2026 DPMM                DPMM                      gies. Thegraphpartitioneristrainedjointlywiththetraining\r\n                                   \u2026                                 \u2026                              phase of the GNN architecture, and is also used to partition\r\n                            GPU           GPU                GPU           GPU                      inference workloads on new input graphs that are not in the\r\n                               Compute Node                     Compute Node                        training dataset.\r\n                                         Performance Measurements                                   After graph partitioning, all subgraphs are sent to different\r\n                                                                                                    GPUsto perform GNN computations in parallel. Instead\r\n                  Figure 2. ROC system overview. DPMM represents dynamic-                           of requiring all the intermediate results related to each sub-\r\n                  programming-based memory manager.                                                 graph to \ufb01t in GPU device memory, ROC uses the much\r\n                                                                                                    larger CPU DRAM on the host machines to hold all the\r\n                                                                                                    data, and treats the GPU memories as caches. Such a design\r\n                                                                                                    allows us to support much larger GNN architectures and\r\n                  2019) typically partition data (e.g., tensors) equally across                     input graphs. However, transferring tensors between a GPU\r\n                  GPUs. Ontheotherhand,graphprocessing frameworks use                               and the host DRAM has a major impact on runtime perfor-\r\n                  more complicated strategies to achieve load balance. For                          mance. ROC introduces a dynamic programming algorithm\r\n                  example, GraphX (Gonzalez et al., 2014) and Gemini (Zhu                           to quickly \ufb01nd a memory management strategy to minimize\r\n                  et al., 2016) statically partition input graphs by minimizing                     these data transfers (Section 5).\r\n                  a heuristic objective function, such as the number of edges\r\n                  spanning different partitions. These simple objective func-                       4     GRAPHPARTITIONER\r\n                  tions can achieve good performance for data-intensive graph\r\n                  processing, but they do not work well for compute-intensive                       The goal of the ROC graph partitioner is discovering bal-\r\n                  GNNs due to the highly varying per-vertex computation                             anced partitioning for GNN training and inference on ar-\r\n                  loads. Dynamic repartitioning (Venkataraman et al., 2013;                         bitrary input graphs, which is especially challenging for\r\n                  Jia et al., 2017) exploits the iterative nature of many graph                     distributed inference on new graphs where no existing per-\r\n                  applications and rebalances the workload in each iteration                        formance measurements are available. We introduce an\r\n                  based on the measured performance of previous iterations.                         online-linear-regression-based graph partitioner that takes\r\n                  Thisapproachconvergestoabalancedworkloaddistribution                              the runtime performance measurements of previously pro-\r\n                  for GNN training, but is much less effective for inference                        cessed graphs as training samples for a cost model, which\r\n                  which computes the GNN model only once for each new                               is then used to predict performance on arbitrary new graphs\r\n                  graph. ROC uses an online-linear-regression-based algo-                           and enable ef\ufb01cient partitioning.\r\n                  rithm to achieve balanced partitioning for both GNN train-                        We formulate graph partitioning for GNNs as an online\r\n                  ing and inference, through jointly learning a cost model to                       learning task. The performance measurements on parti-\r\n                  predict the execution time of the GNN model on arbitrary                          tioned graphs are training samples. Each training iteration\r\n                  graphs.                                                                           produces new data points, and the graph partitioner com-\r\n                   3     ROCOVERVIEW                                                                putes a balanced graph partitioning based on all existing\r\n                                                                                                    data points.\r\n                  Figure 2 shows an overview of ROC, which takes a GNN                              4.1    Cost Model\r\n                  architecture and a graph as inputs, and distributes the GNN\r\n                  computations across multiple GPUs (potentially on differ-                         Thekeycomponentofthe ROC graph partitioner is a cost\r\n                  ent compute nodes) by partitioning the input graph into                           model that predicts the execution time of computing a GNN\r\n                  multiple subgraphs. Each GPU worker executes the GNN                              layer on an arbitrary graph, which could be the whole or any\r\n                  architecture on a subgraph, and communicates with CPU                             subset of an input graph. Note that the cost model learns\r\n                  DRAMtoobtaininputtensorsandsaveintermediateresults.                               to predict the execution time of a GNN layer instead of an\r\n                  The communication is optimized by a per-GPU dynamic-                              entire GNNarchitecturefortworeasons. First, ROC exploits\r\n                  programming-based memory manager (DPMM) to mini-                                  the composability of neural network architectures and the\r\n                  mize data transfers between CPU and GPU memories.\r\n                                  ImprovingtheAccuracy,Scalability, and Performance of Graph Neural Networks with ROC\r\n               Table 2. The vertex features used in the current cost model. The     features estimate the required memory accesses to GPU de-\r\n                semantics of the features are described in Section 4.1. WS is the   vice memory. Recall that when multiple threads in a GPU\r\n                numberofGPUthreadsinawarp,whichis32fortheV100GPUs                   warpissue memory references to consecutive memory ad-\r\n                used in the experiments.                                            dresses, the GPUautomaticallycoalescesthesereferencesto\r\n                       De\ufb01nition          Description                               a single memory access that is handled more ef\ufb01ciently. To\r\n                  x    1                  the vertex itself                         describe continuity of a vertex\u2019s neighbors, we partition all\r\n                   1                                                                neighbors of v as C(v) = {c (v),...,c        (v)}, where each\r\n                  x    |N(v)|             numberofneighbors                                                         1         |C|\r\n                   2\r\n                  x    |C(v)|             continuity of neighbors                   ci(v) is a range of consecutively numbered vertices. For\r\n                   3   P\r\n                  x        \u2308ci(v)\u2309        # mem. accesses to load neighbors         example, for vertex v with neighbors {v ,v ,v ,v }, we\r\n                   4      i   WS                                                                            1                     3   4  6   8\r\n                       P                  # mem. accesses to load the               have c (v ) = {v ,v }, c (v) = {v }, and c (v) = {v }.\r\n                  x        \u2308ci(v)\u00d7din\u2309                                                     1   1        3   4    2          6         3          8\r\n                   5      i     WS        activations of all neighbors              The feature x (v) is the number of consecutive blocks in\r\n                                                                                                   3\r\n                                                                                    v\u2019s neighbors, which is 3 in the example. In addition, x (v)\r\n                                                                                                                                                4\r\n                                                                                    and x5(v) estimate the number of GPU memory accesses to\r\n                learned cost model can be directly applied to a variety of          load all neighbors and their input activations.\r\n                GNNarchitectures. Second, this approach allows ROC to               Thecost model can be easily extended to include new fea-\r\n                gather much more training data in each training iteration.          tures to capture additional model- and hardware-speci\ufb01c\r\n                For a GNN architecture with N layers and P partitions,              information if needed.\r\n                ROCcollects (N \u00d7P)training data points, while modeling\r\n                the entire GNN architecture only provides P data points.\r\n                Ascollecting new training data points is expensive, requir-         4.2    Partitioning Algorithm\r\n                ing measuring GNN computations on GPU devices, we                   Using the learned cost model, the ROC graph partitioner\r\n                employ a simple linear regression model to minimize the             computes a graph partitioning that achieves balanced work-\r\n                number of trainable parameters. Our model assumes that              load distribution under the cost model.\r\n                the cost to perform a DNN operation on a vertex is linear in        ROC uses the graph partitioning strategy proposed by\r\n                a collection of vertex features, such as number of neighbors,       Lux (Jia et al., 2017) to maximize coalesced accesses to\r\n                and the cost to run an arbitrary graph is the summation of          GPUdevicememory,whichiscritical to achieve optimized\r\n                the cost of all its vertices.                                       GPUperformance. Each vertex in a graph is assigned a\r\n                WeformalizethecostforrunningaGNNlayerl onaninput                    unique number between 0 and V \u22121, where V is the num-\r\n                graph G as follows.                                                 ber of vertices in the graph. In ROC, each partition holds\r\n                                     X                                              consecutively numbered vertices, which allows us to use\r\n                                                                                    N\u22121numbers{p ,p ,...,p              }topartitionthegraphinto\r\n                       t(l,v)   =        wi(l)xi(v)                         (4)                         0  1       N\u22121\r\n                                       i                                            Nsubgraphswherethei-thsubgraphcontains all vertices\r\n                                     X              XX                              ranging from p       to p \u22121andtheir in-edges.\r\n                       t(l,G)   =         t(l,v) =           wixi(v)        (5)                      i\u22121     i\r\n                                     v\u2208G            v\u2208G i                           ROCpreprocesses an input graph by computing the partial\r\n                                = Xw Xx(v)=Xwx(G) (6)                               sumsofeachvertex feature, which allows ROC to estimate\r\n                                           i       i             i  i               the runtime performance of a subgraph in O(1) time. In\r\n                                       i     v\u2208G            i                       addition, ROC uses binary search to \ufb01nd a splitting point p\r\n                                                                                                                                                   i\r\n                where v denotes a vertex in the input graph G, wi(l) is a           in O(logV), and therefore computing balanced partitioning\r\n                trainable parameter for layer l, x (v) is the i-th feature of v,    onlytakesO(N logV)time,whereN andV arethenumber\r\n                                                   i\r\n                and xi(G) sums up the i-th feature of all vertices in G.            of partitions and input vertices, respectively.\r\n                Ourmodelminimizesthemeansquareerroroverallavail-                    5     MEMORYMANAGER\r\n                able data points.\r\n                                           N                                        Asdiscussed in Section 3, ROC performs all GNN computa-\r\n                                       1 X                       \u0001\r\n                          Loss(l) =            t(l,G ) \u2212 y(l,G ) 2          (7)     tions on GPUs to optimize runtime performance, but only\r\n                                      N              i          i                   requires all the GNN data to \ufb01t in the host CPU DRAM\r\n                                          i=1\r\n                where N is the total number of available data points for the        to support large GNN architectures and input graphs. The\r\n                GNNlayerl,andy(l,G )istheperformance measurement                    device memory of each GPU therefore only needs to cache\r\n                                         i                                          a subset of intermediate tensors, whose corresponding data\r\n                for the i-th data point.                                            transfers between CPU and GPU memories can be saved\r\n                Table 2 lists the vertex features used in the cost model;           to reduce communication cost. How to select this subset\r\n                x (v) and x (v) capture the computation workload associ-            of tensors to minimize the data transfers within the limited\r\n                 1           2                                                      GPUmemoryisacritical memory management problem.\r\n                ated with vertex v and its edges, respectively. The remaining\r\n                                   ImprovingtheAccuracy,Scalability, and Performance of Graph Neural Networks with ROC\r\n                                  \u2460                              \u2461                                                           Forward Processing\r\n                                    Gather                        Linear+ReLU                      \u2462 Linear                        \u2463\r\n                                   Forward                         Forward                          Forward                       softmax\r\n                                                                                                           \r\n                                                                                                           \r\n                              Gather                   Linear+ReLU                     Linear           \r\n                                    Backward                      \u2465 Backward                          \u2464Backward\r\n                                 \u2466                                                                                             Back Propagation\r\n                Figure 3. The computation graph of a toy 1-layer GIN architecture (Xu et al., 2019). A box represents an operation, and a circle represents\r\n                a tensor. Arrows indicate dependencies between tensors and operations. The gather operation performs neighborhood aggregation. The\r\n                linearandthefollowingReLUarefusedintoasingleoperationasacommonoptimizationinexistingframeworks. h0 and g denote\r\n                the input features and neighbors of all vertices, respectively. w and w are the weights of the two linear layers.\r\n                                                                                1       2\r\n                Table 3. All the valid states and their activation tensors for the     a de\ufb01nition allows the valid states to capture all possible\r\n                GNNarchitecture in Figure 3.                                           execution orderings of the operators in G. For each state S,\r\n                   Valid State S                 Activation Tensors A(S)               wede\ufb01neitsactivetensorsA(S)tobethesetoftensorsthat\r\n                   {\u2460}                                      {g,a}                      were produced by the operations in S and will be consumed\r\n                   {\u2460,\u2461}                                 {g,a,b,w }                    as inputs by the operations outside of S. Intuitively, A(S)\r\n                                                                    1                  capturesallthetensorswecancacheintheGPUtoeliminate\r\n                   {\u2460,\u2461,\u2462}                           {g,a,b,h1,w ,w }\r\n                                                                    1    2             future data transfers at the stage S.\r\n                   {\u2460,\u2461,\u2462,\u2463}                      {g,a,b,w ,w ,\u25bdL(h1)}\r\n                                                             1   2\r\n                   {\u2460,\u2461,\u2462,\u2463,\u2464}                       {g,a,b,w ,\u25bdL(b)}                  Figure3showsthecomputationgraphofatoy1-layerGraph\r\n                                                                1\r\n                   {\u2460,\u2461,\u2462,\u2463,\u2464,\u2465}                        {g,a,\u25bdL(a)}                    IsomorphismNetwork(Xuetal.,2019),whosecomputation\r\n                   {\u2460,\u2461,\u2462,\u2463,\u2464,\u2465,\u2466}                             {}                      can be formalized as following.\r\n                                                                                                   (1)                             X (0)\r\n                                                                                                  h    =W \u00d7RELU(W \u00d7                       h )         (8)\r\n                                                                                                   v         2               1             u\r\n                Theoptimal strategy depends not only on the GPU device                                                           u\u2208N(v)\r\n                memorycapacity and the sizes of the input graph and GNN                For this GNN architecture, all the valid states and their\r\n                tensors, but also on the topology of the GNN architecture,             active tensors are listed in Table 3.\r\n                which determines the reuse distance for each tensor.\r\n                Thepagereplacement algorithms for memory management                    Since the valid states represent all the possible execution\r\n                in operating systems (Aho et al., 1971) assume pages are all           orderings of the GNN, we can use dynamic programming\r\n                the same size and that pages are accessed sequentially. Nei-           to compute the optimal memory management strategy as-\r\n                ther assumption holds for GNN computations since tensors               sociated with each execution state. Algorithm 1 shows\r\n                generally have different sizes, and an operator may access             the pseudocode. COST(S,T ) computes the minimum data\r\n                multiple tensors simultaneously.                                       transfers required to compute all the operations in a state S,\r\n                                                                                       with T being the set of tensors cached in the GPU memory;\r\n                ROCformulates GPUmemorymanagementasacostmini-                          T should be a subset of A(S). We reduce the task of com-\r\n                mizationproblem: givenaninputgraph,aGNNarchitecture,                   puting COST(S,T ) to smaller tasks by enumerating the last\r\n                and a GPU device, \ufb01nd the subset of tensors to cache in the            operation to perform in S (Line 11). The cost is the speci\ufb01c\r\n                GPU memory that minimizes data transfers between the                   data transfers to perform this last operation (xfer in Line 15)\r\n                CPUandGPU.ROCintroduces a dynamic programming                          adding the cost of the corresponding previous state (S\u2032,T \u2032).\r\n                algorithm to quickly \ufb01nd a globally optimal solution.                  Toimproveperformance, we leverage memoization to only\r\n                The key insight of the dynamic programming algorithm                   evaluate COST(S,T ) once for each (S,T ) pair.\r\n                is that, at each stage of the computation, we only need                Time and space complexity. Overall, the time and space\r\n                to consider caching tensors that will be reused by future              complexity of Algorithm 1 are O(S2T) and O(ST), respec-\r\n                operations. For a GNN architecture G, we de\ufb01ne a state S to            tively, where S is the number of possible execution states\r\n                be the set of operations that have already been performed in           for a GNN architecture, and T is the maximum number of\r\n                G. A state is valid only if the operations it contains preserve        available tensor sets for a state. We observed that S and\r\n                all the data dependencies in G, i.e., for any operation in S,          T are at most 16 and 4096 for all GNN architectures in\r\n                all its predecessor operations in G must be also in S. Such            our experiments, making it practical to use the dynamic\r\n                                 ImprovingtheAccuracy,Scalability, and Performance of Graph Neural Networks with ROC\r\n               Algorithm 1 A recursive dynamic programming algorithm                        Table 4. Graph datasets used in our evaluation.\r\n               for computing minimumdatatransfers. IN(o ) and OUT(o )\r\n                                                              i             i        Dataset     Vertex       Edge            Feature    Label\r\n               return the input and output tensors of the operation o , re-\r\n                                                                         i           Pubmed      19,717       108,365         500        3\r\n               spectively, and size(T ) returns the memory space required            PPI         56,944       1,612,348       700        121\r\n               to save all tensors in T .                                            Reddit      232,965      114,848,857     602        41\r\n                1: Input: An input graph g, a GNN architecture G, and the GPU        Amazon      9,430,088    231,594,310     300        24\r\n                   device memory capacity cap.\r\n                2: Output: Minimumdata transfers required to compute G on g\r\n                   within capacity cap.\r\n                3: \u22b2 D is a database storing all computed COST functions.             \u2022 Can we improve the model accuracy on existing\r\n                4:                                                                      datasets by using larger and more sophisticated GNNs?\r\n                5: function COST(S, T )\r\n                6:     if (S,T ) \u2208 D then\r\n                7:         return D(S,T )\r\n                8:     if S is \u2205 then                                              7.1   Experimental Setup\r\n                9:         return size(T )                                         GNNarchitectures. We use three real-world GNN archi-\r\n               10:     cost \u2190 \u221e                                                    tectures to evaluate ROC. GCN is a widely used graph\r\n               11:     for oi \u2208 S do\r\n               12:         if (S \\ oi) is a valid state then                       convolutional network for semi-supervised learning on\r\n               13:            S\u2032 \u2190S\\oi\r\n               14:            T\u2032 \u2190T \\OUT(oi)\u0001\u2229A(S\u2032)                               graph-structured data (Kipf & Welling, 2016). GIN is\r\n                                                     \u2032\u0001                           provably the most expressive GNN architecture for the\r\n               15:            xfer \u2190 size IN(oi) \\ T\r\n                                                           \u0001                      Weisfeiler-Lehmangraphisomorphismtest(Xuetal.,2019).\r\n               16:            if size T \u222a IN(oi) \u222a OUT(oi)\u2032 \u2264\u2032cap then             CommNet consists of multiple cooperating agents that\r\n               17:                cost = min{cost,COST(S ,T )+xfer}                learn to communicate amongst themselves before taking\r\n               18:     D(S,T)\u2190cost                                                 actions (Sukhbaatar et al., 2016).\r\n               19:     return D(S,T )\r\n                                                                                   Datasets. We use four real-world graph datasets in our\r\n               programming algorithm to minimize data transfer cost.               evaluation, listed in Table 4. Pubmed is a citation network\r\n                                                                                   dataset (Sen et al., 2008), containing sparse bag-of-words\r\n                                                                                   feature vectors for each document (i.e., vertex), and cita-\r\n               6    IMPLEMENTATION                                                 tion links between documents (i.e., edges). PPI contains a\r\n                                                                                   numberofprotein-protein interaction graphs, each of which\r\n               ROCisimplementedontopofFlexFlow(Jiaetal., 2019), a                  represents a human tissue (Hamilton et al., 2017). Reddit\r\n               distributed multi-GPU runtime for high-performance DNN              is a dataset for online discussion forum, with each node\r\n               training. We extended FlexFlow in the following aspects             being a post, and each edge being a comment between\r\n               to support ef\ufb01cient GNN computations. First, we have re-            posts (Hamilton et al., 2017). Amazon is the product dataset\r\n               placed the equal partitioning strategy in FlexFlow with a           from Amazon (He & McAuley, 2016). Each node is a\r\n               \ufb01ne-grained partitioning interface that supports splitting ten-     product, and each edge represents also-viewed information\r\n               sors at arbitrary points. This extension is critical to ef\ufb01cient    between products. The task is to categorize a product using\r\n               partitioning for GNN computations. Second, we have added            its description and also-viewed relations.\r\n               a graph propagation engine to support neighborhood aggre-           All experiments were performed on a GPU cluster with 4\r\n               gation operations in GNNs, such as the gather operation             computenodes,eachofwhichcontainstwoIntel10-coreE5-\r\n               in Figure 3. We have reused the highly optimized CUDA               2600 CPUs, 256GB DRAM,andfourNVIDIATeslaP100\r\n               kernels in Lux (Jia et al., 2017) to perform graph propaga-         GPUs. GPUsonthesamenodeareconnectedwithNVLink,\r\n               tion on GPUs. This allows ROC to directly bene\ufb01t from all           and nodes are connected with 100Gb/s EDR In\ufb01niband.\r\n               kernel-level optimizations in Lux.\r\n                                                                                   For each training experiment, the ROC graph partitioner\r\n               7    EVALUATION                                                     learned a new cost model by only using performance mea-\r\n                                                                                   surements obtained during the single experiment. For each\r\n               In this section, we aim to evaluate the following points:           inference experiment, the graph partitioner used the learned\r\n                                                                                   cost model from the training phase on the same dataset.\r\n                  \u2022 Can ROC achieve comparable runtime performance                 Unless otherwise stated, all experiments use the same train-\r\n                     compared to state-of-the-art GNN frameworks on a              ing/validation/test splits as prior work (Hamilton et al., 2017;\r\n                     single GPU?                                                   Kipf & Welling, 2016; He & McAuley, 2016). All train-\r\n                  \u2022 Can ROC improve the end-to-end performance of dis-             ing throughput and inference latency were measured by\r\n                     tributed GNN training and inference?                          averaging 1,000 iterations.\r\n                                     ImprovingtheAccuracy,Scalability, and Performance of Graph Neural Networks with ROC\r\n                            TensorFlow             DGL            PyG           Roc                                      NeuGraph             Roc\r\n                                                                                                                                    2.0\r\n                                                                                                     8\r\n                                                      1.75\r\n                      250\r\n                                                      1.50\r\n                                                                                                                                    1.5\r\n                                                                                                     6\r\n                      200\r\n                                                      1.25\r\n                                                                                                                                    1.0\r\n                                                                                                     4\r\n                      150                             1.00\r\n                                                      0.75\r\n                      100\r\n                                                                                                                                    0.5\r\n                                                                                                     2\r\n                                                      0.50\r\n                       50\r\n                                                      0.25                                           0                              0.0\r\n                                                                                                         1(1) 2(1) 4(1) 8(2) 16(4)        1(1) 2(1) 4(1) 8(2) 16(4)\r\n                        0                             0.00\r\n                                                                                                                 Reddit                          Amazon\r\n                             GCN      GIN   CommNet           GCN      GIN   CommNet\r\n                     Training Throughput (epochs/s)                                               Training Throughput (epochs/s)\r\n                                                                                                                    Number of GPU devices\r\n                                   Pubmed                            Reddit\r\n                 Figure 4. End-to-end training throughput comparison between ex-              Figure 5. Training throughput comparison between NeuGraph and\r\n                 isting GNN frameworks and ROC on a single P100 GPU (higher is                 ROC using different numbers of GPUs (higher is better). Num-\r\n                 better).                                                                     bers in parenthesis are the number of compute nodes used in the\r\n                                                                                              experiments.\r\n                 7.2    Single-GPUResults\r\n                 First, we compare the end-to-end training performance of                     Figure 5 shows the results. For experiments on a single\r\n                  ROCwithexisting GNNframeworksonasingleGPU.Due                               compute node, ROC outperforms NeuGraph by up to 4\u00d7.\r\n                 to the small device memory on a single GPU, we limited                       The speedup is mainly because of the graph partitioning\r\n                 these experiments to graphs that can \ufb01t in a single GPU.                     and memorymanagementoptimizations that are not avail-\r\n                                                                                              able in NeuGraph. First, NeuGraph uses the equal vertex\r\n                 Figure 4 shows the results among TensorFlow (Abadi et al.,                   partitioning strategy that equally distributes the vertices\r\n                 2016), DGL (DGL, 2018), PyG (Fey & Lenssen, 2019),                           across multiple GPUs. Section 7.6 shows that the linear\r\n                 and ROC. Weexpected that ROC would be slightly slower                        regression-based graph partitioner in ROC improves train-\r\n                 than the other frameworks on a single GPU, since it writes                   ing throughput by up to 1.4\u00d7 compared to the equal vertex\r\n                 the output tensors of each operator back to CPU DRAM                         partitioning strategy. Second, NeuGraph uses a stream pro-\r\n                 for distributed computation, while other frameworks keep                     cessing approach that partitions each GNN operation into\r\n                 all tensors in a single GPU, and do not involve such data                    multiple chunks, and sequentially streams each chunk along\r\n                 transfers. However, for these graphs, ROC reuses cached                      with its input data to GPUs. Therefore, it does not consider\r\n                 tensors on the GPU to minimize data transfers from DRAM                      the memory management optimization used in ROC, and\r\n                 to GPU, and overlaps the data transfers back to DRAM with                    Section 7.7 shows that the ROC memory manager improves\r\n                 subsequent GNN computations.                                                 training throughput by up to 2\u00d7.\r\n                 TensorFlow, DGL, and PyG were not able to run the Reddit                     The remaining performance improvement is likely due to\r\n                 dataset due to out-of-device-memory errors. ROC can still                    otheraspectsof ROC,suchastheuseofthehighlyoptimized\r\n                 train Reddit on a single GPU, by using DRAM to save some                     CUDAkernelsinLuxforfast graph propagation, and the\r\n                 of the intermediate tensors.                                                 performance of the underlying Legion runtime (Bauer et al.,\r\n                                                                                              2012). However, we were not able to further investigate\r\n                 7.3    Multi-GPUResults                                                      the performance difference due the absence of a publicly\r\n                 Second, we compare the end-to-end training performance                       available implementation of NeuGraph.\r\n                 of ROC with NeuGraph. NeuGraph supports GNN training                         7.4    ComparisonwithGraphSampling\r\n                 across multiple GPUs on a single compute node.\r\n                 ANeuGraphimplementation is not yet available publicly,                       Wecompare the training performance of ROC with state-\r\n                 so we ran ROC using the same GPU version and software                        of-the-art graph sampling approaches on the Reddit dataset.\r\n                 library versions cited in Ma et al. (2019) and directly com-                 All frameworks use the same GCN model (Kipf & Welling,\r\n                 pares with the performance numbers reported in the paper.                    2016). ROC performs full-batch training on the entire graph\r\n                 Wealsodisabled NVLink for this experiment to rule out the                    as in Kipf & Welling (2016), while GraphSAGE and Fast-\r\n                 effect of NVLink, which was not used in Ma et al. (2019).                    GCNusesmini-batchsamplingwithabatch-size of 512.\r\n                 Wedonotclaimthatthesecomparisons control for all pos-                        Figure 6 shows the time-to-accuracy comparison on a single\r\n                 sible differences as well as directly executing both systems                 P100GPU,wherethex-axisshowstheend-to-endtraining\r\n                 on the same machine, but that preferred approach is simply                   time for each epoch, and the y-axis shows the test accu-\r\n                 not possible at this time.                                                   racy of the current model at the end of each epoch. For\r\n                                 ImprovingtheAccuracy,Scalability, and Performance of Graph Neural Networks with ROC\r\n                                                                                           8\r\n                                                                                                     Equal Edge Partition\r\n                        0.95\r\n                                                                                           7\r\n                                                                                                     Equal Node Partition\r\n                                                                                                     Roc\r\n                                                                                           6\r\n                         0.9\r\n                                                                                           5\r\n                                                                                           4\r\n                        0.85\r\n                                                                                           3\r\n                       Test Accuracy                    Roc\r\n                         0.8\r\n                                                                                           2\r\n                                                        GraphSAGE\r\n                                                                                           1\r\n                                                        FastGCN\r\n                        0.75\r\n                                                                                           0\r\n                                                                                                1(1)     2(1)    4(1)    8(2)    16(4)\r\n                            0            60           120           180\r\n                                                                                          Training Throughput (epochs/s)\r\n                                                                                                         Number of GPUs\r\n                                         Time (second)\r\n               Figure 6. Time-to-accuracy comparison between state-of-the-art     Figure 8. Training throughput comparison among different graph\r\n               sampling techniques and ROC on the Reddit dataset (Hamilton        partitioning strategies on the Reddit dataset (higher is better). Num-\r\n               et al., 2017). All experiments used the same GCN model. ROC per-   bers in parentheses are the number of compute nodes used.\r\n               formed full-batch training on the entire graph, while GraphSAGE\r\n               and FastGCN performed mini-batch sampling. Each dot indicates\r\n               one training epoch for GraphSAGE and FastGCN, and \ufb01ve epochs\r\n               for ROC.                                                           7.5   DeeperandLargerGNNArchitectures\r\n                                                                                  ROC enables the exploration of larger and more sophisti-\r\n                                                                                  cated GNN architectures than those possible in existing\r\n                                                         96.9\r\n                         97                                                       frameworks. As a demonstration, we consider a class of\r\n                                                                                  deep GNNarchitectures formed by stacking multiple GCN\r\n                                                                                  layers (Kipf & Welling, 2016). We add residual connec-\r\n                         96\r\n                                                                                  tions (He et al., 2016) between subsequent GCN layers to\r\n                                                         GraphSAGE                facilitate training of deeper GNN architectures by allowing\r\n                         95                                                       to preserve information learned from previous layers.\r\n                                                       Original GCN\r\n                                                                                  Formally, each layer of our GNN is de\ufb01ned as follows.\r\n                                   2 GCN Layers\r\n                         94\r\n                                   3 GCN Layers\r\n                                                           FastGCN\r\n                                   4 GCN Layers                                              (\r\n                        Test Accuracy on Reddit (%)\r\n                         93                                                                     GCN(H(k))+H(k)            d(H(k+1)) = d(H(k))\r\n                             16     32      64    128     256    512              H(k+1) =\r\n                               Number of Activations Per Layer                                  GCN(H(k))+WH(k) d(H(k+1))6=d(H(k))\r\n               Figure 7. Test accuracy on the Reddit dataset using deeper and\r\n               larger GNN architectures. The dotted lines show the best test      where GCN is the original GCN layer (Kipf & Welling,\r\n               accuracy achieved by GraphSAGE (95.4%), FastGCN (93.7%),           2016), and d(\u00b7) is the number of activations in the input\r\n               and the original GCN architecture (94.7%), respectively.           tensor. When H(k) and H(k+1) have the same number of\r\n                                                                                  activations, we directly insert a residual connection between\r\n                                                                                  the two layers. When H(k) and H(k+1) have different num-\r\n                                                                                  bers of activations, we use a linear layer to transform H(k)\r\n               GraphSAGE and FastGCN, each dot indicates one train-               to the desired shape. This design allows us to add residual\r\n               ing epoch, while for ROC each dot represents \ufb01ve training          connections for all GCN layers.\r\n               epochs for simplicity. Note that GraphSAGE and FastGCN\r\n               can achieve relatively high accuracy within a few training         We increase the depth (i.e., number of GCN layers) and\r\n               epochs. For example, GraphSAGE achieves 93.4% test ac-             width (i.e., number of activations per layer) to obtain larger\r\n               curacy in two epochs. However, ROC requires around 20              and deeper GNN architectures beyond the commonly used\r\n               epochs to achieve the same test accuracy because ROC uses          2-layer GNNs. Figure 7 shows the accuracy achieved by our\r\n               full-batch training (following Kipf & Welling (2016)), and         GNNarchitectures on the Reddit dataset. The \ufb01gure shows\r\n               only updates parameters once per epoch, while existing sam-        that improved accuracy can be obtained by increasing the\r\n               pling approaches generally perform mini-batch training and         depth and width of a GNN architecture. As a result, our\r\n               have more frequent parameter updates. Even though ROC              GNNarchitectures achieve up to 96.9% test accuracy on\r\n               usesmoreepochs,itisstillasfastorfasterthanGraphSAGE                the Reddit dataset, outperforming state-of-the-art sampling\r\n               and FastGCN to any given level of accuracy.                        techniques by 1.5%.\r\n                                    ImprovingtheAccuracy,Scalability, and Performance of Graph Neural Networks with ROC\r\n                                        Equal Edge Partition\r\n                             300                                                                        7.29                       1.01\r\n                                        Equal Node Partition                                         7                        1.0\r\n                             250\r\n                                        Roc                                                          6\r\n                                                                                                                              0.8\r\n                             200                                                                     5                                   0.63\r\n                                                                                                     4                        0.6              0.55\r\n                             150\r\n                                                                                                     3                        0.4\r\n                                                                                                              2.15\r\n                             100                                                                     2\r\n                                                                                                                    1.45      0.2\r\n                                                                                                     1\r\n                              50                                                                                              Per-epoch Run Time (s)\r\n                                                                                                     0  No    LRU   Roc       0.0  No   LRU   Roc\r\n                               0                                                                    Per-epoch Data Transfers (GB)\r\n                                    1(1)    2(1)    4(1)    8(2)   16(4)                               Cache                     Cache\r\n                            End-to-end Inference Time (ms)\r\n                                            Number of GPUs\r\n                                                                                                     (a) Data transfers.       (b) Training time.\r\n                Figure 9. End-to-end inference time for the test graphs in the PPI\r\n                 dataset (lower is better). The numbers were measured by averaging       Figure 10. Performance comparison amongdifferent memoryman-\r\n                 the inference time of the four test graphs.                             agement strategies (lower is better). All numbers are measured by\r\n                                                                                         training GCN on the Reddit dataset on a single GPU.\r\n                 7.6   GraphPartitioning                                                 baseline memory management strategies.\r\n                Toevaluate the linear regression-based graph partitioner in\r\n                 ROC,wecomparetheperformanceofthegraphpartitioning                        8    CONCLUSION\r\n                 achieved by ROC with (1) equal vertex partitioning and                  ROC is a distributed multi-GPU framework for high-\r\n                (2) equal edge partitioning; (1) is used in NeuGraph to                  performance and large-scale GNN training and inference.\r\n                 parallelize GNN training, and (2) has been widely used in               ROCpartitions an input graph onto multiple GPUs on multi-\r\n                 previous graph processing systems. Figure 8 shows the                   ple compute nodes using an online-linear-regression-based\r\n                 training throughput comparison on different sets of GPUs.               strategy to achieve load balance, and coordinates optimized\r\n                 Neither of these baseline strategies perform as well as the             data transfers between GPU devices and host CPU memo-\r\n                 ROClinear regression-based partitioner.                                 ries with a dynamic programming algorithm. ROC increases\r\n                To evaluate the distributed inference performance on new                 the performance by up to 4\u00d7 over existing GNN frame-\r\n                 graphsnotusedduringtraining,weusedthePPIdatasetcon-                     works, and offers better scalability. The ability to process\r\n                 taining 24 protein graphs. Following prior work (Hamilton               larger graphs and GNN architectures additionally enables\r\n                 et al., 2017), we trained the GIN architecture on 20 graphs,            model accuracy improvements. We achieve new state-of-\r\n                 and measured the inference latency on the remaining four                the-art classi\ufb01cation accuracy on the Reddit dataset by using\r\n                 graphs, by using the graph partitioner learned during train-            signi\ufb01cantly deeper and larger GNN architectures.\r\n                 ing. Figure 9 shows that the the learned cost model enables\r\n                 the graph partitioner to discover ef\ufb01cient partitioning on               ACKNOWLEDGEMENT\r\n                 newgraphsforinferenceservices, by reducing the inference\r\n                 latency by up to 1.2\u00d7. For the PPI graphs, the distributed              This work was supported by NSF grant CCF-1409813, the\r\n                 inference across multiple compute nodes achieves worse                  ExascaleComputingProject(17-SC-20-SC),acollaborative\r\n                 performance than the inference on a single node, which is               effort of the U.S. Department of Energy Of\ufb01ce of Science\r\n                 due to the small sizes of the inference graphs.                         and the National Nuclear Security Administration, and is\r\n                                                                                         based on research sponsored by DARPA under agreement\r\n                 7.7   MemoryManagement                                                  numberFA84750-14-2-0006. This research used resources\r\n                Weevaluate the performance of the ROC memory manager                     of the Oak Ridge Leadership Computing Facility, which\r\n                 by comparing it with (1) the streaming processing approach              is a DOE Of\ufb01ce of Science User Facility supported under\r\n                 in NeuGraphthatstreamsinputdataalongwithcomputation                     Contract DE-AC05-00OR22725. This research was sup-\r\n                (i.e., no caching optimization) and (2) the least-recently-              ported in part by af\ufb01liate members and other supporters\r\n                 used (LRU) cache replacement policy.                                    of the Stanford DAWN project\u2014Ant Financial, Facebook,\r\n                                                                                         Google, Infosys, Intel, Microsoft, NEC, SAP, Teradata, and\r\n                 Figure 10 shows the comparison results for training GCN                 VMware\u2014aswellasCiscoandtheNSFunderCAREER\r\n                 on the Reddit dataset on a single GPU. The dynamic                      grant CNS-1651570. Any opinions, \ufb01ndings, and conclu-\r\n                 programming-based memory manager reduces the data                       sions or recommendations expressed in this material are\r\n                 transfers between GPU and DRAM by 1.4\u20135\u00d7 and reduces                    those of the authors and do not necessarily re\ufb02ect the views\r\n                 the per-epoch training time by 1.2\u20132\u00d7 compared with the                 of the National Science Foundation.\r\n                                ImprovingtheAccuracy,Scalability, and Performance of Graph Neural Networks with ROC\r\n               REFERENCES                                                        He, R. and McAuley, J. Ups and downs: Modeling the\r\n               Deep Graph Library: towards ef\ufb01cient and scalable deep               visual evolution of fashion trends with one-class collabo-\r\n                  learning on graphs. https://www.dgl.ai/, 2018.                    rative \ufb01ltering. In Proceedings of the 25th International\r\n                                                                                    ConferenceonWorldWideWeb,WWW\u201916.International\r\n               Abadi, M., Barham, P., Chen, J., Chen, Z., Davis, A., Dean,          World Wide Web Conferences Steering Committee, 2016.\r\n                  J., Devin, M., Ghemawat, S., Irving, G., Isard, M., Kudlur,    Jia, Z., Kwon, Y., Shipman, G., McCormick, P., Erez, M.,\r\n                  M., Levenberg, J., Monga, R., Moore, S., Murray, D. G.,           and Aiken, A. A distributed multi-gpu system for fast\r\n                  Steiner, B., Tucker, P., Vasudevan, V., Warden, P., Wicke,        graph processing. Proc. VLDB Endow., 11(3), November\r\n                  M., Yu, Y., and Zheng, X. Tensor\ufb02ow: A system for                 2017.\r\n                  large-scale machine learning. In Proceedings of the 12th\r\n                  USENIXConferenceonOperatingSystemsDesignand                    Jia, Z., Zaharia, M., and Aiken, A. Beyond data and model\r\n                  Implementation, OSDI, 2016.                                       parallelism for deep neural networks. In Proceedings of\r\n                                                                                    the 2nd Conference on Systems and Machine Learning,\r\n               Aho, A. V., Denning, P. J., and Ullman, J. D. Principles of          SysML\u201919, 2019.\r\n                  optimal page replacement. Journal of the ACM (JACM),           Kipf, T. N. and Welling, M. Semi-supervised classi\ufb01ca-\r\n                  18(1):80\u201393, 1971.                                                tion with graph convolutional networks. arXiv preprint\r\n               Bauer, M., Treichler, S., Slaughter, E., and Aiken, A. Le-           arXiv:1609.02907, 2016.\r\n                  gion: Expressing locality and independence with logical        Ma,L., Yang, Z., Miao, Y., Xue, J., Wu, M., Zhou, L., and\r\n                  regions. In Proceedings of the International Conference           Dai, Y. Neugraph: Parallel deep neural network computa-\r\n                  onHighPerformanceComputing,Networking, Storage                    tion on large graphs. In 2019 USENIX Annual Technical\r\n                  andAnalysis, 2012.                                                Conference (USENIX ATC 19). USENIX Association,\r\n               Caffe2. A New Lightweight, Modular, and Scalable Deep                2019.\r\n                  Learning Framework. https://caffe2.ai,2016.                    Malewicz, G., Austern, M. H., Bik, A. J., Dehnert, J. C.,\r\n                                                                                    Horn, I., Leiser, N., and Czajkowski, G. Pregel: A sys-\r\n               Chen, J., Ma, T., and Xiao, C. FastGCN: Fast learning with           temfor large-scale graph processing. In Proceedings of\r\n                  graph convolutional networks via importance sampling.             the 2010 ACM SIGMOD International Conference on\r\n                  In International Conference on Learning Representations,          ManagementofData,SIGMOD\u201910,2010.\r\n                  2018.                                                          PyTorch. Tensors and Dynamic neural networks in Python\r\n               Fey,M.andLenssen,J.E.Fastgraphrepresentationlearning                 with strong GPU acceleration. https://pytorch.\r\n                  with PyTorch Geometric. In ICLR Workshop on Repre-                org,2017.\r\n                  sentation Learning on Graphs and Manifolds, 2019.              Sen, P., Namata, G., Bilgic, M., Getoor, L., Galligher, B.,\r\n               Gonzalez, J. E., Low, Y., Gu, H., Bickson, D., and Guestrin,         and Eliassi-Rad, T. Collective classi\ufb01cation in network\r\n                  C. Powergraph: Distributed graph-parallel computation             data. AI magazine, 29(3):93\u201393, 2008.\r\n                  on natural graphs. In Proceedings of the 10th USENIX           Sukhbaatar, S., szlam, a., and Fergus, R. Learning multia-\r\n                  Conference on Operating Systems Design and Implemen-              gent communicationwithbackpropagation. In Lee, D. D.,\r\n                  tation, OSDI\u201912, 2012.                                            Sugiyama, M., Luxburg, U. V., Guyon, I., and Garnett,\r\n                                                                                    R. (eds.), Advances in Neural Information Processing\r\n               Gonzalez, J. E., Xin, R. S., Dave, A., Crankshaw, D.,                Systems 29. Curran Associates, Inc., 2016.\r\n                  Franklin, M. J., and Stoica, I. GraphX: Graph process-              \u02c7    \u00b4                                                  `\r\n                  ing in a distributed data\ufb02ow framework. In Proceedings         Velickovic, P., Cucurull, G., Casanova, A., Romero, A., Lio,\r\n                  of the 11th USENIX Conference on Operating Systems                P., and Bengio, Y. Graph attention networks. Interna-\r\n                  Design and Implementation, OSDI\u201914, 2014.                         tional Conference on Learning Representations, 2018.\r\n               Hamilton, W., Ying, Z., and Leskovec, J. Inductive repre-         Venkataraman, S., Bodzsar, E., Roy, I., AuYoung, A., and\r\n                  sentation learning on large graphs. In Advances in Neural         Schreiber, R. S. Presto: Distributed machine learning and\r\n                  Information Processing Systems 30. 2017.                          graph processing with sparse matrices. In Proceedings of\r\n                                                                                    the 8th ACMEuropeanConferenceonComputerSystems,\r\n               He, K., Zhang, X., Ren, S., and Sun, J. Deep residual learn-         EuroSys \u201913, 2013.\r\n                  ing for image recognition. In Proceedings of the IEEE          Xu,K.,Hu,W.,Leskovec,J., and Jegelka, S. How powerful\r\n                  Conference on Computer Vision and Pattern Recognition,            are graph neural networks? In International Conference\r\n                  CVPR,2016.                                                        onLearning Representations, 2019.\r\n              ImprovingtheAccuracy,Scalability, and Performance of Graph Neural Networks with ROC\r\n       Yang, H. Aligraph: A comprehensive graph neural network\r\n        platform. Proceedings of the 25th ACM SIGKDD Inter-\r\n        national Conference on Knowledge Discovery & Data\r\n        Mining - KDD 19, 2019. doi: 10.1145/3292500.3340404.\r\n        URL http://dx.doi.org/10.1145/3292500.\r\n        3340404.\r\n       Ying, R., He, R., Chen, K., Eksombatchai, P., Hamilton,\r\n        W. L., and Leskovec, J. Graph convolutional neural\r\n        networks for web-scale recommender systems. In Pro-\r\n        ceedings of the 24th ACM SIGKDD International Con-\r\n        ference on Knowledge Discovery &#38; Data Mining,\r\n        KDD \u201918, pp. 974\u2013983, New York, NY, USA, 2018.\r\n        ACM. ISBN978-1-4503-5552-0. doi: 10.1145/3219819.\r\n        3219890. URLhttp://doi.acm.org/10.1145/\r\n        3219819.3219890.\r\n       Zhu, X., Chen, W., Zheng, W., and Ma, X. Gemini: A\r\n        computation-centric distributed graph processing system.\r\n        In 12th USENIX Symposium on Operating Systems De-\r\n        sign and Implementation (OSDI 16). USENIX Associa-\r\n        tion, 2016.\r\n", "award": [], "sourceid": 83, "authors": [{"given_name": "Zhihao", "family_name": "Jia", "institution": "Stanford University"}, {"given_name": "Sina", "family_name": "Lin", "institution": "Microsoft"}, {"given_name": "Mingyu", "family_name": "Gao", "institution": "Tsinghua University"}, {"given_name": "Matei", "family_name": "Zaharia", "institution": "Stanford and Databricks"}, {"given_name": "Alex", "family_name": "Aiken", "institution": "Stanford University"}]}