{"title": "Improving the Accuracy, Scalability, and Performance of Graph Neural Networks with Roc", "book": "Proceedings of Machine Learning and Systems", "page_first": 187, "page_last": 198, "abstract": "Graph neural networks (GNNs) have been demonstrated to be an effective model for learning tasks related to graph structured data.\nDifferent from classical deep neural networks which handle relatively small individual samples, GNNs process very large graphs, which must be partitioned and processed in a distributed manner.\nWe present Roc, a distributed multi-GPU framework for fast GNN training and inference on graphs.\nRoc is up to 4.6x faster than existing GNN frameworks on a single machine, and can scale to multiple GPUs on multiple machines.\nThis performance gain is mainly enabled by Roc's graph partitioning and memory management optimizations.\nBesides performance acceleration, the better scalability of Roc also enables the exploration of more sophisticated GNN architectures on large, real-world graphs.\nWe demonstrate that a class of GNN architectures significantly deeper and larger than the typical two-layer models can achieve new state-of-the-art classification accuracy on the widely used Reddit dataset.", "full_text": " IMPROVING THE ACCURACY, SCALABILITY, AND PERFORMANCE OF\r\n GRAPHNEURALNETWORKSWITHROC\r\n ZhihaoJia1 SinaLin2 MingyuGao3 MateiZaharia1 AlexAiken1\r\n ABSTRACT\r\n Graphneuralnetworks(GNNs)havebeendemonstratedtobeaneffectivemodelforlearningtasksrelatedtograph\r\n structured data. Different from classical deep neural networks that handle relatively small individual samples,\r\n GNNsprocessverylargegraphs, which must be partitioned and processed in a distributed manner. We present\r\n ROC,adistributed multi-GPU framework for fast GNN training and inference on graphs. ROC is up to 4\u00d7 faster\r\n than existing GNN frameworks on a single machine, and can scale to multiple GPUs on multiple machines. This\r\n performance gain is mainly enabled by ROC\u2019s graph partitioning and memory management optimizations. Besides\r\n performance acceleration, the better scalability of ROC also enables the exploration of more sophisticated GNN\r\n architectures on large, real-world graphs. We demonstrate that a class of GNN architectures signi\ufb01cantly deeper\r\n and larger than the typical two-layer models can achieve new state-of-the-art classi\ufb01cation accuracy on the widely\r\n used Reddit dataset.\r\n 1 INTRODUCTION\r\n Graphs provide a natural way to represent real-world data\r\n withrelationalstructures, suchassocialnetworks, molecular\r\n networks, and webpage graphs. Recent work has extended Aggr\r\n deep neural networks (DNNs) to extract high-level features\r\n from data sets structured as graphs, and the resulting archi-\r\n tectures, known as graph neural networks (GNNs), have\r\n recently achieved state-of-the-art prediction performance\r\n across a number of graph-related tasks, including vertex\r\n classi\ufb01cation, graph classi\ufb01cation, and link prediction (Kipf Neighbor Aggregation DNN Operations\r\n &Welling, 2016; Hamilton et al., 2017; Xu et al., 2019).\r\n GNNscombineDNNoperations(e.g.,convolution and ma- Figure 1. Computation of one vertex (in red) in a GNN layer by\r\n trix multiplication) with iterative graph propagation: In each \ufb01rst aggregating its neighbors\u2019 activations (in blue), and then ap-\r\n GNNlayer, the activations of each vertex are computed plying DNN operations.\r\n with a set of DNN operations, using the activations of its\r\n neighbors from the previous GNN layer as inputs. Figure 1 collection is relatively small (e.g., a single image). These\r\n illustrates the computation of one vertex (in red) in a GNN systems typically leverage data and/or model parallelism by\r\n layer, which aggregates the activations from its neighbors partitioning the batch of input samples or the DNN models\r\n (in blue), and then applies DNN operations to compute new across multiple devices, such as GPUs, while each input\r\n activations of the vertex. sample is still stored on a single GPU and not partitioned.\r\n Existing deep learning frameworks do not easily support However, GNNs typically use small DNN models (a cou-\r\n GNNtraining and inference at scale. TensorFlow (Abadi ple of layers) on very large and irregular input samples \u2014\r\n et al., 2016), PyTorch (PyTorch), and Caffe2 (Caffe2) were graphs. These large graphs do not \ufb01t in a single device\r\n originally designed to handle situations where the model and so must be partitioned and processed in a distributed\r\n and data collection can be large, but each sample of the manner. Recent GNN frameworks such as DGL (DGL,\r\n 1Stanford University 2Microsoft 3Tsinghua University. Corre- 2018) and PyG (Fey & Lenssen, 2019) are implemented\r\n spondence to: Zhihao Jia . on top of PyTorch (PyTorch), and have the same scalability\r\n limitation. NeuGraph (Ma et al., 2019) stores intermediate\r\n Proceedings of the 3rd MLSys Conference, Austin, TX, USA, GNNdatainthehostCPUDRAMtosupportmulti-GPU\r\n 2020. Copyright 2020 by the author(s). training, but it is still limited to the compute resources of\r\n ImprovingtheAccuracy,Scalability, and Performance of Graph Neural Networks with ROC\r\n a single machine. AliGraph (Yang, 2019) is a distributed each training iteration of a GNN architecture, ROC com-\r\n GNNframeworkonCPUplatforms,whichdoesnotexploit putes a graph partitioning using the run time predictions\r\n GPUsforperformanceacceleration. from the cost model, and uses the graph partitioning to\r\n Thecurrent lack of system support has limited the potential parallelize training. At the end of each training iteration,\r\n application of GNN algorithms on large-scale graphs, and the actual run time of the subgraphs is sent back to the\r\n hasalsopreventedtheexplorationoflargerandmoresophis- ROC graph partitioner, which updates the cost model by\r\n ticated GNN architectures. To alleviate these limitations, minimizing the difference between the actual and predicted\r\n various sampling techniques (Hamilton et al., 2017; Ying run times. We show that this linear regression-based graph\r\n et al., 2018) were introduced to \ufb01rst down-sample the origi- partitioner outperforms existing static and dynamic graph\r\n nal graphs before applying the GNN models, so that the data partitioning strategies by up to 1.4\u00d7.\r\n \ufb01t in a single device. Sampling allows existing frameworks Memorymanagement. In GNNs, computing even a sin-\r\n to train larger graphs at the cost of potential model accuracy gle vertex requires accessing a potentially large number of\r\n loss (Hamilton et al., 2017). neighbor vertices that may span multiple GPUs and com-\r\n In this paper, we propose ROC, a distributed multi-GPU pute nodes. These data transfers have a high impact on\r\n framework for fast GNN training and inference on large- overall performance. The framework thus must carefully\r\n scale graphs. ROC leverages the compute resources of mul- decide in which device memory (CPUorGPU)tostoreeach\r\n tiple GPUs on multiple compute nodes to train large GNN intermediate tensor, in order to minimize data transfer costs.\r\n modelsonthefullreal-worldgraphs,achievingupto4\u00d7per- Thememorymanagementishardtooptimizemanuallyas\r\n formance over existing GNN frameworks. Despite its use of the optimal strategy depends on the input graph size and\r\n full graphs, ROC also achieves better time-to-accuracy per- topology as well as the device constraints such as memory\r\n formance compared to existing sampling techniques. More- capacity and communication bandwidth. We formulate the\r\n over, the better scalability allows ROC to easily support task of optimizing data transfers as a cost minimization\r\n larger and more sophisticated GNNs than those possible in problem, and introduce a dynamic programming algorithm\r\n existing frameworks. To demonstrate ROC\u2019s scalability and to quickly \ufb01nd a globally optimal strategy that minimizes\r\n improved accuracy, we design a class of deep GNN archi- data transfers between CPU and GPU memories. We com-\r\n tectures by stacking multiple GCN layers (Kipf & Welling, pare the ROC memory management algorithm with existing\r\n 2016). By using signi\ufb01cantly larger and deeper GNN ar- heuristic approaches (Ma et al., 2019), and show that ROC\r\n chitectures, we improve the classi\ufb01cation accuracy over reduces data transfer costs between CPU and GPU by 2\u00d7.\r\n state-of-the-art sampling techniques by 1.5% on the widely Overall, compared to NeuGraph, ROC improves the runtime\r\n used Reddit dataset (Hamilton et al., 2017). performancebyupto4\u00d7formulti-GPUtrainingonasingle\r\n Toachieve these results, ROC tackles two signi\ufb01cant system compute node. Beyond improved partitioning and memory\r\n challenges for distributed GNN computation. management, ROC sees other smaller performance improve-\r\n ments from a more ef\ufb01cient distributed runtime (Jia et al.,\r\n Graph partitioning. Real-world graphs could have arbi- 2019) and the highly optimized kernels adopted from Lux\r\n trary sizes and variable per-vertex computation loads, which for fast graph propagation on GPUs (Jia et al., 2017).\r\n are challenging to partition in a balanced way (Gonzalez Besides performance acceleration, ROC also enables exact\r\n et al., 2014; Zhu et al., 2016). GNNs mix compute-intensive GNN computation on full original graphs without using\r\n DNN operations with data-intensive graph propagation, sampling techniques, as well as the exploration of more\r\n making it hard to statically compute a good load-balancing sophisticated GNNarchitecturesbeyondthecommonlyused\r\n partitioning. Furthermore, GNN inference requires parti- two-layer models. For large real-world graphs, we show that\r\n tioning new input graphs that only run for a few iterations, performing exact GNN computation on the original graphs\r\n such as predicting the properties of newly discovered pro- and using larger and deeper GNN architectures can increase\r\n teins (Hamilton et al., 2017), in which case existing dynamic the model accuracy by up to 1.5% on the widely used Reddit\r\n repartitioning approaches do not work well (Venkataraman dataset compared to existing sampling techniques.\r\n et al., 2013). ROC uses an online linear regression model to\r\n optimize graph partitioning. During the training phase of a Tosummarize, our contributions are:\r\n GNNarchitecture, ROC learns a cost model for predicting\r\n the execution time of performing a GNN operation on an \u2022 On the systems side, we present ROC, a distributed\r\n input (sub)graph. To capture the runtime performance of a multi-GPU framework for fast GNN training and in-\r\n GNNoperation, the cost model includes both graph-related ference on large-scale graphs. ROC uses a novel on-\r\n features such as the number of vertices and edges in the line linear regression model to achieve ef\ufb01cient graph\r\n graph, and hardware-related features such as the number of partitioning, and introduces a dynamic programming\r\n GPUmemoryaccesses to perform the operation. During algorithm to minimize data transfer cost.\r\n ImprovingtheAccuracy,Scalability, and Performance of Graph Neural Networks with ROC\r\n Table 1. The graph partitioning strategies used by different frame- the sample (i.e., data parallelism) and operator dimensions\r\n works. Balanced training/inference indicates whether an approach (i.e., model parallelism) to parallelize training, but some\r\n can achieve balanced partitioning for GNN training/inference. recent works exploit multiple dimensions (Jia et al., 2019).\r\n Frameworks Partitioning Balanced Balanced Oneofthekeydifferences with GNNs is that partitioning\r\n Strategies Training Inference in the attribute dimension (i.e., partitioning large individual\r\n TensorFlow, NeuGraph Equal samples) is necessary for supporting GNN training on large\r\n GraphX, Gemini Static graphs. Thelackofsystemsupportforparallelizingintheat-\r\n Presto, Lux Dynamic X tribute dimension prevents most existing DNN frameworks\r\n ROC(ours) Online learning X X from training GNNs on large graphs.\r\n \u2022 On the machine learning side, ROC removes the ne- GNNframeworks. MostoftheexistingGNNframeworks,\r\n cessity of using sampling techniques for GNN training such as DGL (DGL, 2018) and PyG (Fey & Lenssen, 2019)\r\n on large graphs, and also enables the exploration of that extend PyTorch (PyTorch), do not support graphs where\r\n more sophisticated GNN architectures. We demon- the data cannot \ufb01t in a single device. NeuGraph (Ma et al.,\r\n strate this potential by achieving new state-of-the-art 2019) supports GNN computation on multiple GPUs in a\r\n classi\ufb01cation accuracy on the Reddit dataset. single machine. AliGraph (Yang, 2019) is a distributed\r\n GNNframeworkbutonlyusesCPUsratherthanGPUs.\r\n 2 BACKGROUNDANDRELATEDWORK Sampling in GNNs. As discussed in Section 2.1, due to\r\n the highly connected nature of real-world graphs, comput-\r\n 2.1 GraphNeuralNetworks (k)\r\n ing h may require accessing more data than the GPU\r\n v\r\n AGNNtakes graph-structured data as input, and learns memorycapacity. A number of sampling techniques have\r\n a representation vector for each vertex in the graph. The been proposed to support GNN training on large graphs,\r\n learned representation can be used for down-stream tasks by down-sampling the neighbors of each vertex (Hamil-\r\n such as vertex classi\ufb01cation, graph classi\ufb01cation, and link ton et al., 2017; Ying et al., 2018; Chen et al., 2018). The\r\n prediction (Kipf & Welling, 2016; Hamilton et al., 2017; Xu sampling techniques can be formalized as follows.\r\n et al., 2019). (k) (k) (k\u22121) b \u0001\r\n a =AGGREGATE {h |u \u2208 N(v)} (3)\r\n v u\r\n AsshowninFigure1, each GNN layer gathers the activa- b\r\n tions of the neighbor vertices from the previous GNN layer, whereN(v)isthesampledsubsetofN(v)withasizelimit.\r\n and then updates the activations of the vertex, using DNN For example, GraphSAGE (Hamilton et al., 2017) samples\r\n b\r\n operations such as convolution or matrix multiplication. For- at most 25 neighbors for each vertex (i.e., |N(v)| \u2264 25),\r\n mally, the computation in a GNN layer is: while a vertex may actually contain thousands of neighbors.\r\n (k) (k) (k\u22121) \u0001 Our evaluation shows that existing sampling techniques\r\n a = AGGREGATE {h |u \u2208 N(v)} (1)\r\n v u come with potential model accuracy loss for large real-\r\n (k) (k) (k) (k\u22121)\r\n h = UPDATE (a ,h ) (2) world graphs. This observation is consistent with previous\r\n v v v\r\n work(Hamiltonet al., 2017). ROC provides an orthogonal\r\n (k)\r\n where h is the learned activation of vertex v at the k-th approach to support GNN training on large graphs. Any\r\n v\r\n (0)\r\n layer, h is the input features of v. N(v) denotes v\u2019s neigh- existing sampling technique can be additionally applied in\r\n v\r\n bors in the graph. For each vertex, AGGREGATE gathers the ROCtofurther accelerate large-scale GNN training.\r\n activations of its neighbors using an accumulation function Graphframeworksandgraphpartitioning. Anumberof\r\n such as average or summation. For each vertex v, UPDATE distributed graph processing frameworks (Malewicz et al.,\r\n (k)\r\n computesits new activations h bycombiningitsprevious\r\n v 2010; Gonzalez et al., 2014; Jia et al., 2017) have been\r\n (k\u22121) (k)\r\n activations h and the neighborhood aggregation a .\r\n v v proposed to accelerate data-intensive graph applications.\r\n (K)\r\n Theactivations of the last layer h capture the structural\r\n v These systems generally adopt the Gather-Apply-Scatter\r\n information for all neighbors within K hops of v, and can (GAS)(Gonzalez et al., 2012) vertex-centric programming\r\n be used as the input for down-stream prediction tasks. model. GAScannaturally express the data propagation in\r\n GNNs,butcannotsupport many neural network operations.\r\n 2.2 Related Work \u02c7 \u00b4\r\n For example, computing the attention scores (Velickovic\r\n Distributed DNN training. In the terminology of Jia et al. et al., 2018) between vertices not directly connected cannot\r\n (2019), DNN computations can be partitioned in the sample, be easily expressed in the GAS model.\r\n operator, attribute and parameter dimensions for parallel Table 1 summarizes the graph partitioning strategies used\r\n and distributed execution. The vast majority of existing in existing deep learning and graph processing frameworks.\r\n deep learning frameworks (Abadi et al., 2016; PyTorch) use Deep learning frameworks (Abadi et al., 2016; Ma et al.,\r\n ImprovingtheAccuracy,Scalability, and Performance of Graph Neural Networks with ROC\r\n GNN Architecture Input Graph ROC uses an online-linear-regression-based graph parti-\r\n tioner to address the unique load imbalance challenge of\r\n Learning-based Graph Partitioner distributed GNN inference, where a trained GNN model\r\n is used to provide inference service on previously unseen\r\n Partitioned Subgraphs graphs (Section 4). This problem exists today in real-world\r\n GNNinference services (Hamilton et al., 2017), and our\r\n CPU DRAM CPU DRAM partitioning technique improves the inference performance\r\n byupto1.4\u00d7comparedtoexistinggraphpartitioningstrate-\r\n DPMM DPMM \u2026 DPMM DPMM gies. Thegraphpartitioneristrainedjointlywiththetraining\r\n \u2026 \u2026 phase of the GNN architecture, and is also used to partition\r\n GPU GPU GPU GPU inference workloads on new input graphs that are not in the\r\n Compute Node Compute Node training dataset.\r\n Performance Measurements After graph partitioning, all subgraphs are sent to different\r\n GPUsto perform GNN computations in parallel. Instead\r\n Figure 2. ROC system overview. DPMM represents dynamic- of requiring all the intermediate results related to each sub-\r\n programming-based memory manager. graph to \ufb01t in GPU device memory, ROC uses the much\r\n larger CPU DRAM on the host machines to hold all the\r\n data, and treats the GPU memories as caches. Such a design\r\n allows us to support much larger GNN architectures and\r\n 2019) typically partition data (e.g., tensors) equally across input graphs. However, transferring tensors between a GPU\r\n GPUs. Ontheotherhand,graphprocessing frameworks use and the host DRAM has a major impact on runtime perfor-\r\n more complicated strategies to achieve load balance. For mance. ROC introduces a dynamic programming algorithm\r\n example, GraphX (Gonzalez et al., 2014) and Gemini (Zhu to quickly \ufb01nd a memory management strategy to minimize\r\n et al., 2016) statically partition input graphs by minimizing these data transfers (Section 5).\r\n a heuristic objective function, such as the number of edges\r\n spanning different partitions. These simple objective func- 4 GRAPHPARTITIONER\r\n tions can achieve good performance for data-intensive graph\r\n processing, but they do not work well for compute-intensive The goal of the ROC graph partitioner is discovering bal-\r\n GNNs due to the highly varying per-vertex computation anced partitioning for GNN training and inference on ar-\r\n loads. Dynamic repartitioning (Venkataraman et al., 2013; bitrary input graphs, which is especially challenging for\r\n Jia et al., 2017) exploits the iterative nature of many graph distributed inference on new graphs where no existing per-\r\n applications and rebalances the workload in each iteration formance measurements are available. We introduce an\r\n based on the measured performance of previous iterations. online-linear-regression-based graph partitioner that takes\r\n Thisapproachconvergestoabalancedworkloaddistribution the runtime performance measurements of previously pro-\r\n for GNN training, but is much less effective for inference cessed graphs as training samples for a cost model, which\r\n which computes the GNN model only once for each new is then used to predict performance on arbitrary new graphs\r\n graph. ROC uses an online-linear-regression-based algo- and enable ef\ufb01cient partitioning.\r\n rithm to achieve balanced partitioning for both GNN train- We formulate graph partitioning for GNNs as an online\r\n ing and inference, through jointly learning a cost model to learning task. The performance measurements on parti-\r\n predict the execution time of the GNN model on arbitrary tioned graphs are training samples. Each training iteration\r\n graphs. produces new data points, and the graph partitioner com-\r\n 3 ROCOVERVIEW putes a balanced graph partitioning based on all existing\r\n data points.\r\n Figure 2 shows an overview of ROC, which takes a GNN 4.1 Cost Model\r\n architecture and a graph as inputs, and distributes the GNN\r\n computations across multiple GPUs (potentially on differ- Thekeycomponentofthe ROC graph partitioner is a cost\r\n ent compute nodes) by partitioning the input graph into model that predicts the execution time of computing a GNN\r\n multiple subgraphs. Each GPU worker executes the GNN layer on an arbitrary graph, which could be the whole or any\r\n architecture on a subgraph, and communicates with CPU subset of an input graph. Note that the cost model learns\r\n DRAMtoobtaininputtensorsandsaveintermediateresults. to predict the execution time of a GNN layer instead of an\r\n The communication is optimized by a per-GPU dynamic- entire GNNarchitecturefortworeasons. First, ROC exploits\r\n programming-based memory manager (DPMM) to mini- the composability of neural network architectures and the\r\n mize data transfers between CPU and GPU memories.\r\n ImprovingtheAccuracy,Scalability, and Performance of Graph Neural Networks with ROC\r\n Table 2. The vertex features used in the current cost model. The features estimate the required memory accesses to GPU de-\r\n semantics of the features are described in Section 4.1. WS is the vice memory. Recall that when multiple threads in a GPU\r\n numberofGPUthreadsinawarp,whichis32fortheV100GPUs warpissue memory references to consecutive memory ad-\r\n used in the experiments. dresses, the GPUautomaticallycoalescesthesereferencesto\r\n De\ufb01nition Description a single memory access that is handled more ef\ufb01ciently. To\r\n x 1 the vertex itself describe continuity of a vertex\u2019s neighbors, we partition all\r\n 1 neighbors of v as C(v) = {c (v),...,c (v)}, where each\r\n x |N(v)| numberofneighbors 1 |C|\r\n 2\r\n x |C(v)| continuity of neighbors ci(v) is a range of consecutively numbered vertices. For\r\n 3 P\r\n x \u2308ci(v)\u2309 # mem. accesses to load neighbors example, for vertex v with neighbors {v ,v ,v ,v }, we\r\n 4 i WS 1 3 4 6 8\r\n P # mem. accesses to load the have c (v ) = {v ,v }, c (v) = {v }, and c (v) = {v }.\r\n x \u2308ci(v)\u00d7din\u2309 1 1 3 4 2 6 3 8\r\n 5 i WS activations of all neighbors The feature x (v) is the number of consecutive blocks in\r\n 3\r\n v\u2019s neighbors, which is 3 in the example. In addition, x (v)\r\n 4\r\n and x5(v) estimate the number of GPU memory accesses to\r\n learned cost model can be directly applied to a variety of load all neighbors and their input activations.\r\n GNNarchitectures. Second, this approach allows ROC to Thecost model can be easily extended to include new fea-\r\n gather much more training data in each training iteration. tures to capture additional model- and hardware-speci\ufb01c\r\n For a GNN architecture with N layers and P partitions, information if needed.\r\n ROCcollects (N \u00d7P)training data points, while modeling\r\n the entire GNN architecture only provides P data points.\r\n Ascollecting new training data points is expensive, requir- 4.2 Partitioning Algorithm\r\n ing measuring GNN computations on GPU devices, we Using the learned cost model, the ROC graph partitioner\r\n employ a simple linear regression model to minimize the computes a graph partitioning that achieves balanced work-\r\n number of trainable parameters. Our model assumes that load distribution under the cost model.\r\n the cost to perform a DNN operation on a vertex is linear in ROC uses the graph partitioning strategy proposed by\r\n a collection of vertex features, such as number of neighbors, Lux (Jia et al., 2017) to maximize coalesced accesses to\r\n and the cost to run an arbitrary graph is the summation of GPUdevicememory,whichiscritical to achieve optimized\r\n the cost of all its vertices. GPUperformance. Each vertex in a graph is assigned a\r\n WeformalizethecostforrunningaGNNlayerl onaninput unique number between 0 and V \u22121, where V is the num-\r\n graph G as follows. ber of vertices in the graph. In ROC, each partition holds\r\n X consecutively numbered vertices, which allows us to use\r\n N\u22121numbers{p ,p ,...,p }topartitionthegraphinto\r\n t(l,v) = wi(l)xi(v) (4) 0 1 N\u22121\r\n i Nsubgraphswherethei-thsubgraphcontains all vertices\r\n X XX ranging from p to p \u22121andtheir in-edges.\r\n t(l,G) = t(l,v) = wixi(v) (5) i\u22121 i\r\n v\u2208G v\u2208G i ROCpreprocesses an input graph by computing the partial\r\n = Xw Xx(v)=Xwx(G) (6) sumsofeachvertex feature, which allows ROC to estimate\r\n i i i i the runtime performance of a subgraph in O(1) time. In\r\n i v\u2208G i addition, ROC uses binary search to \ufb01nd a splitting point p\r\n i\r\n where v denotes a vertex in the input graph G, wi(l) is a in O(logV), and therefore computing balanced partitioning\r\n trainable parameter for layer l, x (v) is the i-th feature of v, onlytakesO(N logV)time,whereN andV arethenumber\r\n i\r\n and xi(G) sums up the i-th feature of all vertices in G. of partitions and input vertices, respectively.\r\n Ourmodelminimizesthemeansquareerroroverallavail- 5 MEMORYMANAGER\r\n able data points.\r\n N Asdiscussed in Section 3, ROC performs all GNN computa-\r\n 1 X \u0001\r\n Loss(l) = t(l,G ) \u2212 y(l,G ) 2 (7) tions on GPUs to optimize runtime performance, but only\r\n N i i requires all the GNN data to \ufb01t in the host CPU DRAM\r\n i=1\r\n where N is the total number of available data points for the to support large GNN architectures and input graphs. The\r\n GNNlayerl,andy(l,G )istheperformance measurement device memory of each GPU therefore only needs to cache\r\n i a subset of intermediate tensors, whose corresponding data\r\n for the i-th data point. transfers between CPU and GPU memories can be saved\r\n Table 2 lists the vertex features used in the cost model; to reduce communication cost. How to select this subset\r\n x (v) and x (v) capture the computation workload associ- of tensors to minimize the data transfers within the limited\r\n 1 2 GPUmemoryisacritical memory management problem.\r\n ated with vertex v and its edges, respectively. The remaining\r\n ImprovingtheAccuracy,Scalability, and Performance of Graph Neural Networks with ROC\r\n \u2460 \u2461 Forward Processing\r\n Gather Linear+ReLU \u2462 Linear \u2463\r\n Forward Forward Forward softmax\r\n \r\n \r\n Gather Linear+ReLU Linear \r\n Backward \u2465 Backward \u2464Backward\r\n \u2466 Back Propagation\r\n Figure 3. The computation graph of a toy 1-layer GIN architecture (Xu et al., 2019). A box represents an operation, and a circle represents\r\n a tensor. Arrows indicate dependencies between tensors and operations. The gather operation performs neighborhood aggregation. The\r\n linearandthefollowingReLUarefusedintoasingleoperationasacommonoptimizationinexistingframeworks. h0 and g denote\r\n the input features and neighbors of all vertices, respectively. w and w are the weights of the two linear layers.\r\n 1 2\r\n Table 3. All the valid states and their activation tensors for the a de\ufb01nition allows the valid states to capture all possible\r\n GNNarchitecture in Figure 3. execution orderings of the operators in G. For each state S,\r\n Valid State S Activation Tensors A(S) wede\ufb01neitsactivetensorsA(S)tobethesetoftensorsthat\r\n {\u2460} {g,a} were produced by the operations in S and will be consumed\r\n {\u2460,\u2461} {g,a,b,w } as inputs by the operations outside of S. Intuitively, A(S)\r\n 1 capturesallthetensorswecancacheintheGPUtoeliminate\r\n {\u2460,\u2461,\u2462} {g,a,b,h1,w ,w }\r\n 1 2 future data transfers at the stage S.\r\n {\u2460,\u2461,\u2462,\u2463} {g,a,b,w ,w ,\u25bdL(h1)}\r\n 1 2\r\n {\u2460,\u2461,\u2462,\u2463,\u2464} {g,a,b,w ,\u25bdL(b)} Figure3showsthecomputationgraphofatoy1-layerGraph\r\n 1\r\n {\u2460,\u2461,\u2462,\u2463,\u2464,\u2465} {g,a,\u25bdL(a)} IsomorphismNetwork(Xuetal.,2019),whosecomputation\r\n {\u2460,\u2461,\u2462,\u2463,\u2464,\u2465,\u2466} {} can be formalized as following.\r\n (1) X (0)\r\n h =W \u00d7RELU(W \u00d7 h ) (8)\r\n v 2 1 u\r\n Theoptimal strategy depends not only on the GPU device u\u2208N(v)\r\n memorycapacity and the sizes of the input graph and GNN For this GNN architecture, all the valid states and their\r\n tensors, but also on the topology of the GNN architecture, active tensors are listed in Table 3.\r\n which determines the reuse distance for each tensor.\r\n Thepagereplacement algorithms for memory management Since the valid states represent all the possible execution\r\n in operating systems (Aho et al., 1971) assume pages are all orderings of the GNN, we can use dynamic programming\r\n the same size and that pages are accessed sequentially. Nei- to compute the optimal memory management strategy as-\r\n ther assumption holds for GNN computations since tensors sociated with each execution state. Algorithm 1 shows\r\n generally have different sizes, and an operator may access the pseudocode. COST(S,T ) computes the minimum data\r\n multiple tensors simultaneously. transfers required to compute all the operations in a state S,\r\n with T being the set of tensors cached in the GPU memory;\r\n ROCformulates GPUmemorymanagementasacostmini- T should be a subset of A(S). We reduce the task of com-\r\n mizationproblem: givenaninputgraph,aGNNarchitecture, puting COST(S,T ) to smaller tasks by enumerating the last\r\n and a GPU device, \ufb01nd the subset of tensors to cache in the operation to perform in S (Line 11). The cost is the speci\ufb01c\r\n GPU memory that minimizes data transfers between the data transfers to perform this last operation (xfer in Line 15)\r\n CPUandGPU.ROCintroduces a dynamic programming adding the cost of the corresponding previous state (S\u2032,T \u2032).\r\n algorithm to quickly \ufb01nd a globally optimal solution. Toimproveperformance, we leverage memoization to only\r\n The key insight of the dynamic programming algorithm evaluate COST(S,T ) once for each (S,T ) pair.\r\n is that, at each stage of the computation, we only need Time and space complexity. Overall, the time and space\r\n to consider caching tensors that will be reused by future complexity of Algorithm 1 are O(S2T) and O(ST), respec-\r\n operations. For a GNN architecture G, we de\ufb01ne a state S to tively, where S is the number of possible execution states\r\n be the set of operations that have already been performed in for a GNN architecture, and T is the maximum number of\r\n G. A state is valid only if the operations it contains preserve available tensor sets for a state. We observed that S and\r\n all the data dependencies in G, i.e., for any operation in S, T are at most 16 and 4096 for all GNN architectures in\r\n all its predecessor operations in G must be also in S. Such our experiments, making it practical to use the dynamic\r\n ImprovingtheAccuracy,Scalability, and Performance of Graph Neural Networks with ROC\r\n Algorithm 1 A recursive dynamic programming algorithm Table 4. Graph datasets used in our evaluation.\r\n for computing minimumdatatransfers. IN(o ) and OUT(o )\r\n i i Dataset Vertex Edge Feature Label\r\n return the input and output tensors of the operation o , re-\r\n i Pubmed 19,717 108,365 500 3\r\n spectively, and size(T ) returns the memory space required PPI 56,944 1,612,348 700 121\r\n to save all tensors in T . Reddit 232,965 114,848,857 602 41\r\n 1: Input: An input graph g, a GNN architecture G, and the GPU Amazon 9,430,088 231,594,310 300 24\r\n device memory capacity cap.\r\n 2: Output: Minimumdata transfers required to compute G on g\r\n within capacity cap.\r\n 3: \u22b2 D is a database storing all computed COST functions. \u2022 Can we improve the model accuracy on existing\r\n 4: datasets by using larger and more sophisticated GNNs?\r\n 5: function COST(S, T )\r\n 6: if (S,T ) \u2208 D then\r\n 7: return D(S,T )\r\n 8: if S is \u2205 then 7.1 Experimental Setup\r\n 9: return size(T ) GNNarchitectures. We use three real-world GNN archi-\r\n 10: cost \u2190 \u221e tectures to evaluate ROC. GCN is a widely used graph\r\n 11: for oi \u2208 S do\r\n 12: if (S \\ oi) is a valid state then convolutional network for semi-supervised learning on\r\n 13: S\u2032 \u2190S\\oi\r\n 14: T\u2032 \u2190T \\OUT(oi)\u0001\u2229A(S\u2032) graph-structured data (Kipf & Welling, 2016). GIN is\r\n \u2032\u0001 provably the most expressive GNN architecture for the\r\n 15: xfer \u2190 size IN(oi) \\ T\r\n \u0001 Weisfeiler-Lehmangraphisomorphismtest(Xuetal.,2019).\r\n 16: if size T \u222a IN(oi) \u222a OUT(oi)\u2032 \u2264\u2032cap then CommNet consists of multiple cooperating agents that\r\n 17: cost = min{cost,COST(S ,T )+xfer} learn to communicate amongst themselves before taking\r\n 18: D(S,T)\u2190cost actions (Sukhbaatar et al., 2016).\r\n 19: return D(S,T )\r\n Datasets. We use four real-world graph datasets in our\r\n programming algorithm to minimize data transfer cost. evaluation, listed in Table 4. Pubmed is a citation network\r\n dataset (Sen et al., 2008), containing sparse bag-of-words\r\n feature vectors for each document (i.e., vertex), and cita-\r\n 6 IMPLEMENTATION tion links between documents (i.e., edges). PPI contains a\r\n numberofprotein-protein interaction graphs, each of which\r\n ROCisimplementedontopofFlexFlow(Jiaetal., 2019), a represents a human tissue (Hamilton et al., 2017). Reddit\r\n distributed multi-GPU runtime for high-performance DNN is a dataset for online discussion forum, with each node\r\n training. We extended FlexFlow in the following aspects being a post, and each edge being a comment between\r\n to support ef\ufb01cient GNN computations. First, we have re- posts (Hamilton et al., 2017). Amazon is the product dataset\r\n placed the equal partitioning strategy in FlexFlow with a from Amazon (He & McAuley, 2016). Each node is a\r\n \ufb01ne-grained partitioning interface that supports splitting ten- product, and each edge represents also-viewed information\r\n sors at arbitrary points. This extension is critical to ef\ufb01cient between products. The task is to categorize a product using\r\n partitioning for GNN computations. Second, we have added its description and also-viewed relations.\r\n a graph propagation engine to support neighborhood aggre- All experiments were performed on a GPU cluster with 4\r\n gation operations in GNNs, such as the gather operation computenodes,eachofwhichcontainstwoIntel10-coreE5-\r\n in Figure 3. We have reused the highly optimized CUDA 2600 CPUs, 256GB DRAM,andfourNVIDIATeslaP100\r\n kernels in Lux (Jia et al., 2017) to perform graph propaga- GPUs. GPUsonthesamenodeareconnectedwithNVLink,\r\n tion on GPUs. This allows ROC to directly bene\ufb01t from all and nodes are connected with 100Gb/s EDR In\ufb01niband.\r\n kernel-level optimizations in Lux.\r\n For each training experiment, the ROC graph partitioner\r\n 7 EVALUATION learned a new cost model by only using performance mea-\r\n surements obtained during the single experiment. For each\r\n In this section, we aim to evaluate the following points: inference experiment, the graph partitioner used the learned\r\n cost model from the training phase on the same dataset.\r\n \u2022 Can ROC achieve comparable runtime performance Unless otherwise stated, all experiments use the same train-\r\n compared to state-of-the-art GNN frameworks on a ing/validation/test splits as prior work (Hamilton et al., 2017;\r\n single GPU? Kipf & Welling, 2016; He & McAuley, 2016). All train-\r\n \u2022 Can ROC improve the end-to-end performance of dis- ing throughput and inference latency were measured by\r\n tributed GNN training and inference? averaging 1,000 iterations.\r\n ImprovingtheAccuracy,Scalability, and Performance of Graph Neural Networks with ROC\r\n TensorFlow DGL PyG Roc NeuGraph Roc\r\n 2.0\r\n 8\r\n 1.75\r\n 250\r\n 1.50\r\n 1.5\r\n 6\r\n 200\r\n 1.25\r\n 1.0\r\n 4\r\n 150 1.00\r\n 0.75\r\n 100\r\n 0.5\r\n 2\r\n 0.50\r\n 50\r\n 0.25 0 0.0\r\n 1(1) 2(1) 4(1) 8(2) 16(4) 1(1) 2(1) 4(1) 8(2) 16(4)\r\n 0 0.00\r\n Reddit Amazon\r\n GCN GIN CommNet GCN GIN CommNet\r\n Training Throughput (epochs/s) Training Throughput (epochs/s)\r\n Number of GPU devices\r\n Pubmed Reddit\r\n Figure 4. End-to-end training throughput comparison between ex- Figure 5. Training throughput comparison between NeuGraph and\r\n isting GNN frameworks and ROC on a single P100 GPU (higher is ROC using different numbers of GPUs (higher is better). Num-\r\n better). bers in parenthesis are the number of compute nodes used in the\r\n experiments.\r\n 7.2 Single-GPUResults\r\n First, we compare the end-to-end training performance of Figure 5 shows the results. For experiments on a single\r\n ROCwithexisting GNNframeworksonasingleGPU.Due compute node, ROC outperforms NeuGraph by up to 4\u00d7.\r\n to the small device memory on a single GPU, we limited The speedup is mainly because of the graph partitioning\r\n these experiments to graphs that can \ufb01t in a single GPU. and memorymanagementoptimizations that are not avail-\r\n able in NeuGraph. First, NeuGraph uses the equal vertex\r\n Figure 4 shows the results among TensorFlow (Abadi et al., partitioning strategy that equally distributes the vertices\r\n 2016), DGL (DGL, 2018), PyG (Fey & Lenssen, 2019), across multiple GPUs. Section 7.6 shows that the linear\r\n and ROC. Weexpected that ROC would be slightly slower regression-based graph partitioner in ROC improves train-\r\n than the other frameworks on a single GPU, since it writes ing throughput by up to 1.4\u00d7 compared to the equal vertex\r\n the output tensors of each operator back to CPU DRAM partitioning strategy. Second, NeuGraph uses a stream pro-\r\n for distributed computation, while other frameworks keep cessing approach that partitions each GNN operation into\r\n all tensors in a single GPU, and do not involve such data multiple chunks, and sequentially streams each chunk along\r\n transfers. However, for these graphs, ROC reuses cached with its input data to GPUs. Therefore, it does not consider\r\n tensors on the GPU to minimize data transfers from DRAM the memory management optimization used in ROC, and\r\n to GPU, and overlaps the data transfers back to DRAM with Section 7.7 shows that the ROC memory manager improves\r\n subsequent GNN computations. training throughput by up to 2\u00d7.\r\n TensorFlow, DGL, and PyG were not able to run the Reddit The remaining performance improvement is likely due to\r\n dataset due to out-of-device-memory errors. ROC can still otheraspectsof ROC,suchastheuseofthehighlyoptimized\r\n train Reddit on a single GPU, by using DRAM to save some CUDAkernelsinLuxforfast graph propagation, and the\r\n of the intermediate tensors. performance of the underlying Legion runtime (Bauer et al.,\r\n 2012). However, we were not able to further investigate\r\n 7.3 Multi-GPUResults the performance difference due the absence of a publicly\r\n Second, we compare the end-to-end training performance available implementation of NeuGraph.\r\n of ROC with NeuGraph. NeuGraph supports GNN training 7.4 ComparisonwithGraphSampling\r\n across multiple GPUs on a single compute node.\r\n ANeuGraphimplementation is not yet available publicly, Wecompare the training performance of ROC with state-\r\n so we ran ROC using the same GPU version and software of-the-art graph sampling approaches on the Reddit dataset.\r\n library versions cited in Ma et al. (2019) and directly com- All frameworks use the same GCN model (Kipf & Welling,\r\n pares with the performance numbers reported in the paper. 2016). ROC performs full-batch training on the entire graph\r\n Wealsodisabled NVLink for this experiment to rule out the as in Kipf & Welling (2016), while GraphSAGE and Fast-\r\n effect of NVLink, which was not used in Ma et al. (2019). GCNusesmini-batchsamplingwithabatch-size of 512.\r\n Wedonotclaimthatthesecomparisons control for all pos- Figure 6 shows the time-to-accuracy comparison on a single\r\n sible differences as well as directly executing both systems P100GPU,wherethex-axisshowstheend-to-endtraining\r\n on the same machine, but that preferred approach is simply time for each epoch, and the y-axis shows the test accu-\r\n not possible at this time. racy of the current model at the end of each epoch. For\r\n ImprovingtheAccuracy,Scalability, and Performance of Graph Neural Networks with ROC\r\n 8\r\n Equal Edge Partition\r\n 0.95\r\n 7\r\n Equal Node Partition\r\n Roc\r\n 6\r\n 0.9\r\n 5\r\n 4\r\n 0.85\r\n 3\r\n Test Accuracy Roc\r\n 0.8\r\n 2\r\n GraphSAGE\r\n 1\r\n FastGCN\r\n 0.75\r\n 0\r\n 1(1) 2(1) 4(1) 8(2) 16(4)\r\n 0 60 120 180\r\n Training Throughput (epochs/s)\r\n Number of GPUs\r\n Time (second)\r\n Figure 6. Time-to-accuracy comparison between state-of-the-art Figure 8. Training throughput comparison among different graph\r\n sampling techniques and ROC on the Reddit dataset (Hamilton partitioning strategies on the Reddit dataset (higher is better). Num-\r\n et al., 2017). All experiments used the same GCN model. ROC per- bers in parentheses are the number of compute nodes used.\r\n formed full-batch training on the entire graph, while GraphSAGE\r\n and FastGCN performed mini-batch sampling. Each dot indicates\r\n one training epoch for GraphSAGE and FastGCN, and \ufb01ve epochs\r\n for ROC. 7.5 DeeperandLargerGNNArchitectures\r\n ROC enables the exploration of larger and more sophisti-\r\n cated GNN architectures than those possible in existing\r\n 96.9\r\n 97 frameworks. As a demonstration, we consider a class of\r\n deep GNNarchitectures formed by stacking multiple GCN\r\n layers (Kipf & Welling, 2016). We add residual connec-\r\n 96\r\n tions (He et al., 2016) between subsequent GCN layers to\r\n GraphSAGE facilitate training of deeper GNN architectures by allowing\r\n 95 to preserve information learned from previous layers.\r\n Original GCN\r\n Formally, each layer of our GNN is de\ufb01ned as follows.\r\n 2 GCN Layers\r\n 94\r\n 3 GCN Layers\r\n FastGCN\r\n 4 GCN Layers (\r\n Test Accuracy on Reddit (%)\r\n 93 GCN(H(k))+H(k) d(H(k+1)) = d(H(k))\r\n 16 32 64 128 256 512 H(k+1) =\r\n Number of Activations Per Layer GCN(H(k))+WH(k) d(H(k+1))6=d(H(k))\r\n Figure 7. Test accuracy on the Reddit dataset using deeper and\r\n larger GNN architectures. The dotted lines show the best test where GCN is the original GCN layer (Kipf & Welling,\r\n accuracy achieved by GraphSAGE (95.4%), FastGCN (93.7%), 2016), and d(\u00b7) is the number of activations in the input\r\n and the original GCN architecture (94.7%), respectively. tensor. When H(k) and H(k+1) have the same number of\r\n activations, we directly insert a residual connection between\r\n the two layers. When H(k) and H(k+1) have different num-\r\n bers of activations, we use a linear layer to transform H(k)\r\n GraphSAGE and FastGCN, each dot indicates one train- to the desired shape. This design allows us to add residual\r\n ing epoch, while for ROC each dot represents \ufb01ve training connections for all GCN layers.\r\n epochs for simplicity. Note that GraphSAGE and FastGCN\r\n can achieve relatively high accuracy within a few training We increase the depth (i.e., number of GCN layers) and\r\n epochs. For example, GraphSAGE achieves 93.4% test ac- width (i.e., number of activations per layer) to obtain larger\r\n curacy in two epochs. However, ROC requires around 20 and deeper GNN architectures beyond the commonly used\r\n epochs to achieve the same test accuracy because ROC uses 2-layer GNNs. Figure 7 shows the accuracy achieved by our\r\n full-batch training (following Kipf & Welling (2016)), and GNNarchitectures on the Reddit dataset. The \ufb01gure shows\r\n only updates parameters once per epoch, while existing sam- that improved accuracy can be obtained by increasing the\r\n pling approaches generally perform mini-batch training and depth and width of a GNN architecture. As a result, our\r\n have more frequent parameter updates. Even though ROC GNNarchitectures achieve up to 96.9% test accuracy on\r\n usesmoreepochs,itisstillasfastorfasterthanGraphSAGE the Reddit dataset, outperforming state-of-the-art sampling\r\n and FastGCN to any given level of accuracy. techniques by 1.5%.\r\n ImprovingtheAccuracy,Scalability, and Performance of Graph Neural Networks with ROC\r\n Equal Edge Partition\r\n 300 7.29 1.01\r\n Equal Node Partition 7 1.0\r\n 250\r\n Roc 6\r\n 0.8\r\n 200 5 0.63\r\n 4 0.6 0.55\r\n 150\r\n 3 0.4\r\n 2.15\r\n 100 2\r\n 1.45 0.2\r\n 1\r\n 50 Per-epoch Run Time (s)\r\n 0 No LRU Roc 0.0 No LRU Roc\r\n 0 Per-epoch Data Transfers (GB)\r\n 1(1) 2(1) 4(1) 8(2) 16(4) Cache Cache\r\n End-to-end Inference Time (ms)\r\n Number of GPUs\r\n (a) Data transfers. (b) Training time.\r\n Figure 9. End-to-end inference time for the test graphs in the PPI\r\n dataset (lower is better). The numbers were measured by averaging Figure 10. Performance comparison amongdifferent memoryman-\r\n the inference time of the four test graphs. agement strategies (lower is better). All numbers are measured by\r\n training GCN on the Reddit dataset on a single GPU.\r\n 7.6 GraphPartitioning baseline memory management strategies.\r\n Toevaluate the linear regression-based graph partitioner in\r\n ROC,wecomparetheperformanceofthegraphpartitioning 8 CONCLUSION\r\n achieved by ROC with (1) equal vertex partitioning and ROC is a distributed multi-GPU framework for high-\r\n (2) equal edge partitioning; (1) is used in NeuGraph to performance and large-scale GNN training and inference.\r\n parallelize GNN training, and (2) has been widely used in ROCpartitions an input graph onto multiple GPUs on multi-\r\n previous graph processing systems. Figure 8 shows the ple compute nodes using an online-linear-regression-based\r\n training throughput comparison on different sets of GPUs. strategy to achieve load balance, and coordinates optimized\r\n Neither of these baseline strategies perform as well as the data transfers between GPU devices and host CPU memo-\r\n ROClinear regression-based partitioner. ries with a dynamic programming algorithm. ROC increases\r\n To evaluate the distributed inference performance on new the performance by up to 4\u00d7 over existing GNN frame-\r\n graphsnotusedduringtraining,weusedthePPIdatasetcon- works, and offers better scalability. The ability to process\r\n taining 24 protein graphs. Following prior work (Hamilton larger graphs and GNN architectures additionally enables\r\n et al., 2017), we trained the GIN architecture on 20 graphs, model accuracy improvements. We achieve new state-of-\r\n and measured the inference latency on the remaining four the-art classi\ufb01cation accuracy on the Reddit dataset by using\r\n graphs, by using the graph partitioner learned during train- signi\ufb01cantly deeper and larger GNN architectures.\r\n ing. Figure 9 shows that the the learned cost model enables\r\n the graph partitioner to discover ef\ufb01cient partitioning on ACKNOWLEDGEMENT\r\n newgraphsforinferenceservices, by reducing the inference\r\n latency by up to 1.2\u00d7. For the PPI graphs, the distributed This work was supported by NSF grant CCF-1409813, the\r\n inference across multiple compute nodes achieves worse ExascaleComputingProject(17-SC-20-SC),acollaborative\r\n performance than the inference on a single node, which is effort of the U.S. Department of Energy Of\ufb01ce of Science\r\n due to the small sizes of the inference graphs. and the National Nuclear Security Administration, and is\r\n based on research sponsored by DARPA under agreement\r\n 7.7 MemoryManagement numberFA84750-14-2-0006. This research used resources\r\n Weevaluate the performance of the ROC memory manager of the Oak Ridge Leadership Computing Facility, which\r\n by comparing it with (1) the streaming processing approach is a DOE Of\ufb01ce of Science User Facility supported under\r\n in NeuGraphthatstreamsinputdataalongwithcomputation Contract DE-AC05-00OR22725. This research was sup-\r\n (i.e., no caching optimization) and (2) the least-recently- ported in part by af\ufb01liate members and other supporters\r\n used (LRU) cache replacement policy. of the Stanford DAWN project\u2014Ant Financial, Facebook,\r\n Google, Infosys, Intel, Microsoft, NEC, SAP, Teradata, and\r\n Figure 10 shows the comparison results for training GCN VMware\u2014aswellasCiscoandtheNSFunderCAREER\r\n on the Reddit dataset on a single GPU. The dynamic grant CNS-1651570. Any opinions, \ufb01ndings, and conclu-\r\n programming-based memory manager reduces the data sions or recommendations expressed in this material are\r\n transfers between GPU and DRAM by 1.4\u20135\u00d7 and reduces those of the authors and do not necessarily re\ufb02ect the views\r\n the per-epoch training time by 1.2\u20132\u00d7 compared with the of the National Science Foundation.\r\n ImprovingtheAccuracy,Scalability, and Performance of Graph Neural Networks with ROC\r\n REFERENCES He, R. and McAuley, J. Ups and downs: Modeling the\r\n Deep Graph Library: towards ef\ufb01cient and scalable deep visual evolution of fashion trends with one-class collabo-\r\n learning on graphs. https://www.dgl.ai/, 2018. rative \ufb01ltering. In Proceedings of the 25th International\r\n ConferenceonWorldWideWeb,WWW\u201916.International\r\n Abadi, M., Barham, P., Chen, J., Chen, Z., Davis, A., Dean, World Wide Web Conferences Steering Committee, 2016.\r\n J., Devin, M., Ghemawat, S., Irving, G., Isard, M., Kudlur, Jia, Z., Kwon, Y., Shipman, G., McCormick, P., Erez, M.,\r\n M., Levenberg, J., Monga, R., Moore, S., Murray, D. G., and Aiken, A. A distributed multi-gpu system for fast\r\n Steiner, B., Tucker, P., Vasudevan, V., Warden, P., Wicke, graph processing. Proc. VLDB Endow., 11(3), November\r\n M., Yu, Y., and Zheng, X. Tensor\ufb02ow: A system for 2017.\r\n large-scale machine learning. In Proceedings of the 12th\r\n USENIXConferenceonOperatingSystemsDesignand Jia, Z., Zaharia, M., and Aiken, A. Beyond data and model\r\n Implementation, OSDI, 2016. parallelism for deep neural networks. In Proceedings of\r\n the 2nd Conference on Systems and Machine Learning,\r\n Aho, A. V., Denning, P. J., and Ullman, J. D. Principles of SysML\u201919, 2019.\r\n optimal page replacement. Journal of the ACM (JACM), Kipf, T. N. and Welling, M. Semi-supervised classi\ufb01ca-\r\n 18(1):80\u201393, 1971. tion with graph convolutional networks. arXiv preprint\r\n Bauer, M., Treichler, S., Slaughter, E., and Aiken, A. Le- arXiv:1609.02907, 2016.\r\n gion: Expressing locality and independence with logical Ma,L., Yang, Z., Miao, Y., Xue, J., Wu, M., Zhou, L., and\r\n regions. In Proceedings of the International Conference Dai, Y. Neugraph: Parallel deep neural network computa-\r\n onHighPerformanceComputing,Networking, Storage tion on large graphs. In 2019 USENIX Annual Technical\r\n andAnalysis, 2012. Conference (USENIX ATC 19). USENIX Association,\r\n Caffe2. A New Lightweight, Modular, and Scalable Deep 2019.\r\n Learning Framework. https://caffe2.ai,2016. Malewicz, G., Austern, M. H., Bik, A. J., Dehnert, J. C.,\r\n Horn, I., Leiser, N., and Czajkowski, G. Pregel: A sys-\r\n Chen, J., Ma, T., and Xiao, C. FastGCN: Fast learning with temfor large-scale graph processing. In Proceedings of\r\n graph convolutional networks via importance sampling. the 2010 ACM SIGMOD International Conference on\r\n In International Conference on Learning Representations, ManagementofData,SIGMOD\u201910,2010.\r\n 2018. PyTorch. Tensors and Dynamic neural networks in Python\r\n Fey,M.andLenssen,J.E.Fastgraphrepresentationlearning with strong GPU acceleration. https://pytorch.\r\n with PyTorch Geometric. In ICLR Workshop on Repre- org,2017.\r\n sentation Learning on Graphs and Manifolds, 2019. Sen, P., Namata, G., Bilgic, M., Getoor, L., Galligher, B.,\r\n Gonzalez, J. E., Low, Y., Gu, H., Bickson, D., and Guestrin, and Eliassi-Rad, T. Collective classi\ufb01cation in network\r\n C. Powergraph: Distributed graph-parallel computation data. AI magazine, 29(3):93\u201393, 2008.\r\n on natural graphs. In Proceedings of the 10th USENIX Sukhbaatar, S., szlam, a., and Fergus, R. Learning multia-\r\n Conference on Operating Systems Design and Implemen- gent communicationwithbackpropagation. In Lee, D. D.,\r\n tation, OSDI\u201912, 2012. Sugiyama, M., Luxburg, U. V., Guyon, I., and Garnett,\r\n R. (eds.), Advances in Neural Information Processing\r\n Gonzalez, J. E., Xin, R. S., Dave, A., Crankshaw, D., Systems 29. Curran Associates, Inc., 2016.\r\n Franklin, M. J., and Stoica, I. GraphX: Graph process- \u02c7 \u00b4 `\r\n ing in a distributed data\ufb02ow framework. In Proceedings Velickovic, P., Cucurull, G., Casanova, A., Romero, A., Lio,\r\n of the 11th USENIX Conference on Operating Systems P., and Bengio, Y. Graph attention networks. Interna-\r\n Design and Implementation, OSDI\u201914, 2014. tional Conference on Learning Representations, 2018.\r\n Hamilton, W., Ying, Z., and Leskovec, J. Inductive repre- Venkataraman, S., Bodzsar, E., Roy, I., AuYoung, A., and\r\n sentation learning on large graphs. In Advances in Neural Schreiber, R. S. Presto: Distributed machine learning and\r\n Information Processing Systems 30. 2017. graph processing with sparse matrices. In Proceedings of\r\n the 8th ACMEuropeanConferenceonComputerSystems,\r\n He, K., Zhang, X., Ren, S., and Sun, J. Deep residual learn- EuroSys \u201913, 2013.\r\n ing for image recognition. In Proceedings of the IEEE Xu,K.,Hu,W.,Leskovec,J., and Jegelka, S. How powerful\r\n Conference on Computer Vision and Pattern Recognition, are graph neural networks? In International Conference\r\n CVPR,2016. onLearning Representations, 2019.\r\n ImprovingtheAccuracy,Scalability, and Performance of Graph Neural Networks with ROC\r\n Yang, H. Aligraph: A comprehensive graph neural network\r\n platform. Proceedings of the 25th ACM SIGKDD Inter-\r\n national Conference on Knowledge Discovery & Data\r\n Mining - KDD 19, 2019. doi: 10.1145/3292500.3340404.\r\n URL http://dx.doi.org/10.1145/3292500.\r\n 3340404.\r\n Ying, R., He, R., Chen, K., Eksombatchai, P., Hamilton,\r\n W. L., and Leskovec, J. Graph convolutional neural\r\n networks for web-scale recommender systems. In Pro-\r\n ceedings of the 24th ACM SIGKDD International Con-\r\n ference on Knowledge Discovery & Data Mining,\r\n KDD \u201918, pp. 974\u2013983, New York, NY, USA, 2018.\r\n ACM. ISBN978-1-4503-5552-0. doi: 10.1145/3219819.\r\n 3219890. URLhttp://doi.acm.org/10.1145/\r\n 3219819.3219890.\r\n Zhu, X., Chen, W., Zheng, W., and Ma, X. Gemini: A\r\n computation-centric distributed graph processing system.\r\n In 12th USENIX Symposium on Operating Systems De-\r\n sign and Implementation (OSDI 16). USENIX Associa-\r\n tion, 2016.\r\n", "award": [], "sourceid": 83, "authors": [{"given_name": "Zhihao", "family_name": "Jia", "institution": "Stanford University"}, {"given_name": "Sina", "family_name": "Lin", "institution": "Microsoft"}, {"given_name": "Mingyu", "family_name": "Gao", "institution": "Tsinghua University"}, {"given_name": "Matei", "family_name": "Zaharia", "institution": "Stanford and Databricks"}, {"given_name": "Alex", "family_name": "Aiken", "institution": "Stanford University"}]}