Systems for Graph Neural Networks

In-GPU sampling and training, distributed multi-GPU systems

Representation learning is a fundamental task in machine learning. It consists of learning the features of data items automatically, typically using a deep neural network (DNN), instead of selecting hand-engineered features that typically have worse performance. Graph data requires specific algorithms for representation learning such as DeepWalk, node2vec, and GraphSAGE. These algorithms first sample the input graph and then train a DNN based on the samples. It is common to use GPUs for training, but graph sampling on GPUs is challenging. Sampling is an embarrassingly parallel task since each sample can be generated independently. However, the irregularity of graphs makes it hard to use GPU resources effectively. Existing graph processing, mining, and representation learning systems do not effectively parallelize sampling and this negatively impacts the end-to-end performance of representation learning.

We developed NextDoor, the first system specifically designed to perform graph sampling on GPUs. NextDoor introduces a high-level API based on a novel paradigm for parallel graph sampling called transit-parallelism. We implement several graph sampling applications, and show that NextDoor runs them orders of magnitude faster than existing systems

Our follow-up work developed split-parallelism, a novel hybrid parallelism strategies for mini-batch training. The traditional data parallel approach poses inherent limitations to scalable GNN training because it introduces redundant data loading and computation. Rather than proposing incremental patches to that paradigm, as done by previous work, this project proposes a fundamental shift towards a new paradigm for GNN training called split parallelism. The main idea is to split the computation of mini-batches across multiple GPUs, which cooperatively perform each iteration. Split parallelism scales the size of the cache when the number of GPUs in the system grows and maximizes cache access locality, which results in much better performance also in smaller models. We also proposed probabilistic splitting algorithms to balance the load across splits and minimize communication cost. We implemented split parallelism in a system called GSplit.

Finally, we performed and extensive experimental comparison of ywo common methods for training GNNs: mini-batch training and full-graph training. Since these two methods require different training pipelines and systems optimizations, two separate classes of GNN training systems emerged, each tailored for one method. We provided a comprehensive empirical comparison of representative full-graph and mini-batch GNN training systems. We found that the mini-batch training systems consistently converge faster than the full-graph training ones across multiple datasets, GNN models, and system configurations. We also found that mini-batch training techniques converge to similar to or often higher accuracy values than full-graph training ones, showing that mini-batch sampling is not necessarily detrimental to accuracy.

References

2025

  1. Graph Neural Network Training Systems: A Performance Comparison of Full-Graph and Mini-Batch
    Saurabh Bajaj, Hojae Son, Juelin Liu, Hui Guan, and Marco Serafini
    Proceedings of the VLDB Endowment, 2025
  2. GSplit: Scaling Graph Neural Network Training on Large Graphs via Probabilistic Splitting
    Sandeep Polisetty, Juelin Liu, Kobi Falus, Yi Ren Fung, Seung-Hwan Lim, Hui Guan, and Marco Serafini
    In Proceedings of MLSys, 2025

2021

  1. Accelerating Graph Sampling for Graph Machine Learning Using GPUs
    Abhinav Jangda, Sandeep Polisetty, Arjun Guha, and Marco Serafini
    In Proceedings of the 16th ACM European Conference on Computer Systems (Eurosys), 2021
  2. Scalable Graph Neural Network Training: The Case for Sampling
    Marco Serafini, and Hui Guan
    ACM SIGOPS Operating Systems Review (OSR), 2021