Distributed Graph Neural Network Training: A Survey,ACM Computing Surveys

当前位置： X-MOL 学术 › ACM Comput. Surv. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Distributed Graph Neural Network Training: A Survey
ACM Computing Surveys ( IF 16.6 ) Pub Date : 2024-04-10 , DOI: 10.1145/3648358
Yingxia Shao ₁ , Hongzheng Li ₁ , Xizhi Gu ₁ , Hongbo Yin ₁ , Yawen Li ₁ , Xupeng Miao ₂ , Wentao Zhang ₃ , Bin Cui ₄ , Lei Chen ₅

Affiliation

Graph neural networks (GNNs) are a type of deep learning models that are trained on graphs and have been successfully applied in various domains. Despite the effectiveness of GNNs, it is still challenging for GNNs to efficiently scale to large graphs. As a remedy, distributed computing becomes a promising solution of training large-scale GNNs, since it is able to provide abundant computing resources. However, the dependency of graph structure increases the difficulty of achieving high-efficiency distributed GNN training, which suffers from the massive communication and workload imbalance. In recent years, many efforts have been made on distributed GNN training, and an array of training algorithms and systems have been proposed. Yet, there is a lack of systematic review of the optimization techniques for the distributed execution of GNN training. In this survey, we analyze three major challenges in distributed GNN training: massive feature communication, the loss of model accuracy, and workload imbalance. Then, we introduce a new taxonomy for the optimization techniques in distributed GNN training that address the above challenges. The new taxonomy classifies existing techniques into four categories: GNN data partition, GNN batch generation, GNN execution model, and GNN communication protocol. We carefully discuss the techniques in each category. In the conclusion, we summarize existing distributed GNN systems for multi–graphics processing units (GPUs), GPU-clusters and central processing unit (CPU)-clusters, respectively, and present a discussion about the future direction of distributed GNN training.

中文翻译：

分布式图神经网络训练：调查

图神经网络（GNN）是一种基于图进行训练的深度学习模型，并已成功应用于各个领域。尽管 GNN 很有效，但 GNN 有效地扩展到大型图仍然具有挑战性。作为一种补救措施，分布式计算成为训练大规模 GNN 的一种有前途的解决方案，因为它能够提供丰富的计算资源。然而，图结构的依赖性增加了实现高效分布式 GNN 训练的难度，这会受到大量通信和工作负载不平衡的影响。近年来，人们在分布式GNN训练方面做出了很多努力，并提出了一系列训练算法和系统。然而，对于 GNN 训练的分布式执行的优化技术缺乏系统的回顾。在本次调查中，我们分析了分布式 GNN 训练的三大挑战：海量特征通信、模型精度损失和工作负载不平衡。然后，我们为分布式 GNN 训练中的优化技术引入了一种新的分类法，以解决上述挑战。新的分类法将现有技术分为四类：GNN 数据分区、GNN 批量生成、GNN 执行模型和 GNN 通信协议。我们仔细讨论每个类别的技术。最后，我们分别总结了多图形处理单元（GPU）、GPU集群和中央处理单元（CPU）集群的现有分布式GNN系统，并讨论了分布式GNN训练的未来方向。

更新日期：2024-04-10

点击分享查看原文

点击收藏

公开下载

阅读更多本刊最新论文本刊介绍/投稿指南

全部期刊列表>>