当前位置: X-MOL 学术J. Comput. Sci. Tech. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
xCCL: A Survey of Industry-Led Collective Communication Libraries for Deep Learning
Journal of Computer Science and Technology ( IF 1.9 ) Pub Date : 2023-03-31 , DOI: 10.1007/s11390-023-2894-6
Adam Weingram , Yuke Li , Hao Qi , Darren Ng , Liuyao Dai , Xiaoyi Lu

Machine learning techniques have become ubiquitous both in industry and academic applications. Increasing model sizes and training data volumes necessitate fast and efficient distributed training approaches. Collective communications greatly simplify inter- and intra-node data transfer and are an essential part of the distributed training process as information such as gradients must be shared between processing nodes. In this paper, we survey the current state-of-the-art collective communication libraries (namely xCCL, including NCCL, oneCCL, RCCL, MSCCL, ACCL, and Gloo), with a focus on the industry-led ones for deep learning workloads. We investigate the design features of these xCCLs, discuss their use cases in the industry deep learning workloads, compare their performance with industry-made benchmarks (i.e., NCCL Tests and PARAM), and discuss key take-aways and interesting observations. We believe our survey sheds light on potential research directions of future designs for xCCLs.



中文翻译:

xCCL:行业主导的深度学习集体通信库调查

机器学习技术在工业和学术应用中已经无处不在。增加模型大小和训练数据量需要快速有效的分布式训练方法。集体通信极大地简化了节点间和节点内的数据传输,并且是分布式训练过程的重要组成部分,因为梯度等信息必须在处理节点之间共享。在本文中,我们调查了当前最先进的集体通信库(即 xCCL,包括 NCCL、oneCCL、RCCL、MSCCL、ACCL 和 Gloo),重点关注行业领先的深度学习工作负载. 我们研究这些 xCCL 的设计特点,讨论它们在行业深度学习工作负载中的用例,将它们的性能与行业基准(即 NCCL 测试和 PARAM)进行比较,并讨论关键要点和有趣的观察结果。我们相信我们的调查揭示了 xCCL 未来设计的潜在研究方向。

更新日期:2023-04-01
down
wechat
bug