Accelerating Massively Distributed Deep Learning Through Efficient Pseudo-Synchronous Update Method,International Journal of Parallel Programming

当前位置： X-MOL 学术 › Int. J. Parallel. Program › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Accelerating Massively Distributed Deep Learning Through Efficient Pseudo-Synchronous Update Method
International Journal of Parallel Programming ( IF 1.5 ) Pub Date : 2023-11-13 , DOI: 10.1007/s10766-023-00759-4
Yingpeng Wen , Zhilin Qiu , Dongyu Zhang , Dan Huang , Nong Xiao , Liang Lin

In recent years, deep learning models have been successfully applied to large-scale data analysis, including image classification, video caption, natural language processing, etc. Large-scale data analyses take advantage of parallel computing to accelerate the speed of model training, in which data parallelism has become the dominant method for deep learning model training due to its high throughput rate. Synchronous stochastic gradient descent optimization becomes a well-recognized optimization method to ensure model convergence, but the overhead of gradients synchronization increases linearly as the number of workers increases, causing a huge waste of time. Although some efficiency-first asynchronous methods have been proposed, these methods cannot guarantee their convergence in large-scale distributed training. To solve this problem, we propose an efficient pseudo-synchronous approach that updates the network with the previous gradient, performing the synchronization of a new gradient to overlap computation and synchronization. This idea will obviously affect the normal convergence of the model, so we propose a novel adaptive exponential smoothing predicted gradient algorithm for model optimization, which can adaptively adjust the confidence coefficient of the history gradient to ensure the normal convergence of the training process. Experiments prove that our method can speed up the training process and achieve a comparable accuracy rate with standard synchronous SGD. Besides, our method has more efficient weak scalability compared to the traditional synchronous SGD and those in previous related work. We apply our methods to image recognition and video caption applications at most 12288 cores with strong scalability on Tianhe II. Evaluations show that, when configured appropriately, our method attains near-linear scalability using 128 nodes. We get 93.4% weak scaling efficiency on 64 nodes, 90.5% on 128 nodes.

中文翻译：

通过高效的伪同步更新方法加速大规模分布式深度学习

近年来，深度学习模型已成功应用于大规模数据分析，包括图像分类、视频字幕、自然语言处理等。大规模数据分析利用并行计算来加快模型训练的速度，在其中数据并行因其高吞吐率而成为深度学习模型训练的主导方法。同步随机梯度下降优化成为一种公认的保证模型收敛的优化方法，但梯度同步的开销随着worker数量的增加而线性增加，造成巨大的时间浪费。尽管已经提出了一些效率优先的异步方法，但这些方法不能保证它们在大规模分布式训练中的收敛性。为了解决这个问题，我们提出了一种高效的伪同步方法，用先前的梯度更新网络，执行新梯度的同步以重叠计算和同步。这种想法显然会影响模型的正常收敛，因此我们提出了一种新颖的自适应指数平滑预测梯度算法用于模型优化，它可以自适应调整历史梯度的置信系数，以保证训练过程的正常收敛。实验证明，我们的方法可以加快训练过程，并达到与标准同步 SGD 相当的准确率。此外，与传统的同步SGD和之前的相关工作相比，我们的方法具有更高效的弱可扩展性。我们将我们的方法应用于最多 12288 个内核的图像识别和视频字幕应用，在天河 II 上具有强大的可扩展性。评估表明，当配置适当时，我们的方法使用 128 个节点即可实现近线性可扩展性。我们在 64 个节点上获得了 93.4% 的弱扩展效率，在 128 个节点上获得了 90.5% 的弱扩展效率。

更新日期：2023-11-14

点击分享查看原文

点击收藏

阅读更多本刊最新论文本刊介绍/投稿指南

全部期刊列表>>