Accelerating Massively Distributed Deep Learning Through Efficient Pseudo-Synchronous Update Method

Wen, Yingpeng; Qiu, Zhilin; Zhang, Dongyu; Huang, Dan; Xiao, Nong; Lin, Liang

doi:10.1007/s10766-023-00759-4

Accelerating Massively Distributed Deep Learning Through Efficient Pseudo-Synchronous Update Method

Published: 13 November 2023

(2023)
Cite this article

International Journal of Parallel Programming Aims and scope Submit manuscript

Yingpeng Wen¹,
Zhilin Qiu¹,
Dongyu Zhang¹,
Dan Huang¹,
Nong Xiao¹ &
…
Liang Lin¹

136 Accesses
Explore all metrics

Abstract

In recent years, deep learning models have been successfully applied to large-scale data analysis, including image classification, video caption, natural language processing, etc. Large-scale data analyses take advantage of parallel computing to accelerate the speed of model training, in which data parallelism has become the dominant method for deep learning model training due to its high throughput rate. Synchronous stochastic gradient descent optimization becomes a well-recognized optimization method to ensure model convergence, but the overhead of gradients synchronization increases linearly as the number of workers increases, causing a huge waste of time. Although some efficiency-first asynchronous methods have been proposed, these methods cannot guarantee their convergence in large-scale distributed training. To solve this problem, we propose an efficient pseudo-synchronous approach that updates the network with the previous gradient, performing the synchronization of a new gradient to overlap computation and synchronization. This idea will obviously affect the normal convergence of the model, so we propose a novel adaptive exponential smoothing predicted gradient algorithm for model optimization, which can adaptively adjust the confidence coefficient of the history gradient to ensure the normal convergence of the training process. Experiments prove that our method can speed up the training process and achieve a comparable accuracy rate with standard synchronous SGD. Besides, our method has more efficient weak scalability compared to the traditional synchronous SGD and those in previous related work. We apply our methods to image recognition and video caption applications at most 12288 cores with strong scalability on Tianhe II. Evaluations show that, when configured appropriately, our method attains near-linear scalability using 128 nodes. We get 93.4% weak scaling efficiency on 64 nodes, 90.5% on 128 nodes.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

CBAM: Convolutional Block Attention Module

A review on the long short-term memory model

Article 13 May 2020

Bolstering stochastic gradient descent with model building

Article Open access 15 April 2024

Notes

SGD in the following refers to the SGD with momentum.

References

Abadi, M., Barham, P., Chen, J., et al.: Tensorflow: A system for large-scale machine learning. In: 12th \(\{\)USENIX\(\}\) symposium on operating systems design and implementation (\(\{\)OSDI\(\}\) 16), pp. 265–283. (2016)
Bottou, L.: Large-scale machine learning with stochastic gradient descent. In: Proceedings of COMPSTAT’2010. pp. 177–186. Springer (2010)
Chen, D.L., Dolan, W.B.: Collecting highly parallel data for paraphrase evaluation. In: Meeting of the Association for Computational Linguistics: Human Language Technologies, pp. 190–200. (2011)
Chen, J., Pan, X., Monga, R., et al.: Revisiting distributed synchronous sgd. (2016) arXiv preprint arXiv:1604.00981
Szegedy, C., Liu, W., Jia, Y., et al.: Going deeper with convolutions. In: 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1–9. IEEE, Boston, MA (2015)
Dean, J., Barroso, L.A.: The tail at scale. Commun. ACM 56(2), 74–80 (2013)
Article Google Scholar
Dean, J., Corrado, G.S., Monga, R., et al.: Large scale distributed deep networks. In: International Conference on Neural Information Processing Systems, pp. 1223–1231. (2012)
Devlin, J., Chang, M.W., Lee, K., et al.: Bert: Pre-training of deep bidirectional transformers for language understanding. (2018) arXiv preprint arXiv:1810.04805
Duchi, J.C., Hazan, E., Singer, Y.: Adaptive subgradient methods adaptive subgradient methods for online learning and stochastic optimization. J. Mach. Learn. Res. 12, 2121–2159 (2011)
MathSciNet MATH Google Scholar
Dutta, S., Wang, J., Joshi, G.: Slow and stale gradients can win the race. (2020) arXiv preprint arXiv:2003.10579
Dutta, S., Wang, J., Joshi, G.: Slow and stale gradients can win the race. IEEE J. Select. Areas Inform. Theory 2(3), 1012–1024 (2021)
Article Google Scholar
Goyal, P., Dollár, P., Girshick, R., et al.: Accurate, large minibatch sgd: training imagenet in 1 h. (2017) arXiv preprint arXiv:1706.02677
Gupta, S., Zhang, W., Wang, F.: Model accuracy and runtime tradeoff in distributed deep learning: a systematic study. Comput. Sci. pp 171–180 (2016)
He, K., Zhang, X., Ren, S., et al.: Delving deep into rectifiers: Surpassing human-level performance on imagenet classification. In: Proceedings of the IEEE International Conference On Computer Vision, pp. 1026–1034. (2015)
He, K., Zhang, X., Ren, S., et al.: Deep residual learning for image recognition. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 770–778. IEEE, Las Vegas, NV, United States (2016)
Heinecke, A., Vaidyanathan, K., Smelyanskiy, M., et al.: Petascale high order dynamic rupture earthquake simulations on heterogeneous supercomputers. In: International Conference for High PERFORMANCE Computing, pp. 3–14. Networking, Storage and Analysis (2014)
Hu, J., Shen, L., Sun, G.: Squeeze-and-excitation networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 7132–7141. (2018)
Jaggi, M., Smith, V., Takác, M., et al.: Communication-efficient distributed dual coordinate ascent. Adv. Neural. Inform. Process. Syst. 4, 3068–3076 (2014)
Google Scholar
Jeon, W., Ko, G., Lee, J., et al.: Deep learning with gpus. In: Advances in Computers, pp. 167–215. Elsevier (2021)
Google Scholar
Johnson, R., Zhang, T.: Accelerating stochastic gradient descent using predictive variance reduction. In: International Conference on Neural Information Processing Systems, pp. 315–323. (2013)
Keuper, J., Pfreundt, F.J.: Asynchronous parallel stochastic gradient descent: a numeric core for scalable distributed machine learning algorithms. In: The Workshop on Machine Learning in High-Performance Computing Environments, p. 1. (2015)
Kingma, D., Ba, J.: Adam: a method for stochastic optimization. Comput. Sci. (2014a)
Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolutional neural networks. In: NIPS’12 Proceedings of the 25th International Conference on Neural Information Processing Systems. ACM, Lake Tahoe, Nevada, pp. 1097–1105. (2012)
Li, M.: Scaling distributed machine learning with the parameter server. In: International Conference on Big Data Science and Computing, p. 1. (2014)
Mahdisoltani, F., Berger, G., Gharbieh, W., et al.: Fine-grained video classification and captioning. (2018) arXiv preprint arXiv:1804.09235 5(6)
Mittal, S., Vaishay, S.: A survey of techniques for optimizing deep learning on gpus. J. Syst. Architect. 99(101), 635 (2019)
Google Scholar
Paszke, A., Gross, S., Massa, F., et al.: Pytorch: an imperative style, high-performance deep learning library. Adv. Neural. Inform. Process. Syst. 32, 8026–8037 (2019)
Google Scholar
Rada-Vilela, J., Zhang, M., Seah. W.: A performance study on synchronous and asynchronous updates in particle swarm optimization. In: Conference on Genetic and Evolutionary Computation, pp 21–28. (2011)
Radford, A., Wu, J., Child, R., et al.: Language models are unsupervised multitask learners. Open AI blog 1(8), 9 (2019)
Google Scholar
Sculley, D.: Web-scale k-means clustering. In: International Conference on World Wide Web, WWW 2010, pp. 1177–1178. Raleigh, North Carolina, USA, April (2010)
Shalev-Shwartz, S., Zhang, T.: Stochastic dual coordinate ascent methods for regularized loss minimization. J. Mach. Learn. Res. 14(1), 2013 (2013)
MathSciNet MATH Google Scholar
Shi, S., Chu, X., Li, B.: Mg-wfbp: efficient data communication for distributed synchronous sgd algorithms. In: IEEE INFOCOM 2019-IEEE Conference on Computer Communications, IEEE, pp. 172–180. (2019)
Shi, S., Wang, Q., Chu, X., et al.: Communication-efficient distributed deep learning with merged gradient sparsification on gpus. In: IEEE INFOCOM 2020-IEEE Conference on Computer Communications, IEEE, pp. 406–415. (2020)
Shi, S., Chu, X., Li, B.: Exploiting simultaneous communications to accelerate data parallel distributed deep learning. In: IEEE INFOCOM 2021-IEEE Conference on Computer Communications, IEEE, pp 1–10. (2021)
Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. (2014a) arXiv preprint arXiv:1409.1556
Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. Comput. Sci. (2014b)
Sutskever, I., Vinyals, O., Le, QV.: Sequence to sequence learning with neural networks. In: Advances in Neural Information Processing Systems, pp. 3104–3112. (2014)
Thomason, J., Venugopalan, S., Guadarrama, S., et al.: Integrating language and vision to generate natural language descriptions of videos in the wild. University of Texas at Austin Austin United States, Tech. rep. (2014)
Venugopalan, S., Xu, H., Donahue, J., et al.: Translating videos to natural language using deep recurrent neural networks. Comput. Sci. (2014)
Venugopalan, S., Rohrbach, M., Donahue, J., et al.: Sequence to sequence – video to text. In: IEEE International Conference on Computer Vision, pp. 4534–4542. (2015)
Verbraeken, J., Wolting, M., Katzy, J., et al.: A survey on distributed machine learning. ACM Comput. Surv. (CSUR) 53(2), 1–33 (2020)
Article Google Scholar
Wang, J., Wang, H., Zhao, C., et al.: Iteration acceleration for distributed learning systems. Parallel Comput. 72, 29–41 (2018). https://doi.org/10.1016/j.parco.2018.01.001
Article MathSciNet Google Scholar
Wang, L., Shen, B., Zhao, N.: Second-order convergence of asynchronous parallel stochastic gradient descent: When is the linear speedup achieved? (2019) arXiv preprint arXiv:1910.06000
Yao, L., Torabi, A., Cho, K., et al.: Describing videos by exploiting temporal structure. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 4507–4515. (2015)
You, Y., Buluc, A., Demmel, J.: Scaling deep learning on gpu and knights landing clusters. In: In Proceedings of SC17. ACM, LDenver,CO, USA, p. 12. (2017a)
You, Y., Gitman, I., Ginsburg, B.: Scaling sgd batch size to 32k for imagenet training. (2017b) arXiv preprint arXiv:1708.03888 6(12):6
Zhang, R., Zheng, S., Kwok, J.T.: Fast distributed asynchronous sgd with variance reduction. (2015a) arXiv preprint arXiv:1508.01633
Zhang, S., Choromanska, A.E., LeCun, Y.: Deep learning with elastic averaging sgd. Adv. Neural Inform. Process. Syst. 28 (2015b)
Zheng, S., Meng, Q., Wang, T., et al.: Asynchronous stochastic gradient descent with delay compensation for distributed deep learning. (2016) arXiv preprint arXiv:1609.08326
Zheng, S., Meng, Q., Wang, T., et al.: Asynchronous stochastic gradient descent with delay compensation. In: Precup D, Teh YW (eds) Proceedings of the 34th International Conference on Machine Learning, Proceedings of Machine Learning Research, vol 70, pp. 4120–4129. PMLR (2017) http://proceedings.mlr.press/v70/zheng17b.html
Zhu, Y., Ying, L.: A sharp convergence rate for the asynchronous stochastic gradient descent. (2020) arXiv preprint arXiv:2001.09126

Download references

Acknowledgements

This research was supported by the Natural Science Foundation of China under Grant No. U1811464, and was also supported in part by the Guangdong Natural Science Foundation under Grant No. 2018B030312002, in part by the Program for Guangdong Introducing Innovative and Entrepreneurial Teams under Grant No. 2016ZT06D211, in part by the CCF-Baidu Open Fund of 2021032.

Author information

Authors and Affiliations

School of Computer Science and Engineering, Sun Yat-sen University, Guangzhou, Guangdong, China
Yingpeng Wen, Zhilin Qiu, Dongyu Zhang, Dan Huang, Nong Xiao & Liang Lin

Authors

Yingpeng Wen
View author publications
You can also search for this author in PubMed Google Scholar
Zhilin Qiu
View author publications
You can also search for this author in PubMed Google Scholar
Dongyu Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Dan Huang
View author publications
You can also search for this author in PubMed Google Scholar
Nong Xiao
View author publications
You can also search for this author in PubMed Google Scholar
Liang Lin
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

Investigation, software, writing original draft, and editing were performed by YW. Conceptualization, methodology by YW and ZQ. Writing review and editing were performed by all authors.

Corresponding author

Correspondence to Nong Xiao.

Ethics declarations

Conflict of interest

The authors declare that they have no conflict of interest.

Ethical Approval

This study does not contain any studies with human participants or animals performed by any of the authors.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Wen, Y., Qiu, Z., Zhang, D. et al. Accelerating Massively Distributed Deep Learning Through Efficient Pseudo-Synchronous Update Method. Int J Parallel Prog (2023). https://doi.org/10.1007/s10766-023-00759-4

Download citation

Received: 26 August 2022
Accepted: 06 September 2023
Published: 13 November 2023
DOI: https://doi.org/10.1007/s10766-023-00759-4

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Accelerating Massively Distributed Deep Learning Through Efficient Pseudo-Synchronous Update Method

Abstract

Access this article

Similar content being viewed by others

CBAM: Convolutional Block Attention Module

A review on the long short-term memory model

Bolstering stochastic gradient descent with model building

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interest

Ethical Approval

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Accelerating Massively Distributed Deep Learning Through Efficient Pseudo-Synchronous Update Method

Abstract

Access this article

Similar content being viewed by others

CBAM: Convolutional Block Attention Module

A review on the long short-term memory model

Bolstering stochastic gradient descent with model building

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interest

Ethical Approval

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation