On Stochastic Roundoff Errors in Gradient Descent with Low-Precision Computation

Xia, Lu; Massei, Stefano; Hochstenbach, Michiel E.; Koren, Barry

doi:10.1007/s10957-023-02345-7

On Stochastic Roundoff Errors in Gradient Descent with Low-Precision Computation

Published: 20 December 2023

Volume 200, pages 634–668, (2024)
Cite this article

Journal of Optimization Theory and Applications Aims and scope Submit manuscript

Lu Xia ORCID: orcid.org/0000-0002-4941-8999¹,
Stefano Massei²,
Michiel E. Hochstenbach¹ &
…
Barry Koren¹

260 Accesses
Explore all metrics

Abstract

When implementing the gradient descent method in low precision, the employment of stochastic rounding schemes helps to prevent stagnation of convergence caused by the vanishing gradient effect. Unbiased stochastic rounding yields zero bias by preserving small updates with probabilities proportional to their relative magnitudes. This study provides a theoretical explanation for the stagnation of the gradient descent method in low-precision computation. Additionally, we propose two new stochastic rounding schemes that trade the zero bias property with a larger probability to preserve small gradients. Our methods yield a constant rounding bias that, on average, lies in a descent direction. For convex problems, we prove that the proposed rounding methods typically have a beneficial effect on the convergence rate of gradient descent. We validate our theoretical analysis by comparing the performances of various rounding schemes when optimizing a multinomial logistic regression model and when training a simple neural network with an 8-bit floating-point format.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Limitations of neural network training due to numerical instability of backpropagation

Article Open access 11 February 2024

A Stochastic Gradient Method with Biased Estimation for Faster Nonconvex Optimization

Stochastic Gradient Descent with Polyak’s Learning Rate

Article 08 September 2021

Notes

The MATLAB code is available upon request.

References

Bertsekas, D.P., Tsitsiklis, J.N.: Gradient convergence in gradient methods with errors. SIAM J. Optim. 10(3), 627–642 (2000)
Article MathSciNet Google Scholar
Böhning, D.: Multinomial logistic regression algorithm. Ann. Inst. Stat. Math. 44(1), 197–200 (1992)
Article Google Scholar
Chung, E., Fowers, J., Ovtcharov, K., Papamichael, M., Caulfield, A., Massengill, T., Liu, M., Lo, D., Alkalay, S., Haselman, M., Abeydeera, M., Adams, L., Angepat, H., Boehn, C., Chiou, D., Firestein, O., Forin, A., Gatlin, K., Ghandi, M., Heil, S., Holohan, K., Husseini, A., Juhász, T., Kagi, K., Kovvuri, R., Lanka, S., Megen, F.V., Mukhortov, D., Patel, P., Perez, B., Rapsang, A., Reinhardt, S., Rouhani, B., Sapek, A., Seera, R., Shekar, S., Sridharan, B., Weisz, G., Woods, L., Xiao, P.Y., Zhang, D., Zhao, R., Burger, D.: Serving DNNs in real time at datacenter scale with project brainwave. IEEE Micro 38(2), 8–20 (2018)
Connolly, M.P., Higham, N.J., Mary, T.: Stochastic rounding and its probabilistic backward error analysis. SIAM J. Sci. Comput. 43(1), A566–A585 (2021)
Article MathSciNet Google Scholar
Croci, M., Fasi, M., Higham, N.J., Mary, T., Mikaitis, M.: Stochastic rounding: implementation, error analysis and applications. R. Soc. Open Sci. 9(3), 211631 (2022)
Article Google Scholar
Croci, M., Giles, M.B.: Effects of round-to-nearest and stochastic rounding in the numerical solution of the heat equation in low precision. IMA J. Numer. Anal. 43(3), 1358–1390 (2023)
Article MathSciNet Google Scholar
Davies, M., Srinivasa, N., Lin, T.H., Chinya, G., Cao, Y., Choday, S.H., Dimou, G., Joshi, P., Imam, N., Jain, S., Liao, Y., Lin, C.-K., Lines, A., Liu, R., Mathaikutty, D., McCoy, S., Paul, A., Tse, J., Venkataramanan, G., Weng, Y.-H., Wild, A., Yang, Y., Wang, H.: Loihi: A neuromorphic manycore processor with on-chip learning. IEEE Micro 38(1), 82–99 (2018)
Article Google Scholar
Deng, L.: The MNIST database of handwritten digit images for machine learning research. IEEE Signal Process. Mag. 29(6), 141–142 (2012)
Article Google Scholar
Glorot, X., Bengio, Y.: Understanding the difficulty of training deep feedforward neural networks. In: Proc. of the 13th Int. Conf. Artif. Intell. Stat., pp. 249–256 (2010)
Gupta, S., Agrawal, A., Gopalakrishnan, K., Narayanan, P.: Deep learning with limited numerical precision. In: Proc. of the 32nd Int. Conf. Mach. Learn., pp. 1737–1746 (2015)
Hickmann, B., Chen, J., Rotzin, M., Yang, A., Urbanski, M., Avancha, S.: Intel Nervana neural network processor-t (NNP-T) fused floating point many-term dot product. In: Proc. of the 27th IEEE Symp. Comput., pp. 133–136. IEEE (2020)
Higham, N.J.: Accuracy and Stability of Numerical Algorithms. SIAM, Philadelphia (2002)
Book Google Scholar
Higham, N.J., Pranesh, S.: Simulating low precision floating-point arithmetic. SIAM J. Sci. Comput. 41(5), C585–C602 (2019)
Article MathSciNet Google Scholar
Hopkins, M., Mikaitis, M., Lester, D.R., Furber, S.: Stochastic rounding and reduced-precision fixed-point arithmetic for solving neural ordinary differential equations. Philos. Trans. Royal Soc. A 378(2166), 20190052 (2020)
Article MathSciNet Google Scholar
Hosmer, D.W., Jr., Lemeshow, S., Sturdivant, R.X.: Applied Logistic Regression. John Wiley & Sons, Hoboken (2013)
Book Google Scholar
Huskey, H.D., Hartree, D.R.: On the precision of a certain procedure of numerical integration. J. Res. Natl. Inst. Stand. Technol. 42, 57–62 (1949)
Article MathSciNet Google Scholar
IEEE: IEEE standard for floating-point arithmetic. IEEE Std 754-2019 (Revision of IEEE 754-2008) pp. 1-84 (2019)
Jouppi, N.P., Yoon, D.H., Kurian, G., Li, S., Patil, N., Laudon, J., Young, C., Patterson, D.: A domain-specific supercomputer for training deep neural networks. Commun. ACM 63(7), 67–78 (2020)
Article Google Scholar
Kuczma, M.: An Introduction to the Theory of Functional Equations and Inequalities: Cauchy’s Equation and Jensen’s Inequality. Birkhäuser Verlag AG, Basel (2009)
Book Google Scholar
Lee, J.D., Simchowitz, M., Jordan, M.I., Recht, B.: Gradient descent only converges to minimizers. In: Proc. of the 29th Annual Conf. on Learn. Theory, pp. 1246–1257 (2016)
Li, H., De, S., Xu, Z., Studer, C., Samet, H., Goldstein, T.: Training quantized nets: A deeper understanding. In: Proc. of the 31st Neural Inf. Process. Syst. Conf., vol. 30 (2017)
Liu, Y., Gao, Y., Tong, S., Li, Y.: Fuzzy approximation-based adaptive backstepping optimal control for a class of nonlinear discrete-time systems with dead-zone. IEEE Trans. Fuzzy Syst. 24(1), 16–28 (2015)
Article Google Scholar
Mikaitis, M.: Stochastic rounding: Algorithms and hardware accelerator. In: Proc. of 2021 Int. Jt. Conf. Neural Netw., pp. 1–6. IEEE (2021)
Moulay, E., Léchappé, V., Plestan, F.: Properties of the sign gradient descent algorithms. Inf. Sci. 492, 29–39 (2019)
Article MathSciNet Google Scholar
Na, T., Ko, J.H., Kung, J., Mukhopadhyay, S.: On-chip training of recurrent neural networks with limited numerical precision. In: Proc. of the 2017 Int. Jt. Conf. Neural Netw., pp. 3716–3723. IEEE (2017)
Nesterov, Y.: Introductory Lectures on Convex Optimization: A Basic Course. Springer, New York (2003)
Google Scholar
NVIDIA H100 tensor core GPU architecture [white paper] (2022)
Ortiz, M., Cristal, A., Ayguadé, E., Casas, M.: Low-precision floating-point schemes for neural network training. arXiv preprint: 1804.05267 (2018)
Paxton, E.A., Chantry, M., Klöwer, M., Saffin, L., Palmer, T.: Climate modeling in low precision: Effects of both deterministic and stochastic rounding. J. Clim. 35(4), 1215–1229 (2022)
Article Google Scholar
Petres, C., Pailhas, Y., Patron, P., Petillot, Y., Evans, J., Lane, D.: Path planning for autonomous underwater vehicles. IEEE Trans. Robot. 23(2), 331–341 (2007)
Article Google Scholar
Schmidt, M., Roux, N., Bach, F.: Convergence rates of inexact proximal-gradient methods for convex optimization. In: Proc. of the 24th Neural Inf. Process. Syst. Conf., pp. 1458–1466 (2011)
Singh, H., Upadhyaya, L., Namjoshi, U.: Estimation of finite population variance. Curr. Sci. 57, 1331–1334 (1988)
Google Scholar
Steyer, R., Nagel, W.: Probability and Conditional Expectation: Fundamentals for the Empirical Sciences. John Wiley & Sons, Oxford (2017)
Google Scholar
Su, C., Zhou, S., Feng, L., Zhang, W.: Towards high performance low bitwidth training for deep neural networks. J. Semicond. 41(2), 022404 (2020)
Article Google Scholar
Wang, N., Choi, J., Brand, D., Chen, C.Y., Gopalakrishnan, K.: Training deep neural networks with 8-bit floating point numbers. In: Proc. of the 31st Neural Inf. Process. Syst. Conf., pp. 7675–7684 (2018)
Zou, D., Cao, Y., Zhou, D., Gu, Q.: Gradient descent optimizes over-parameterized deep ReLU networks. Mach. Learn. 109(3), 467–492 (2020)
Article MathSciNet Google Scholar

Download references

Acknowledgements

We thank the reviewers for their constructive comments and the editor for the handling of this paper. This research was funded by the EU ECSEL Joint Undertaking under Grant agreement No. 826452 (project Arrowhead Tools).

Author information

Authors and Affiliations

Eindhoven University of Technology, 5600 MB, Eindhoven, The Netherlands
Lu Xia, Michiel E. Hochstenbach & Barry Koren
Università di Pisa, 56127, Pisa, Italy
Stefano Massei

Authors

Lu Xia
View author publications
You can also search for this author in PubMed Google Scholar
Stefano Massei
View author publications
You can also search for this author in PubMed Google Scholar
Michiel E. Hochstenbach
View author publications
You can also search for this author in PubMed Google Scholar
Barry Koren
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Lu Xia.

Additional information

Communicated by Olivier Fercoq.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Xia, L., Massei, S., Hochstenbach, M.E. et al. On Stochastic Roundoff Errors in Gradient Descent with Low-Precision Computation. J Optim Theory Appl 200, 634–668 (2024). https://doi.org/10.1007/s10957-023-02345-7

Download citation

Received: 24 February 2023
Accepted: 16 November 2023
Published: 20 December 2023
Issue Date: February 2024
DOI: https://doi.org/10.1007/s10957-023-02345-7

Keywords

Mathematics Subject Classification

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

On Stochastic Roundoff Errors in Gradient Descent with Low-Precision Computation

Abstract

Access this article

Similar content being viewed by others

Limitations of neural network training due to numerical instability of backpropagation

A Stochastic Gradient Method with Biased Estimation for Faster Nonconvex Optimization

Stochastic Gradient Descent with Polyak’s Learning Rate

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Mathematics Subject Classification

Navigation

On Stochastic Roundoff Errors in Gradient Descent with Low-Precision Computation

Abstract

Access this article

Similar content being viewed by others

Limitations of neural network training due to numerical instability of backpropagation

A Stochastic Gradient Method with Biased Estimation for Faster Nonconvex Optimization

Stochastic Gradient Descent with Polyak’s Learning Rate

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Mathematics Subject Classification

Search

Navigation