Limitations of neural network training due to numerical instability of backpropagation

Karner, Clemens; Kazeev, Vladimir; Petersen, Philipp Christian

doi:10.1007/s10444-024-10106-x

Limitations of neural network training due to numerical instability of backpropagation

Open access
Published: 11 February 2024

Volume 50, article number 14, (2024)
Cite this article

Download PDF

You have full access to this open access article

Advances in Computational Mathematics Aims and scope Submit manuscript

Limitations of neural network training due to numerical instability of backpropagation

Download PDF

Clemens Karner¹,
Vladimir Kazeev^1,2 &
Philipp Christian Petersen^1,2

448 Accesses
Explore all metrics

Abstract

We study the training of deep neural networks by gradient descent where floating-point arithmetic is used to compute the gradients. In this framework and under realistic assumptions, we demonstrate that it is highly unlikely to find ReLU neural networks that maintain, in the course of training with gradient descent, superlinearly many affine pieces with respect to their number of layers. In virtually all approximation theoretical arguments which yield high order polynomial rates of approximation, sequences of ReLU neural networks with exponentially many affine pieces compared to their numbers of layers are used. As a consequence, we conclude that approximating sequences of ReLU neural networks resulting from gradient descent in practice differ substantially from theoretically constructed sequences. The assumptions and the theoretical results are compared to a numerical study, which yields concurring results.

Article PDF

Why Deep Neural Networks: Yet Another Explanation

Stochastic Gradient Descent with Polyak’s Learning Rate

Article 08 September 2021

Training Neural Networks Using Predictor-Corrector Gradient Descent

References

Arridge, S., Maass, P., Öktem, O., Schönlieb, C.-B.: Solving inverse problems using data-driven models. Acta Numer. 28, 1–174 (2019)
Article MathSciNet Google Scholar
Bachmayr, M., Kazeev, V.: Stability and preconditioning of elliptic PDEs with low-rank multilevel structure. Found. Comput. Math. 20, 1175–1236 (2020)
Article MathSciNet Google Scholar
Bhattacharya, K., Hosseini, B., Kovachki, N.B., Stuart, A.M.: Model reduction and neural networks for parametric PDEs. arXiv:2005.03180 (2020)
Boche, H., Fono, A., Kutyniok, G.: Limitations of deep learning for inverse problems on digital hardware. arXiv:2202.13490 (2022)
Bölcskei, H., Grohs, P., Kutyniok, G., Petersen, P.C.: Optimal approximation with sparsely connected deep neural networks. SIAM J. Math. Data Sci. 1, 8–45 (2019)
Article MathSciNet Google Scholar
Colbrook, M.J., Antun, V., Hansen, A.C.: The difficulty of computing stable and accurate neural networks: on the barriers of deep learning and Smale’s 18th problem. Proc. Natl. Acad. Sci. 119(12), e2107151119 (2022)
Article MathSciNet CAS PubMed PubMed Central Google Scholar
W. E., Yu, B.: The deep Ritz method: a deep learning-based numerical algorithm for solving variational problems. Commun. Math. Stat. 6(1), 1–12 (2018)
Goodfellow, I., Bengio, Y., Courville, A.: Deep Learning. MIT Press (2016)
Grohs, P., Hornung, F., Jentzen, A., Von Wurstemberger, P.: A proof that artificial neural networks overcome the curse of dimensionality in the numerical approximation of Black-Scholes partial differential equations. Memoirs of the American Mathematical Society (2020)
Gupta, S., Agrawal, A., Gopalakrishnan, K., Narayanan, P.: Deep learning with limited numerical precision. In: International Conference on Machine Learning, pp. 1737–1746. PMLR (2015)
Gühring, I., Kutyniok, G., Petersen, P.C.: Error bounds for approximations with deep ReLU neural networks in ${W}^{s,p}$- norms. arXiv:1902.07896 (2019)
Han, J., Jentzen, A., W. E. Solving high-dimensional partial differential equations using deep learning. Proc. Natl. Acad. Sci. 115(34), 8505–8510 (2018)
Han, S., Mao, H., Dally, W.J.: Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding. arXiv:1510.00149 (2015)
Hanin, B., Rolnick, D.: Complexity of linear regions in deep networks. In: International Conference on Machine Learning, pp. 2596–2604. PMLR, (2019)
He, J., Li, L., Xu, J., Zheng, C.: ReLU deep neural networks and linear finite elements. arXiv:1807.03973 (2018)
He, K., Zhang, X., Ren, S., Sun, J.: Delving deep into rectifiers: surpassing human-level performance on imagenet classification. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 1026–1034. (2015)
Higham, N.J.: Accuracy and stability of numerical algorithms. Society for Industrial and Applied Mathematics, second edition, (2002)
Hubara, I., Courbariaux, M., Soudry, D., El-Yaniv, R., Bengio, Y.: Quantized neural networks: training neural networks with low precision weights and activations. J. Mach. Learn. Res. 18(1), 6869–6898 (2017)
MathSciNet Google Scholar
Jacob, B., Kligys, S., Chen, B., Zhu, M., Tang, M., Howard, A., Adam, H., Kalenichenko, D.: Quantization and training of neural networks for efficient integer-arithmetic-only inference. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2704–2713. (2018)
Jin, K.H., McCann, M.T., Froustey, E., Unser, M.: Deep convolutional neural network for inverse problems in imaging. IEEE Trans. Image. Process. 26(9), 4509–4522 (2017)
Article ADS MathSciNet Google Scholar
Kazeev, V., Oseledets, I., Rakhuba, M., Schwab, C.: QTT-finite-element approximation for multiscale problems I: model problems in one dimension. Adv. Comput. Math. 43(2), 411–442 (2017)
Article MathSciNet Google Scholar
Kazeev, V., Oseledets, I., Rakhuba, M.V., Schwab, C.: Quantized tensor FEM for multiscale problems: diffusion problems in two and three dimensions. Multiscale Model. Simul. 20(3), 893–935 (2022)
Article MathSciNet Google Scholar
Kazeev, V., Schwab, C.: Quantized tensor-structured finite elements for second-order elliptic PDEs in two dimensions. Numer. Math. 138, 133–190 (2018)
Article MathSciNet Google Scholar
Krizhevsky, A., Sutskever, I., Hinton, G.: Imagenet classification with deep convolutional neural networks. In: Adv. Neural Inf. Process. Syst. 25, pp. 1097–1105. Curran Associates, Inc., (2012)
Kutyniok, G., Petersen, P., Raslan, M., Schneider, R.: A theoretical analysis of deep neural networks and parametric pdes. Constr. Approx. 55(1), 73–125 (2022)
Article MathSciNet Google Scholar
LeCun, Y., Bengio, Y., Hinton, G.: Deep learning. Nature 521(7553), 436–444 (2015)
Article ADS CAS PubMed Google Scholar
Li, H., Schwab, J., Antholzer, S., Haltmeier, M.: Nett: solving inverse problems with deep neural networks. Inverse Probl. 36(6), 065005 (2020)
Article ADS MathSciNet Google Scholar
Li, Z., Ma, Y., Vajiac, C., Zhang, Y.: Exploration of numerical precision in deep neural networks. arXiv:1805.01078 (2018)
Lu, L., Jin, P., Pang, G., Zhang, Z., Karniadakis, G.E.: Learning nonlinear operators via deeponet based on the universal approximation theorem of operators. Nat. Mach. Intell. 3(3), 218–229 (2021)
Article Google Scholar
Marcati, C., Opschoor, J.A., Petersen, P.C., Schwab, C.: Exponential ReLU neural network approximation rates for point and edge singularities. arXiv:2010.12217 (2020)
Marcati, C., Rakhuba, M., Schwab, C.: Tensor rank bounds for point singularities in $\mathbb{R} ^3$. Adv. Comput. Math. 48(3), 17–57 (2022)
Article Google Scholar
Nemirovski, A., Juditsky, A., Lan, G., Shapiro, A.: Robust stochastic approximation approach to stochastic programming. SIAM J. Optim. 19(4), 1574–1609 (2009)
Article MathSciNet Google Scholar
Nemirovsky, A.S., Yudin, D.B.: Problem complexity and method efficiency in optimization. Wiley-Interscience Series in Discrete Mathematics. Wiley, (1983)
Ongie, G., Jalal, A., Metzler, C.A., Baraniuk, R.G., Dimakis, A.G., Willett, R.: Deep learning techniques for inverse problems in imaging. IEEE J. Sel. Areas Inf. Theory 1(1), 39–56 (2020)
Article Google Scholar
Opschoor, J., Petersen, P., Schwab, C.: Deep ReLU networks and high-order finite element methods. Anal. Appl. 18, 12 (2019)
MathSciNet Google Scholar
Petersen, P., Raslan, M., Voigtlaender, F.: Topological properties of the set of functions generated by neural networks of fixed size. Found. Comput. Math. 21(2), 375–444 (2021)
Article MathSciNet Google Scholar
Petersen, P., Voigtlaender, F.: Optimal approximation of piecewise smooth functions using deep ReLU neural networks. Neural Netw. 108, 296–330 (2018)
Article PubMed Google Scholar
Pineda, A.F.L., Petersen, P.C.: Deep neural networks can stably solve high-dimensional, noisy, non-linear inverse problems. arXiv:2206.00934 (2022)
Raissi, M., Perdikaris, P., Karniadakis, G.E.: Physics-informed neural networks: a deep learning framework for solving forward and inverse problems involving nonlinear partial differential equations. J. Comput. Phys. 378, 686–707 (2019)
Article ADS MathSciNet Google Scholar
Schwab, C., Zech, J.: Deep learning in high dimension: neural network expression rates for generalized polynomial chaos expansions in UQ. Anal. Appl. 17(01), 19–55 (2019)
Article MathSciNet Google Scholar
Senior, A.W., Evans, R., Jumper, J., Kirkpatrick, J., Sifre, L., Green, T., Qin, C., Žídek, A., Nelson, A.W., Bridgland, A.: Improved protein structure prediction using potentials from deep learning. Nature 577(7792), 706–710 (2020)
Article ADS CAS PubMed Google Scholar
Shaham, U., Cloninger, A., Coifman, R.R.: Provable approximation properties for deep neural networks. Appl. Comput. Harmon. Anal. 44(3), 537–557 (2018)
Article MathSciNet Google Scholar
Shalev-Shwartz, S., Ben-David, S.: Understanding Machine Learning: From Theory to Algorithms. Cambridge University Press, (2014)
Silver, D., Huang, A., Maddison, C.J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M.: Mastering the game of go with deep neural networks and tree search. Nature 529(7587), 484–489 (2016)
Article ADS CAS PubMed Google Scholar
Sun, Y., Lao, D., Sundaramoorthi, G., Yezzi, A.: Surprising instabilities in training deep networks and a theoretical analysis. arXiv:2206.02001 (2022)
Telgarsky, M.: Representation benefits of deep feedforward networks. arXiv:1509.08101 (2015)
Yarotsky, D.: Error bounds for approximations with deep ReLU networks. Neural Netw. 94, 103–114 (2017)
Article PubMed Google Scholar
Yarotsky, D.: Optimal approximation of continuous functions by very deep ReLU networks. In: Conference on Learning Theory, pp. 639–649. PMLR, (2018)
Zhou, S.-C., Wang, Y.-Z., Wen, H., He, Q.-Y., Zou, Y.-H.: Balanced quantization: an effective and efficient approach to quantized neural networks. J. Comput. Sci. Technol. 32(4), 667–682 (2017)
Article MathSciNet Google Scholar

Download references

Funding

Open access funding provided by University of Vienna. VK was partially supported by the Austrian Science Fund (FWF) under project F65 “Taming Complexity in Partial Differential Systems”.

Author information

Authors and Affiliations

Faculty of Mathematics, University of Vienna, Kolingasse 14–16, 1090, Vienna, Austria
Clemens Karner, Vladimir Kazeev & Philipp Christian Petersen
Research Network Data Science, University of Vienna, Kolingasse 14–16, 1090, Vienna, Austria
Vladimir Kazeev & Philipp Christian Petersen

Authors

Clemens Karner
View author publications
You can also search for this author in PubMed Google Scholar
Vladimir Kazeev
View author publications
You can also search for this author in PubMed Google Scholar
Philipp Christian Petersen
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Philipp Christian Petersen.

Ethics declarations

Conflict of interest

The authors declare no competing interests.

Additional information

Communicated by: Siddhartha Mishra

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendix

1.1 A.1 Example of unstable neural network

Here, we introduce the notion of a finite precision realisation of a given neural network $\Phi $ and give an example of how numerical instability can affect such NNs. We assume the floating-point addition and multiplication to be defined according to (1.2), where the minima are well defined since $\mathbb {M}$ has no accumulation points. Further, we consider the floating-point realisation of matrix-vector multiplication that is given by

$$ (A \times _{\mathbb {M}} b)_j:= (A_{j,1} \times _{\mathbb {M}} b_1) +_{\mathbb {M}} (A_{j,2} \times _{\mathbb {M}} b_2) +_{\mathbb {M}} \dots +_{\mathbb {M}} (A_{j,n} \times _{\mathbb {M}} b_n), $$

for any matrix $A \in \mathbb {R}^{m\times n}$ and a vector $b \in \mathbb {R}^n$ with $m,n\in \mathbb {N}$, where $j = 1, \dots m$ and the summation is performed from left to right. Next, we define the finite precision realisation of a neural network.

Definition A.1

Let $\mathbb {M} \subset \mathbb {R}$ be a set without accumulation points. For a neural network $\Phi $ with input dimension $d\in \mathbb {N}$, $L \in \mathbb {N}$ layers, output dimension $N_L \in \mathbb {N}$ and an activation function $\varrho _{\mathbb {M}}:\mathbb {M} \rightarrow \mathbb {M}$, we define the finite precision realisation by

$$\begin{aligned} \textrm{R}_{\mathbb {M}}(\Phi ) :{\mathbb {R}^d} \rightarrow \mathbb {M}^{N_{L}}, x \mapsto x^{(L)}, \end{aligned}$$

where $x^{(L)}$ is defined as

$$\begin{aligned} {\begin{matrix} x^{(0)} &{}:= x, \\ x^{(j)} &{}:= \varrho _{\mathbb {M}}(A_{\ell } \times _{\mathbb {M}} x^{(j-1)} +_{\mathbb {M}} b_\ell ) \quad \text { for } j = 1, \dots , L-1,\\ x^{(L)} &{}:= A_{L} \times _{\mathbb {M}} x^{(L-1)} +_{\mathbb {M}} b_{L}. \end{matrix}} \end{aligned}$$

(A.1)

Remark A.2

Observe that in this section we only require the operations of the NNs to be elements of $\mathbb {M}$ but not the weights. This greatly simplifies the mathematical argument while still allowing NNs to exhibit instabilities.

Having defined finite precision operations we continue by presenting a neural network $\Phi $ such that for every non-negative input x it holds that

$$ |\textrm{R}(\Phi )(x) - \textrm{R}_{\mathbb {M}}(\Phi )(x)| = |\textrm{R}(\Phi )(x)|= |x|. $$

This implies that the relative error of passing to the finite precision network is 1.

To illustrate numerical instability, we introduce the following neural network $\Phi $: Let $a \ge 1$, $N, L \in \mathbb {N}$ and

$$\begin{aligned} A_1 = \left( \begin{array}{c} 1\\ a\\ \vdots \\ a \end{array}\right) , \qquad A_{\ell } = \left( \begin{array}{cccc} 1 &{} 0 &{} \dots &{} 0\\ 0 &{} a &{} \dots &{} a\\ 0 &{} \vdots &{} \ddots &{} \vdots \\ 0 &{} a &{} \dots &{} a \end{array}\right) , \qquad \text { for } \ell =2, \dots L-3 \end{aligned}$$

and

$$\begin{aligned} A_{L-2} = \left( \begin{array}{cccc} 1 &{} 0 &{} \dots &{} 0\\ 0 &{} a &{} \dots &{} a \end{array}\right) , \qquad A_{L-1} = \left( \begin{array}{cc} 1 &{} a \\ 0 &{} a\\ \end{array}\right) , \qquad A_{L} = \left( \begin{array}{cc} 1&-1 \end{array}\right) . \end{aligned}$$

Furthermore, all biases satisfy $b_{\ell } = 0$ for $\ell =1, \dots , L$. We call the neural network defined above $\Phi _{a, N, L}$, an illustration is provided below in Fig. 7.

Proposition A.3

Let $a = 2^r \ge 1$ for $r \in \mathbb {Z}$ and let $\mathbb {M}$ be as in (1.1), with $e_{\min } \le 0, e_{\max } > 53+r$, $p= 53$ and hence $\epsilon = 2^{-53}$. Furthermore, let $\varrho (x) = \max \{0,x\}$ and $\varrho _{\mathbb {M}}(z):= \varrho (z)$ for all $z \in \mathbb {M}$. Additionally, let $L\in \mathbb {N}$ and $N \in \mathbb {N}$ be such that $(L-3)\log _{2}(N-1) + (L-1)\log _{2}(a-2\epsilon ) \ge 54$, then the neural network $\Phi _{a, N,L}$ as defined above satisfies

$$ |\textrm{R}(\Phi _{a, N,L})(x) - \textrm{R}_{\mathbb {M}}(\Phi _{a, N,L})(x)| = |\textrm{R}(\Phi _{a, N,L})(x)|= |x| $$

for all $x \in \mathbb {M}$ such that $x \ge 0$.

Proof

It follows from the definition of the NN $\Phi _{a, N,L}$ that $\textrm{R}(\Phi )(x) = x$. Thus, to prove the theorem we have to show that

$$\begin{aligned} \textrm{R}_{\mathbb {M}}(\Phi _{a, N,L})(x) = 0 \end{aligned}$$

(A.2)

for all non-negative x. Let, for $x\ge 0$, $x^{(0)}, \dots , x^{(L)}$ be as in (A.1). The construction of $\Phi _{a, N,L}$ implies that $x^{(L)} = 0$ if $(x^{(L-1)})_1 = (x^{(L-1)})_2$. Additionally, $(x^{(L-1)})_1 = (x^{(L-1)})_2$ follows if $(x^{(L-2)})_1 < a (x^{(L-2)})_2 \epsilon /2$.

As a consequence, we have to find a lower bound of $(x^{(L-2)})_2$. To begin with, we observe that for $x \in \mathbb {M}$ the multiplication with a is exact and thus $(x^{(2)})_k = a x$ for all $k = 2,\dots , N$. Also, we observe that for $j = 2, \dots , L-3$ it holds for $k\in \{2, \dots , n\}$ that

$$\begin{aligned}&(N-1) \, a \max _{i \in \{2, \ldots , N-1\}}(x^{(j-1)})_i + \epsilon (N-1) \, a \max _{i \in \{2, \ldots , N-1\}}(x^{(j-1)})_i\\&\qquad \ge (A_j \, x^{(j-1)})_k \\&\qquad \ge (N-1) \, a \min _{i \in \{2, \ldots , N-1\}}(x^{(j-1)})_i - \epsilon (N-1) \, a \max _{i \in \{2, \ldots , N-1\}}(x^{(j-1)})_i. \end{aligned}$$

This inequality results from the assumption that in the summation of each row of $A_j$ we create $(N-2)$ times an error at most of size $\epsilon \cdot \max _{i \in \{2, \ldots , N-1\}}(x^{(j-1)})_i$. This leads to

$$ (x^{(L-3)})_k \ge a \, (N-1)^{L-4} (a - 2\epsilon )^{L-4} x $$

and

$$ (x^{(L-2)})_2 \ge a \, (N-1)^{L-3} (a - 2\epsilon )^{L-3} x $$

for all $k \in \{2,\dots , N\}$. Due to the construction of our NN, we also have that $(x^{(L-2)})_1 = x$.

Thus, the inequality $(x^{(L-2)})_1 < a \, (x^{(L-2)})_2 \, \frac{\epsilon }{2}$ holds if

$$ \frac{1}{2}a^2 (N-1)^{L-3} (a - 2\epsilon )^{L-3} > \epsilon ^{-1}. $$

Note that

$$\begin{aligned} \frac{1}{2}a^2 (N-1)^{L-3} (a - 2\epsilon )^{L-3} > \frac{1}{2} (N-1)^{L-3} (a - 2\epsilon )^{L-1} \end{aligned}$$

(A.3)

and hence, it suffices to show that the right-hand side of (A.3) is bounded below by $\epsilon ^{-1} = 2^{53}$. Taking logarithms implies that this inequality is implied by

$$ (L-3)\log _{2}(N-1) + (L-1)\log _{2}(a-2\epsilon ) - \log _{2}(2)> 53, $$

which completes the proof.$\square $

1.2 A.2 Auxiliary results

1.2.1 A.2.1 Proof of Lemma 3.1

Proof

We denote the affine pieces of h on [0, c] by $(I_{p})_{p = 1}^P$. Here we choose the pieces as semi-open intervals $I_{p} = (r_p, s_p]$ so that they are disjoint. Moreover, we choose $x_p \in I_p$ for $p = 1, \dots , P$. We claim that

$$\begin{aligned} \sum _{p= 1}^P \lambda (h(I_p)) \ge t \lambda (A). \end{aligned}$$

(A.4)

To prove this, we consider the auxiliary function

$$\begin{aligned} \tilde{h} :[0,1]&\rightarrow A \times \left( \{1, \dots , P\} \cup \{\infty \}\right) ,\\ \tilde{h}(x)&:= \left( h(x), \#\{z \in [0, x] :h(z) = h(x) \} \right) . \end{aligned}$$

Note that for $x \in [0,1]$ either $\#\{z \in [0, x] :h(z) = h(x) \} \le P$ or $\#\{z \in [0, x] :h(z) = h(x) \} = \infty $, and hence $\tilde{h}$ is well defined. We have by assumption that

$$ \tilde{h}([0,1]) \supset A \times \{1, \dots , t\}. $$

If we equip $A \times (\{1, \dots , P\} \cup \{\infty \})$ with the measure $\lambda \times \nu $, where $\lambda $ is the Lebesgue and $\nu $ is the counting measure, then we conclude by the monotonicity of measures that

$$ \left( \lambda \times \nu \right) \left( \tilde{h}([0,1])\right) \ge \lambda (A) t. $$

It is easy to see that

$$\begin{aligned} \lambda (h(I_p)) \le (\lambda \times \nu )\left( \tilde{h}(I_p)\right) \le \sup _{\begin{array}{c} A_1, \dots , A_{P+1}\subset I_p,\\ \text { measurable} \end{array}} \sum _{q= 1}^{P+1} \lambda (h(A_q)), \end{aligned}$$

(A.5)

If $h'(x_p) \ne 0$ on $I_p$, then the right-hand side of (A.5) equals $\lambda (h(I_p))$. If $h'(x_p) = 0$, then the right-hand side of (A.5) is 0 and by the non-negativity of the Lebesgue measure the left-hand side is 0 as well. We conclude that

$$\begin{aligned} \lambda (h(I_p)) = (\lambda \times \nu )(\tilde{h}(I_p)). \end{aligned}$$

(A.6)

By the subadditivity of measures and (A.6), we have that

$$ \lambda (A) t \le \sum _{p=1}^P (\lambda \times \nu )\left( \tilde{h}(I_p)\right) = \sum _{p=1}^P \lambda (h(I_p)), $$

which yields (A.4). Finally, by standard properties of the Lebesgue measure under Lipschitz maps, we have that

$$ \sum _{p= 1}^P \lambda (h(I_p)) \le \sum _{p= 1}^P h'(x_p) \lambda (I_p) = \Vert h'\Vert _1 \le c \Vert h'\Vert _\infty , $$

where we used that $I_p$ are disjoint. $\square $

1.2.2 A.2.2 Proof of Proposition 3.3

Proof

Part 1: (Proof of (3.1)) Let $\kappa \subset [0,1]^d$ be a straight line. Recalling equation (2.5), we observe that

$$\begin{aligned} \widehat{{\textbf {A}}}_j\mathbbm {1}_{\hat{\eta }_{j-1}\ge 0} \, \hat{\eta }_{j-1}(x) + \widehat{{\textbf {b}}}_j&=\widehat{{\textbf {A}}}_j\mathbbm {1}_{\hat{\eta }_{j-1}\ge 0} \, \hat{\eta }_{j-1}(x)+(b_j-\lambda \hat{u}_j^b) \, . \end{aligned}$$

Inserting $\hat{u}_j^b=(I+\varepsilon _j\Theta _j^b)u_j^b$ from equation (2.4) leads to

$$\begin{aligned} \widehat{{\textbf {A}}}_j\mathbbm {1}_{\hat{\eta }_{j-1}\ge 0} \, \hat{\eta }_{j-1}(x)+\widehat{{\textbf {b}}}_j&= \widehat{{\textbf {A}}}_j\mathbbm {1}_{\hat{\eta }_{j-1}\ge 0} \, \hat{\eta }_{j-1}(x)+({\textbf {b}}_j- \lambda \varepsilon _j\Theta _j u_j^b)\\&=:\widehat{{\textbf {A}}}_j\mathbbm {1}_{\hat{\eta }_{j-1}\ge 0} \, \hat{\eta }_{j-1}(x)+{\textbf {b}}_j+\theta _{j}. \end{aligned}$$

Let us introduce for $k \in \{1, \ldots , N_j\}$ the random variable

$$ Q_{\widehat{{\textbf {A}}}_{j}, \hat{\eta }_{j-1}, \theta _{j},k}:= \#\{ x \in \kappa :\big ( \widehat{{\textbf {A}}}_{j}\mathbbm {1}_{\hat{\eta }_{j-1}\ge 0} \, \hat{\eta }_{j-1}(x)+{\textbf {b}}_j + \theta _{j} \big )_k=0 \} $$

and observe that

$$ \# \left( \hat{\omega }_{j,k} {\setminus } \omega _{j,k} \right) \le Q_{\widehat{{\textbf {A}}}_{j}, \hat{\eta }_{j-1}, \theta _{j},k} $$

since the number of breakpoints added after applying the activation function is less or equal than the number of $x \in \kappa $ satisfying $(\widehat{{\textbf {A}}}_{j}\mathbbm {1}_{\hat{\eta }_{j-1}\ge 0}\hat{\eta }_{j-1}(x)+{\textbf {b}}_j)_k+(\theta _{j})_k= 0$. Per Definition 2.3, we have that $\widehat{{\textbf {A}}}_{j}$, $\hat{\eta }_{j-1}$, and $\theta _{j}$ are independent random variables which allows the following computation for $q \in \mathbb {N}$

$$\begin{aligned} \mathbb {P}(\# \left( \hat{\omega }_{j,k} \setminus \omega _{j,k} \right) \ge q)&\le \mathbb {P}_{\widehat{{\textbf {A}}}_{j},\theta _{j}, \hat{\eta }_{j-1}}(Q_{\widehat{{\textbf {A}}}_{j},\theta _{j},k}\ge q)\nonumber \\&=\mathbb {E}_{\widehat{{\textbf {A}}}_{j}, \hat{\eta }_{j-1}}\mathbb {E}_{\theta _{j}}\mathbbm {1}_{Q_{\widehat{{\textbf {A}}}_{j}, \hat{\eta }_{j-1}, \theta _{j},k} \ge q}\nonumber \\&=\mathbb {E}_{\widehat{{\textbf {A}}}_{j}, \hat{\eta }_{j-1}} \mathbb {P}\left( Q_{\widehat{{\textbf {A}}}_{j}, \hat{\eta }_{j-1}, \theta _{j},k} \ge q \right) . \end{aligned}$$

(A.7)

By Assumption 2.7 we have that for all $x \in \Omega $

$$\begin{aligned} 1&\ge |\nabla _x (\eta _j)_k| \nonumber \\&= \left| \nabla _x \left( \widehat{{\textbf {A}}}_{j}\mathbbm {1}_{\hat{\eta }_{j-1}\ge 0}\hat{\eta }_{j-1}(x)+{\textbf {b}}_j\right) _k\right| \nonumber \\&= \left| \nabla _x \left( \widehat{{\textbf {A}}}_{j}\mathbbm {1}_{\hat{\eta }_{j-1}\ge 0}\hat{\eta }_{j-1}(x)\right) _k\right| . \end{aligned}$$

(A.8)

We pick $v, w \in \mathbb {R}^d$ with $\Vert v\Vert = 1$ such that $\kappa = \{tv + w :t \in [0, \mathcal {L}(\kappa )] \}$ where $\mathcal {L}(\kappa )$ is the length of $\kappa $. Next, we define

$$\begin{aligned} h:[0, \mathcal {L}(\kappa )]&\rightarrow [0,1]^d\\ t&\mapsto \left( \widehat{{\textbf {A}}}_{j}\mathbbm {1}_{\hat{\eta }_{j-1}\ge 0} \, \hat{\eta }_{j-1}(tv + w )\right) _k \, . \end{aligned}$$

Then, by the chain rule and (A.8), we have that $\Vert h'\Vert _\infty \le 1$. Moreover,

$$\begin{aligned} Q_{\widehat{{\textbf {A}}}_{j}, \hat{\eta }_{j-1}, \theta _{j},k} = \#\{x \in [0, \mathcal {L}(\kappa )]: h(x) +(\theta _{j})_k=0\}. \end{aligned}$$

Since by construction $(\theta _{j})_k$ is uniformly distributed on the interval $[-\lambda \varepsilon _j |(u_j^b)_k|/2,$ $ \lambda \varepsilon _j |(u_j^b)_k|/2]$, we conclude from Remark 3.2 that

$$\begin{aligned} \mathbb {P}\left( Q_{\widehat{{\textbf {A}}}_{j}, \hat{\eta }_{j-1}, \theta _{j},k} \ge q \right) \le \frac{2 \mathcal {L}(\kappa )}{\lambda } \cdot \left( \varepsilon _j \, q \, |(u_j^b)_k|\right) ^{-1} \end{aligned}$$

applied to (A.7) yields (3.1).

Part 2: (Proof of (3.2)) We recall the following upper bound on the total number of pieces of the realisation of an arbitrary NN:

Theorem A.4

([46]) Let $L \in \mathbb {N}$ and $\sigma $ be piecewise affine with p pieces. Then, for every NN $\Phi $ with $d=1, N_L=1$ and $N_1, \ldots , N_{L-1} \le N$, we have that $\textrm{R}(\Phi )$ has at most $(pN)^{(L-1)}$ affine pieces.

Invoking Theorem A.4 results in $\max _{k}\{\# \left( \hat{\omega }_{j,k} \setminus \omega _{j,k} \right) \} \le N^{j-1}$ for $j = 1, \dots , L$. We observe that

$$\begin{aligned} \mathbb {E}(\# (\hat{\omega }_{j,k} \setminus \omega _{j,k} ))&=\sum _{q=1}^{N^{j}}\bigg [\mathbb {P}(\# \left( \hat{\omega }_{j,k} \setminus \omega _{j,k} \right) \ge q)-\mathbb {P}(\# \left( \hat{\omega }_{j,k} \setminus \omega _{j,k} \right) \ge q+1)\bigg ]q\\&=\sum _{q=1}^{N^{j}}\mathbb {P}(\# \left( \hat{\omega }_{j,k} \setminus \omega _{j,k} \right) \ge q)q -\sum _{q=2}^{N^{j}+1}\mathbb {P}(\# \left( \hat{\omega }_{j,k} \setminus \omega _{j,k} \right) \ge q)(q-1)\\&=\sum _{q=1}^{N^{j}}\mathbb {P}(\# \left( \hat{\omega }_{j,k} \setminus \omega _{j,k} \right) \ge q)(q-(q-1)) \, . \end{aligned}$$

Next, we invoke (3.1) and obtain that

$$\begin{aligned} \mathbb {E}(\# (\hat{\omega }_{j,k} \setminus \omega _{j,k} ))&\le \sum _{q=1}^{N^{j}}\frac{2\mathcal {L}(\kappa )}{\lambda \varepsilon _j |(u_j^b)_k| q}(q-(q-1))\\&=\frac{2 \mathcal {L}(\kappa )}{\lambda \varepsilon _j |(u_j^b)_k|} \sum _{q=1}^{N^{j}}\frac{1}{q}\\&\le \frac{2\mathcal {L}(\kappa )}{\lambda \varepsilon _j |(u_j^b)_k|} \left( 1+\int _{1}^{N^{j}} \frac{1}{x} dx\right) \\\&\le \frac{2\mathcal {L}(\kappa )}{\lambda \varepsilon _j |(u_j^b)_k|} (1+\ln (N^{j})) \le \frac{\mathcal {L}(\kappa )}{2 \lambda \varepsilon _j |(u_j^b)_k|} (\ln (e N^{j})) \le \frac{2\mathcal {L}(\kappa )}{\lambda \varepsilon _j |(u_j^b)_k|} (\ln (N^{j+1})) \, , \end{aligned}$$

where we used that $e \le 3 \le N$. This completes the proof of (3.2).$\square $

1.3 A.3 Proof of Theorem 3.4

Proof

Part 1: We start by proving the result for $j' = L$. Using the notation of Proposition 3.3, we start by assuming that

$$\begin{aligned} \textrm{pieces}(\textrm{R}(\widehat{\Phi }^\varepsilon ), \kappa )) \le 1 + \sum _{j \in [L-1]} \sum _{k \in N_j} \# \left( \hat{\omega }_{j,k} \setminus \omega _{j,k}\right) , \end{aligned}$$

(A.9)

if $\omega _{j,k}$ is chosen as $\omega _{j,k} = \bigcup _{k \in N_{j-1}} \hat{\omega }_{j-1,k}$ for $j > 1$ and $\omega _{1,k} = \varnothing $.

We will prove (A.9) at the end this step of this proof to not distract from the main argument. By (A.9) and the linearity of the expected value, we obtain,

$$ \mathbb {E}(\textrm{pieces}(\textrm{R}(\widehat{\Phi }^\varepsilon ), \kappa ))) \le 1+ \sum _{j \in [L-1]} \sum _{k \in N_j} \mathbb {E}(\# \left( \hat{\omega }_{j,k} {\setminus } \omega _{j,k}\right) ) \,. $$

By Assumption 2.6, it holds that if $(u_j^b)_k = 0$, then $\textrm{R}((\widehat{\Phi }^\varepsilon )_j)_k = 0$ and therefore $\# \left( \hat{\omega }_{j,k} \setminus \omega _{j,k}\right) = 0$. Hence, invoking Proposition 3.3, we conclude, using Assumption 2.6, that

$$\begin{aligned} \mathbb {E}(\textrm{pieces}(\textrm{R}(\widehat{\Phi }^\varepsilon ), \kappa )))&\le 1+ 2\mathcal {L}(\kappa ) \lambda ^{-1}\sum _{j =1 }^{L-1} \sum _{k \in N_j} \frac{1}{\varepsilon _j} |(u_j^b)_k|^{\dagger } \log (N^{j+1})\\&\le 1+ 2\mathcal {L}(\kappa ) \lambda ^{-1}\sum _{j =1 }^{L-1} \sum _{k \in N_j} \frac{j+1}{\varepsilon _j} |(u_j^b)_k|^{\dagger } \log (N)\\&\le 1+ 2\mathcal {L}(\kappa ) \lambda ^{-1} \frac{L}{\hat{\epsilon }_{L}} \sum _{j =1 }^{L-1} \sum _{k \in N_j} |(u_j^b)_k|^{\dagger } \log (N)\\&\le 1+ 2\mathcal {L}(\kappa ) \lambda ^{-1} c_0 \frac{L}{\hat{\epsilon }_{L}} N^\nu \log (N) \, . \end{aligned}$$

We complete the proof of the first part by showing (A.9). Note that since

$$\begin{aligned} \textrm{R}(\widehat{\Phi }^\varepsilon )_{k'} = \sum _{k \in N_{L-1}} (\widehat{{\textbf {A}}}_{L})_{k'k} \, \varrho (\textrm{R}((\widehat{\Phi }^\varepsilon )_{L-1})_k) + (\widehat{{\textbf {b}}}_L)_{k'} \, , \end{aligned}$$

(A.10)

we have that

$$\begin{aligned} \textrm{pieces}(\textrm{R}(\widehat{\Phi }^\varepsilon ), \kappa )) \le 1 + \#\bigcup _{k \in N_{L-1}} \hat{\omega }_{L-1,k} \, , \end{aligned}$$

where we use that a piecewise affine function on a line has exactly one more affine piece than the smallest number of distinct points where the function is not affine.

We will show by induction over L that

$$\begin{aligned} \#\bigcup _{k \in N_{L-1}}\hat{\omega }_{L-1,k} \le \sum _{j \in [L-1]} \sum _{k \in N_j} \# \left( \hat{\omega }_{j,k} \setminus \omega _{j,k}\right) \end{aligned}$$

(A.11)

which then yields the claim. For $L = 2$, the claim follows directly by the choice $\omega _{1,k} = \varnothing $.

Now let $L > 2$, then

$$\begin{aligned} \bigcup _{k \in N_{L-1}}\hat{\omega }_{L-1,k}&= \bigcup _{k \in N_{L-1}}\left( \hat{\omega }_{L-1,k} \setminus \omega _{L-1,k}\right) \cup \omega _{L-1,k}\nonumber \\&= \omega _{L-1,1} \cup \bigcup _{k \in N_{L-1}}\left( \hat{\omega }_{L-1,k} \setminus \omega _{L-1,k}\right) , \end{aligned}$$

(A.12)

since $\omega _{L-1,1} = \omega _{L-1,k}$ for all $k \in N_{L-1}$ by definition. We have that

$$\begin{aligned} \textrm{R}((\widehat{\Phi }^\varepsilon )_{L-1})_{k'} = \sum _{k \in N_{L-2}} (\widehat{{\textbf {A}}}_{L-1})_{k'k} \, \varrho (\textrm{R}(\Phi _{L-2})_k) + (\widehat{{\textbf {b}}}_{L-1})_{k'} \, , \end{aligned}$$

and hence

$$\begin{aligned} \omega _{L-1,1} \subset \bigcup _{k \in N_{L-2}} \hat{\omega }_{L-2, k} \, . \end{aligned}$$

(A.13)

Using (A.13) and applying the induction hypothesis (A.11) to (A.12), we obtain that

$$\begin{aligned} \#\bigcup _{k \in N_{L-1}}\hat{\omega }_{L-1,k}&\le \#\omega _{L-1,1} + \#\bigcup _{k \in N_{L-1}}\left( \hat{\omega }_{L-1,k} \setminus \omega _{L-1,k}\right) \\&\le \#\bigcup _{k \in N_{L-2}} \hat{\omega }_{L-2, k} + \#\bigcup _{k \in N_{L-1}}\left( \hat{\omega }_{L-1,k} \setminus \omega _{L-1,k}\right) \\&\le \sum _{j \in [L-2]} \sum _{k \in N_j} \# \left( \hat{\omega }_{j,k} \setminus \omega _{j,k}\right) + \sum _{k \in N_{L-1}} \# \left( \hat{\omega }_{L-1,k} \setminus \omega _{L-1,k}\right) \, , \end{aligned}$$

which yields (A.11).

Part 2: Let $j'>0$. We have that

$$\begin{aligned} \textrm{R}(\widehat{\Phi }^\varepsilon ) = \textrm{R}(\widehat{\Phi }^{\varepsilon , c}_{j'}) \circ \varrho \circ \textrm{R}(\widehat{\Phi }^\varepsilon _{j'}), \end{aligned}$$

where

$$\begin{aligned} \widehat{\Phi }^{\varepsilon , c}_{j'} = \big ((\widehat{{\textbf {A}}}_{j'+1}, \widehat{{\textbf {b}}}_{j'+1}), \dots , (\widehat{{\textbf {A}}}_L, \widehat{{\textbf {b}}}_L)\big ). \end{aligned}$$

(A.14)

For $q, p \in \mathbb {N}$ and two functions $f: \kappa \rightarrow \mathbb {R}^q$ and $g:\mathbb {R}^q \rightarrow \mathbb {R}^p$ such that f has $r \in \mathbb {N}$ affine pieces on $\kappa $, and g has at most $s\in \mathbb {N}$ many pieces along all possible lines it holds that $g \circ f :\kappa \rightarrow \mathbb {R}^q$ has at most rs affine pieces, since on the image of each affine piece of f no more than s affine pieces can be generated by g. Note that $\varrho :\mathbb {R}^{N_{j'}} \rightarrow \mathbb {R}^{N_{j'}}$ has at most 2N, pieces along each line, since every coordinate of a line can change sign at most once. Invoking again Proposition A.4 as well as Part 1 of this proof and the monotonicity of the expected value therefore yields that

$$\begin{aligned} \mathbb {E}(\textrm{pieces}(\textrm{R}(\widehat{\Phi }^\varepsilon ), \kappa ))&\le (2N)^{L-j'-1} \cdot 2 N_{j'} \cdot \left( 1+ 2\mathcal {L}(\kappa ) \lambda ^{-1} c_0 \, \frac{j'}{\hat{\epsilon }_{j'}} N^\nu \log (N)\right) \\&\le (2N)^{L-j'} \cdot \left( 1+ 2\mathcal {L}(\kappa ) \lambda ^{-1} c_0 \, \frac{j'}{\hat{\epsilon }_{j'}} N^\nu \log (N)\right) , \end{aligned}$$

which completes the proof.$\square $

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Karner, C., Kazeev, V. & Petersen, P.C. Limitations of neural network training due to numerical instability of backpropagation. Adv Comput Math 50, 14 (2024). https://doi.org/10.1007/s10444-024-10106-x

Download citation

Received: 11 October 2022
Accepted: 16 January 2024
Published: 11 February 2024
DOI: https://doi.org/10.1007/s10444-024-10106-x

Keywords

Mathematics Subject Classification (2010)

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Limitations of neural network training due to numerical instability of backpropagation

Abstract

Article PDF

Similar content being viewed by others

Why Deep Neural Networks: Yet Another Explanation

Stochastic Gradient Descent with Polyak’s Learning Rate

Training Neural Networks Using Predictor-Corrector Gradient Descent

References

Funding

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Appendix

1.1 A.1 Example of unstable neural network

Definition A.1

Remark A.2

Proposition A.3

Proof

1.2 A.2 Auxiliary results

1.2.1 A.2.1 Proof of Lemma 3.1

Proof

1.2.2 A.2.2 Proof of Proposition 3.3

Proof

Theorem A.4

1.3 A.3 Proof of Theorem 3.4

Proof

Rights and permissions

About this article

Cite this article

Keywords

Mathematics Subject Classification (2010)

Navigation

Limitations of neural network training due to numerical instability of backpropagation

Abstract

Article PDF

Similar content being viewed by others

Why Deep Neural Networks: Yet Another Explanation

Stochastic Gradient Descent with Polyak’s Learning Rate

Training Neural Networks Using Predictor-Corrector Gradient Descent

References

Funding

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Appendix

Appendix

1.1 A.1 Example of unstable neural network

Definition A.1

Remark A.2

Proposition A.3

Proof

1.2 A.2 Auxiliary results

1.2.1 A.2.1 Proof of Lemma 3.1

Proof

1.2.2 A.2.2 Proof of Proposition 3.3

Proof

Theorem A.4

1.3 A.3 Proof of Theorem 3.4

Proof

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Mathematics Subject Classification (2010)

Search

Navigation