Skip to main content
Log in

Barcodes as Summary of Loss Function Topology

  • Published:
Doklady Mathematics Aims and scope Submit manuscript

Abstract

We propose to study neural networks’ loss surfaces by methods of topological data analysis. We suggest to apply barcodes of Morse complexes to explore topology of loss surfaces. An algorithm for calculations of the loss function’s barcodes of local minima is described. We have conducted experiments for calculating barcodes of local minima for benchmark functions and for loss surfaces of small neural networks. Our experiments confirm our two principal observations for neural networks’ loss surfaces. First, the barcodes of local minima are located in a small lower part of the range of values of neural networks’ loss function. Secondly, increase of the neural network’s depth and width lowers the barcodes of local minima. This has some natural implications for the neural network’s learning and for its generalization properties.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1.
Fig. 2.
Fig. 3.
Fig. 4.
Fig. 5.
Fig. 6.
Fig. 7.
Fig. 8.

REFERENCES

  1. H. Li, Z. Xu, G. Taylor, C. Studer, and T. Goldstein, “Visualizing the loss landscape of neural nets,” in Advances in Neural Information Processing Systems (2018), pp. 6389–6399.

    Google Scholar 

  2. Y. N. Dauphin, R. Pascanu, C. Gulcehre, K. Cho, S. Ganguli, and Y. Bengio, “Identifying and attacking the saddle point problem in high-dimensional non-convex optimization,” in Proceedings of the 27th International Conference on Neural Information Processing Systems (2014), pp. 2933–2941.

  3. A. Choromanska, M. Henaff, M. Mathieu, G. B. Arous, and Y. LeCun, “The loss surfaces of multilayer networks,” JMLR Workshop Conf. Proc. 38, 192–204 (2015). https://doi.org/10.48550/arXiv.1412.0233

    Article  Google Scholar 

  4. R. Bott, “Lectures on Morse theory, old and new,” Bull. Am. Math. Soc. 7 (2), 331–358 (1982).

    Article  MathSciNet  Google Scholar 

  5. S. Smale, “Differentiable dynamical systems,” Bull. Am. Math. Soc. 73 (6), 747–817 (1967).

    Article  MathSciNet  Google Scholar 

  6. R. Thom, “Sur une partition en cellules associée à une fonction sur une variété,” C. R. Acad. Sci. 228 (12), 973–975 (1949).

    MathSciNet  Google Scholar 

  7. S. Barannikov, “Framed Morse complexes and its invariants,” Ad. Sov. Math. 21, 93–116 (1994). https://doi.org/10.1090/advsov/021/03

    Article  MathSciNet  Google Scholar 

  8. D. Le Peutrec, F. Nier, and C. Viterbo, “Precise Arrhenius law for p-forms: The Witten Laplacian and Morse–Barannikov complex,” Ann. H. Poincaré 14 (3), 567–610 (2013). https://doi.org/10.1007/s00023-012-0193-9

    Article  MathSciNet  Google Scholar 

  9. F. Le Roux, S. Seyfaddini, and C. Viterbo, “Barcodes and area-preserving homeomorphisms” (2018). arXiv:1810.03139

  10. J. Bergstra and Y. Bengio, “Random search for hyper-parameter optimization,” J. Mach. Learn. Res. 13, 281–305 (2012).

    MathSciNet  Google Scholar 

  11. P. B. M. K. Chung and P. T. Kim, “Persistence diagrams of cortical surface data,” Inf. Process. Med. Imaging 5636, 386–397 (2009).

    Google Scholar 

  12. T. Sousbie, C. Pichon, and H. Kawahara, “The persistent cosmic web and its filamentary structure: II. Illustrations,” Mon. Not. R. Astron. Soc. 414 (1), 384–403 (2011). https://doi.org/10.1111/j.1365-2966.2011.18395.x

    Article  Google Scholar 

  13. C. S. Pun, K. Xia, and S. X. Lee, “Persistent-homology-based machine learning and its applications—a survey” (2018). arXiv:1811.00252

  14. C. Dellago, P. G. Bolhuis, and P. L. Geissler, Transition Path Sampling (Wiley, New York, 2003), pp. 1–78. https://doi.org/10.1002/0471231509.ch1

    Book  Google Scholar 

  15. A. R. Oganov and M. Valle, “How to quantify energy landscapes of solids,” J. Chem. Phys. 130 (10), 104504 (2009). https://doi.org/10.1063/1.3079326

  16. F. Chazal, L. Guibas, S. Oudot, and P. Skraba, “Scalar field analysis over point cloud data,” Discrete Comput. Geom. 46 (4), 743 (2011).

    Article  MathSciNet  Google Scholar 

  17. D. Cohen-Steiner, H. Edelsbrunner, and J. Harer, “Stability of persistence diagrams,” Discrete Comput. Geom. 37 (1), 103–120 (2007).

    Article  MathSciNet  Google Scholar 

  18. Y. A. Malkov and D. A. Yashunin, “Efficient and robust approximate nearest neighbor search using hierarchical navigable small world graphs,” IEEE Trans. Pattern Anal. Mach. Intell. 42 (4), 824–836 (2020). https://doi.org/10.1109/TPAMI.2018.2889473

    Article  Google Scholar 

  19. M. Jamil and X.-S. Yang, “A literature survey of benchmark functions for global optimization problems,” Int. J. Math. Model. Numer. Optim. 4 (2), 150–194 (2013).

    Google Scholar 

  20. A. Efrat, A. Itai, and M. J. Katz, “Geometry helps in bottleneck matching and related problems,” Algorithmica 31 (1), 1–28 (2001).

    Article  MathSciNet  Google Scholar 

  21. K. Kawaguchi, “Deep learning without poor local minima,” in Advances in Neural Information Processing Systems (2016), pp. 586–594.

    Google Scholar 

  22. M. Gori and A. Tesi, “On the problem of local minima in backpropagation,” IEEE Trans. Pattern Anal. Mach. Intell. 14 (1), 76–86 (1992).

    Article  Google Scholar 

  23. J. Cao, Q. Wu, Y. Yan, L. Wang, and M. Tan, “On the flatness of loss surface for two-layered relu networks,” in Asian Conference on Machine Learning (2017), pp. 545–560.

  24. M. Yi, Q. Meng, W. Chen, Z.-m. Ma, and T.-Y. Liu, “Positively scale-invariant flatness of ReLU neural networks” (2019). https://doi.org/10.48550/arXiv.1903.02237

  25. P. Chaudhari, A. Choromanska, S. Soatto, Y. LeCun, C. Baldassi, C. Borgs, J. Chayes, L. Sagun, and R. Zecchina, “Entropy-SGD: Biasing gradient descent into wide valleys,” J. Stat. Mech. 2019, 124018 (2019). https://doi.org/10.1088/1742-5468/ab39d9

  26. L. Dinh, R. Pascanu, S. Bengio, and Y. Bengio, “Sharp minima can generalize for deep nets,” in Proceedings of the 34th International Conference on Machine Learning, PMLR (2017), pp. 1019–1028.

  27. M. Z. Alom, T. M. Taha, C. Yakopcic, S. Westberg, P. Sidike, M. S. Nasrin, M. Hasan, B. C. Van Essen, A. A. Awwal, and V. K. Asari, “A state-of-the-art survey on deep learning theory and architectures,” Electronics 8 (3), 292 (2019).

    Article  Google Scholar 

Download references

Funding

This work was partially supported by the Russian Foundation for Basic Research, grant 21-51-12005 NNIO_a.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to S. A. Barannikov.

Ethics declarations

The authors of this work declare that they have no conflicts of interest.

Additional information

Publisher’s Note.

Pleiades Publishing remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

APPENDIX

APPENDIX

1.1 GRADIENT MORSE COMPLEX

The gradient Morse complex \(({{C}_{*}},{{\partial }_{*}})\), is defined as follows. For generic f  the critical points \({{p}_{\alpha }}\), \(df{{{\text{|}}}_{{{{T}_{{{{p}_{\alpha }}}}}}}} = 0\), are isolated. Near each critical point \({{p}_{\alpha }}\) f can be written as \(f = \sum\nolimits_{l = 1}^j - {{({{x}^{l}})}^{2}} + \sum\nolimits_{l = j}^n {{({{x}^{l}})}^{2}}\) in some local coordinates. The index of the critical point is defined as the dimension of the set of downward pointing directions at that point, or of the negative subspace of the Hessian:

$${\text{index}}({{p}_{\alpha }}) = j,$$

Then define

$${{C}_{j}} = {{ \oplus }_{{{\text{index}}({{p}_{\alpha }}) = j}}}[{{p}_{\alpha }},{\text{or}}(T_{{{{l}_{\alpha }}}}^{ - })],$$

where or is an orientation on a negative subspace \({{T}_{{{{p}_{\alpha }}}}} = T_{{{{p}_{\alpha }}}}^{ - } \oplus T_{{{{p}_{\alpha }}}}^{ + }\) of the Hessian \({{\partial }^{2}}f\).

Let

$$\begin{gathered} \mathcal{M}({{p}_{\alpha }},{{p}_{\beta }}) = \{ \gamma :\mathbb{R} \to {{M}^{n}}|\dot {\gamma } = - ({\text{gra}}{{{\text{d}}}_{g}}f)(\gamma (t)), \\ \mathop {\lim }\limits_{t \to - \infty } = {{p}_{\alpha }},\mathop {\lim }\limits_{t \to + \infty } = {{p}_{\beta }}{\text{\} /}}\mathbb{R}{\text{,}} \\ \end{gathered} $$

be the set of gradient trajectories connecting critical points \({{p}_{\alpha }}\) and \({{p}_{\beta }}\), where the natural action of \(\mathbb{R}\) is by the shift \(\gamma (t) \mapsto \gamma (t + \tau )\).

If \({\text{index}}({{p}_{\beta }}) = {\text{index}}({{p}_{\alpha }}) - 1\) then generically the set \(\mathcal{M}({{p}_{\alpha }},{{p}_{\beta }})\) is finite. Let

$$\# \mathcal{M}(\left[ {{{p}_{\alpha }},{\text{or}}} \right],\left[ {{{p}_{\beta }},{\text{or}}} \right]),$$

denote in this case the number of the trajectories, counted with signs taking into account a choice of orientation, between critical points \({{p}_{\alpha }}\) and \({{p}_{\beta }}\).

The linear operator \({{\partial }_{j}}\) is defined by

$${{\partial }_{j}}\left[ {{{p}_{\alpha }},{\text{or}}} \right] = \sum\limits_{{\text{index}}({{p}_{\beta }}) = j - 1} \left[ {{{p}_{\beta }},{\text{or}}} \right]\# \mathcal{M}({{p}_{\alpha }},{{p}_{\beta }}).$$

The description of the critical points on manifold \(\Theta \) with nonempty boundary \(\partial \Theta \) is modified slightly in the following way. A connected component of sublevel set is born also at a local minimum of restriction of f to the boundary \({{\left. f \right|}_{{\partial \Theta }}}\), if \({\text{grad}}f\) is pointed inside manifold \(\Theta \). The merging of two connected components can also happen at 1-saddle of \({{\left. f \right|}_{{\partial \Theta }}}\), if \({\text{grad}}f\) is pointed inside \(\Theta \). When we speak about minima and 1-saddles, this also means such critical points of \({{\left. f \right|}_{{\partial \Theta }}}\). Similarly the set of generators of index \(j\) chains in Morse complex includes index \(j\) critical points of \({{\left. f \right|}_{{\partial \Theta }}}\) with \({\text{grad}}f\) pointed inside \(\Theta \). The differential is also modified similarly to take into account trajectories involving such critical points.

1.2 PROOF OF THEOREM 3

Theorem. ([7], Section 2) Any \(\mathbb{R}\)-filtered chain complex \({{C}_{*}}\) over field \(k\) can be brought by a linear transformation preserving the \(\mathbb{R}\)-filtration to “canonical form”, a canonically defined direct sum of indecomposable \(\mathbb{R}\)-filtered complexes of two types:

1-dimensional \(\mathbb{R}\)-filtered complex with trivial differential, \(\partial \tilde {e}_{i}^{{(j)}} = 0\) , \(\left\langle {\tilde {e}_{i}^{{(j)}}} \right\rangle = {{F}_{{ \leqslant r}}}\), \(r \in \mathbb{R}\),

2-dimensional \(\mathbb{R}\)-filtered complex with trivial homology \(\partial \tilde {e}_{{{{i}_{2}}}}^{{(j + 1)}} = \tilde {e}_{{{{i}_{1}}}}^{{(j)}}\), \(\left\langle {\tilde {e}_{{{{i}_{1}}}}^{{(j)}}} \right\rangle = {{F}_{{ \leqslant {{s}_{1}}}}}\), \(\left\langle {\tilde {e}_{{{{i}_{1}}}}^{{(j)}},\tilde {e}_{{{{i}_{2}}}}^{{(j + 1)}}} \right\rangle \) = \({{F}_{{ \leqslant {{s}_{2}}}}}\), \({{s}_{1}},{\kern 1pt} {{s}_{2}} \in \mathbb{R}\).

The resulting canonical form is unique.

Proof. ([7], Section 2). Let \(\{ e_{i}^{{(n)}}\} \) be a basis in the vector spaces Cn compatible with the filtration, so that each subspace \({{F}_{r}}{{C}_{n}}\) is the span \(\left\langle {e_{1}^{{(n)}}, \ldots ,e_{{{{i}_{r}}}}^{{(n)}}} \right\rangle \). Notice that the filtration defines the natural order on the set of basis elements.

Let \(\partial e_{l}^{{(n)}}\) have the required form for \(n = j\) and \(l \leqslant i\), or \(n < j\) and all l; i.e., either \(\partial e_{l}^{{(n)}} = 0\) or \(\partial e_{l}^{{(n)}} = e_{{m(l)}}^{{(n - 1)}}\), where \(m(l) \ne m(l')\) for \(l \ne l'\).

Let

$$\partial e_{{i + 1}}^{{(j)}} = \sum\limits_k e_{k}^{{(j - 1)}}{{\alpha }_{k}}.$$

Let’s move all the terms with \(e_{k}^{{(j - 1)}} = \partial e_{q}^{j}\), \(q \leqslant i\), from the right to the left side. We get

$$\partial \left( {e_{{i + 1}}^{{(j)}} - \sum\limits_{q \leqslant i} e_{q}^{{(j)}}{{\alpha }_{{k(q)}}}} \right) = \sum\limits_k e_{k}^{{(j - 1)}}{{\beta }_{k}}.$$

If \({{\beta }_{k}} = 0\) for all k, then define

$$\tilde {e}_{{i + 1}}^{{(j)}} = e_{{i + 1}}^{{(j)}} - \sum\limits_{q \leqslant i} e_{q}^{{(j)}}{{\alpha }_{{k(q)}}},$$

so that

$$\partial \tilde {e}_{{i + 1}}^{{(j)}} = 0,$$

and \(\partial e_{l}^{{(n)}}\) has the required form for \(l \leqslant i + 1\) and \(n = j\), and for \(n < j\) and all l.

Otherwise let \({{k}_{0}}\) be the maximal k with \({{\beta }_{k}} \ne 0\). Then

$$\partial \left( {e_{{i + 1}}^{{(j)}} - \sum\limits_{q \leqslant i} e_{q}^{{(j)}}{{\alpha }_{{k(q)}}}} \right) = e_{{{{k}_{0}}}}^{{(j - 1)}}{{\beta }_{{{{k}_{0}}}}} + \sum\limits_{k < {{k}_{0}}} e_{k}^{{(j - 1)}}{{\beta }_{k}},$$

\({{\beta }_{{{{k}_{0}}}}} \ne 0.\) Define

$$\tilde {e}_{{i + 1}}^{{(j)}} = \left( {e_{{i + 1}}^{{(j)}} - \sum\limits_{q \leqslant i} e_{q}^{{(j)}}{{\alpha }_{{k(q)}}}} \right){\text{/}}{{\beta }_{{{{k}_{0}}}}},$$
$$\tilde {e}_{{{{k}_{0}}}}^{{(j - 1)}} = e_{{{{k}_{0}}}}^{{(j - 1)}} + \sum\limits_{k < {{k}_{0}}} e_{k}^{{(j - 1)}}{{\beta }_{k}}{\text{/}}{{\beta }_{{{{k}_{0}}}}}.$$

Then

$$\partial \tilde {e}_{{i + 1}}^{{(j)}} = \tilde {e}_{{{{k}_{0}}}}^{{(j - 1)}}$$

and for n = j and \(l \leqslant i + 1\), or \(n < j\) and all l, \(\partial e_{l}^{{(n)}}\) has the required form. If the complex has been reduced to “canonical form” on subcomplex \({{ \oplus }_{{n \leqslant j}}}{{C}_{n}}\), then reduce similarly \(\partial e_{1}^{{(j + 1)}}\) and so on.

Uniqueness of the canonical form follows essentially from the uniqueness at each previous step. Let \(\left\{ {a_{i}^{{(j)}}} \right\}\), \(\left\{ {b_{i}^{{(j)}} = \sum\nolimits_{k \leqslant i} a_{k}^{{(j)}}{{\alpha }_{k}}} \right\}\) be two bases of \({{C}_{*}}\) for two different canonical forms. Assume that for all indexes \(p < j\) and all \(n\), and \(p = j\) and \(n \leqslant i\) the canonical forms agree. Let \(\partial a_{{i + 1}}^{{(j)}} = a_{m}^{{(j - 1)}}\) and \(\partial b_{{i + 1}}^{{(j)}} = b_{l}^{{(j - 1)}}\) with \(m > l\), \(a_{m}^{{(j - 1)}}\) is not in the filtration subspace corresponding to \(b_{l}^{{(j - 1)}}\).

It follows that

$$\partial \left( {\sum\limits_{k \leqslant i + 1} a_{k}^{{(j)}}{{\alpha }_{k}}} \right) = \sum\limits_{n \leqslant l} a_{n}^{{(j - 1)}}{{\beta }_{n}},$$

where \({{\alpha }_{{i + 1}}} \ne 0\), \({{\beta }_{l}} \ne 0\). Therefore

$$\partial a_{{i + 1}}^{{(j)}} = \sum\limits_{n \leqslant l} a_{n}^{{(j - 1)}}{{\beta }_{n}}{\text{/}}{{\alpha }_{{i + 1}}} - \sum\limits_{k \leqslant i} \partial a_{k}^{{(j)}}{{\alpha }_{k}}{\text{/}}{{\alpha }_{{i + 1}}}.$$

On the other hand \(\partial a_{{i + 1}}^{{(j)}} = a_{m}^{{(j - 1)}}\), with \(m > l\), and \(\partial a_{k}^{{(j)}}\) for \(k \leqslant i\) are either zero or some basis elements \(a_{n}^{{(j - 1)}}\) different from \(a_{m}^{{(j - 1)}}\). This gives a contradiction.

Similarly if \(\partial b_{{i + 1}}^{{(j)}} = 0\), then

$$\partial a_{{i + 1}}^{{(j)}} = - \sum\limits_{k \leqslant i} \partial a_{k}^{{(j)}}{{\alpha }_{k}}{\text{/}}{{\alpha }_{{i + 1}}},$$

which again gives a contradiction by the same arguments. This proves the uniqueness of the canonical form.

Remark 2. The barcode of \(\mathbb{R}\)-filtered chain complex consists of segments representing the indecomposable \(\mathbb{R}\)-filtered chain complexes, see Definition 2 in Section 2. There is the standard lexicographic order on a set of such segments. The direct sum from the theorem statement is the standard “direct sum over a set” vector space.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Barannikov, S.A., Korotin, A.A., Oganesyan, D.A. et al. Barcodes as Summary of Loss Function Topology. Dokl. Math. 108 (Suppl 2), S333–S347 (2023). https://doi.org/10.1134/S1064562423701570

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1134/S1064562423701570

Keywords:

Navigation