Barcodes as Summary of Loss Function Topology

Barannikov, S. A.; Korotin, A. A.; Oganesyan, D. A.; Emtsev, D. I.; Burnaev, E. V.

doi:10.1134/S1064562423701570

Barcodes as Summary of Loss Function Topology

Published: 25 March 2024

Volume 108, pages S333–S347, (2023)
Cite this article

Doklady Mathematics Aims and scope Submit manuscript

S. A. Barannikov^1,3,
A. A. Korotin^1,2,
D. A. Oganesyan¹,
D. I. Emtsev^1,4 &
…
E. V. Burnaev^1,2

12 Accesses
Explore all metrics

Abstract

We propose to study neural networks’ loss surfaces by methods of topological data analysis. We suggest to apply barcodes of Morse complexes to explore topology of loss surfaces. An algorithm for calculations of the loss function’s barcodes of local minima is described. We have conducted experiments for calculating barcodes of local minima for benchmark functions and for loss surfaces of small neural networks. Our experiments confirm our two principal observations for neural networks’ loss surfaces. First, the barcodes of local minima are located in a small lower part of the range of values of neural networks’ loss function. Secondly, increase of the neural network’s depth and width lowers the barcodes of local minima. This has some natural implications for the neural network’s learning and for its generalization properties.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

REFERENCES

H. Li, Z. Xu, G. Taylor, C. Studer, and T. Goldstein, “Visualizing the loss landscape of neural nets,” in Advances in Neural Information Processing Systems (2018), pp. 6389–6399.
Google Scholar
Y. N. Dauphin, R. Pascanu, C. Gulcehre, K. Cho, S. Ganguli, and Y. Bengio, “Identifying and attacking the saddle point problem in high-dimensional non-convex optimization,” in Proceedings of the 27th International Conference on Neural Information Processing Systems (2014), pp. 2933–2941.
A. Choromanska, M. Henaff, M. Mathieu, G. B. Arous, and Y. LeCun, “The loss surfaces of multilayer networks,” JMLR Workshop Conf. Proc. 38, 192–204 (2015). https://doi.org/10.48550/arXiv.1412.0233
Article Google Scholar
R. Bott, “Lectures on Morse theory, old and new,” Bull. Am. Math. Soc. 7 (2), 331–358 (1982).
Article MathSciNet Google Scholar
S. Smale, “Differentiable dynamical systems,” Bull. Am. Math. Soc. 73 (6), 747–817 (1967).
Article MathSciNet Google Scholar
R. Thom, “Sur une partition en cellules associée à une fonction sur une variété,” C. R. Acad. Sci. 228 (12), 973–975 (1949).
MathSciNet Google Scholar
S. Barannikov, “Framed Morse complexes and its invariants,” Ad. Sov. Math. 21, 93–116 (1994). https://doi.org/10.1090/advsov/021/03
Article MathSciNet Google Scholar
D. Le Peutrec, F. Nier, and C. Viterbo, “Precise Arrhenius law for p-forms: The Witten Laplacian and Morse–Barannikov complex,” Ann. H. Poincaré 14 (3), 567–610 (2013). https://doi.org/10.1007/s00023-012-0193-9
Article MathSciNet Google Scholar
F. Le Roux, S. Seyfaddini, and C. Viterbo, “Barcodes and area-preserving homeomorphisms” (2018). arXiv:1810.03139
J. Bergstra and Y. Bengio, “Random search for hyper-parameter optimization,” J. Mach. Learn. Res. 13, 281–305 (2012).
MathSciNet Google Scholar
P. B. M. K. Chung and P. T. Kim, “Persistence diagrams of cortical surface data,” Inf. Process. Med. Imaging 5636, 386–397 (2009).
Google Scholar
T. Sousbie, C. Pichon, and H. Kawahara, “The persistent cosmic web and its filamentary structure: II. Illustrations,” Mon. Not. R. Astron. Soc. 414 (1), 384–403 (2011). https://doi.org/10.1111/j.1365-2966.2011.18395.x
Article Google Scholar
C. S. Pun, K. Xia, and S. X. Lee, “Persistent-homology-based machine learning and its applications—a survey” (2018). arXiv:1811.00252
C. Dellago, P. G. Bolhuis, and P. L. Geissler, Transition Path Sampling (Wiley, New York, 2003), pp. 1–78. https://doi.org/10.1002/0471231509.ch1
Book Google Scholar
A. R. Oganov and M. Valle, “How to quantify energy landscapes of solids,” J. Chem. Phys. 130 (10), 104504 (2009). https://doi.org/10.1063/1.3079326
F. Chazal, L. Guibas, S. Oudot, and P. Skraba, “Scalar field analysis over point cloud data,” Discrete Comput. Geom. 46 (4), 743 (2011).
Article MathSciNet Google Scholar
D. Cohen-Steiner, H. Edelsbrunner, and J. Harer, “Stability of persistence diagrams,” Discrete Comput. Geom. 37 (1), 103–120 (2007).
Article MathSciNet Google Scholar
Y. A. Malkov and D. A. Yashunin, “Efficient and robust approximate nearest neighbor search using hierarchical navigable small world graphs,” IEEE Trans. Pattern Anal. Mach. Intell. 42 (4), 824–836 (2020). https://doi.org/10.1109/TPAMI.2018.2889473
Article Google Scholar
M. Jamil and X.-S. Yang, “A literature survey of benchmark functions for global optimization problems,” Int. J. Math. Model. Numer. Optim. 4 (2), 150–194 (2013).
Google Scholar
A. Efrat, A. Itai, and M. J. Katz, “Geometry helps in bottleneck matching and related problems,” Algorithmica 31 (1), 1–28 (2001).
Article MathSciNet Google Scholar
K. Kawaguchi, “Deep learning without poor local minima,” in Advances in Neural Information Processing Systems (2016), pp. 586–594.
Google Scholar
M. Gori and A. Tesi, “On the problem of local minima in backpropagation,” IEEE Trans. Pattern Anal. Mach. Intell. 14 (1), 76–86 (1992).
Article Google Scholar
J. Cao, Q. Wu, Y. Yan, L. Wang, and M. Tan, “On the flatness of loss surface for two-layered relu networks,” in Asian Conference on Machine Learning (2017), pp. 545–560.
M. Yi, Q. Meng, W. Chen, Z.-m. Ma, and T.-Y. Liu, “Positively scale-invariant flatness of ReLU neural networks” (2019). https://doi.org/10.48550/arXiv.1903.02237
P. Chaudhari, A. Choromanska, S. Soatto, Y. LeCun, C. Baldassi, C. Borgs, J. Chayes, L. Sagun, and R. Zecchina, “Entropy-SGD: Biasing gradient descent into wide valleys,” J. Stat. Mech. 2019, 124018 (2019). https://doi.org/10.1088/1742-5468/ab39d9
L. Dinh, R. Pascanu, S. Bengio, and Y. Bengio, “Sharp minima can generalize for deep nets,” in Proceedings of the 34th International Conference on Machine Learning, PMLR (2017), pp. 1019–1028.
M. Z. Alom, T. M. Taha, C. Yakopcic, S. Westberg, P. Sidike, M. S. Nasrin, M. Hasan, B. C. Van Essen, A. A. Awwal, and V. K. Asari, “A state-of-the-art survey on deep learning theory and architectures,” Electronics 8 (3), 292 (2019).
Article Google Scholar

Download references

Funding

This work was partially supported by the Russian Foundation for Basic Research, grant 21-51-12005 NNIO_a.

Author information

Authors and Affiliations

Skolkovo Institute of Science and Technology, Moscow, Russia
S. A. Barannikov, A. A. Korotin, D. A. Oganesyan, D. I. Emtsev & E. V. Burnaev
Artificial Intelligence Research Institute, Moscow, Russia
A. A. Korotin & E. V. Burnaev
CNRS, IMJ, Paris City University, Paris, France
S. A. Barannikov
ETH, Zurich, Switzerland
D. I. Emtsev

Authors

S. A. Barannikov
View author publications
You can also search for this author in PubMed Google Scholar
A. A. Korotin
View author publications
You can also search for this author in PubMed Google Scholar
D. A. Oganesyan
View author publications
You can also search for this author in PubMed Google Scholar
D. I. Emtsev
View author publications
You can also search for this author in PubMed Google Scholar
E. V. Burnaev
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to S. A. Barannikov.

Ethics declarations

The authors of this work declare that they have no conflicts of interest.

Additional information

Publisher’s Note.

Pleiades Publishing remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

APPENDIX

1.1 GRADIENT MORSE COMPLEX

The gradient Morse complex $({{C}_{*}},{{\partial }_{*}})$, is defined as follows. For generic f the critical points ${{p}_{\alpha }}$, $df{{{\text{|}}}_{{{{T}_{{{{p}_{\alpha }}}}}}}} = 0$, are isolated. Near each critical point ${{p}_{\alpha }}$ f can be written as $f = \sum\nolimits_{l = 1}^j - {{({{x}^{l}})}^{2}} + \sum\nolimits_{l = j}^n {{({{x}^{l}})}^{2}}$ in some local coordinates. The index of the critical point is defined as the dimension of the set of downward pointing directions at that point, or of the negative subspace of the Hessian:

$${\text{index}}({{p}_{\alpha }}) = j,$$

Then define

$${{C}_{j}} = {{ \oplus }_{{{\text{index}}({{p}_{\alpha }}) = j}}}[{{p}_{\alpha }},{\text{or}}(T_{{{{l}_{\alpha }}}}^{ - })],$$

where or is an orientation on a negative subspace ${{T}_{{{{p}_{\alpha }}}}} = T_{{{{p}_{\alpha }}}}^{ - } \oplus T_{{{{p}_{\alpha }}}}^{ + }$ of the Hessian ${{\partial }^{2}}f$.

Let

$$\begin{gathered} \mathcal{M}({{p}_{\alpha }},{{p}_{\beta }}) = \{ \gamma :\mathbb{R} \to {{M}^{n}}|\dot {\gamma } = - ({\text{gra}}{{{\text{d}}}_{g}}f)(\gamma (t)), \\ \mathop {\lim }\limits_{t \to - \infty } = {{p}_{\alpha }},\mathop {\lim }\limits_{t \to + \infty } = {{p}_{\beta }}{\text{\} /}}\mathbb{R}{\text{,}} \\ \end{gathered} $$

be the set of gradient trajectories connecting critical points ${{p}_{\alpha }}$ and ${{p}_{\beta }}$, where the natural action of $\mathbb{R}$ is by the shift $\gamma (t) \mapsto \gamma (t + \tau )$.

If ${\text{index}}({{p}_{\beta }}) = {\text{index}}({{p}_{\alpha }}) - 1$ then generically the set $\mathcal{M}({{p}_{\alpha }},{{p}_{\beta }})$ is finite. Let

$$\# \mathcal{M}(\left[ {{{p}_{\alpha }},{\text{or}}} \right],\left[ {{{p}_{\beta }},{\text{or}}} \right]),$$

denote in this case the number of the trajectories, counted with signs taking into account a choice of orientation, between critical points ${{p}_{\alpha }}$ and ${{p}_{\beta }}$.

The linear operator ${{\partial }_{j}}$ is defined by

$${{\partial }_{j}}\left[ {{{p}_{\alpha }},{\text{or}}} \right] = \sum\limits_{{\text{index}}({{p}_{\beta }}) = j - 1} \left[ {{{p}_{\beta }},{\text{or}}} \right]\# \mathcal{M}({{p}_{\alpha }},{{p}_{\beta }}).$$

The description of the critical points on manifold $\Theta $ with nonempty boundary $\partial \Theta $ is modified slightly in the following way. A connected component of sublevel set is born also at a local minimum of restriction of f to the boundary ${{\left. f \right|}_{{\partial \Theta }}}$, if ${\text{grad}}f$ is pointed inside manifold $\Theta $. The merging of two connected components can also happen at 1-saddle of ${{\left. f \right|}_{{\partial \Theta }}}$, if ${\text{grad}}f$ is pointed inside $\Theta $. When we speak about minima and 1-saddles, this also means such critical points of ${{\left. f \right|}_{{\partial \Theta }}}$. Similarly the set of generators of index $j$ chains in Morse complex includes index $j$ critical points of ${{\left. f \right|}_{{\partial \Theta }}}$ with ${\text{grad}}f$ pointed inside $\Theta $. The differential is also modified similarly to take into account trajectories involving such critical points.

1.2 PROOF OF THEOREM 3

Theorem. ([7], Section 2) Any $\mathbb{R}$-filtered chain complex ${{C}_{*}}$ over field $k$ can be brought by a linear transformation preserving the $\mathbb{R}$-filtration to “canonical form”, a canonically defined direct sum of indecomposable $\mathbb{R}$-filtered complexes of two types:

• 1-dimensional $\mathbb{R}$-filtered complex with trivial differential, $\partial \tilde {e}_{i}^{{(j)}} = 0$ , $\left\langle {\tilde {e}_{i}^{{(j)}}} \right\rangle = {{F}_{{ \leqslant r}}}$, $r \in \mathbb{R}$,

• 2-dimensional $\mathbb{R}$-filtered complex with trivial homology $\partial \tilde {e}_{{{{i}_{2}}}}^{{(j + 1)}} = \tilde {e}_{{{{i}_{1}}}}^{{(j)}}$, $\left\langle {\tilde {e}_{{{{i}_{1}}}}^{{(j)}}} \right\rangle = {{F}_{{ \leqslant {{s}_{1}}}}}$, $\left\langle {\tilde {e}_{{{{i}_{1}}}}^{{(j)}},\tilde {e}_{{{{i}_{2}}}}^{{(j + 1)}}} \right\rangle $ = ${{F}_{{ \leqslant {{s}_{2}}}}}$, ${{s}_{1}},{\kern 1pt} {{s}_{2}} \in \mathbb{R}$.

The resulting canonical form is unique.

Proof. ([7], Section 2). Let $\{ e_{i}^{{(n)}}\} $ be a basis in the vector spaces C_n compatible with the filtration, so that each subspace ${{F}_{r}}{{C}_{n}}$ is the span $\left\langle {e_{1}^{{(n)}}, \ldots ,e_{{{{i}_{r}}}}^{{(n)}}} \right\rangle $. Notice that the filtration defines the natural order on the set of basis elements.

Let $\partial e_{l}^{{(n)}}$ have the required form for $n = j$ and $l \leqslant i$, or $n < j$ and all l; i.e., either $\partial e_{l}^{{(n)}} = 0$ or $\partial e_{l}^{{(n)}} = e_{{m(l)}}^{{(n - 1)}}$, where $m(l) \ne m(l')$ for $l \ne l'$.

Let

$$\partial e_{{i + 1}}^{{(j)}} = \sum\limits_k e_{k}^{{(j - 1)}}{{\alpha }_{k}}.$$

Let’s move all the terms with $e_{k}^{{(j - 1)}} = \partial e_{q}^{j}$, $q \leqslant i$, from the right to the left side. We get

$$\partial \left( {e_{{i + 1}}^{{(j)}} - \sum\limits_{q \leqslant i} e_{q}^{{(j)}}{{\alpha }_{{k(q)}}}} \right) = \sum\limits_k e_{k}^{{(j - 1)}}{{\beta }_{k}}.$$

If ${{\beta }_{k}} = 0$ for all k, then define

$$\tilde {e}_{{i + 1}}^{{(j)}} = e_{{i + 1}}^{{(j)}} - \sum\limits_{q \leqslant i} e_{q}^{{(j)}}{{\alpha }_{{k(q)}}},$$

so that

$$\partial \tilde {e}_{{i + 1}}^{{(j)}} = 0,$$

and $\partial e_{l}^{{(n)}}$ has the required form for $l \leqslant i + 1$ and $n = j$, and for $n < j$ and all l.

Otherwise let ${{k}_{0}}$ be the maximal k with ${{\beta }_{k}} \ne 0$. Then

$$\partial \left( {e_{{i + 1}}^{{(j)}} - \sum\limits_{q \leqslant i} e_{q}^{{(j)}}{{\alpha }_{{k(q)}}}} \right) = e_{{{{k}_{0}}}}^{{(j - 1)}}{{\beta }_{{{{k}_{0}}}}} + \sum\limits_{k < {{k}_{0}}} e_{k}^{{(j - 1)}}{{\beta }_{k}},$$

${{\beta }_{{{{k}_{0}}}}} \ne 0.$ Define

$$\tilde {e}_{{i + 1}}^{{(j)}} = \left( {e_{{i + 1}}^{{(j)}} - \sum\limits_{q \leqslant i} e_{q}^{{(j)}}{{\alpha }_{{k(q)}}}} \right){\text{/}}{{\beta }_{{{{k}_{0}}}}},$$

$$\tilde {e}_{{{{k}_{0}}}}^{{(j - 1)}} = e_{{{{k}_{0}}}}^{{(j - 1)}} + \sum\limits_{k < {{k}_{0}}} e_{k}^{{(j - 1)}}{{\beta }_{k}}{\text{/}}{{\beta }_{{{{k}_{0}}}}}.$$

Then

$$\partial \tilde {e}_{{i + 1}}^{{(j)}} = \tilde {e}_{{{{k}_{0}}}}^{{(j - 1)}}$$

and for n = j and $l \leqslant i + 1$, or $n < j$ and all l, $\partial e_{l}^{{(n)}}$ has the required form. If the complex has been reduced to “canonical form” on subcomplex ${{ \oplus }_{{n \leqslant j}}}{{C}_{n}}$, then reduce similarly $\partial e_{1}^{{(j + 1)}}$ and so on.

Uniqueness of the canonical form follows essentially from the uniqueness at each previous step. Let $\left\{ {a_{i}^{{(j)}}} \right\}$, $\left\{ {b_{i}^{{(j)}} = \sum\nolimits_{k \leqslant i} a_{k}^{{(j)}}{{\alpha }_{k}}} \right\}$ be two bases of ${{C}_{*}}$ for two different canonical forms. Assume that for all indexes $p < j$ and all $n$, and $p = j$ and $n \leqslant i$ the canonical forms agree. Let $\partial a_{{i + 1}}^{{(j)}} = a_{m}^{{(j - 1)}}$ and $\partial b_{{i + 1}}^{{(j)}} = b_{l}^{{(j - 1)}}$ with $m > l$, $a_{m}^{{(j - 1)}}$ is not in the filtration subspace corresponding to $b_{l}^{{(j - 1)}}$.

It follows that

$$\partial \left( {\sum\limits_{k \leqslant i + 1} a_{k}^{{(j)}}{{\alpha }_{k}}} \right) = \sum\limits_{n \leqslant l} a_{n}^{{(j - 1)}}{{\beta }_{n}},$$

where ${{\alpha }_{{i + 1}}} \ne 0$, ${{\beta }_{l}} \ne 0$. Therefore

$$\partial a_{{i + 1}}^{{(j)}} = \sum\limits_{n \leqslant l} a_{n}^{{(j - 1)}}{{\beta }_{n}}{\text{/}}{{\alpha }_{{i + 1}}} - \sum\limits_{k \leqslant i} \partial a_{k}^{{(j)}}{{\alpha }_{k}}{\text{/}}{{\alpha }_{{i + 1}}}.$$

On the other hand $\partial a_{{i + 1}}^{{(j)}} = a_{m}^{{(j - 1)}}$, with $m > l$, and $\partial a_{k}^{{(j)}}$ for $k \leqslant i$ are either zero or some basis elements $a_{n}^{{(j - 1)}}$ different from $a_{m}^{{(j - 1)}}$. This gives a contradiction.

Similarly if $\partial b_{{i + 1}}^{{(j)}} = 0$, then

$$\partial a_{{i + 1}}^{{(j)}} = - \sum\limits_{k \leqslant i} \partial a_{k}^{{(j)}}{{\alpha }_{k}}{\text{/}}{{\alpha }_{{i + 1}}},$$

which again gives a contradiction by the same arguments. This proves the uniqueness of the canonical form.

Remark 2. The barcode of $\mathbb{R}$-filtered chain complex consists of segments representing the indecomposable $\mathbb{R}$-filtered chain complexes, see Definition 2 in Section 2. There is the standard lexicographic order on a set of such segments. The direct sum from the theorem statement is the standard “direct sum over a set” vector space.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Barannikov, S.A., Korotin, A.A., Oganesyan, D.A. et al. Barcodes as Summary of Loss Function Topology. Dokl. Math. 108 (Suppl 2), S333–S347 (2023). https://doi.org/10.1134/S1064562423701570

Download citation

Received: 02 September 2023
Revised: 08 September 2023
Accepted: 18 October 2023
Published: 25 March 2024
Issue Date: December 2023
DOI: https://doi.org/10.1134/S1064562423701570

Keywords:

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Barcodes as Summary of Loss Function Topology

Abstract

Access this article

REFERENCES

Funding

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Additional information

Publisher’s Note.

APPENDIX

APPENDIX

1.1 GRADIENT MORSE COMPLEX

1.2 PROOF OF THEOREM 3

Rights and permissions

About this article

Cite this article

Share this article

Keywords:

Search

Navigation