Abstract
We propose to study neural networks’ loss surfaces by methods of topological data analysis. We suggest to apply barcodes of Morse complexes to explore topology of loss surfaces. An algorithm for calculations of the loss function’s barcodes of local minima is described. We have conducted experiments for calculating barcodes of local minima for benchmark functions and for loss surfaces of small neural networks. Our experiments confirm our two principal observations for neural networks’ loss surfaces. First, the barcodes of local minima are located in a small lower part of the range of values of neural networks’ loss function. Secondly, increase of the neural network’s depth and width lowers the barcodes of local minima. This has some natural implications for the neural network’s learning and for its generalization properties.
REFERENCES
H. Li, Z. Xu, G. Taylor, C. Studer, and T. Goldstein, “Visualizing the loss landscape of neural nets,” in Advances in Neural Information Processing Systems (2018), pp. 6389–6399.
Y. N. Dauphin, R. Pascanu, C. Gulcehre, K. Cho, S. Ganguli, and Y. Bengio, “Identifying and attacking the saddle point problem in high-dimensional non-convex optimization,” in Proceedings of the 27th International Conference on Neural Information Processing Systems (2014), pp. 2933–2941.
A. Choromanska, M. Henaff, M. Mathieu, G. B. Arous, and Y. LeCun, “The loss surfaces of multilayer networks,” JMLR Workshop Conf. Proc. 38, 192–204 (2015). https://doi.org/10.48550/arXiv.1412.0233
R. Bott, “Lectures on Morse theory, old and new,” Bull. Am. Math. Soc. 7 (2), 331–358 (1982).
S. Smale, “Differentiable dynamical systems,” Bull. Am. Math. Soc. 73 (6), 747–817 (1967).
R. Thom, “Sur une partition en cellules associée à une fonction sur une variété,” C. R. Acad. Sci. 228 (12), 973–975 (1949).
S. Barannikov, “Framed Morse complexes and its invariants,” Ad. Sov. Math. 21, 93–116 (1994). https://doi.org/10.1090/advsov/021/03
D. Le Peutrec, F. Nier, and C. Viterbo, “Precise Arrhenius law for p-forms: The Witten Laplacian and Morse–Barannikov complex,” Ann. H. Poincaré 14 (3), 567–610 (2013). https://doi.org/10.1007/s00023-012-0193-9
F. Le Roux, S. Seyfaddini, and C. Viterbo, “Barcodes and area-preserving homeomorphisms” (2018). arXiv:1810.03139
J. Bergstra and Y. Bengio, “Random search for hyper-parameter optimization,” J. Mach. Learn. Res. 13, 281–305 (2012).
P. B. M. K. Chung and P. T. Kim, “Persistence diagrams of cortical surface data,” Inf. Process. Med. Imaging 5636, 386–397 (2009).
T. Sousbie, C. Pichon, and H. Kawahara, “The persistent cosmic web and its filamentary structure: II. Illustrations,” Mon. Not. R. Astron. Soc. 414 (1), 384–403 (2011). https://doi.org/10.1111/j.1365-2966.2011.18395.x
C. S. Pun, K. Xia, and S. X. Lee, “Persistent-homology-based machine learning and its applications—a survey” (2018). arXiv:1811.00252
C. Dellago, P. G. Bolhuis, and P. L. Geissler, Transition Path Sampling (Wiley, New York, 2003), pp. 1–78. https://doi.org/10.1002/0471231509.ch1
A. R. Oganov and M. Valle, “How to quantify energy landscapes of solids,” J. Chem. Phys. 130 (10), 104504 (2009). https://doi.org/10.1063/1.3079326
F. Chazal, L. Guibas, S. Oudot, and P. Skraba, “Scalar field analysis over point cloud data,” Discrete Comput. Geom. 46 (4), 743 (2011).
D. Cohen-Steiner, H. Edelsbrunner, and J. Harer, “Stability of persistence diagrams,” Discrete Comput. Geom. 37 (1), 103–120 (2007).
Y. A. Malkov and D. A. Yashunin, “Efficient and robust approximate nearest neighbor search using hierarchical navigable small world graphs,” IEEE Trans. Pattern Anal. Mach. Intell. 42 (4), 824–836 (2020). https://doi.org/10.1109/TPAMI.2018.2889473
M. Jamil and X.-S. Yang, “A literature survey of benchmark functions for global optimization problems,” Int. J. Math. Model. Numer. Optim. 4 (2), 150–194 (2013).
A. Efrat, A. Itai, and M. J. Katz, “Geometry helps in bottleneck matching and related problems,” Algorithmica 31 (1), 1–28 (2001).
K. Kawaguchi, “Deep learning without poor local minima,” in Advances in Neural Information Processing Systems (2016), pp. 586–594.
M. Gori and A. Tesi, “On the problem of local minima in backpropagation,” IEEE Trans. Pattern Anal. Mach. Intell. 14 (1), 76–86 (1992).
J. Cao, Q. Wu, Y. Yan, L. Wang, and M. Tan, “On the flatness of loss surface for two-layered relu networks,” in Asian Conference on Machine Learning (2017), pp. 545–560.
M. Yi, Q. Meng, W. Chen, Z.-m. Ma, and T.-Y. Liu, “Positively scale-invariant flatness of ReLU neural networks” (2019). https://doi.org/10.48550/arXiv.1903.02237
P. Chaudhari, A. Choromanska, S. Soatto, Y. LeCun, C. Baldassi, C. Borgs, J. Chayes, L. Sagun, and R. Zecchina, “Entropy-SGD: Biasing gradient descent into wide valleys,” J. Stat. Mech. 2019, 124018 (2019). https://doi.org/10.1088/1742-5468/ab39d9
L. Dinh, R. Pascanu, S. Bengio, and Y. Bengio, “Sharp minima can generalize for deep nets,” in Proceedings of the 34th International Conference on Machine Learning, PMLR (2017), pp. 1019–1028.
M. Z. Alom, T. M. Taha, C. Yakopcic, S. Westberg, P. Sidike, M. S. Nasrin, M. Hasan, B. C. Van Essen, A. A. Awwal, and V. K. Asari, “A state-of-the-art survey on deep learning theory and architectures,” Electronics 8 (3), 292 (2019).
Funding
This work was partially supported by the Russian Foundation for Basic Research, grant 21-51-12005 NNIO_a.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
The authors of this work declare that they have no conflicts of interest.
Additional information
Publisher’s Note.
Pleiades Publishing remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
APPENDIX
APPENDIX
1.1 GRADIENT MORSE COMPLEX
The gradient Morse complex \(({{C}_{*}},{{\partial }_{*}})\), is defined as follows. For generic f the critical points \({{p}_{\alpha }}\), \(df{{{\text{|}}}_{{{{T}_{{{{p}_{\alpha }}}}}}}} = 0\), are isolated. Near each critical point \({{p}_{\alpha }}\) f can be written as \(f = \sum\nolimits_{l = 1}^j - {{({{x}^{l}})}^{2}} + \sum\nolimits_{l = j}^n {{({{x}^{l}})}^{2}}\) in some local coordinates. The index of the critical point is defined as the dimension of the set of downward pointing directions at that point, or of the negative subspace of the Hessian:
Then define
where or is an orientation on a negative subspace \({{T}_{{{{p}_{\alpha }}}}} = T_{{{{p}_{\alpha }}}}^{ - } \oplus T_{{{{p}_{\alpha }}}}^{ + }\) of the Hessian \({{\partial }^{2}}f\).
Let
be the set of gradient trajectories connecting critical points \({{p}_{\alpha }}\) and \({{p}_{\beta }}\), where the natural action of \(\mathbb{R}\) is by the shift \(\gamma (t) \mapsto \gamma (t + \tau )\).
If \({\text{index}}({{p}_{\beta }}) = {\text{index}}({{p}_{\alpha }}) - 1\) then generically the set \(\mathcal{M}({{p}_{\alpha }},{{p}_{\beta }})\) is finite. Let
denote in this case the number of the trajectories, counted with signs taking into account a choice of orientation, between critical points \({{p}_{\alpha }}\) and \({{p}_{\beta }}\).
The linear operator \({{\partial }_{j}}\) is defined by
The description of the critical points on manifold \(\Theta \) with nonempty boundary \(\partial \Theta \) is modified slightly in the following way. A connected component of sublevel set is born also at a local minimum of restriction of f to the boundary \({{\left. f \right|}_{{\partial \Theta }}}\), if \({\text{grad}}f\) is pointed inside manifold \(\Theta \). The merging of two connected components can also happen at 1-saddle of \({{\left. f \right|}_{{\partial \Theta }}}\), if \({\text{grad}}f\) is pointed inside \(\Theta \). When we speak about minima and 1-saddles, this also means such critical points of \({{\left. f \right|}_{{\partial \Theta }}}\). Similarly the set of generators of index \(j\) chains in Morse complex includes index \(j\) critical points of \({{\left. f \right|}_{{\partial \Theta }}}\) with \({\text{grad}}f\) pointed inside \(\Theta \). The differential is also modified similarly to take into account trajectories involving such critical points.
1.2 PROOF OF THEOREM 3
Theorem. ([7], Section 2) Any \(\mathbb{R}\)-filtered chain complex \({{C}_{*}}\) over field \(k\) can be brought by a linear transformation preserving the \(\mathbb{R}\)-filtration to “canonical form”, a canonically defined direct sum of indecomposable \(\mathbb{R}\)-filtered complexes of two types:
• 1-dimensional \(\mathbb{R}\)-filtered complex with trivial differential, \(\partial \tilde {e}_{i}^{{(j)}} = 0\) , \(\left\langle {\tilde {e}_{i}^{{(j)}}} \right\rangle = {{F}_{{ \leqslant r}}}\), \(r \in \mathbb{R}\),
• 2-dimensional \(\mathbb{R}\)-filtered complex with trivial homology \(\partial \tilde {e}_{{{{i}_{2}}}}^{{(j + 1)}} = \tilde {e}_{{{{i}_{1}}}}^{{(j)}}\), \(\left\langle {\tilde {e}_{{{{i}_{1}}}}^{{(j)}}} \right\rangle = {{F}_{{ \leqslant {{s}_{1}}}}}\), \(\left\langle {\tilde {e}_{{{{i}_{1}}}}^{{(j)}},\tilde {e}_{{{{i}_{2}}}}^{{(j + 1)}}} \right\rangle \) = \({{F}_{{ \leqslant {{s}_{2}}}}}\), \({{s}_{1}},{\kern 1pt} {{s}_{2}} \in \mathbb{R}\).
The resulting canonical form is unique.
Proof. ([7], Section 2). Let \(\{ e_{i}^{{(n)}}\} \) be a basis in the vector spaces Cn compatible with the filtration, so that each subspace \({{F}_{r}}{{C}_{n}}\) is the span \(\left\langle {e_{1}^{{(n)}}, \ldots ,e_{{{{i}_{r}}}}^{{(n)}}} \right\rangle \). Notice that the filtration defines the natural order on the set of basis elements.
Let \(\partial e_{l}^{{(n)}}\) have the required form for \(n = j\) and \(l \leqslant i\), or \(n < j\) and all l; i.e., either \(\partial e_{l}^{{(n)}} = 0\) or \(\partial e_{l}^{{(n)}} = e_{{m(l)}}^{{(n - 1)}}\), where \(m(l) \ne m(l')\) for \(l \ne l'\).
Let
Let’s move all the terms with \(e_{k}^{{(j - 1)}} = \partial e_{q}^{j}\), \(q \leqslant i\), from the right to the left side. We get
If \({{\beta }_{k}} = 0\) for all k, then define
so that
and \(\partial e_{l}^{{(n)}}\) has the required form for \(l \leqslant i + 1\) and \(n = j\), and for \(n < j\) and all l.
Otherwise let \({{k}_{0}}\) be the maximal k with \({{\beta }_{k}} \ne 0\). Then
\({{\beta }_{{{{k}_{0}}}}} \ne 0.\) Define
Then
and for n = j and \(l \leqslant i + 1\), or \(n < j\) and all l, \(\partial e_{l}^{{(n)}}\) has the required form. If the complex has been reduced to “canonical form” on subcomplex \({{ \oplus }_{{n \leqslant j}}}{{C}_{n}}\), then reduce similarly \(\partial e_{1}^{{(j + 1)}}\) and so on.
Uniqueness of the canonical form follows essentially from the uniqueness at each previous step. Let \(\left\{ {a_{i}^{{(j)}}} \right\}\), \(\left\{ {b_{i}^{{(j)}} = \sum\nolimits_{k \leqslant i} a_{k}^{{(j)}}{{\alpha }_{k}}} \right\}\) be two bases of \({{C}_{*}}\) for two different canonical forms. Assume that for all indexes \(p < j\) and all \(n\), and \(p = j\) and \(n \leqslant i\) the canonical forms agree. Let \(\partial a_{{i + 1}}^{{(j)}} = a_{m}^{{(j - 1)}}\) and \(\partial b_{{i + 1}}^{{(j)}} = b_{l}^{{(j - 1)}}\) with \(m > l\), \(a_{m}^{{(j - 1)}}\) is not in the filtration subspace corresponding to \(b_{l}^{{(j - 1)}}\).
It follows that
where \({{\alpha }_{{i + 1}}} \ne 0\), \({{\beta }_{l}} \ne 0\). Therefore
On the other hand \(\partial a_{{i + 1}}^{{(j)}} = a_{m}^{{(j - 1)}}\), with \(m > l\), and \(\partial a_{k}^{{(j)}}\) for \(k \leqslant i\) are either zero or some basis elements \(a_{n}^{{(j - 1)}}\) different from \(a_{m}^{{(j - 1)}}\). This gives a contradiction.
Similarly if \(\partial b_{{i + 1}}^{{(j)}} = 0\), then
which again gives a contradiction by the same arguments. This proves the uniqueness of the canonical form.
Remark 2. The barcode of \(\mathbb{R}\)-filtered chain complex consists of segments representing the indecomposable \(\mathbb{R}\)-filtered chain complexes, see Definition 2 in Section 2. There is the standard lexicographic order on a set of such segments. The direct sum from the theorem statement is the standard “direct sum over a set” vector space.
Rights and permissions
About this article
Cite this article
Barannikov, S.A., Korotin, A.A., Oganesyan, D.A. et al. Barcodes as Summary of Loss Function Topology. Dokl. Math. 108 (Suppl 2), S333–S347 (2023). https://doi.org/10.1134/S1064562423701570
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1134/S1064562423701570