A topological data analysis based classifier

Kindelan, Rolando; Frías, José; Cerda, Mauricio; Hitschfeld, Nancy

doi:10.1007/s11634-023-00548-4

Rolando Kindelan ORCID: orcid.org/0000-0002-4948-6051^1,3,
José Frías⁴^na1,
Mauricio Cerda²^na1 &
…
Nancy Hitschfeld¹^na1

372 Accesses
Explore all metrics

Abstract

Topological Data Analysis (TDA) is an emerging field that aims to discover a dataset’s underlying topological information. TDA tools have been commonly used to create filters and topological descriptors to improve Machine Learning (ML) methods. This paper proposes a different TDA pipeline to classify balanced and imbalanced multi-class datasets without additional ML methods. Our proposed method was designed to solve multi-class and imbalanced classification problems with no data resampling preprocessing stage. The proposed TDA-based classifier (TDABC) builds a filtered simplicial complex on the dataset representing high-order data relationships. Following the assumption that a meaningful sub-complex exists in the filtration that approximates the data topology, we apply Persistent Homology (PH) to guide the selection of that sub-complex by considering detected topological features. We use each unlabeled point’s link and star operators to provide different-sized and multi-dimensional neighborhoods to propagate labels from labeled to unlabeled points. The labeling function depends on the filtration’s entire history of the filtered simplicial complex and it is encoded within the persistence diagrams at various dimensions. We select eight datasets with different dimensions, degrees of class overlap, and imbalanced samples per class to validate our method. The TDABC outperforms all baseline methods classifying multi-class imbalanced data with high imbalanced ratios and data with overlapped classes. Also, on average, the proposed method was better than K Nearest Neighbors (KNN) and weighted KNN and behaved competitively with Support Vector Machine and Random Forest baseline classifiers in balanced datasets.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Network-based data classification: combining K-associated optimal graphs and high-level prediction

Article Open access 17 June 2014

Supervised Classification Box Algorithm Based on Graph Partitioning

Hostility measure for multi-level study of data complexity

Article Open access 26 July 2022

Notes

By intervals intercepted at level value $\varepsilon _i$, we mean the homology of $f^{-1}([0, \varepsilon _i])$, i.e, the j-cycles alive at time $\varepsilon _i$. See Fig. 5, how the selected purple and yellow intervals are “intercepting” barcodes.
See Appendix 3 for details about data structures for computing $U_x$ efficiently.
See Sect. 5.2 for details.

References

Adams H, Emerson T, Kirby M et al (2017) Persistence images: a stable vector representation of persistent homology. J Mach Learn Res 18(8):1–35
MathSciNet MATH Google Scholar
Aggarwal CC, Hinneburg A, Keim DA (2001) On the surprising behavior of distance metrics in high dimensional spaces. In: Proceedings of the 8th international conference on database theory. Springer, Berlin, Heidelberg, ICDT ’01, pp 420–434
Ali D, Asaad A, Jimenez MJ et al (2022) A survey of vectorization methods in topological data analysis. https://doi.org/10.48550/ARXIV.2212.09703
Anai H, Chazal F, Glisse M et al (2020) Dtm-based filtrations. In: Baas NA, Carlsson GE, Quick G et al (eds) Topological data analysis. Springer, Cham, pp 33–66
Chapter Google Scholar
Arafat NA, Basu D, Bressan S (2019) Topological data analysis with $\epsilon$-net induced lazy witness complex. In: Hartmann S, Küng J, Chakravarthy S et al (eds) Database and expert systems applications. Springer, Cham, pp 376–392
Chapter Google Scholar
Asniar MNU, Surendro K (2022) Smote-lof for noise identification in imbalanced data classification. J King Saud Univ Comput Inf Sci 34(6, Part B):3413–3423. https://doi.org/10.1016/j.jksuci.2021.01.014
Article Google Scholar
Atienza N, Gonzalez-Díaz R, Soriano-Trigueros M (2020) On the stability of persistent entropy and new summary functions for topological data analysis. Pattern Recogn 107:107–509. https://doi.org/10.1016/j.patcog.2020.107509
Article Google Scholar
Attali D, Lieutier A, Salinas D (2011) Efficient data structure for representing and simplifying simplicial complexes in high dimensions. In: Proceedings of the twenty-seventh annual symposium on computational geometry. Association for Computing Machinery, New York, SoCG ’11, pp 501–509. https://doi.org/10.1145/1998196.1998277
Baudry JP, Maugis C, Michel B (2012) Slope heuristics: overview and implementation. Stat Comput. https://doi.org/10.1007/s11222-011-9236-1
Article MathSciNet MATH Google Scholar
Bauer U (2021) Ripser: efficient computation of vietoris-rips persistence barcodes. J Appli Comput Topol. https://doi.org/10.1007/s41468-021-00071-5
Article MathSciNet MATH Google Scholar
Bishnoi S, Hooda BK (2020) A survey of distance measures for mixed variables. Int J Chem Stud 8:338–343. https://doi.org/10.22271/chemi.2020.v8.i4f.10087
Article Google Scholar
Boissonnat J, Karthik CS (2018) An efficient representation for filtrations of simplicial complexes. ACM Trans Algorithms 14(4):44:1-44:21
Article MathSciNet MATH Google Scholar
Boissonnat J, Maria C (2014) The simplex tree: an efficient data structure for general simplicial complexes. Algorithmica 70(3):406–427
Article MathSciNet MATH Google Scholar
Boissonnat JD, Pritam S (2020) Edge collapse and persistence of flag complexes. In: Cabello S, Chen DZ (eds) 36th International symposium on computational geometry (SoCG 2020), Leibniz international proceedings in informatics (LIPIcs), vol 164. Schloss Dagstuhl–Leibniz-Zentrum für Informatik, Dagstuhl, Germany, pp 19:1–19:15. https://doi.org/10.4230/LIPIcs.SoCG.2020.19
Boissonnat J, Karthik CS, Tavenas S (2017) Building efficient and compact data structures for simplicial complexes. Algorithmica 79(2):530–567
Article MathSciNet MATH Google Scholar
Broder AZ, Kirsch A, Kumar R et al (2010) The hiring problem and lake Wobegon strategies. SIAM J Comput 39(4):1233–1255. https://doi.org/10.1137/07070629X
Article MathSciNet MATH Google Scholar
Bubenik P, Dłotko P (2017) A persistence landscapes toolbox for topological statistics. J Symb Comput 78:91–114. https://doi.org/10.1016/j.jsc.2016.03.009
Article MathSciNet MATH Google Scholar
Caillerie C, Michel B (2011) Model selection for simplicial approximation. Found Comput Math 11(6):707–731
Article MathSciNet MATH Google Scholar
Carlsson G, Gabrielsson RB (2020) Topological approaches to deep learning. In: Topological data analysis. Springer, pp 119–146
Carrière M, Cuturi M, Oudot S (2017) Sliced wasserstein kernel for persistence diagrams. In: Proceedings of the 34th international conference on machine learning, vol 70. JMLR.org, ICML’17, pp 664–673
Carriere M, Chazal F, Ike Y, et al (2020) Perslay: a neural network layer for persistence diagrams and new graph topological signatures. In: Chiappa S, Calandra R (eds) Proceedings of the twenty third international conference on artificial intelligence and statistics, proceedings of machine learning research, vol 108. PMLR, pp 2786–2796
Chawla N, Bowyer K, Hall L et al (2002) Smote: synthetic minority over-sampling technique. J Artif Intell Res (JAIR) 16:321–357. https://doi.org/10.1613/jair.953
Article MATH Google Scholar
Chen Y (2015) The distance-decay function of geographical gravity model: Power law or exponential law? Chaos, Solitons Fractals 77:174–189. https://doi.org/10.1016/j.chaos.2015.05.022
Article MathSciNet MATH Google Scholar
Chung YM, Lawson A (2022) Persistence curves: A canonical framework for summarizing persistence diagrams. Adv Comput Math 48(1):6. https://doi.org/10.1007/s10444-021-09893-4
Article MathSciNet MATH Google Scholar
Curry J, Mukherjee S, Turner K (2018) How many directions determine a shape and other sufficiency results for two topological transforms. arXiv: Algebraic Topology
de Silva V, Morozov D, Vejdemo-Johansson M (2011) Persistent cohomology and circular coordinates. Discrete Comput Geom 45(4):737–759
Article MathSciNet MATH Google Scholar
de Silva V, Carlsson G (2004) Topological estimation using witness complexes. In: Gross M, Pfister H, Alexa M, et al (eds) SPBG’04 symposium on point-based graphics 2004. The Eurographics Association. https://doi.org/10.2312/SPBG/SPBG04/157-166
Dey TK, Fan F, Wang Y (2014) Computing topological persistence for simplicial maps. SOCG’14, Association for Computing Machinery, New York
Deza MM, Deza E (2013) Generalizations of metric spaces. Springer, Berlin, Heidelberg, pp 67–78. https://doi.org/10.1007/978-3-642-30958-8_3
Dietterich TG (2000) Ensemble methods in machine learning. Multiple classifier systems. Springer, Berlin, Heidelberg, pp 1–15
Google Scholar
Dixon JK (1979) Pattern recognition with partly missing data. IEEE Trans Syst Man Cybern 9(10):617–621. https://doi.org/10.1109/TSMC.1979.4310090
Article Google Scholar
Dua D, Graff C (2017) UCI machine learning repository. http://archive.ics.uci.edu/ml
Edelsbrunner H, Harer J (2010) Computational Topology—an Introduction. American Mathematical Society, Michigan. https://doi.org/10.1007/978-3-540-33259-6_7
Edelsbrunner, Letscher, Zomorodian (2002) Topological persistence and simplification. Discrete Comput Geom 28(4):511–533. https://doi.org/10.1007/s00454-002-2885-2
Article MathSciNet MATH Google Scholar
Fernández A, García S, Galar M et al (2018) Foundations on imbalanced classification. Springer, Cham, pp 19–46. https://doi.org/10.1007/978-3-319-98074-4_2
Book Google Scholar
Francois D, Wertz V, Verleysen M (2007) The concentration of fractional distances. IEEE Trans on Knowl and Data Eng 19(7):873–886. https://doi.org/10.1109/TKDE.2007.1037
Article Google Scholar
Freeman PR (1983) The secretary problem and its extensions: a review. Int Stat Rev Revue Internationale de Statistique 51(2):189–206
MathSciNet MATH Google Scholar
Gabrielsson RB, Nelson BJ, Dwaraknath A, et al (2020) A topology layer for machine learning. In: PMLR, pp 1553–1563
Garside K, Henderson R, Makarenko I et al (2019) Topological data analysis of high resolution diabetic retinopathy images. PLoS ONE 14(5):e0217,413-e0217,413. https://doi.org/10.1371/journal.pone.0217413
Article Google Scholar
Ghrist R (2008) Barcodes: the persistent topology of data. Bull (New Series) Am Math Soc 45:61–75
Article MathSciNet MATH Google Scholar
Goyal A, Rathore L, Kumar S (2021) A survey on solution of imbalanced data classification problem using smote and extreme learning machine. In: Sharma H, Gupta MK, Tomar GS et al (eds) Communication and intelligent systems. Springer, Singapore, pp 31–44
Chapter Google Scholar
Harris CR, Millman KJ, van der Walt SJ et al (2020) Array programming with NumPy. Nature 585(7825):357–362. https://doi.org/10.1038/s41586-020-2649-2
Article Google Scholar
Hatcher A (2002) Algebraic Topology. Cambridge University Press, Cambridge
MATH Google Scholar
Hensel F, Moor M, Rieck B (2021) A survey of topological machine learning methods. Front Artif Intell 4:123. https://doi.org/10.3389/frai.2021.681108
Article Google Scholar
Hofer C, Kwitt R, Niethammer M, et al (2017) Deep learning with topological signatures. In: Proceedings of the 31st international conference on neural information processing systems. Curran Associates Inc., Red Hook, NIPS’17, pp 1633–1643
Hunter JD (2007) Matplotlib: a 2d graphics environment. Comput Sci Eng 9(3):90–95. https://doi.org/10.1109/MCSE.2007.55
Article Google Scholar
Ibrahim H, Anwar SA (1878) Classification of imbalanced data using support vector machine and rough set theory: a review. J Phys Conf Ser 1:012054. https://doi.org/10.1088/1742-6596/1878/1/012054
Article Google Scholar
Japkowicz N, Shah M (2011) Evaluating learning algorithms: a classification perspective. Cambridge University Press, New York
Book MATH Google Scholar
Ji Z, Wang CL (2022) Efficient exact k-nearest neighbor graph construction for billion-scale datasets using gpus with tensor cores. In: Proceedings of the 36th ACM international conference on supercomputing. Association for Computing Machinery, New York, ICS ’22. https://doi.org/10.1145/3524059.3532368
Jiang G, Wang W (2017) Error estimation based on variance analysis of k-fold cross-validation. Pattern Recogn 69:94–106
Article Google Scholar
Kindelan R, Frías J, Cerda M, et al (2021) Classification based on topological data analysis. 2102.03709
Kubat M, Holte R, Matwin S (1997) Learning when negative examples abound. In: van Someren M, Widmer G (eds) Machine learning: ECML-97. Springer, Berlin, Heidelberg, pp 146–153
Chapter Google Scholar
Lam L, Suen CY (1997) Application of majority voting to pattern recognition: an analysis of its behavior and performance. IEEE Trans Syst Man Cybern Part A 27:553–568
Article Google Scholar
Luo H, Patania A, Kim J et al (2021) Generalized penalty for circular coordinate representation. Found Data Sci 3(4):729–767
Article MATH Google Scholar
Majumdar S, Laha AK (2020) Clustering and classification of time series using topological data analysis with applications to finance. Expert Syst Appl 162(113):868. https://doi.org/10.1016/j.eswa.2020.113868
Article Google Scholar
Maria C, Boissonnat J, Glisse M et al (2014) The gudhi library: simplicial complexes and persistent homology. In: Hong H, Yap C (eds) Mathematical software-ICMS 2014. Springer, Berlin, Heidelberg
Google Scholar
McInnes L, Healy J, Melville J (2020) Umap: uniform manifold approximation and projection for dimension reduction. 1802.03426
Mitchell TM (1997) Machine learning, international edition. McGraw-Hill Series in Computer Science, McGraw-Hill
Google Scholar
Motwani R, Raghavan P (1995) Randomized algorithms. Cambridge University Press, Cambridge. https://doi.org/10.1017/CBO9780511814075
Book MATH Google Scholar
Navarro G (2002) Searching in metric spaces by spatial approximation. VLDB J
Pedregosa F, et al (2012) Scikit-learn: machine learning in python. J Mach Learn Res 12
Pérez JB, Hauke S, Lupo U, et al (2021) giotto-ph: a python library for high-performance computation of persistent homology of vietoris–rips filtrations. 2107.05412
Rabadan R, Blumberg AJ (2019) Topological data analysis for genomics and evolution: topology in biology. Cambridge University Press, Cambridge. https://doi.org/10.1017/9781316671665
Book MATH Google Scholar
Ren S, Wu C, Wu J (2021) Computational tools in weighted persistent homology. Chin Ann Math Ser B 42(2):237–258. https://doi.org/10.1007/s11401-021-0255-8
Article MathSciNet MATH Google Scholar
Rouvreau V (2022) Cython interface. In: GUDHI user and reference manual, 3.6.0 edn. GUDHI Editorial Board. https://gudhi.inria.fr/python/3.6.0/
Saadat-Yazdi A, Andreeva R, Sarkar R (2021) Topological detection of Alzheimer’s disease using Betti curves. In: Reyes M, Henriques Abreu P, Cardoso J et al (eds) Interpretability of machine intelligence in medical image computing, and topological data analysis and its applications for medical data. Springer, Cham, pp 119–128
Google Scholar
Samet H (2006) Foundations of multidimensional and metric data structures. Morgan Kaufman, San Francisco
MATH Google Scholar
Seversky LM, Davis S, Berger M (2016) On time-series topological data analysis: new data and opportunities. In: CVPRW, pp 1014–1022. https://doi.org/10.1109/CVPRW.2016.131
Shepard D (1968) A two-dimensional interpolation function for irregularly-spaced data. In: Proceedings of the 1968 23rd ACM national conference. Association for Computing Machinery, New York, ACM ’68, pp 517–524. https://doi.org/10.1145/800186.810616
The HDF Group (1997–2022) Hierarchical data format, version 5. https://www.hdfgroup.org/HDF5/
Umeda Y (2017) Time series classification via topological data analysis. Trans Jpn Soc Artif Intell 32:D–G72_1. https://doi.org/10.1527/tjsai.D-G72
Venkataraman V, Ramamurthy K, Turaga P (2016) Persistent homology of attractors for action recognition. In: 2016 IEEE international conference on image processing, ICIP 2016-proceedings. IEEE Computer Society, pp 4150–4154. https://doi.org/10.1109/ICIP.2016.7533141
Vuttipittayamongkol P, Elyan E (2020) Neighbourhood-based undersampling approach for handling imbalanced and overlapped data. Inf Sci 509:47–70. https://doi.org/10.1016/j.ins.2019.08.062
Article Google Scholar
Wagner H, Dłotko P (2014) Towards topological analysis of high-dimensional feature spaces. Comput Vis Image Underst 121:21–26. https://doi.org/10.1016/j.cviu.2014.01.005
Article Google Scholar
Wilson DR, Martinez TR (1997) Improved heterogeneous distance functions. J Artif Int Res 6(1):1–34
MathSciNet MATH Google Scholar
Yershov DS, LaValle SM (2011) Simplicial dijkstra and a* algorithms for optimal feedback planning. In: 2011 IEEE/RSJ international conference on intelligent robots and systems, pp 3862–3867. https://doi.org/10.1109/IROS.2011.6095032
Zhang S, Xiao M, Wang H (2020) Gpu-accelerated computation of vietoris-rips persistence barcodes. arXiv:2003.07989
Zhang X, Li Y, Kotagiri R et al (2017) Krnn: k rare-class nearest neighbour classification. Pattern Recognit 62:33–44. https://doi.org/10.1016/j.patcog.2016.08.023
Article Google Scholar

Download references

Acknowledgements

This research work was supported by the National Agency for Research and Development of Chile (ANID), with grants ANID 2018/BECA DOCTORADO NACIONAL-21181978, FONDECYT 1211484, 1221696, ICN09 015, and PIA ACT192015. Beca postdoctoral CONACYT (Mexico) also supports this work. The first author would like to thank professor José Carlos Gómez-Larrañaga from CIMAT, Mexico, for his insightful discussions.

Author information

José Frías, Mauricio Cerda and Nancy Hitschfeld have contributed equally to this work.

Authors and Affiliations

Computer Sciences Department, Faculty of Mathematical and Physical Sciences, University of Chile, 851 Beauchef Av., 8370456, Santiago, Metropolitan Region, Chile
Rolando Kindelan & Nancy Hitschfeld
Integrative Biology Program, Institute of Biomedical Sciences, Center for Medical Informatics and Telemedicine, Faculty of Medicine, Universidad de Chile, 1027 Independencia Av., Santiago, Metropolitan Region, Chile
Mauricio Cerda
Medical and Biophysics Center, University of Oriente, Patricio Lumumba S/N, Santiago, Cuba
Rolando Kindelan
Center for Research in Mathematics, Jalisco S/N, Col. Valenciana, 63023, Guanajuato, Guanajuato, Mexico
José Frías

Authors

Rolando Kindelan
View author publications
You can also search for this author in PubMed Google Scholar
José Frías
View author publications
You can also search for this author in PubMed Google Scholar
Mauricio Cerda
View author publications
You can also search for this author in PubMed Google Scholar
Nancy Hitschfeld
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Rolando Kindelan.

Ethics declarations

Conflict of interest

The authors are unaware of any affiliations, memberships, funding, or financial holdings that might be perceived as affecting the objectivity of this paper.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendix 1: Implementation details

This Section provides some details about implementing our TDA-based classifier (TDABC). The TDABC was implemented on top of the Gudhi Library (Maria et al. 2014), Giotto Library (Pérez et al. 2021), Ripser (Bauer 2021; Zhang et al. 2020) to solve the computational topology aspects such as Simplicial Complexes and Persistent Homology (PH). Sci-kit learn (Pedregosa et al. 2012) for the Machine Learning algorithms such as baseline classifiers and PCA and TSNE dimensionality reduction methods. Numpy (Harris et al. 2020) for multi-dimensional arrays manipulation. UMAP-learn for UMAP-based dimensionality reduction (McInnes et al. 2020). HDF5 (The HDF Group 1997-2022) to handle large data in primary and secondary memory. Matplotlib (Hunter 2007) for visualization purposes. The source code of our proposed TDABC is available on https://github.com/rolan2kn/TDABC-4-ADAC. The following sections cover different aspects of the TDABC implementation.

1.1 Build simplicial complexes

1.1.1 Maximal edge length

A p-cycle can be born at any time and live unaltered up to the maximum edge length, in which case the p-cycle will die or be divided into two topological features. By controlling the maximal edge length, we control the topological feature-length and the size of the simplicial complex, combinatorial on the number of points and the simplex dimension. Thus, we recommend using the mean distance of the distance matrix as the maximal edge length. Consequently, noise points will only affect those cycles with a diameter twice the mean distance and make the filtration robust.

1.1.2 Edge collapse

Edge collapsing in Gudhi must be performed on the 1-skeleton of the simplicial complex and then expand from 1-skeleton to build all high dimensional simplices up to a maximal dimension $q \ll |P|$. The Algorithm 4 computes a simplex tree using the edge collapsing method (Boissonnat and Pritam 2020). A collapsing coefficient is defined to be dependent on the maximal dimension q. However, it could be enhanced by repeatedly calling the $collapse\_edges$ method until the simplex tree no longer changes. We recommend applying a collapsing factor (obtained experimentally) computed as a function of the point cloud’s ambient dimension and the simplicial complex’s maximal dimension. Edge collapse in Gudhi and Giotto supported only flag complexes like the Vietoris Rips. Our method can be used in any simplicial complex with minimal variations, but each complex has its intricacies to optimize the consumption of time and space resources.

1.2 Computing link and filtration values

The association function, $\Psi _{i}$ from Definition 5 depends on the $Lk_{\mathcal {K}}$ operation. However, up to now, the Python interface of Gudhi Library (v.3.6.0) (Rouvreau 2022) does not have an implementation of the simplex link operation. Regardless, it can be derived from the star and co-face operators according to Definition 2.

In Gudhi, each q-simplex $\sigma \in \mathcal {K}$ is stored with its filtration value $f(\sigma )$. Thus, the $star(\mathcal {S}_\mathcal {K}, \sigma )$ is a function in $\mathcal {S}_\mathcal {K}$ which returns a 2-tuple set

$$\begin{aligned} \{(\mu , f(\mu )) \mid \mu \in St_\mathcal {K}(\sigma ) \}. \end{aligned}$$

This data structure makes it easy to recover the filtration values required to implement Eq. 10, Algorithm 2, and Algorithm 3.

1.3 Finding neighborhoods of external points

The Algorithm 2 has a lot of room for optimizations. The search for the closest points to x can be drastically enhanced by applying computational geometry algorithms and spatial/metric data structures. As a few examples: a pivot point could be selected from P and then built as a Vantage Point Tree (VP-tree), Ball tree, or M-tree, among others, to perform multidimensional indexing of all elements by their distance (or proximity) to the pivot (Samet 2006). Other metric space searching data structures like the Spatial Approximation Tree (a-tree) (Navarro 2002) could be applied by considering the same complex as support instead a Delaunay triangulation. It could be more efficient than other space partitioning approaches to solve neighboring queries in datasets with more than 20 dimensions. These data structures could help to find points likely to share a simplex with x reducing the computational cost in time complexity to $O(|P| \log { |P|})$ in the worst case.

1.4 Permutation invariance

Given a proximity function $h(\cdot , \cdot )$, labeled ($X_l$) and unlabeled ($X_u$) point sets both subsets of P. Our TDABC is invariant to permutations of $X_l$ and permutations of $X_u$ building every time the same type of filtered simplicial complex. Let $\mathcal {K}({M_P})$ be a filtered simplicial complex constructed using a distance matrix $M_P$ over a point set P. For any permutation $\pi (P)$ we obtain a distance matrix $M_{\pi (P)}$ and a filtered simplicial complex $\mathcal {K}({M_{\pi (P)}})$. We know that $M_p$ and $M_{\pi (P)}$ are equivalent matrices because both have the same number of rows and cols. We can turn one into another by elemental row and column transpositions since they were constructed with the same $h(\cdot , \cdot )$ function over the same point set P. Thence, $\mathcal {K}({M_{\pi (P)}})$ and $\mathcal {K}({M_{P}})$ are the same complexes up to labeling permutations because we can always use a simplicial map that applies the inverse permutation ${\pi ^{-1}(P)}$ to each element of $\pi (P)$ to obtain the corresponding element in P this is equivalent to perform the column and row transpositions to rename simplices in $\mathcal {K}({M_{\pi (P)}})$ to obtain $\mathcal {K}({M_{P}})$. The case of permutations in unlabeled points $X_u$ is straightforward since the unlabeled points do not contribute to labeling other unlabeled points. Therefore, no matter which permutation is applied, TDABC will label the same point (among permutations) with the same label.

Although the results of the TDABC remain consistent across different permutations, there is room for improvement in terms of time execution. Certain permutations result in faster computations compared to others, contributing to the observed difference in performance. One of the reasons for this discrepancy is the locality property, where elements with small distances between them are clustered together in the element collection. In such cases, finding similar elements requires less time as they are located in close proximity. On the other hand, when similar elements are scattered in arbitrary positions, their retrieval becomes more time-consuming, leading to the wastage of computational resources. To address this issue, we propose using space-filling curves and related data structures (refer to Sect. 3), such as Z-order and Hilbert-order. These techniques, similar to those employed in Gudhi (Maria et al. 2014), have been shown to accelerate the construction of complexes and streamline simplicial queries.

1.5 Topological information

We complement Sect. 4.3 by presenting more information regarding the selected filtration value on the Swissroll and Sphere Datasets. See Figs. 10, 11 and Table 4.

1.5.1 Swissroll topological information

See Fig. 10.

1.5.2 Sphere topological information

See Fig. 11.

1.6 Extended results

We also conduct experiments increasing the number of unlabeled points. We perform TDABC with three different cross-validation configurations by taking the fold size to be NORMAL (10%), EXTREME (60%), and HYPER EXTREME (90%) as fold size. We present the results of the F1 metric in Swissroll and Sphere Datasets in Table 4.

Table 4 Results varying the number of unlabeled points

Full size table

1.7 Complexity analysis

We utilize the simplex tree from Gudhi as the chosen data structure in our paper because of its capability to represent any type of simplicial complex. It is worth mentioning that Gudhi offers various other data structures, as explained in the paragraph, which are specifically designed for maximal simplices (simplices without cofaces) (Boissonnat et al. 2017; Boissonnat and Karthik 2018). However, for our purposes, we focus on the simplex tree due to its versatility and ability to handle a broad range of complex types.

The simplex tree data structure was presented by Clement Maria and Jean-Daniel Boisonnat (Boissonnat and Maria 2014). Let $\mathcal {S}_\mathcal {K}$ be a simplex tree representation of a simplicial complex $\mathcal {K}$. Let $\sigma$ be a q-simplex where $\sigma \in \mathcal {K}$. The operations $insert(\mathcal {S}_\mathcal {K}, \sigma )$, and $search(\mathcal {S}_\mathcal {K}, \sigma )$ have a time complexity $O(|\sigma | \cdot log |P|)$ by using red-black trees to represent sibling nodes (nodes sharing its father node, see Fig. 12). When using hashing functions to represent sibling nodes, the time complexity is reduced to $O(|\sigma |)$. Insertion of a q-simplex $\sigma$ with all its faces has a complexity of $O(2^{|\sigma |} \cdot |\sigma | \cdot log |P|)$. See (Boissonnat and Maria 2014) for a more detailed explanation.

Let $\tau = \{\tau _0, \tau _1, \cdots , \tau _j\}$ be a j-simplex, with $\tau \in \mathcal {K}$. Compute $St_\mathcal {K}(\tau )$ in a simplex tree $\mathcal {S}_\mathcal {K}$ is performed by the operation $Locate\_cofaces(\mathcal {S}_\mathcal {K}, \tau )$ (Boissonnat and Maria 2014). To locate all cofaces of $\tau$ in $\mathcal {S}_\mathcal {K}$, it is needed to find all occurrences of $\tau _j$ in nodes whose depth is greater than j, and navigate upwards on $\mathcal {S}_\mathcal {K}$ looking for remaining elements of $\tau$. Those paths where $\tau$ was completely found, will contain the cofaces of $\tau$. Traversing a path in a simplex tree has a worst-case time complexity of $O(q+1)$ with $q = dim(\mathcal {K})$. Lets $O(\mathcal {T}_{\tau _j}^{>j})$ be the time complexity to locate all nodes at a depth greater than j, which contains $\tau _j$. Accordingly, the worst case time complexity of $St_\mathcal {K}(\tau )$ is $O((q+1)\cdot \mathcal {T}_{\tau _j}^{>j})$. In Gudhi, every path from the root to any leaf defines a maximal simplex.

A few algorithms remain missing from this section like the Algorithm 1 to label a test point set. This algorithm could be implemented considering the explanations mentioned above. The computation of PH is done using the method provided by Gudhi. In Sect. 3.3, we define the selection of persistence interval after obtaining the PH. The Algorithm 4 builds a simplicial complex with distance matrix and edge collapses. Computing the distance matrix can be done in time $O(d\cdot |P|^2)$ with the brute force method, but it could be at most $O(|P|)$ in a massive parallelism platform (Ji and Wang 2022). The edge collapse method runs in time $O(n\cdot n_c \cdot k^2)$, with $n, n_c$ the number of edges on the input and output graphs, k is the maximal degree of a vertex.

Algorithm 1 includes computing PH, which has a worst-case time complexity of $O(|\mathcal {K}|^3)$, but in practice, it has a time complexity near linear. The selection function computation is linear on the number of persistence intervals $O(|D|)$. The inverse level-set function $f^{-1}(\cdot )$ that has a worst-case time complexity of $O((q + 1) \cdot \log |P|)$ the same time required to find a simplex in the simplex tree, since we need to locate the simplex to ask for its filtration value. Algorithm 2 has a time complexity of $O(|P| \cdot \log { |P|} \cdot |U| \cdot (q+1)\cdot \mathcal {T}_{\tau _j}^{>j})$; it is an output-sensitive algorithm which depends of the number of points inside the $(2\varepsilon )$-ball and for each point the complexity of the star. There is much room for optimizations by applying dynamic programming techniques since multiple star queries on the same dense regions have many non-disjoint solutions. Algorithm 3 finds label contributions to label an unlabeled point x by building an implicit minimal spanning tree on the connected component containing $Lk_{\mathcal {K}_i}(\{x\})$. By the time the tree is finished, we have visited O(M) nodes performing a link operation per node, where M can be, at most, the number of simplices on the connected component. If the entire complex is connected, $M = |K|$. Enqueue and dequeue operations on the priority queue Q have a time complexity of $O(\log {M})$ in Q. Therefore the time complexity is $O(q \cdot \mathcal {T}_{\tau _j}^{>j} \cdot M \cdot \log { M})$. Each node in this implicit tree has space complexity $O((q+2) \cdot w)$ bits, $(q+1)$ the maximal q-simplex cardinality plus one because of the priority. Since we have O(M) nodes, the total space complexity is $O((q+2) \cdot w \cdot M)$ bits.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Kindelan, R., Frías, J., Cerda, M. et al. A topological data analysis based classifier. Adv Data Anal Classif (2023). https://doi.org/10.1007/s11634-023-00548-4

Download citation

Received: 04 February 2022
Revised: 15 May 2023
Accepted: 12 June 2023
Published: 01 July 2023
DOI: https://doi.org/10.1007/s11634-023-00548-4

Keywords

Mathematics Subject Classification

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A topological data analysis based classifier

Abstract

Access this article

Similar content being viewed by others

Network-based data classification: combining K-associated optimal graphs and high-level prediction

Supervised Classification Box Algorithm Based on Graph Partitioning

Hostility measure for multi-level study of data complexity

Notes

References

Acknowledgements