Skip to main content
Log in

A topological data analysis based classifier

  • Regular Article
  • Published:
Advances in Data Analysis and Classification Aims and scope Submit manuscript

Abstract

Topological Data Analysis (TDA) is an emerging field that aims to discover a dataset’s underlying topological information. TDA tools have been commonly used to create filters and topological descriptors to improve Machine Learning (ML) methods. This paper proposes a different TDA pipeline to classify balanced and imbalanced multi-class datasets without additional ML methods. Our proposed method was designed to solve multi-class and imbalanced classification problems with no data resampling preprocessing stage. The proposed TDA-based classifier (TDABC) builds a filtered simplicial complex on the dataset representing high-order data relationships. Following the assumption that a meaningful sub-complex exists in the filtration that approximates the data topology, we apply Persistent Homology (PH) to guide the selection of that sub-complex by considering detected topological features. We use each unlabeled point’s link and star operators to provide different-sized and multi-dimensional neighborhoods to propagate labels from labeled to unlabeled points. The labeling function depends on the filtration’s entire history of the filtered simplicial complex and it is encoded within the persistence diagrams at various dimensions. We select eight datasets with different dimensions, degrees of class overlap, and imbalanced samples per class to validate our method. The TDABC outperforms all baseline methods classifying multi-class imbalanced data with high imbalanced ratios and data with overlapped classes. Also, on average, the proposed method was better than K Nearest Neighbors (KNN) and weighted KNN and behaved competitively with Support Vector Machine and Random Forest baseline classifiers in balanced datasets.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9

Similar content being viewed by others

Notes

  1. By intervals intercepted at level value \(\varepsilon _i\), we mean the homology of \(f^{-1}([0, \varepsilon _i])\), i.e, the j-cycles alive at time \(\varepsilon _i\). See Fig. 5, how the selected purple and yellow intervals are “intercepting” barcodes.

  2. See Appendix 3 for details about data structures for computing \(U_x\) efficiently.

  3. See Sect. 5.2 for details.

References

  • Adams H, Emerson T, Kirby M et al (2017) Persistence images: a stable vector representation of persistent homology. J Mach Learn Res 18(8):1–35

    MathSciNet  MATH  Google Scholar 

  • Aggarwal CC, Hinneburg A, Keim DA (2001) On the surprising behavior of distance metrics in high dimensional spaces. In: Proceedings of the 8th international conference on database theory. Springer, Berlin, Heidelberg, ICDT ’01, pp 420–434

  • Ali D, Asaad A, Jimenez MJ et al (2022) A survey of vectorization methods in topological data analysis. https://doi.org/10.48550/ARXIV.2212.09703

  • Anai H, Chazal F, Glisse M et al (2020) Dtm-based filtrations. In: Baas NA, Carlsson GE, Quick G et al (eds) Topological data analysis. Springer, Cham, pp 33–66

    Chapter  Google Scholar 

  • Arafat NA, Basu D, Bressan S (2019) Topological data analysis with \(\epsilon\)-net induced lazy witness complex. In: Hartmann S, Küng J, Chakravarthy S et al (eds) Database and expert systems applications. Springer, Cham, pp 376–392

    Chapter  Google Scholar 

  • Asniar MNU, Surendro K (2022) Smote-lof for noise identification in imbalanced data classification. J King Saud Univ Comput Inf Sci 34(6, Part B):3413–3423. https://doi.org/10.1016/j.jksuci.2021.01.014

    Article  Google Scholar 

  • Atienza N, Gonzalez-Díaz R, Soriano-Trigueros M (2020) On the stability of persistent entropy and new summary functions for topological data analysis. Pattern Recogn 107:107–509. https://doi.org/10.1016/j.patcog.2020.107509

    Article  Google Scholar 

  • Attali D, Lieutier A, Salinas D (2011) Efficient data structure for representing and simplifying simplicial complexes in high dimensions. In: Proceedings of the twenty-seventh annual symposium on computational geometry. Association for Computing Machinery, New York, SoCG ’11, pp 501–509. https://doi.org/10.1145/1998196.1998277

  • Baudry JP, Maugis C, Michel B (2012) Slope heuristics: overview and implementation. Stat Comput. https://doi.org/10.1007/s11222-011-9236-1

    Article  MathSciNet  MATH  Google Scholar 

  • Bauer U (2021) Ripser: efficient computation of vietoris-rips persistence barcodes. J Appli Comput Topol. https://doi.org/10.1007/s41468-021-00071-5

    Article  MathSciNet  MATH  Google Scholar 

  • Bishnoi S, Hooda BK (2020) A survey of distance measures for mixed variables. Int J Chem Stud 8:338–343. https://doi.org/10.22271/chemi.2020.v8.i4f.10087

    Article  Google Scholar 

  • Boissonnat J, Karthik CS (2018) An efficient representation for filtrations of simplicial complexes. ACM Trans Algorithms 14(4):44:1-44:21

    Article  MathSciNet  MATH  Google Scholar 

  • Boissonnat J, Maria C (2014) The simplex tree: an efficient data structure for general simplicial complexes. Algorithmica 70(3):406–427

    Article  MathSciNet  MATH  Google Scholar 

  • Boissonnat JD, Pritam S (2020) Edge collapse and persistence of flag complexes. In: Cabello S, Chen DZ (eds) 36th International symposium on computational geometry (SoCG 2020), Leibniz international proceedings in informatics (LIPIcs), vol 164. Schloss Dagstuhl–Leibniz-Zentrum für Informatik, Dagstuhl, Germany, pp 19:1–19:15. https://doi.org/10.4230/LIPIcs.SoCG.2020.19

  • Boissonnat J, Karthik CS, Tavenas S (2017) Building efficient and compact data structures for simplicial complexes. Algorithmica 79(2):530–567

    Article  MathSciNet  MATH  Google Scholar 

  • Broder AZ, Kirsch A, Kumar R et al (2010) The hiring problem and lake Wobegon strategies. SIAM J Comput 39(4):1233–1255. https://doi.org/10.1137/07070629X

    Article  MathSciNet  MATH  Google Scholar 

  • Bubenik P, Dłotko P (2017) A persistence landscapes toolbox for topological statistics. J Symb Comput 78:91–114. https://doi.org/10.1016/j.jsc.2016.03.009

    Article  MathSciNet  MATH  Google Scholar 

  • Caillerie C, Michel B (2011) Model selection for simplicial approximation. Found Comput Math 11(6):707–731

    Article  MathSciNet  MATH  Google Scholar 

  • Carlsson G, Gabrielsson RB (2020) Topological approaches to deep learning. In: Topological data analysis. Springer, pp 119–146

  • Carrière M, Cuturi M, Oudot S (2017) Sliced wasserstein kernel for persistence diagrams. In: Proceedings of the 34th international conference on machine learning, vol 70. JMLR.org, ICML’17, pp 664–673

  • Carriere M, Chazal F, Ike Y, et al (2020) Perslay: a neural network layer for persistence diagrams and new graph topological signatures. In: Chiappa S, Calandra R (eds) Proceedings of the twenty third international conference on artificial intelligence and statistics, proceedings of machine learning research, vol 108. PMLR, pp 2786–2796

  • Chawla N, Bowyer K, Hall L et al (2002) Smote: synthetic minority over-sampling technique. J Artif Intell Res (JAIR) 16:321–357. https://doi.org/10.1613/jair.953

    Article  MATH  Google Scholar 

  • Chen Y (2015) The distance-decay function of geographical gravity model: Power law or exponential law? Chaos, Solitons Fractals 77:174–189. https://doi.org/10.1016/j.chaos.2015.05.022

    Article  MathSciNet  MATH  Google Scholar 

  • Chung YM, Lawson A (2022) Persistence curves: A canonical framework for summarizing persistence diagrams. Adv Comput Math 48(1):6. https://doi.org/10.1007/s10444-021-09893-4

    Article  MathSciNet  MATH  Google Scholar 

  • Curry J, Mukherjee S, Turner K (2018) How many directions determine a shape and other sufficiency results for two topological transforms. arXiv: Algebraic Topology

  • de Silva V, Morozov D, Vejdemo-Johansson M (2011) Persistent cohomology and circular coordinates. Discrete Comput Geom 45(4):737–759

    Article  MathSciNet  MATH  Google Scholar 

  • de Silva V, Carlsson G (2004) Topological estimation using witness complexes. In: Gross M, Pfister H, Alexa M, et al (eds) SPBG’04 symposium on point-based graphics 2004. The Eurographics Association. https://doi.org/10.2312/SPBG/SPBG04/157-166

  • Dey TK, Fan F, Wang Y (2014) Computing topological persistence for simplicial maps. SOCG’14, Association for Computing Machinery, New York

  • Deza MM, Deza E (2013) Generalizations of metric spaces. Springer, Berlin, Heidelberg, pp 67–78. https://doi.org/10.1007/978-3-642-30958-8_3

  • Dietterich TG (2000) Ensemble methods in machine learning. Multiple classifier systems. Springer, Berlin, Heidelberg, pp 1–15

    Google Scholar 

  • Dixon JK (1979) Pattern recognition with partly missing data. IEEE Trans Syst Man Cybern 9(10):617–621. https://doi.org/10.1109/TSMC.1979.4310090

    Article  Google Scholar 

  • Dua D, Graff C (2017) UCI machine learning repository. http://archive.ics.uci.edu/ml

  • Edelsbrunner H, Harer J (2010) Computational Topology—an Introduction. American Mathematical Society, Michigan. https://doi.org/10.1007/978-3-540-33259-6_7

  • Edelsbrunner, Letscher, Zomorodian (2002) Topological persistence and simplification. Discrete Comput Geom 28(4):511–533. https://doi.org/10.1007/s00454-002-2885-2

    Article  MathSciNet  MATH  Google Scholar 

  • Fernández A, García S, Galar M et al (2018) Foundations on imbalanced classification. Springer, Cham, pp 19–46. https://doi.org/10.1007/978-3-319-98074-4_2

    Book  Google Scholar 

  • Francois D, Wertz V, Verleysen M (2007) The concentration of fractional distances. IEEE Trans on Knowl and Data Eng 19(7):873–886. https://doi.org/10.1109/TKDE.2007.1037

    Article  Google Scholar 

  • Freeman PR (1983) The secretary problem and its extensions: a review. Int Stat Rev Revue Internationale de Statistique 51(2):189–206

    MathSciNet  MATH  Google Scholar 

  • Gabrielsson RB, Nelson BJ, Dwaraknath A, et al (2020) A topology layer for machine learning. In: PMLR, pp 1553–1563

  • Garside K, Henderson R, Makarenko I et al (2019) Topological data analysis of high resolution diabetic retinopathy images. PLoS ONE 14(5):e0217,413-e0217,413. https://doi.org/10.1371/journal.pone.0217413

    Article  Google Scholar 

  • Ghrist R (2008) Barcodes: the persistent topology of data. Bull (New Series) Am Math Soc 45:61–75

    Article  MathSciNet  MATH  Google Scholar 

  • Goyal A, Rathore L, Kumar S (2021) A survey on solution of imbalanced data classification problem using smote and extreme learning machine. In: Sharma H, Gupta MK, Tomar GS et al (eds) Communication and intelligent systems. Springer, Singapore, pp 31–44

    Chapter  Google Scholar 

  • Harris CR, Millman KJ, van der Walt SJ et al (2020) Array programming with NumPy. Nature 585(7825):357–362. https://doi.org/10.1038/s41586-020-2649-2

    Article  Google Scholar 

  • Hatcher A (2002) Algebraic Topology. Cambridge University Press, Cambridge

    MATH  Google Scholar 

  • Hensel F, Moor M, Rieck B (2021) A survey of topological machine learning methods. Front Artif Intell 4:123. https://doi.org/10.3389/frai.2021.681108

    Article  Google Scholar 

  • Hofer C, Kwitt R, Niethammer M, et al (2017) Deep learning with topological signatures. In: Proceedings of the 31st international conference on neural information processing systems. Curran Associates Inc., Red Hook, NIPS’17, pp 1633–1643

  • Hunter JD (2007) Matplotlib: a 2d graphics environment. Comput Sci Eng 9(3):90–95. https://doi.org/10.1109/MCSE.2007.55

    Article  Google Scholar 

  • Ibrahim H, Anwar SA (1878) Classification of imbalanced data using support vector machine and rough set theory: a review. J Phys Conf Ser 1:012054. https://doi.org/10.1088/1742-6596/1878/1/012054

    Article  Google Scholar 

  • Japkowicz N, Shah M (2011) Evaluating learning algorithms: a classification perspective. Cambridge University Press, New York

    Book  MATH  Google Scholar 

  • Ji Z, Wang CL (2022) Efficient exact k-nearest neighbor graph construction for billion-scale datasets using gpus with tensor cores. In: Proceedings of the 36th ACM international conference on supercomputing. Association for Computing Machinery, New York, ICS ’22. https://doi.org/10.1145/3524059.3532368

  • Jiang G, Wang W (2017) Error estimation based on variance analysis of k-fold cross-validation. Pattern Recogn 69:94–106

    Article  Google Scholar 

  • Kindelan R, Frías J, Cerda M, et al (2021) Classification based on topological data analysis. 2102.03709

  • Kubat M, Holte R, Matwin S (1997) Learning when negative examples abound. In: van Someren M, Widmer G (eds) Machine learning: ECML-97. Springer, Berlin, Heidelberg, pp 146–153

    Chapter  Google Scholar 

  • Lam L, Suen CY (1997) Application of majority voting to pattern recognition: an analysis of its behavior and performance. IEEE Trans Syst Man Cybern Part A 27:553–568

    Article  Google Scholar 

  • Luo H, Patania A, Kim J et al (2021) Generalized penalty for circular coordinate representation. Found Data Sci 3(4):729–767

    Article  MATH  Google Scholar 

  • Majumdar S, Laha AK (2020) Clustering and classification of time series using topological data analysis with applications to finance. Expert Syst Appl 162(113):868. https://doi.org/10.1016/j.eswa.2020.113868

    Article  Google Scholar 

  • Maria C, Boissonnat J, Glisse M et al (2014) The gudhi library: simplicial complexes and persistent homology. In: Hong H, Yap C (eds) Mathematical software-ICMS 2014. Springer, Berlin, Heidelberg

    Google Scholar 

  • McInnes L, Healy J, Melville J (2020) Umap: uniform manifold approximation and projection for dimension reduction. 1802.03426

  • Mitchell TM (1997) Machine learning, international edition. McGraw-Hill Series in Computer Science, McGraw-Hill

    Google Scholar 

  • Motwani R, Raghavan P (1995) Randomized algorithms. Cambridge University Press, Cambridge. https://doi.org/10.1017/CBO9780511814075

    Book  MATH  Google Scholar 

  • Navarro G (2002) Searching in metric spaces by spatial approximation. VLDB J

  • Pedregosa F, et al (2012) Scikit-learn: machine learning in python. J Mach Learn Res 12

  • Pérez JB, Hauke S, Lupo U, et al (2021) giotto-ph: a python library for high-performance computation of persistent homology of vietoris–rips filtrations. 2107.05412

  • Rabadan R, Blumberg AJ (2019) Topological data analysis for genomics and evolution: topology in biology. Cambridge University Press, Cambridge. https://doi.org/10.1017/9781316671665

    Book  MATH  Google Scholar 

  • Ren S, Wu C, Wu J (2021) Computational tools in weighted persistent homology. Chin Ann Math Ser B 42(2):237–258. https://doi.org/10.1007/s11401-021-0255-8

    Article  MathSciNet  MATH  Google Scholar 

  • Rouvreau V (2022) Cython interface. In: GUDHI user and reference manual, 3.6.0 edn. GUDHI Editorial Board. https://gudhi.inria.fr/python/3.6.0/

  • Saadat-Yazdi A, Andreeva R, Sarkar R (2021) Topological detection of Alzheimer’s disease using Betti curves. In: Reyes M, Henriques Abreu P, Cardoso J et al (eds) Interpretability of machine intelligence in medical image computing, and topological data analysis and its applications for medical data. Springer, Cham, pp 119–128

    Google Scholar 

  • Samet H (2006) Foundations of multidimensional and metric data structures. Morgan Kaufman, San Francisco

    MATH  Google Scholar 

  • Seversky LM, Davis S, Berger M (2016) On time-series topological data analysis: new data and opportunities. In: CVPRW, pp 1014–1022. https://doi.org/10.1109/CVPRW.2016.131

  • Shepard D (1968) A two-dimensional interpolation function for irregularly-spaced data. In: Proceedings of the 1968 23rd ACM national conference. Association for Computing Machinery, New York, ACM ’68, pp 517–524. https://doi.org/10.1145/800186.810616

  • The HDF Group (1997–2022) Hierarchical data format, version 5. https://www.hdfgroup.org/HDF5/

  • Umeda Y (2017) Time series classification via topological data analysis. Trans Jpn Soc Artif Intell 32:D–G72_1. https://doi.org/10.1527/tjsai.D-G72

  • Venkataraman V, Ramamurthy K, Turaga P (2016) Persistent homology of attractors for action recognition. In: 2016 IEEE international conference on image processing, ICIP 2016-proceedings. IEEE Computer Society, pp 4150–4154. https://doi.org/10.1109/ICIP.2016.7533141

  • Vuttipittayamongkol P, Elyan E (2020) Neighbourhood-based undersampling approach for handling imbalanced and overlapped data. Inf Sci 509:47–70. https://doi.org/10.1016/j.ins.2019.08.062

    Article  Google Scholar 

  • Wagner H, Dłotko P (2014) Towards topological analysis of high-dimensional feature spaces. Comput Vis Image Underst 121:21–26. https://doi.org/10.1016/j.cviu.2014.01.005

    Article  Google Scholar 

  • Wilson DR, Martinez TR (1997) Improved heterogeneous distance functions. J Artif Int Res 6(1):1–34

    MathSciNet  MATH  Google Scholar 

  • Yershov DS, LaValle SM (2011) Simplicial dijkstra and a* algorithms for optimal feedback planning. In: 2011 IEEE/RSJ international conference on intelligent robots and systems, pp 3862–3867. https://doi.org/10.1109/IROS.2011.6095032

  • Zhang S, Xiao M, Wang H (2020) Gpu-accelerated computation of vietoris-rips persistence barcodes. arXiv:2003.07989

  • Zhang X, Li Y, Kotagiri R et al (2017) Krnn: k rare-class nearest neighbour classification. Pattern Recognit 62:33–44. https://doi.org/10.1016/j.patcog.2016.08.023

    Article  Google Scholar 

Download references

Acknowledgements

This research work was supported by the National Agency for Research and Development of Chile (ANID), with grants ANID 2018/BECA DOCTORADO NACIONAL-21181978, FONDECYT 1211484, 1221696, ICN09 015, and PIA ACT192015. Beca postdoctoral CONACYT (Mexico) also supports this work. The first author would like to thank professor José Carlos Gómez-Larrañaga from CIMAT, Mexico, for his insightful discussions.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Rolando Kindelan.

Ethics declarations

Conflict of interest

The authors are unaware of any affiliations, memberships, funding, or financial holdings that might be perceived as affecting the objectivity of this paper.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendix 1: Implementation details

Appendix 1: Implementation details

This Section provides some details about implementing our TDA-based classifier (TDABC). The TDABC was implemented on top of the Gudhi Library (Maria et al. 2014), Giotto Library (Pérez et al. 2021), Ripser (Bauer 2021; Zhang et al. 2020) to solve the computational topology aspects such as Simplicial Complexes and Persistent Homology (PH). Sci-kit learn (Pedregosa et al. 2012) for the Machine Learning algorithms such as baseline classifiers and PCA and TSNE dimensionality reduction methods. Numpy (Harris et al. 2020) for multi-dimensional arrays manipulation. UMAP-learn for UMAP-based dimensionality reduction (McInnes et al. 2020). HDF5 (The HDF Group 1997-2022) to handle large data in primary and secondary memory. Matplotlib (Hunter 2007) for visualization purposes. The source code of our proposed TDABC is available on https://github.com/rolan2kn/TDABC-4-ADAC. The following sections cover different aspects of the TDABC implementation.

1.1 Build simplicial complexes

1.1.1 Maximal edge length

A p-cycle can be born at any time and live unaltered up to the maximum edge length, in which case the p-cycle will die or be divided into two topological features. By controlling the maximal edge length, we control the topological feature-length and the size of the simplicial complex, combinatorial on the number of points and the simplex dimension. Thus, we recommend using the mean distance of the distance matrix as the maximal edge length. Consequently, noise points will only affect those cycles with a diameter twice the mean distance and make the filtration robust.

1.1.2 Edge collapse

Edge collapsing in Gudhi must be performed on the 1-skeleton of the simplicial complex and then expand from 1-skeleton to build all high dimensional simplices up to a maximal dimension \(q \ll |P|\). The Algorithm 4 computes a simplex tree using the edge collapsing method (Boissonnat and Pritam 2020). A collapsing coefficient is defined to be dependent on the maximal dimension q. However, it could be enhanced by repeatedly calling the \(collapse\_edges\) method until the simplex tree no longer changes. We recommend applying a collapsing factor (obtained experimentally) computed as a function of the point cloud’s ambient dimension and the simplicial complex’s maximal dimension. Edge collapse in Gudhi and Giotto supported only flag complexes like the Vietoris Rips. Our method can be used in any simplicial complex with minimal variations, but each complex has its intricacies to optimize the consumption of time and space resources.

figure d

1.2 Computing link and filtration values

The association function, \(\Psi _{i}\) from Definition 5 depends on the \(Lk_{\mathcal {K}}\) operation. However, up to now, the Python interface of Gudhi Library (v.3.6.0) (Rouvreau 2022) does not have an implementation of the simplex link operation. Regardless, it can be derived from the star and co-face operators according to Definition 2.

In Gudhi, each q-simplex \(\sigma \in \mathcal {K}\) is stored with its filtration value \(f(\sigma )\). Thus, the \(star(\mathcal {S}_\mathcal {K}, \sigma )\) is a function in \(\mathcal {S}_\mathcal {K}\) which returns a 2-tuple set

$$\begin{aligned} \{(\mu , f(\mu )) \mid \mu \in St_\mathcal {K}(\sigma ) \}. \end{aligned}$$

This data structure makes it easy to recover the filtration values required to implement Eq. 10, Algorithm 2, and Algorithm 3.

1.3 Finding neighborhoods of external points

The Algorithm 2 has a lot of room for optimizations. The search for the closest points to x can be drastically enhanced by applying computational geometry algorithms and spatial/metric data structures. As a few examples: a pivot point could be selected from P and then built as a Vantage Point Tree (VP-tree), Ball tree, or M-tree, among others, to perform multidimensional indexing of all elements by their distance (or proximity) to the pivot (Samet 2006). Other metric space searching data structures like the Spatial Approximation Tree (a-tree) (Navarro 2002) could be applied by considering the same complex as support instead a Delaunay triangulation. It could be more efficient than other space partitioning approaches to solve neighboring queries in datasets with more than 20 dimensions. These data structures could help to find points likely to share a simplex with x reducing the computational cost in time complexity to \(O(|P| \log { |P|})\) in the worst case.

1.4 Permutation invariance

Given a proximity function \(h(\cdot , \cdot )\), labeled (\(X_l\)) and unlabeled (\(X_u\)) point sets both subsets of P. Our TDABC is invariant to permutations of \(X_l\) and permutations of \(X_u\) building every time the same type of filtered simplicial complex. Let \(\mathcal {K}({M_P})\) be a filtered simplicial complex constructed using a distance matrix \(M_P\) over a point set P. For any permutation \(\pi (P)\) we obtain a distance matrix \(M_{\pi (P)}\) and a filtered simplicial complex \(\mathcal {K}({M_{\pi (P)}})\). We know that \(M_p\) and \(M_{\pi (P)}\) are equivalent matrices because both have the same number of rows and cols. We can turn one into another by elemental row and column transpositions since they were constructed with the same \(h(\cdot , \cdot )\) function over the same point set P. Thence, \(\mathcal {K}({M_{\pi (P)}})\) and \(\mathcal {K}({M_{P}})\) are the same complexes up to labeling permutations because we can always use a simplicial map that applies the inverse permutation \({\pi ^{-1}(P)}\) to each element of \(\pi (P)\) to obtain the corresponding element in P this is equivalent to perform the column and row transpositions to rename simplices in \(\mathcal {K}({M_{\pi (P)}})\) to obtain \(\mathcal {K}({M_{P}})\). The case of permutations in unlabeled points \(X_u\) is straightforward since the unlabeled points do not contribute to labeling other unlabeled points. Therefore, no matter which permutation is applied, TDABC will label the same point (among permutations) with the same label.

Although the results of the TDABC remain consistent across different permutations, there is room for improvement in terms of time execution. Certain permutations result in faster computations compared to others, contributing to the observed difference in performance. One of the reasons for this discrepancy is the locality property, where elements with small distances between them are clustered together in the element collection. In such cases, finding similar elements requires less time as they are located in close proximity. On the other hand, when similar elements are scattered in arbitrary positions, their retrieval becomes more time-consuming, leading to the wastage of computational resources. To address this issue, we propose using space-filling curves and related data structures (refer to Sect. 3), such as Z-order and Hilbert-order. These techniques, similar to those employed in Gudhi (Maria et al. 2014), have been shown to accelerate the construction of complexes and streamline simplicial queries.

1.5 Topological information

We complement Sect. 4.3 by presenting more information regarding the selected filtration value on the Swissroll and Sphere Datasets. See Figs. 10, 11 and Table 4.

1.5.1 Swissroll topological information

See Fig. 10.

Fig. 10
figure 10

Results of applying the selection function on the Swissroll Dataset: the confusion matrices (first column), barcodes with the chosen interval (second column), and the chosen sub-complex \(\mathcal {K}_i\) (third column), each 0-simplex has a color representing its label, unlabeled points in black

1.5.2 Sphere topological information

See Fig. 11.

Fig. 11
figure 11

Results of applying the selection function on the Sphere Dataset: the confusion matrices (first column), barcodes with the chosen interval (second column), and the chosen sub-complex \(\mathcal {K}_i\) (third column), each 0-simplex has a color representing its label, unlabeled points in black

1.6 Extended results

We also conduct experiments increasing the number of unlabeled points. We perform TDABC with three different cross-validation configurations by taking the fold size to be NORMAL (10%), EXTREME (60%), and HYPER EXTREME (90%) as fold size. We present the results of the F1 metric in Swissroll and Sphere Datasets in Table 4.

Table 4 Results varying the number of unlabeled points

1.7 Complexity analysis

We utilize the simplex tree from Gudhi as the chosen data structure in our paper because of its capability to represent any type of simplicial complex. It is worth mentioning that Gudhi offers various other data structures, as explained in the paragraph, which are specifically designed for maximal simplices (simplices without cofaces) (Boissonnat et al. 2017; Boissonnat and Karthik 2018). However, for our purposes, we focus on the simplex tree due to its versatility and ability to handle a broad range of complex types.

Fig. 12
figure 12

A simplicial complex on ten vertices and its simplex tree. The deepest node represents the tetrahedron of the complex. Every label position at a given depth is linked in a list, as illustrated in the case of label 5. Picture and caption were taken from (Boissonnat and Maria 2014)

The simplex tree data structure was presented by Clement Maria and Jean-Daniel Boisonnat (Boissonnat and Maria 2014). Let \(\mathcal {S}_\mathcal {K}\) be a simplex tree representation of a simplicial complex \(\mathcal {K}\). Let \(\sigma\) be a q-simplex where \(\sigma \in \mathcal {K}\). The operations \(insert(\mathcal {S}_\mathcal {K}, \sigma )\), and \(search(\mathcal {S}_\mathcal {K}, \sigma )\) have a time complexity \(O(|\sigma | \cdot log |P|)\) by using red-black trees to represent sibling nodes (nodes sharing its father node, see Fig. 12). When using hashing functions to represent sibling nodes, the time complexity is reduced to \(O(|\sigma |)\). Insertion of a q-simplex \(\sigma\) with all its faces has a complexity of \(O(2^{|\sigma |} \cdot |\sigma | \cdot log |P|)\). See (Boissonnat and Maria 2014) for a more detailed explanation.

Let \(\tau = \{\tau _0, \tau _1, \cdots , \tau _j\}\) be a j-simplex, with \(\tau \in \mathcal {K}\). Compute \(St_\mathcal {K}(\tau )\) in a simplex tree \(\mathcal {S}_\mathcal {K}\) is performed by the operation \(Locate\_cofaces(\mathcal {S}_\mathcal {K}, \tau )\) (Boissonnat and Maria 2014). To locate all cofaces of \(\tau\) in \(\mathcal {S}_\mathcal {K}\), it is needed to find all occurrences of \(\tau _j\) in nodes whose depth is greater than j, and navigate upwards on \(\mathcal {S}_\mathcal {K}\) looking for remaining elements of \(\tau\). Those paths where \(\tau\) was completely found, will contain the cofaces of \(\tau\). Traversing a path in a simplex tree has a worst-case time complexity of \(O(q+1)\) with \(q = dim(\mathcal {K})\). Lets \(O(\mathcal {T}_{\tau _j}^{>j})\) be the time complexity to locate all nodes at a depth greater than j, which contains \(\tau _j\). Accordingly, the worst case time complexity of \(St_\mathcal {K}(\tau )\) is \(O((q+1)\cdot \mathcal {T}_{\tau _j}^{>j})\). In Gudhi, every path from the root to any leaf defines a maximal simplex.

A few algorithms remain missing from this section like the Algorithm 1 to label a test point set. This algorithm could be implemented considering the explanations mentioned above. The computation of PH is done using the method provided by Gudhi. In Sect. 3.3, we define the selection of persistence interval after obtaining the PH. The Algorithm 4 builds a simplicial complex with distance matrix and edge collapses. Computing the distance matrix can be done in time \(O(d\cdot |P|^2)\) with the brute force method, but it could be at most \(O(|P|)\) in a massive parallelism platform (Ji and Wang 2022). The edge collapse method runs in time \(O(n\cdot n_c \cdot k^2)\), with \(n, n_c\) the number of edges on the input and output graphs, k is the maximal degree of a vertex.

Algorithm 1 includes computing PH, which has a worst-case time complexity of \(O(|\mathcal {K}|^3)\), but in practice, it has a time complexity near linear. The selection function computation is linear on the number of persistence intervals \(O(|D|)\). The inverse level-set function \(f^{-1}(\cdot )\) that has a worst-case time complexity of \(O((q + 1) \cdot \log |P|)\) the same time required to find a simplex in the simplex tree, since we need to locate the simplex to ask for its filtration value. Algorithm 2 has a time complexity of \(O(|P| \cdot \log { |P|} \cdot |U| \cdot (q+1)\cdot \mathcal {T}_{\tau _j}^{>j})\); it is an output-sensitive algorithm which depends of the number of points inside the \((2\varepsilon )\)-ball and for each point the complexity of the star. There is much room for optimizations by applying dynamic programming techniques since multiple star queries on the same dense regions have many non-disjoint solutions. Algorithm 3 finds label contributions to label an unlabeled point x by building an implicit minimal spanning tree on the connected component containing \(Lk_{\mathcal {K}_i}(\{x\})\). By the time the tree is finished, we have visited O(M) nodes performing a link operation per node, where M can be, at most, the number of simplices on the connected component. If the entire complex is connected, \(M = |K|\). Enqueue and dequeue operations on the priority queue Q have a time complexity of \(O(\log {M})\) in Q. Therefore the time complexity is \(O(q \cdot \mathcal {T}_{\tau _j}^{>j} \cdot M \cdot \log { M})\). Each node in this implicit tree has space complexity \(O((q+2) \cdot w)\) bits, \((q+1)\) the maximal q-simplex cardinality plus one because of the priority. Since we have O(M) nodes, the total space complexity is \(O((q+2) \cdot w \cdot M)\) bits.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Kindelan, R., Frías, J., Cerda, M. et al. A topological data analysis based classifier. Adv Data Anal Classif (2023). https://doi.org/10.1007/s11634-023-00548-4

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s11634-023-00548-4

Keywords

Mathematics Subject Classification

Navigation