skip to main content
note

Supervised Clustering of Persian Handwritten Images Using Regularization and Dimension Reduction Methods

Published:27 February 2024Publication History
Skip Abstract Section

Abstract

Clustering, as a fundamental exploratory data technique, not only is used to discover patterns and structures in complex datasets but also is utilized to group variables in high-dimensional data analysis. Dimension reduction through clustering helps identify important variables and reduce data dimensions without losing significant information. High-dimensional image datasets, such as Persian handwritten images, have numerous pixels, making statistical inference difficult. Such high-dimensionality property pose challenges for analysis and processing, requiring specialized techniques like clustering to extract information. Incorporating response variable information enhances clustering analysis, transforming it into a supervised method. This article evaluates a supervised clustering approach using Ridge and Lasso penalties, comparing them in analyzing a real dataset while identifying important variables. We demonstrate that despite choosing a small number of variables as important variables, Lasso penalty performs relatively well in predicting the labels of new observations for this multi-class dataset.

REFERENCES

  1. [1] Erin L. Allwein, Robert E. Schapire, and Yoram Singer. 2000. Reducing multiclass to binary: A unifying approach for margin classifiers. Journal of Machine Learning Research 1 (2000), 113–141.Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. [2] Howard D. Bondell and Brian J. Reich. 2008. Simultaneous regression shrinkage, variable selection, and supervised clustering of predictors with OSCAR. Biometrics 64 (2008), 115–123.Google ScholarGoogle ScholarCross RefCross Ref
  3. [3] Léon Bottou, Frank E. Curtis, and Jorge Nocedal. 2018. Optimization methods for large-scale machine learning. SIAM Review 60 (2018), 223–311.Google ScholarGoogle ScholarCross RefCross Ref
  4. [4] Peter Bühlmann and Sara Van De Geer. 2011. Statistics for High-Dimensional Data: Methods, Theory and Applications. Springer Science and Business Media, New York, NY.Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. [5] Charbuty Bahzad and Adnan Abdulazeez. 2021. Classification based on decision tree algorithm for machine learning. Journal of Applied Science and Technology Trends 2 (2021), 20–28.Google ScholarGoogle ScholarCross RefCross Ref
  6. [6] Zenglin Deng, Shichao Zhang, Lin Yang, Ming Zong, and Dazong Cheng. 2018. Sparse sample self-representation for subspace clustering. Neural Computing and Applications 29 (2018), 43–49.Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. [7] Marcel Dettling and Peter Bühlmann. 2002. Supervised clustering of genes. Genome Biology 3 (2002), 1–15.Google ScholarGoogle ScholarCross RefCross Ref
  8. [8] Marcel Dettling and Peter Bühlmann. 2004. Finding predictive gene groups from microarray data. Journal of Multivariate Analysis 90 (2004), 106–131.Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. [9] Michael Fop and Thomas Brendan Murphy. 2018. Variable selection methods for model-based clustering. Statistics Surveys 12 (2018), 18–65.Google ScholarGoogle ScholarCross RefCross Ref
  10. [10] Ricardo Fraiman, Ana Justel, and Marcela Svarc. 2008. Selection of variables for cluster analysis and classification rules. Journal of the American Statistical Association 103 (2008), 1294–1303.Google ScholarGoogle ScholarCross RefCross Ref
  11. [11] Jerome Friedman, Trevor Hastie, and Robert Tibshirani. 2010. Regularization paths for generalized linear models via coordinate descent. Journal of Statistical Software 33 (2010), 1–22.Google ScholarGoogle ScholarCross RefCross Ref
  12. [12] Iryna Haponchyk and Alessandro Moschitti. 2021. Supervised neural clustering via latent structured output learning: Application to question intents. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 3364–3374.Google ScholarGoogle ScholarCross RefCross Ref
  13. [13] Trevor Hastie, Robert Tibshirani, David Botstein, and Patrick Brown. 2001. Supervised harvesting of expression trees. Genome Biology 2 (2001), 1–12.Google ScholarGoogle ScholarCross RefCross Ref
  14. [14] Trevor Hastie, Robert Tibshirani, and Jerome H. Friedman. 2009. The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Springer, New York, NY.Google ScholarGoogle Scholar
  15. [15] Trevor Hastie, Robert Tibshirani, and Martin Wainwright. 2015. Statistical Learning with Sparsity: The Lasso and Generalizations. Chapman & Hall/CRC, Boca Raton, FL.Google ScholarGoogle Scholar
  16. [16] Christian Hennig. 2019. Cluster validation by measurement of clustering characteristics relevant to the user. In Data Analysis and Applications 1: Clustering and Regression, Modeling-Estimating, Forecasting and Data Mining, C. H. Skiadas and J. R. Bozeman (Eds.). Wiley, New York, NY, 1–24.Google ScholarGoogle Scholar
  17. [17] Alan J. Izenman. 2008. Modern Multivariate Statistical Techniques, Regression, Classification and Manifold Learning. Springer, New York, NY.Google ScholarGoogle Scholar
  18. [18] Longlong Jing and Yingli Tian. 2020. Self-supervised visual feature learning with deep neural networks: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence 43 (2020), 4037–4058.Google ScholarGoogle ScholarCross RefCross Ref
  19. [19] Hossein Khosravi and Ehsanollah Kabir. 2007. Introducing a very large dataset of handwritten Farsi digits and a study on their varieties. Pattern Recognition Letters 28 (2007), 1133–1141.Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. [20] Cheng-Lin Liu, Kazuki Nakashima, Hiroshi Sako, and Hiromichi Fujisawa. 2003. Handwritten digit recognition: Benchmarking of state-of-the-art techniques. Pattern Recognition 36 (2003), 2271–2285.Google ScholarGoogle ScholarCross RefCross Ref
  21. [21] Rui Luo, Zhiwen Yu, Wenming Cao, Cheng Liu, Hau-San Wong, and C. L. Philip Chen. 2019. Adaptive regularized semi-supervised clustering ensemble. IEEE Access 8 (2019), 17926–17934.Google ScholarGoogle ScholarCross RefCross Ref
  22. [22] Guy Mayraz and Geoffrey E. Hinton. 2002. Recognizing handwritten digits using hierarchical products of experts. IEEE Transactions on Pattern Analysis and Machine Intelligence 24 (2002), 189–197.Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. [23] Geoffrey J. McLachlan and Kaye E. Basford. 1988. Mixture Models: Inference and Applications to Clustering. Dekker, New York, NY.Google ScholarGoogle Scholar
  24. [24] Danh V. Nguyen and David M. Rocke. 2002. Tumor classification by partial least squares using microarray gene expression data. Bioinformatics 18 (2002), 39–50.Google ScholarGoogle ScholarCross RefCross Ref
  25. [25] Ching Y. Suen, Christine Nadal, Raymond Legault, Tuan A. Mai, and Louisa Lam. 1992. Computer recognition of unconstrained handwritten numerals. Proceedings of the IEEE 80 (1992), 1162–1180.Google ScholarGoogle ScholarCross RefCross Ref
  26. [26] Robert Tibshirani. 1996. Regression shrinkage and selection via the Lasso. Journal of the Royal Statistical Society:Series B 58 (1996), 267–288.Google ScholarGoogle ScholarCross RefCross Ref
  27. [27] Robert Tibshirani, Guenther Walther, and Trevor Hastie. 2001. Estimating the number of clusters in a data set via the gap statistic. Journal of the Royal Statistical Society:Series B 63 (2001), 411–423.Google ScholarGoogle ScholarCross RefCross Ref
  28. [28] Øivind Due Trier, Anil K. Jain, and Torfinn Taxt. 1996. Feature extraction methods for character recognition—A survey. Pattern Recognition 29 (1996), 411–423.Google ScholarGoogle Scholar
  29. [29] Yan Yang, Fei Teng, Tianrui Li, Hao Wang, Hongjun Wang, and Qi Zhang. 2015. Parallel semi-supervised multi-ant colonies clustering ensemble based on MapReduce methodology. IEEE Transactions on Cloud Computing 6 (2015), 857–867.Google ScholarGoogle ScholarCross RefCross Ref
  30. [30] Shichao Zhang, Xuelong Li, Ming Zong, Xiaofeng Zhu, and Ruili Wang. 2017. Efficient kNN classification with different numbers of nearest neighbors. IEEE Transactions on Neural Networks and Learning Systems 29 (2017), 1774–1785.Google ScholarGoogle ScholarCross RefCross Ref
  31. [31] Shichao Zhang, Dazong Cheng, Rong Hu, and Zenglin Deng. 2018. Supervised feature selection algorithm via discriminative ridge regression. World Wide Web 21 (2018), 1545–1562.Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. [32] Shichao Zhang and Jiaye Li. 2023. KNN classification with one-step computation. IEEE Transactions on Knowledge and Data Engineering 35, 3 (2023), 2711–2723.Google ScholarGoogle Scholar
  33. [33] Shichao Zhang, Jiaye Li, and Yangding Li. 2023. Reachable distance function for KNN classification. IEEE Transactions on Knowledge and Data Engineering 35 (2023), 7382–7396.Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. [34] Shichao Zhang, Jiaye Li, Wenzhen Zhang, and Yongsong Qin. 2022. Hyper-class representation of data. Neurocomputing 503 (2022), 200–218.Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. [35] Xiaojin Zhu and Zoubin Ghahramani. 2002. Learning from Labeled and Unlabeled Data with Label Propagation. Technical Report CMU-CALD-02-107. School of Computer Science, Carnegie Mellon University, Pittsburgh, PA.Google ScholarGoogle Scholar
  36. [36] Hui Zou and Trevor Hastie. 2005. Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society Series B: Statistical Methodology 67 (2005), 301–320.Google ScholarGoogle ScholarCross RefCross Ref

Index Terms

  1. Supervised Clustering of Persian Handwritten Images Using Regularization and Dimension Reduction Methods

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in

    Full Access

    • Published in

      cover image ACM Transactions on Knowledge Discovery from Data
      ACM Transactions on Knowledge Discovery from Data  Volume 18, Issue 5
      June 2024
      699 pages
      ISSN:1556-4681
      EISSN:1556-472X
      DOI:10.1145/3613659
      Issue’s Table of Contents

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 27 February 2024
      • Online AM: 20 December 2023
      • Accepted: 15 December 2023
      • Revised: 9 December 2023
      • Received: 1 July 2023
      Published in tkdd Volume 18, Issue 5

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • note
    • Article Metrics

      • Downloads (Last 12 months)60
      • Downloads (Last 6 weeks)15

      Other Metrics

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Full Text

    View this article in Full Text.

    View Full Text