note

Supervised Clustering of Persian Handwritten Images Using Regularization and Dimension Reduction Methods

Authors:
Sajedeh Moradnia

Department of Statistics, Tarbiat Modares University, Tehran, Iran

Department of Statistics, Tarbiat Modares University, Tehran, Iran
Search about this author

,
Mousa Golalizadeh

Department of Statistics, Tarbiat Modares University, Tehran, Iran

Department of Statistics, Tarbiat Modares University, Tehran, Iran
Search about this author

ACM Transactions on Knowledge Discovery from Data Volume 18 Issue 5Article No.: 118pp 1–19https://doi.org/10.1145/3638060

Published:27 February 2024Publication History

ACM Transactions on Knowledge Discovery from Data

Abstract

Clustering, as a fundamental exploratory data technique, not only is used to discover patterns and structures in complex datasets but also is utilized to group variables in high-dimensional data analysis. Dimension reduction through clustering helps identify important variables and reduce data dimensions without losing significant information. High-dimensional image datasets, such as Persian handwritten images, have numerous pixels, making statistical inference difficult. Such high-dimensionality property pose challenges for analysis and processing, requiring specialized techniques like clustering to extract information. Incorporating response variable information enhances clustering analysis, transforming it into a supervised method. This article evaluates a supervised clustering approach using Ridge and Lasso penalties, comparing them in analyzing a real dataset while identifying important variables. We demonstrate that despite choosing a small number of variables as important variables, Lasso penalty performs relatively well in predicting the labels of new observations for this multi-class dataset.

REFERENCES

[1] Erin L. Allwein, Robert E. Schapire, and Yoram Singer. 2000. Reducing multiclass to binary: A unifying approach for margin classifiers. Journal of Machine Learning Research 1 (2000), 113–141.Google ScholarDigital Library
[2] Howard D. Bondell and Brian J. Reich. 2008. Simultaneous regression shrinkage, variable selection, and supervised clustering of predictors with OSCAR. Biometrics 64 (2008), 115–123.Google ScholarCross Ref
[3] Léon Bottou, Frank E. Curtis, and Jorge Nocedal. 2018. Optimization methods for large-scale machine learning. SIAM Review 60 (2018), 223–311.Google ScholarCross Ref
[4] Peter Bühlmann and Sara Van De Geer. 2011. Statistics for High-Dimensional Data: Methods, Theory and Applications. Springer Science and Business Media, New York, NY.Google ScholarDigital Library
[5] Charbuty Bahzad and Adnan Abdulazeez. 2021. Classification based on decision tree algorithm for machine learning. Journal of Applied Science and Technology Trends 2 (2021), 20–28.Google ScholarCross Ref
[6] Zenglin Deng, Shichao Zhang, Lin Yang, Ming Zong, and Dazong Cheng. 2018. Sparse sample self-representation for subspace clustering. Neural Computing and Applications 29 (2018), 43–49.Google ScholarDigital Library
[7] Marcel Dettling and Peter Bühlmann. 2002. Supervised clustering of genes. Genome Biology 3 (2002), 1–15.Google ScholarCross Ref
[8] Marcel Dettling and Peter Bühlmann. 2004. Finding predictive gene groups from microarray data. Journal of Multivariate Analysis 90 (2004), 106–131.Google ScholarDigital Library
[9] Michael Fop and Thomas Brendan Murphy. 2018. Variable selection methods for model-based clustering. Statistics Surveys 12 (2018), 18–65.Google ScholarCross Ref
[10] Ricardo Fraiman, Ana Justel, and Marcela Svarc. 2008. Selection of variables for cluster analysis and classification rules. Journal of the American Statistical Association 103 (2008), 1294–1303.Google ScholarCross Ref
[11] Jerome Friedman, Trevor Hastie, and Robert Tibshirani. 2010. Regularization paths for generalized linear models via coordinate descent. Journal of Statistical Software 33 (2010), 1–22.Google ScholarCross Ref
[12] Iryna Haponchyk and Alessandro Moschitti. 2021. Supervised neural clustering via latent structured output learning: Application to question intents. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 3364–3374.Google ScholarCross Ref
[13] Trevor Hastie, Robert Tibshirani, David Botstein, and Patrick Brown. 2001. Supervised harvesting of expression trees. Genome Biology 2 (2001), 1–12.Google ScholarCross Ref
[14] Trevor Hastie, Robert Tibshirani, and Jerome H. Friedman. 2009. The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Springer, New York, NY.Google Scholar
[15] Trevor Hastie, Robert Tibshirani, and Martin Wainwright. 2015. Statistical Learning with Sparsity: The Lasso and Generalizations. Chapman & Hall/CRC, Boca Raton, FL.Google Scholar
[16] Christian Hennig. 2019. Cluster validation by measurement of clustering characteristics relevant to the user. In Data Analysis and Applications 1: Clustering and Regression, Modeling-Estimating, Forecasting and Data Mining, C. H. Skiadas and J. R. Bozeman (Eds.). Wiley, New York, NY, 1–24.Google Scholar
[17] Alan J. Izenman. 2008. Modern Multivariate Statistical Techniques, Regression, Classification and Manifold Learning. Springer, New York, NY.Google Scholar
[18] Longlong Jing and Yingli Tian. 2020. Self-supervised visual feature learning with deep neural networks: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence 43 (2020), 4037–4058.Google ScholarCross Ref
[19] Hossein Khosravi and Ehsanollah Kabir. 2007. Introducing a very large dataset of handwritten Farsi digits and a study on their varieties. Pattern Recognition Letters 28 (2007), 1133–1141.Google ScholarDigital Library
[20] Cheng-Lin Liu, Kazuki Nakashima, Hiroshi Sako, and Hiromichi Fujisawa. 2003. Handwritten digit recognition: Benchmarking of state-of-the-art techniques. Pattern Recognition 36 (2003), 2271–2285.Google ScholarCross Ref
[21] Rui Luo, Zhiwen Yu, Wenming Cao, Cheng Liu, Hau-San Wong, and C. L. Philip Chen. 2019. Adaptive regularized semi-supervised clustering ensemble. IEEE Access 8 (2019), 17926–17934.Google ScholarCross Ref
[22] Guy Mayraz and Geoffrey E. Hinton. 2002. Recognizing handwritten digits using hierarchical products of experts. IEEE Transactions on Pattern Analysis and Machine Intelligence 24 (2002), 189–197.Google ScholarDigital Library
[23] Geoffrey J. McLachlan and Kaye E. Basford. 1988. Mixture Models: Inference and Applications to Clustering. Dekker, New York, NY.Google Scholar
[24] Danh V. Nguyen and David M. Rocke. 2002. Tumor classification by partial least squares using microarray gene expression data. Bioinformatics 18 (2002), 39–50.Google ScholarCross Ref
[25] Ching Y. Suen, Christine Nadal, Raymond Legault, Tuan A. Mai, and Louisa Lam. 1992. Computer recognition of unconstrained handwritten numerals. Proceedings of the IEEE 80 (1992), 1162–1180.Google ScholarCross Ref
[26] Robert Tibshirani. 1996. Regression shrinkage and selection via the Lasso. Journal of the Royal Statistical Society:Series B 58 (1996), 267–288.Google ScholarCross Ref
[27] Robert Tibshirani, Guenther Walther, and Trevor Hastie. 2001. Estimating the number of clusters in a data set via the gap statistic. Journal of the Royal Statistical Society:Series B 63 (2001), 411–423.Google ScholarCross Ref
[28] Øivind Due Trier, Anil K. Jain, and Torfinn Taxt. 1996. Feature extraction methods for character recognition—A survey. Pattern Recognition 29 (1996), 411–423.Google Scholar
[29] Yan Yang, Fei Teng, Tianrui Li, Hao Wang, Hongjun Wang, and Qi Zhang. 2015. Parallel semi-supervised multi-ant colonies clustering ensemble based on MapReduce methodology. IEEE Transactions on Cloud Computing 6 (2015), 857–867.Google ScholarCross Ref
[30] Shichao Zhang, Xuelong Li, Ming Zong, Xiaofeng Zhu, and Ruili Wang. 2017. Efficient kNN classification with different numbers of nearest neighbors. IEEE Transactions on Neural Networks and Learning Systems 29 (2017), 1774–1785.Google ScholarCross Ref
[31] Shichao Zhang, Dazong Cheng, Rong Hu, and Zenglin Deng. 2018. Supervised feature selection algorithm via discriminative ridge regression. World Wide Web 21 (2018), 1545–1562.Google ScholarDigital Library
[32] Shichao Zhang and Jiaye Li. 2023. KNN classification with one-step computation. IEEE Transactions on Knowledge and Data Engineering 35, 3 (2023), 2711–2723.Google Scholar
[33] Shichao Zhang, Jiaye Li, and Yangding Li. 2023. Reachable distance function for KNN classification. IEEE Transactions on Knowledge and Data Engineering 35 (2023), 7382–7396.Google ScholarDigital Library
[34] Shichao Zhang, Jiaye Li, Wenzhen Zhang, and Yongsong Qin. 2022. Hyper-class representation of data. Neurocomputing 503 (2022), 200–218.Google ScholarDigital Library
[35] Xiaojin Zhu and Zoubin Ghahramani. 2002. Learning from Labeled and Unlabeled Data with Label Propagation. Technical Report CMU-CALD-02-107. School of Computer Science, Carnegie Mellon University, Pittsburgh, PA.Google Scholar
[36] Hui Zou and Trevor Hastie. 2005. Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society Series B: Statistical Methodology 67 (2005), 301–320.Google ScholarCross Ref

Index Terms

Supervised Clustering of Persian Handwritten Images Using Regularization and Dimension Reduction Methods
1. Mathematics of computing
  1. Probability and statistics
    1. Statistical paradigms
      1. Cluster analysis

Recommendations

Semi-supervised Dimension Reduction Using Graph-Based Discriminant Analysis
CIT '09: Proceedings of the 2009 Ninth IEEE International Conference on Computer and Information Technology - Volume 02

Semi-supervised learning aims to utilize unlabeled data in the process of supervised learning. In particular, combining semi-supervised learning with dimension reduction can reduce overfitting caused by small sample size in high dimensional data. By ...
Read More
Semi-supervised linear discriminant analysis for dimension reduction and classification

When facing high dimensional data, dimension reduction is necessary before classification. Among dimension reduction methods, linear discriminant analysis (LDA) is a popular one that has been widely used. LDA aims to maximize the ratio of the between-...
Read More
A clustering algorithm for stream data with LDA-based unsupervised localized dimension reduction

We present an algorithm for clustering high dimensional streaming data. The algorithm incorporates dimension reduction into the stream clustering framework. When a new datum arrives, the algorithm performs dimension reduction to find a local projected ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in
ACM Transactions on Knowledge Discovery from Data Volume 18, Issue 5
June 2024
699 pages
ISSN:1556-4681
EISSN:1556-472X
DOI:10.1145/3613659
Editor:
Jian Pei
Duke University, USA
Issue’s Table of Contents
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 27 February 2024
- Online AM: 20 December 2023
- Accepted: 15 December 2023
- Revised: 9 December 2023
- Received: 1 July 2023
Published in tkdd Volume 18, Issue 5

Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
Supervised clustering
high-dimensional data
dimension reduction
regularization methods
Lasso
Persian handwritten images
Qualifiers
- note
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 0
  Total Citations
  View Citations
- 60
  Total Downloads
- Downloads (Last 12 months)60
- Downloads (Last 6 weeks)15
Other Metrics
View Author Metrics
Cited By
This publication has not been cited yet

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Full Text

View this article in Full Text.

View Full Text

Supervised Clustering of Persian Handwritten Images Using Regularization and Dimension Reduction Methods

ACM Transactions on Knowledge Discovery from Data

Abstract

REFERENCES

Cited By

Index Terms

Recommendations

Semi-supervised Dimension Reduction Using Graph-Based Discriminant Analysis

Semi-supervised linear discriminant analysis for dimension reduction and classification

A clustering algorithm for stream data with LDA-based unsupervised localized dimension reduction