Abstract
Clustering, as a fundamental exploratory data technique, not only is used to discover patterns and structures in complex datasets but also is utilized to group variables in high-dimensional data analysis. Dimension reduction through clustering helps identify important variables and reduce data dimensions without losing significant information. High-dimensional image datasets, such as Persian handwritten images, have numerous pixels, making statistical inference difficult. Such high-dimensionality property pose challenges for analysis and processing, requiring specialized techniques like clustering to extract information. Incorporating response variable information enhances clustering analysis, transforming it into a supervised method. This article evaluates a supervised clustering approach using Ridge and Lasso penalties, comparing them in analyzing a real dataset while identifying important variables. We demonstrate that despite choosing a small number of variables as important variables, Lasso penalty performs relatively well in predicting the labels of new observations for this multi-class dataset.
- [1] Erin L. Allwein, Robert E. Schapire, and Yoram Singer. 2000. Reducing multiclass to binary: A unifying approach for margin classifiers. Journal of Machine Learning Research 1 (2000), 113–141.Google ScholarDigital Library
- [2] Howard D. Bondell and Brian J. Reich. 2008. Simultaneous regression shrinkage, variable selection, and supervised clustering of predictors with OSCAR. Biometrics 64 (2008), 115–123.Google ScholarCross Ref
- [3] Léon Bottou, Frank E. Curtis, and Jorge Nocedal. 2018. Optimization methods for large-scale machine learning. SIAM Review 60 (2018), 223–311.Google ScholarCross Ref
- [4] Peter Bühlmann and Sara Van De Geer. 2011. Statistics for High-Dimensional Data: Methods, Theory and Applications. Springer Science and Business Media, New York, NY.Google ScholarDigital Library
- [5] Charbuty Bahzad and Adnan Abdulazeez. 2021. Classification based on decision tree algorithm for machine learning. Journal of Applied Science and Technology Trends 2 (2021), 20–28.Google ScholarCross Ref
- [6] Zenglin Deng, Shichao Zhang, Lin Yang, Ming Zong, and Dazong Cheng. 2018. Sparse sample self-representation for subspace clustering. Neural Computing and Applications 29 (2018), 43–49.Google ScholarDigital Library
- [7] Marcel Dettling and Peter Bühlmann. 2002. Supervised clustering of genes. Genome Biology 3 (2002), 1–15.Google ScholarCross Ref
- [8] Marcel Dettling and Peter Bühlmann. 2004. Finding predictive gene groups from microarray data. Journal of Multivariate Analysis 90 (2004), 106–131.Google ScholarDigital Library
- [9] Michael Fop and Thomas Brendan Murphy. 2018. Variable selection methods for model-based clustering. Statistics Surveys 12 (2018), 18–65.Google ScholarCross Ref
- [10] Ricardo Fraiman, Ana Justel, and Marcela Svarc. 2008. Selection of variables for cluster analysis and classification rules. Journal of the American Statistical Association 103 (2008), 1294–1303.Google ScholarCross Ref
- [11] Jerome Friedman, Trevor Hastie, and Robert Tibshirani. 2010. Regularization paths for generalized linear models via coordinate descent. Journal of Statistical Software 33 (2010), 1–22.Google ScholarCross Ref
- [12] Iryna Haponchyk and Alessandro Moschitti. 2021. Supervised neural clustering via latent structured output learning: Application to question intents. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 3364–3374.Google ScholarCross Ref
- [13] Trevor Hastie, Robert Tibshirani, David Botstein, and Patrick Brown. 2001. Supervised harvesting of expression trees. Genome Biology 2 (2001), 1–12.Google ScholarCross Ref
- [14] Trevor Hastie, Robert Tibshirani, and Jerome H. Friedman. 2009. The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Springer, New York, NY.Google Scholar
- [15] Trevor Hastie, Robert Tibshirani, and Martin Wainwright. 2015. Statistical Learning with Sparsity: The Lasso and Generalizations. Chapman & Hall/CRC, Boca Raton, FL.Google Scholar
- [16] Christian Hennig. 2019. Cluster validation by measurement of clustering characteristics relevant to the user. In Data Analysis and Applications 1: Clustering and Regression, Modeling-Estimating, Forecasting and Data Mining, C. H. Skiadas and J. R. Bozeman (Eds.). Wiley, New York, NY, 1–24.Google Scholar
- [17] Alan J. Izenman. 2008. Modern Multivariate Statistical Techniques, Regression, Classification and Manifold Learning. Springer, New York, NY.Google Scholar
- [18] Longlong Jing and Yingli Tian. 2020. Self-supervised visual feature learning with deep neural networks: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence 43 (2020), 4037–4058.Google ScholarCross Ref
- [19] Hossein Khosravi and Ehsanollah Kabir. 2007. Introducing a very large dataset of handwritten Farsi digits and a study on their varieties. Pattern Recognition Letters 28 (2007), 1133–1141.Google ScholarDigital Library
- [20] Cheng-Lin Liu, Kazuki Nakashima, Hiroshi Sako, and Hiromichi Fujisawa. 2003. Handwritten digit recognition: Benchmarking of state-of-the-art techniques. Pattern Recognition 36 (2003), 2271–2285.Google ScholarCross Ref
- [21] Rui Luo, Zhiwen Yu, Wenming Cao, Cheng Liu, Hau-San Wong, and C. L. Philip Chen. 2019. Adaptive regularized semi-supervised clustering ensemble. IEEE Access 8 (2019), 17926–17934.Google ScholarCross Ref
- [22] Guy Mayraz and Geoffrey E. Hinton. 2002. Recognizing handwritten digits using hierarchical products of experts. IEEE Transactions on Pattern Analysis and Machine Intelligence 24 (2002), 189–197.Google ScholarDigital Library
- [23] Geoffrey J. McLachlan and Kaye E. Basford. 1988. Mixture Models: Inference and Applications to Clustering. Dekker, New York, NY.Google Scholar
- [24] Danh V. Nguyen and David M. Rocke. 2002. Tumor classification by partial least squares using microarray gene expression data. Bioinformatics 18 (2002), 39–50.Google ScholarCross Ref
- [25] Ching Y. Suen, Christine Nadal, Raymond Legault, Tuan A. Mai, and Louisa Lam. 1992. Computer recognition of unconstrained handwritten numerals. Proceedings of the IEEE 80 (1992), 1162–1180.Google ScholarCross Ref
- [26] Robert Tibshirani. 1996. Regression shrinkage and selection via the Lasso. Journal of the Royal Statistical Society:Series B 58 (1996), 267–288.Google ScholarCross Ref
- [27] Robert Tibshirani, Guenther Walther, and Trevor Hastie. 2001. Estimating the number of clusters in a data set via the gap statistic. Journal of the Royal Statistical Society:Series B 63 (2001), 411–423.Google ScholarCross Ref
- [28] Øivind Due Trier, Anil K. Jain, and Torfinn Taxt. 1996. Feature extraction methods for character recognition—A survey. Pattern Recognition 29 (1996), 411–423.Google Scholar
- [29] Yan Yang, Fei Teng, Tianrui Li, Hao Wang, Hongjun Wang, and Qi Zhang. 2015. Parallel semi-supervised multi-ant colonies clustering ensemble based on MapReduce methodology. IEEE Transactions on Cloud Computing 6 (2015), 857–867.Google ScholarCross Ref
- [30] Shichao Zhang, Xuelong Li, Ming Zong, Xiaofeng Zhu, and Ruili Wang. 2017. Efficient kNN classification with different numbers of nearest neighbors. IEEE Transactions on Neural Networks and Learning Systems 29 (2017), 1774–1785.Google ScholarCross Ref
- [31] Shichao Zhang, Dazong Cheng, Rong Hu, and Zenglin Deng. 2018. Supervised feature selection algorithm via discriminative ridge regression. World Wide Web 21 (2018), 1545–1562.Google ScholarDigital Library
- [32] Shichao Zhang and Jiaye Li. 2023. KNN classification with one-step computation. IEEE Transactions on Knowledge and Data Engineering 35, 3 (2023), 2711–2723.Google Scholar
- [33] Shichao Zhang, Jiaye Li, and Yangding Li. 2023. Reachable distance function for KNN classification. IEEE Transactions on Knowledge and Data Engineering 35 (2023), 7382–7396.Google ScholarDigital Library
- [34] Shichao Zhang, Jiaye Li, Wenzhen Zhang, and Yongsong Qin. 2022. Hyper-class representation of data. Neurocomputing 503 (2022), 200–218.Google ScholarDigital Library
- [35] Xiaojin Zhu and Zoubin Ghahramani. 2002. Learning from Labeled and Unlabeled Data with Label Propagation. Technical Report CMU-CALD-02-107. School of Computer Science, Carnegie Mellon University, Pittsburgh, PA.Google Scholar
- [36] Hui Zou and Trevor Hastie. 2005. Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society Series B: Statistical Methodology 67 (2005), 301–320.Google ScholarCross Ref
Index Terms
- Supervised Clustering of Persian Handwritten Images Using Regularization and Dimension Reduction Methods
Recommendations
Semi-supervised Dimension Reduction Using Graph-Based Discriminant Analysis
CIT '09: Proceedings of the 2009 Ninth IEEE International Conference on Computer and Information Technology - Volume 02Semi-supervised learning aims to utilize unlabeled data in the process of supervised learning. In particular, combining semi-supervised learning with dimension reduction can reduce overfitting caused by small sample size in high dimensional data. By ...
Semi-supervised linear discriminant analysis for dimension reduction and classification
When facing high dimensional data, dimension reduction is necessary before classification. Among dimension reduction methods, linear discriminant analysis (LDA) is a popular one that has been widely used. LDA aims to maximize the ratio of the between-...
A clustering algorithm for stream data with LDA-based unsupervised localized dimension reduction
We present an algorithm for clustering high dimensional streaming data. The algorithm incorporates dimension reduction into the stream clustering framework. When a new datum arrives, the algorithm performs dimension reduction to find a local projected ...
Comments