Supervised feature selection using principal component analysis

Rahmat, Fariq; Zulkafli, Zed; Ishak, Asnor Juraiza; Abdul Rahman, Ribhan Zafira; Stercke, Simon De; Buytaert, Wouter; Tahir, Wardah; Ab Rahman, Jamalludin; Ibrahim, Salwa; Ismail, Muhamad

doi:10.1007/s10115-023-01993-5

Supervised feature selection using principal component analysis

Regular paper
Published: 08 November 2023

Volume 66, pages 1955–1995, (2024)
Cite this article

Knowledge and Information Systems Aims and scope Submit manuscript

Fariq Rahmat¹^na1,
Zed Zulkafli²^na1,
Asnor Juraiza Ishak¹^na1,
Ribhan Zafira Abdul Rahman¹^na1,
Simon De Stercke³^na1,
Wouter Buytaert³^na1,
Wardah Tahir⁴^na1,
Jamalludin Ab Rahman⁵^na1,
Salwa Ibrahim⁶^na1 &
…
Muhamad Ismail⁶^na1

367 Accesses
1 Citation
1 Altmetric
Explore all metrics

Abstract

The principal component analysis (PCA) is widely used in computational science branches such as computer science, pattern recognition, and machine learning, as it can effectively reduce the dimensionality of high-dimensional data. In particular, it is a popular transformation method used for feature extraction. In this study, we explore PCA’s ability for feature selection in regression applications. We introduce a new approach using PCA, called Targeted PCA to analyze a multivariate dataset that includes the dependent variable—it identifies the principal component with a high representation of the dependent variable and then examines the selected principal component to capture and rank the contribution of the non-dependent variables. The study also compares the feature selected with that resulting from a Least Absolute Shrinkage and Selection Operator (LASSO) regression. Finally, the selected features were tested in two regression models: multiple linear regression (MLR) and artificial neural network (ANN). The results are presented for three socioeconomic, environmental, and computer image processing datasets. Our study found that 2 of 3 random datasets have more than 50% similarity in the selected features by the PCA and LASSO regression methods. In the regression predictions, our PCA-selected features resulted in little difference compared to the LASSO regression-selected features in terms of the MLR prediction accuracy. However, the ANN regression demonstrated a faster convergence and a higher reduction of error.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Feature dimensionality reduction: a review

Article Open access 21 January 2022

Learning from imbalanced data: open challenges and future directions

Article Open access 22 April 2016

Data clustering: application and trends

Article 27 November 2022

Data availability

Datasets 1 and 2 can be retrieved from the open-source UCI Machine Learning Repository collection. Meanwhile, Dataset 3 is subject to the following licenses/restrictions: The datasets are owned by multiple government agencies and have sharing restrictions. Requests to access these datasets and codes should be directed to fariqrahmat94@gmail.com.

References

Guyon I, Elisseeff A (2003) An introduction to variable and feature selection. J Mach Learn Res 3(Mar):1157–1182
Google Scholar
Jain D, Singh V (2018) Feature selection and classification systems for chronic disease prediction: a review. Egypt Inf J 19(3):179–189
Google Scholar
Tang J, Alelyani S, Liu H (2014) Feature selection for classification: A review. In: Algorithms and applications, data classification, p 37
Battiti R (1994) Using mutual information for selecting features in supervised neural net learning. IEEE Trans Neural Netw 5(4):537–550
Article CAS PubMed Google Scholar
Verma L, Srivastava S, Negi PC (2016) A hybrid data mining model to predict coronary artery disease cases using non-invasive clinical data. J Med Syst 40(7):1–7
Article Google Scholar
Yu L, Liu H (2003) Feature selection for high-dimensional data: A fast correlation-based filter solution. In: Proceedings of the 20th international conference on machine learning (ICML-03), pp 856–863
Wosiak A, Zakrzewska D (2018) Integrating correlation-based feature selection and clustering for improved cardiovascular disease diagnosis. Complexity, 2018
Hall MA (1999) Correlation-based feature selection for machine learning
Kumar V, Minz S (2014) Feature selection: a literature review. SmartCR 4(3):211–229
Article Google Scholar
Shahana AH, Preeja V (2016) Survey on feature subset selection for high dimensional data. In: 2016 International conference on circuit, power and computing technologies (ICCPCT), pp 1–4. IEEE
Song F, Guo Z, Mei D (2010) Feature selection using principal component analysis. In: 2010 international conference on system science, engineering design and manufacturing informatization, vol 1, pp 27–30. IEEE
Mubarak S, Darwis H, Umar F, Ilmawan LB, Anraeni S, Mude MA (2018) Feature selection of oral cyst and tumor images using principal component analysis. In: 2018 2nd east indonesia conference on computer and information technology (EIConCIT), pp 322–325. IEEE
Wang XD, Chen RC, Zeng ZQ, Hong CQ, Yan F (2018) Robust dimension reduction for clustering with local adaptive learning. IEEE Trans Neural Netw Learn Syst 30(3):657–669
Article MathSciNet PubMed Google Scholar
Hair JF (2009) Multivariate data analysis
Kassambara A (2017) Practical guide to principal component methods. In: R: PCA, M (CA), FAMD, MFA, HCPC, factoextra (Vol. 2). Sthda
Abdi H, Williams LJ (2010) Principal component analysis. WIREs Comp Stat 2:433–459
Article Google Scholar
Xu Y, Zhang D, Yang JY (2010) A feature extraction method for use with bimodal biometrics. Patt Recogn 43(3):1106–1115
Article ADS Google Scholar
Giersdorf J, Conzelmann M (2017) Analysis of feature-selection for LASSO regression models
Hamming R (2012) Numerical methods for scientists and engineers. Courier Corporation
Zhai D, Liu X, Chang H, Zhen Y, Chen X, Guo M, Gao W (2018) Parametric local multiview hamming distance metric learning. Patt Recogn 75:250–262
Article ADS Google Scholar
Tang M, Yu Y, Aref WG, Malluhi QM, Ouzzani M (2015) Efficient processing of hamming-distance-based similarity-search queries over MapReduce. In EDBT, pp 361–372
Uyanık GK, Güler N (2013) A study on multiple linear regression analysis. Proc Soc Behav Sci 106:234–240
Article Google Scholar
Fischer MM (2015) Neural networks: a class of flexible non-linear models for regression and classification. In: Handbook of research methods and applications in economic geography. Edward Elgar Publishing
Rabunal JR, Dorado J (Eds.) (2006) Artificial neural networks in real-life applications. IGI Global
Redmond M, Baveja A (2002) A data-driven software tool for enabling cooperative information sharing among police departments. Eur J Oper Res 141(3):660–678
Article Google Scholar
Graf F, Kriegel HP, Schubert M, Pölsterl S, Cavallaro A (2011) 2D image registration in CT images using radial image descriptors. In: International conference on medical image computing and computer-assisted intervention. Springer, Berlin, Heidelberg, pp 607–614
Mandalapu V, Elluri L, Vyas P, Roy N (2023) Crime prediction using machine learning and deep learning: a systematic review and future directions. IEEE Access
Adelman R, Reid LW, Markle G, Weiss S, Jaret C (2017) Urban crime rates and the changing face of immigration: evidence across four decades. J Ethn Crim Just 15(1):52–77
Google Scholar
Furuhashi S, Abe K, Takahashi M, Aizawa T, Shizukuishi T, Sakaguchi M, Sasaki Y (2009) A computer-assisted system for diagnostic workstations: automated bone labeling for CT images. J Digit Imag 22:689–695
Article Google Scholar
Ng M (2016) Environmental factors associated with increased rat populations: a focused practice question
Byers KA, Lee MJ, Patrick DM, Himsworth CG (2019) Rats about town: a systematic review of rat movement in urban ecosystems. Front Ecol Evol 7:13
Article Google Scholar
Navarrete EJ, Rivas SB, Soriano RML (2015) Leptospirosis prevalence and associated factors in school children from Valle de Chalco-Solidaridad, State of Mexico. Int J Pediatr Res 1:8
Google Scholar
Tan WL, Soelar SA, Mohd Suan MA, Hussin N, Cheah WK, Verasahib K, Goh PP (2016) Leptospirosis incidence and mortality in Malaysia. Southeast Asian J Trop Med Public Health 47(3):434–40
PubMed Google Scholar
Mohamed-Hassan SN, Bahaman AR, Mutalib AR, Khairani-Bejo S (2012) Prevalence of pathogenic leptospires in rats from selected locations in peninsular Malaysia. Res J Animal Sci 6(1):12–25
Article Google Scholar
Ridzuan J, Aziah BD, Zahiruddin WM (2016) The occupational hazard study for leptospirosis among agriculture workers. Int J Collab Res Intern Med Public Health 8:MA13–MA22
Google Scholar
Lemhadri I, Ruan F, Tibshirani R (2021) Lassonet: neural networks with feature sparsity. In: International conference on artificial intelligence and statistics, pp 10–18. PMLR
Krakovska O, Christie G, Sixsmith A, Ester M, Moreno S (2019) Performance comparison of linear and non-linear feature selection methods for the analysis of large survey datasets. Plos one 14(3):e0213584
Article CAS PubMed PubMed Central Google Scholar

Download references

Acknowledgements

This work was supported by a Grant (No.: NEWTON/1/2018/WAB05/UPM/1) from the UK Natural Environment Council and the Ministry of Education Malaysia under the Understanding of the Impacts of Hydrometeorological Hazards program for their collaboration in research planning and development. We also acknowledge the Negeri Sembilan Town and Rural Planning Department and International Institute for Applied System Analysis for providing environmental data for free. We would like to thank the Director General of Health Malaysia for his permission to publish this article.

Author information

Fariq Rahmat, Zed Zulkafli, Asnor Juraiza Ishak, Ribhan Zafira Abdul Rahman, Simon De Stercke, Wouter Buytaert, Wardah Tahir, Jamalludin Ab Rahman, Salwa Ibrahim, and Muhamad Ismail contributed equally to this work.

Authors and Affiliations

Department of Electrical and Electronic Engineering, Universiti Putra Malaysia, 43400, Serdang, Selangor, Malaysia
Fariq Rahmat, Asnor Juraiza Ishak & Ribhan Zafira Abdul Rahman
Department of Civil Engineering, Universiti Putra Malaysia, 43400, Serdang, Selangor, Malaysia
Zed Zulkafli
Department of Civil and Environmental Engineering, Imperial College London, South Kensington, London, SW7 2BX, UK
Simon De Stercke & Wouter Buytaert
School of Civil Engineering, College of Engineering Universiti Teknologi Mara, 40450, Shah Alam, Selangor, Malaysia
Wardah Tahir
Department of Community Medicine, Kulliyyah of Medicine, International Islamic University Malaysia, 25200, Kuantan, Pahang, Malaysia
Jamalludin Ab Rahman
Negeri Sembilan State Health Department, Ministry of Health Malaysia, 70300, Seremban, Negeri Sembilan, Malaysia
Salwa Ibrahim & Muhamad Ismail

Authors

Fariq Rahmat
View author publications
You can also search for this author in PubMed Google Scholar
Zed Zulkafli
View author publications
You can also search for this author in PubMed Google Scholar
Asnor Juraiza Ishak
View author publications
You can also search for this author in PubMed Google Scholar
Ribhan Zafira Abdul Rahman
View author publications
You can also search for this author in PubMed Google Scholar
Simon De Stercke
View author publications
You can also search for this author in PubMed Google Scholar
Wouter Buytaert
View author publications
You can also search for this author in PubMed Google Scholar
Wardah Tahir
View author publications
You can also search for this author in PubMed Google Scholar
Jamalludin Ab Rahman
View author publications
You can also search for this author in PubMed Google Scholar
Salwa Ibrahim
View author publications
You can also search for this author in PubMed Google Scholar
Muhamad Ismail
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

FR, ZZ, and AJ conceived the research. FR processed all the data, performed the analyses, and interpreted the results. FR and ZZ wrote the manuscript with contributions from all authors.

Corresponding author

Correspondence to Zed Zulkafli.

Ethics declarations

Conflict of interest

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Ethical approval

Ethical approval for this study was obtained from the Medical Research and Ethics Committee (MREC), Ministry of Health Malaysia (NMRR-19-4115-47702).

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

Appendix A Table of Dataset 1

See Table 8.

Table 8 The rank of all features in Dataset 1 is based on the Targeted PCA and LASSO regression

Full size table

Appendix B Table of Dataset 2

See Table 9.

Table 9 The rank of all features in Dataset 2 is based on the Targeted PCA and LASSO regression

Full size table

Table 10 The rank of all features in Dataset 3 is based on the Targeted PCA and LASSO regression

Full size table

Appendix C Table of Dataset 3

See Table 10.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Rahmat, F., Zulkafli, Z., Ishak, A.J. et al. Supervised feature selection using principal component analysis. Knowl Inf Syst 66, 1955–1995 (2024). https://doi.org/10.1007/s10115-023-01993-5

Download citation

Received: 28 June 2022
Revised: 26 July 2023
Accepted: 15 September 2023
Published: 08 November 2023
Issue Date: March 2024
DOI: https://doi.org/10.1007/s10115-023-01993-5

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Supervised feature selection using principal component analysis

Abstract

Access this article

Similar content being viewed by others

Feature dimensionality reduction: a review

Learning from imbalanced data: open challenges and future directions

Data clustering: application and trends

Data availability

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interest

Ethical approval

Additional information

Publisher's Note

Appendices

Appendix A Table of Dataset 1

Appendix B Table of Dataset 2

Appendix C Table of Dataset 3

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Supervised feature selection using principal component analysis

Abstract

Access this article

Similar content being viewed by others

Feature dimensionality reduction: a review

Learning from imbalanced data: open challenges and future directions

Data clustering: application and trends

Data availability

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interest

Ethical approval

Additional information

Publisher's Note

Appendices

Appendix A Table of Dataset 1

Appendix B Table of Dataset 2

Appendix C Table of Dataset 3

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation