Skip to main content
Log in

Supervised feature selection using principal component analysis

  • Regular paper
  • Published:
Knowledge and Information Systems Aims and scope Submit manuscript

Abstract

The principal component analysis (PCA) is widely used in computational science branches such as computer science, pattern recognition, and machine learning, as it can effectively reduce the dimensionality of high-dimensional data. In particular, it is a popular transformation method used for feature extraction. In this study, we explore PCA’s ability for feature selection in regression applications. We introduce a new approach using PCA, called Targeted PCA to analyze a multivariate dataset that includes the dependent variable—it identifies the principal component with a high representation of the dependent variable and then examines the selected principal component to capture and rank the contribution of the non-dependent variables. The study also compares the feature selected with that resulting from a Least Absolute Shrinkage and Selection Operator (LASSO) regression. Finally, the selected features were tested in two regression models: multiple linear regression (MLR) and artificial neural network (ANN). The results are presented for three socioeconomic, environmental, and computer image processing datasets. Our study found that 2 of 3 random datasets have more than 50% similarity in the selected features by the PCA and LASSO regression methods. In the regression predictions, our PCA-selected features resulted in little difference compared to the LASSO regression-selected features in terms of the MLR prediction accuracy. However, the ANN regression demonstrated a faster convergence and a higher reduction of error.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3

Similar content being viewed by others

Data availability

Datasets 1 and 2 can be retrieved from the open-source UCI Machine Learning Repository collection. Meanwhile, Dataset 3 is subject to the following licenses/restrictions: The datasets are owned by multiple government agencies and have sharing restrictions. Requests to access these datasets and codes should be directed to fariqrahmat94@gmail.com.

References

  1. Guyon I, Elisseeff A (2003) An introduction to variable and feature selection. J Mach Learn Res 3(Mar):1157–1182

    Google Scholar 

  2. Jain D, Singh V (2018) Feature selection and classification systems for chronic disease prediction: a review. Egypt Inf J 19(3):179–189

    Google Scholar 

  3. Tang J, Alelyani S, Liu H (2014) Feature selection for classification: A review. In: Algorithms and applications, data classification, p 37

  4. Battiti R (1994) Using mutual information for selecting features in supervised neural net learning. IEEE Trans Neural Netw 5(4):537–550

    Article  CAS  PubMed  Google Scholar 

  5. Verma L, Srivastava S, Negi PC (2016) A hybrid data mining model to predict coronary artery disease cases using non-invasive clinical data. J Med Syst 40(7):1–7

    Article  Google Scholar 

  6. Yu L, Liu H (2003) Feature selection for high-dimensional data: A fast correlation-based filter solution. In: Proceedings of the 20th international conference on machine learning (ICML-03), pp 856–863

  7. Wosiak A, Zakrzewska D (2018) Integrating correlation-based feature selection and clustering for improved cardiovascular disease diagnosis. Complexity, 2018

  8. Hall MA (1999) Correlation-based feature selection for machine learning

  9. Kumar V, Minz S (2014) Feature selection: a literature review. SmartCR 4(3):211–229

    Article  Google Scholar 

  10. Shahana AH, Preeja V (2016) Survey on feature subset selection for high dimensional data. In: 2016 International conference on circuit, power and computing technologies (ICCPCT), pp 1–4. IEEE

  11. Song F, Guo Z, Mei D (2010) Feature selection using principal component analysis. In: 2010 international conference on system science, engineering design and manufacturing informatization, vol 1, pp 27–30. IEEE

  12. Mubarak S, Darwis H, Umar F, Ilmawan LB, Anraeni S, Mude MA (2018) Feature selection of oral cyst and tumor images using principal component analysis. In: 2018 2nd east indonesia conference on computer and information technology (EIConCIT), pp 322–325. IEEE

  13. Wang XD, Chen RC, Zeng ZQ, Hong CQ, Yan F (2018) Robust dimension reduction for clustering with local adaptive learning. IEEE Trans Neural Netw Learn Syst 30(3):657–669

    Article  MathSciNet  PubMed  Google Scholar 

  14. Hair JF (2009) Multivariate data analysis

  15. Kassambara A (2017) Practical guide to principal component methods. In: R: PCA, M (CA), FAMD, MFA, HCPC, factoextra (Vol. 2). Sthda

  16. Abdi H, Williams LJ (2010) Principal component analysis. WIREs Comp Stat 2:433–459

    Article  Google Scholar 

  17. Xu Y, Zhang D, Yang JY (2010) A feature extraction method for use with bimodal biometrics. Patt Recogn 43(3):1106–1115

    Article  ADS  Google Scholar 

  18. Giersdorf J, Conzelmann M (2017) Analysis of feature-selection for LASSO regression models

  19. Hamming R (2012) Numerical methods for scientists and engineers. Courier Corporation

  20. Zhai D, Liu X, Chang H, Zhen Y, Chen X, Guo M, Gao W (2018) Parametric local multiview hamming distance metric learning. Patt Recogn 75:250–262

    Article  ADS  Google Scholar 

  21. Tang M, Yu Y, Aref WG, Malluhi QM, Ouzzani M (2015) Efficient processing of hamming-distance-based similarity-search queries over MapReduce. In EDBT, pp 361–372

  22. Uyanık GK, Güler N (2013) A study on multiple linear regression analysis. Proc Soc Behav Sci 106:234–240

    Article  Google Scholar 

  23. Fischer MM (2015) Neural networks: a class of flexible non-linear models for regression and classification. In: Handbook of research methods and applications in economic geography. Edward Elgar Publishing

  24. Rabunal JR, Dorado J (Eds.) (2006) Artificial neural networks in real-life applications. IGI Global

  25. Redmond M, Baveja A (2002) A data-driven software tool for enabling cooperative information sharing among police departments. Eur J Oper Res 141(3):660–678

    Article  Google Scholar 

  26. Graf F, Kriegel HP, Schubert M, Pölsterl S, Cavallaro A (2011) 2D image registration in CT images using radial image descriptors. In: International conference on medical image computing and computer-assisted intervention. Springer, Berlin, Heidelberg, pp 607–614

  27. Mandalapu V, Elluri L, Vyas P, Roy N (2023) Crime prediction using machine learning and deep learning: a systematic review and future directions. IEEE Access

  28. Adelman R, Reid LW, Markle G, Weiss S, Jaret C (2017) Urban crime rates and the changing face of immigration: evidence across four decades. J Ethn Crim Just 15(1):52–77

    Google Scholar 

  29. Furuhashi S, Abe K, Takahashi M, Aizawa T, Shizukuishi T, Sakaguchi M, Sasaki Y (2009) A computer-assisted system for diagnostic workstations: automated bone labeling for CT images. J Digit Imag 22:689–695

    Article  Google Scholar 

  30. Ng M (2016) Environmental factors associated with increased rat populations: a focused practice question

  31. Byers KA, Lee MJ, Patrick DM, Himsworth CG (2019) Rats about town: a systematic review of rat movement in urban ecosystems. Front Ecol Evol 7:13

    Article  Google Scholar 

  32. Navarrete EJ, Rivas SB, Soriano RML (2015) Leptospirosis prevalence and associated factors in school children from Valle de Chalco-Solidaridad, State of Mexico. Int J Pediatr Res 1:8

    Google Scholar 

  33. Tan WL, Soelar SA, Mohd Suan MA, Hussin N, Cheah WK, Verasahib K, Goh PP (2016) Leptospirosis incidence and mortality in Malaysia. Southeast Asian J Trop Med Public Health 47(3):434–40

    PubMed  Google Scholar 

  34. Mohamed-Hassan SN, Bahaman AR, Mutalib AR, Khairani-Bejo S (2012) Prevalence of pathogenic leptospires in rats from selected locations in peninsular Malaysia. Res J Animal Sci 6(1):12–25

    Article  Google Scholar 

  35. Ridzuan J, Aziah BD, Zahiruddin WM (2016) The occupational hazard study for leptospirosis among agriculture workers. Int J Collab Res Intern Med Public Health 8:MA13–MA22

    Google Scholar 

  36. Lemhadri I, Ruan F, Tibshirani R (2021) Lassonet: neural networks with feature sparsity. In: International conference on artificial intelligence and statistics, pp 10–18. PMLR

  37. Krakovska O, Christie G, Sixsmith A, Ester M, Moreno S (2019) Performance comparison of linear and non-linear feature selection methods for the analysis of large survey datasets. Plos one 14(3):e0213584

    Article  CAS  PubMed  PubMed Central  Google Scholar 

Download references

Acknowledgements

This work was supported by a Grant (No.: NEWTON/1/2018/WAB05/UPM/1) from the UK Natural Environment Council and the Ministry of Education Malaysia under the Understanding of the Impacts of Hydrometeorological Hazards program for their collaboration in research planning and development. We also acknowledge the Negeri Sembilan Town and Rural Planning Department and International Institute for Applied System Analysis for providing environmental data for free. We would like to thank the Director General of Health Malaysia for his permission to publish this article.

Author information

Authors and Affiliations

Authors

Contributions

FR, ZZ, and AJ conceived the research. FR processed all the data, performed the analyses, and interpreted the results. FR and ZZ wrote the manuscript with contributions from all authors.

Corresponding author

Correspondence to Zed Zulkafli.

Ethics declarations

Conflict of interest

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Ethical approval

Ethical approval for this study was obtained from the Medical Research and Ethics Committee (MREC), Ministry of Health Malaysia (NMRR-19-4115-47702).

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

Appendix A Table of Dataset 1

See Table 8.

Table 8 The rank of all features in Dataset 1 is based on the Targeted PCA and LASSO regression

Appendix B Table of Dataset 2

See Table 9.

Table 9 The rank of all features in Dataset 2 is based on the Targeted PCA and LASSO regression
Table 10 The rank of all features in Dataset 3 is based on the Targeted PCA and LASSO regression

Appendix C Table of Dataset 3

See Table 10.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Rahmat, F., Zulkafli, Z., Ishak, A.J. et al. Supervised feature selection using principal component analysis. Knowl Inf Syst 66, 1955–1995 (2024). https://doi.org/10.1007/s10115-023-01993-5

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10115-023-01993-5

Keywords

Navigation