Skip to main content
Log in

scFED: Clustering Identifying Cell Types of scRNA-Seq Data Based on Feature Engineering Denoising

  • Original research article
  • Published:
Interdisciplinary Sciences: Computational Life Sciences Aims and scope Submit manuscript

Abstract

Recently developed single-cell RNA-seq (scRNA-seq) technology has given researchers the chance to investigate single-cell level of disease development. Clustering is one of the most essential strategies for analyzing scRNA-seq data. Choosing high-quality feature sets can significantly enhance the outcomes of single-cell clustering and classification. But computationally burdensome and highly expressed genes cannot afford a stabilized and predictive feature set for technical reasons. In this study, we introduce scFED, a feature-engineered gene selection framework. scFED identifies prospective feature sets to eliminate the noise fluctuation. And fuse them with existing knowledge from the tissue-specific cellular taxonomy reference database (CellMatch) to avoid the influence of subjective factors. Then present a reconstruction approach for noise reduction and crucial information amplification. We apply scFED on four genuine single-cell datasets and compare it with other techniques. According to the results, scFED improves clustering, decreases dimension of the scRNA-seq data, improves cell type identification when combined with clustering algorithms, and has higher performance than other methods. Therefore, scFED offers certain benefits in scRNA-seq data gene selection.

Graphical abstract

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5

Similar content being viewed by others

Data availability

We downloaded these datasets from databases provided by the National Biotechnology Information Retrieval Database (NCBI) and the European Institute for Bioinformatics (EMBL-EBI).

References

  1. Regev A, Teichmann SA, Lander ES, Amit I, Benoist C, Birney E, Bodenmiller B, Campbell P, Carninci P, Clatworthy M et al (2017) The human cell atlas. Elife 6:e27041. https://doi.org/10.7554/eLife.27041

    Article  PubMed  PubMed Central  Google Scholar 

  2. Tian Y, Zhang MY, Zhao AH, Kong L, Wang JJ, Shen W, Li L (2021) Single-cell transcriptomic profiling provides insights into the toxic effects of Zearalenone exposure on primordial follicle assembly. Theranostics 11(11):5197–5213. https://doi.org/10.7150/thno.58433

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  3. Potter SS (2018) Single-cell RNA sequencing for the study of development, physiology and disease. Nat Rev Nephrol 14(8):479–492. https://doi.org/10.1038/s41581-018-0021-7

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  4. Su K, Yu T, Wu H (2021) Accurate feature selection improves single-cell RNA-seq cell clustering. Brief Bioinform. https://doi.org/10.1093/bib/bbab034

    Article  PubMed  PubMed Central  Google Scholar 

  5. He X, Deng C, Niyogi P (2005) Laplacian score for feature selection. In: Advances in neural information processing systems 18 [Neural Information Processing Systems, NIPS 2005, December 5–8, 2005, Vancouver, British Columbia, Canada]. pp 507–514. https://dl.acm.org/doi/https://doi.org/10.5555/2976248.2976312

  6. Wang L, Li J, Qin H, Xu J, Zhang X, Huang L (2019) Selecting near-infrared hyperspectral wavelengths based on one-way ANOVA to identify the origin of Lycium barbarum. In: 2019 International Conference on High Performance Big Data and Intelligent Systems (HPBD&IS). IEEE: 122–125. https://ieeexplore.ieee.org/abstract/document/8735444

  7. Hao Y, Hao S, Andersen-Nissen E, Mauck WM 3rd, Zheng S, Butler A, Lee MJ, Wilk AJ, Darby C, Zager M et al (2021) Integrated analysis of multimodal single-cell data. Cell 184(13):3573.e3529-3587.e3529. https://doi.org/10.1016/j.cell.2021.04.048

    Article  CAS  Google Scholar 

  8. Chen H, Ryu J, Vinyard ME, Lerer A, Pinello L (2022) SIMBA: SIngle-cell eMBedding Along with features. bioRxiv. https://doi.org/10.1101/2021.10.17.464750

    Article  PubMed  PubMed Central  Google Scholar 

  9. Lall S, Ray S, Bandyopadhyay S (2021) RgCop-A regularized copula based method for gene selection in single-cell RNA-seq data. PLoS Comput Biol 17(10):e1009464. https://doi.org/10.1371/journal.pcbi.1009464

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  10. Li L, Tang H, Xia R, Dai H, Liu R, Chen L (2022) Intrinsic entropy model for feature selection of scRNA-seq data. J Mol Cell Biol 14(2):2. https://doi.org/10.1093/jmcb/mjac008

    Article  CAS  Google Scholar 

  11. Butler A, Hoffman P, Smibert P, Papalexi E, Satija R (2018) Integrating single-cell transcriptomic data across different conditions, technologies, and species. Nat Biotechnol 36(5):411–420. https://doi.org/10.1038/nbt.4096

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  12. Cui H, Zhou C, Dai X, Liang Y, Paffenroth R, Korkin D (2017) Boosting gene expression clustering with system-wide biological information: a robust autoencoder approach. bioRxiv. https://doi.org/10.1504/IJCBDD.2020.105113

    Article  Google Scholar 

  13. Yip SH, Wang P, Kocher JA, Sham PC, Wang J (2017) Corrigendum: Linnorm: improved statistical analysis for single cell RNA-seq expression data. Nucleic Acids Res 45(22):13097. https://doi.org/10.1093/nar/gkx1189

    Article  PubMed  PubMed Central  Google Scholar 

  14. Shao X, Liao J, Lu X, Xue R, Ai N, Fan X (2020) scCATCH: automatic annotation on cell types of clusters from single-cell RNA sequencing data. iScience 23(3):100882. https://doi.org/10.1016/j.isci.2020.100882

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  15. Kiselev VY, Kirschner K, Schaub MT, Andrews T, Yiu A, Chandra T, Natarajan KN, Reik W, Barahona M, Green AR et al (2017) SC3: consensus clustering of single-cell RNA-seq data. Nat Methods 14(5):483–486. https://doi.org/10.1038/nmeth.4236

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  16. Lall S, Ghosh A, Ray S, Bandyopadhyay S (2022) sc-REnF: an entropy guided robust feature selection for single-cell RNA-seq data. Brief Bioinform. https://doi.org/10.1093/bib/bbab517

    Article  PubMed  Google Scholar 

  17. Lall S, Sinha D, Ghosh A, Sengupta D, Bandyopadhyay S (2021) Stable feature selection using copula based mutual information. Pattern Recognit 112(1):107697. https://doi.org/10.1016/j.patcog.2020.107697

    Article  Google Scholar 

  18. Radanliev P, De Roure D (2023) New and emerging forms of data and technologies: literature and bibliometric review. Multimed Tools Appl 82(2):2887–2911. https://doi.org/10.1007/s11042-022-13451-5

    Article  PubMed  Google Scholar 

  19. Zhang X, Lan Y, Xu J, Quan F, Zhao E, Deng C, Luo T, Xu L, Liao G, Yan M et al (2019) Cell Marker: a manually curated resource of cell markers in human and mouse. Nucleic Acids Res 47(D1):D721–D728. https://doi.org/10.1093/nar/gky900

    Article  CAS  PubMed  Google Scholar 

  20. Han X, Wang R, Zhou Y, Fei L, Sun H, Lai S, Saadatpour A, Zhou Z, Chen H, Ye F et al (2018) Mapping the mouse cell atlas by microwell-seq. Cell 172(5):1091.e1017-1107.e1017. https://doi.org/10.1016/j.cell.2018.02.001

    Article  CAS  Google Scholar 

  21. Yuan H, Yan M, Zhang G, Liu W, Deng C, Liao G, Xu L, Luo T, Yan H, Long Z et al (2019) CancerSEA: a cancer single-cell state atlas. Nucleic Acids Res 47(D1):D900–D908. https://doi.org/10.1093/nar/gky939

    Article  CAS  PubMed  Google Scholar 

  22. Wang LF, Shi CY, Lin SZ, Qin PL, Wang YL (2020) Convolutional sparse representation and local density peak clustering for medical image fusion. Int J Pattern Recognit Artif Intell 34(7):575–592. https://doi.org/10.1142/S0218001420570037

    Article  Google Scholar 

  23. Oller-Moreno S, Kloiber K, Machart P, Bonn S (2021) Algorithmic advances in machine learning for single-cell expression analysis. Curr Opin Syst Biol 25:27–33. https://doi.org/10.1016/j.coisb.2021.02.002

    Article  CAS  Google Scholar 

  24. Lin C, Jain S, Kim H, Bar-Joseph Z (2017) Using neural networks for reducing the dimensions of single-cell RNA-Seq data. Nucleic Acids Res 45(17):e156. https://doi.org/10.1093/nar/gkx681

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  25. Pollen AA, Nowakowski TJ, Shuga J, Wang X, Leyrat AA, Lui JH, Li N, Szpankowski L, Fowler B, Chen P et al (2014) Low-coverage single-cell mRNA sequencing reveals cellular heterogeneity and activated signaling pathways in developing cerebral cortex. Nat Biotechnol 32(10):1053–1058. https://doi.org/10.1038/nbt.2967

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  26. Yan L, Yang M, Guo H, Yang L, Wu J, Li R, Liu P, Lian Y, Zheng X, Yan J et al (2013) Single-cell RNA-Seq profiling of human preimplantation embryos and embryonic stem cells. Nat Struct Mol Biol 20(9):1131–1139. https://doi.org/10.1038/nsmb.2660

    Article  CAS  PubMed  Google Scholar 

  27. Engel I, Seumois G, Chavez L, Samaniego-Castruita D, White B, Chawla A, Mock D, Vijayanand P, Kronenberg M (2016) Innate-like functions of natural killer T cell subsets result from highly divergent gene programs. Nat Immunol 17(6):728–739. https://doi.org/10.1038/ni.3437

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  28. Do VH, Canzar S (2021) A generalization of t-SNE and UMAP to single-cell multimodal omics. Genome Biol 22(1):130. https://doi.org/10.1186/s13059-021-02356-5

    Article  PubMed  PubMed Central  Google Scholar 

  29. Jiahu Q, Weiming F, Huijun G, Wei Xing Z (2017) Distributed k -means algorithm and fuzzy c-means algorithm for sensor networks based on multiagent consensus theory. IEEE Trans Cybern 47(3):772–783. https://doi.org/10.1109/TCYB.2016.2526683

    Article  Google Scholar 

  30. Gárate-Escamila AK, Hajjam El Hassani A, Andrès E (2020) Classification models for heart disease prediction using feature selection and PCA. Inform Med Unlocked. https://doi.org/10.1016/j.imu.2020.100330

    Article  Google Scholar 

  31. Jiang H, Sohn LL, Huang H, Chen L (2018) Single cell clustering based on cell-pair differentiability correlation and variance analysis. Bioinformatics 34(21):3684–3694. https://doi.org/10.1093/bioinformatics/bty390

    Article  CAS  PubMed  Google Scholar 

  32. Strehl A, Ghosh J (2003) Cluster ensembles –- a knowledge reuse framework for combining multiple partitions. J Mach Learn Res 3:583–617. https://doi.org/10.1162/153244303321897735

    Article  Google Scholar 

  33. Meila M (2007) Comparing clusterings - an information based distance. J Multivar Anal 98(5):873–895. https://doi.org/10.1016/j.jmva.2006.11.013

    Article  Google Scholar 

  34. Zhang SH, Wong HS, Shen Y (2012) Generalized adjusted rand indices for cluster ensembles. Pattern Recognit 45(6):2214–2226. https://doi.org/10.1016/j.patcog.2011.11.017

    Article  Google Scholar 

  35. Zhang NN, Liu JX, Zheng CH, Wang J (2022) SLRRSC: single-cell type recognition method based on similarity and graph regularization constraints. IEEE J Biomed Health Inform 26(7):3556–3566. https://doi.org/10.1109/JBHI.2022.3148286

    Article  PubMed  Google Scholar 

  36. Ren X, Zheng L, Zhang Z (2019) SSCC: a novel computational framework for rapid and accurate clustering large-scale single cell RNA-seq data. Genom Proteom Bioinform 17(2):201–210. https://doi.org/10.1016/j.gpb.2018.10.003

    Article  Google Scholar 

  37. Zhang DJ, Gao YL, Zhao JX, Zheng CH, Liu JX (2022) A new graph autoencoder-based consensus-guided model for scRNA-seq cell type detection. IEEE Trans Neural Netw Learn Syst. https://doi.org/10.1109/TNNLS.2022.3190289

    Article  PubMed  Google Scholar 

  38. Petegrosso R, Li Z, Kuang R (2020) Machine learning and statistical methods for clustering single-cell RNA-sequencing data. Brief Bioinform 21(4):1209–1223. https://doi.org/10.1093/bib/bbz063

    Article  CAS  PubMed  Google Scholar 

  39. Belgiu M, Dragut L (2016) Random forest in remote sensing: a review of applications and future directions. Isprs J Photogramm Remote Sens 114:24–31. https://doi.org/10.1016/j.isprsjprs.2016.01.011

    Article  Google Scholar 

  40. Chen P-H, Lin C-J, Schölkopf B (2005) A tutorial onν-support vector machines. Appl Stoch Model Bus Ind 21(2):111–136. https://doi.org/10.1002/asmb.537

    Article  Google Scholar 

  41. Jiang L, Cai Z, Wang D, Jiang S (2007) Survey of improving K-nearest-neighbor for classification. In: Fourth International Conference on Fuzzy Systems and Knowledge Discovery (FSKD 2007): 24–27 Aug. 679–683. https://ieeexplore.ieee.org/document/4406010

  42. Kordzakhia N, Mishra GD, Reiersølmoen L (2001) Robust estimation in the logistic regression model. J Stat Plan Inference 98(1–2):211–223. https://doi.org/10.1016/s0378-3758(00)00312-8

    Article  Google Scholar 

  43. Peng H, Long F, Ding C (2005) Feature selection based on mutual information: criteria of max-dependency, max-relevance, and min-redundancy. IEEE Trans Pattern Anal Mach Intell 27(8):1226–1238. https://doi.org/10.1109/TPAMI.2005.159

    Article  PubMed  Google Scholar 

  44. Bennasar M, Hicks Y, Setchi R (2015) Feature selection using joint mutual information maximisation. Expert Syst Appl 42(22):8520–8532. https://doi.org/10.1016/j.eswa.2015.07.007

    Article  Google Scholar 

  45. Brown G, Pocock A, Zhao MJ, Lujan M (2012) Conditional likelihood maximisation: a unifying framework for information theoretic feature selection. J Mach Learn Res 13(1):27–66. https://doi.org/10.5555/2503308.2188387

    Article  Google Scholar 

  46. Meyer PE, Schretter C, Bontempi G (2008) Information-theoretic feature selection in microarray data using variable complementarity. IEEE J Sel Top Signal Process 2(3):261–274. https://doi.org/10.1109/Jstsp.2008.923858

    Article  Google Scholar 

  47. Zhong S, Zhang S, Fan X, Wu Q, Yan L, Dong J, Zhang H, Li L, Sun L, Pan N et al (2018) A single-cell RNA-seq survey of the developmental landscape of the human prefrontal cortex. Nature 555(7697):524–528. https://doi.org/10.1038/nature25980

    Article  CAS  PubMed  Google Scholar 

Download references

Funding

This work was supported by the National Natural Science Foundation of China (61902216, 61972226, 62172254 and 62172253).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Feng Li.

Ethics declarations

Conflict of Interest

The authors declare that they have no conflict of interest.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Liu, Y., Li, F., Shang, J. et al. scFED: Clustering Identifying Cell Types of scRNA-Seq Data Based on Feature Engineering Denoising. Interdiscip Sci Comput Life Sci 15, 590–601 (2023). https://doi.org/10.1007/s12539-023-00574-y

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s12539-023-00574-y

Keywords

Navigation