Improving the accuracy and internal consistency of regression-based clustering of high-dimensional datasets

Bo Zhang; Jianghua He; Jinxiang Hu; Prabhakar Chalise; Devin C. Koestler

doi:10.1515/sagmb-2022-0031

Published by De Gruyter July 25, 2023

Improving the accuracy and internal consistency of regression-based clustering of high-dimensional datasets

Bo Zhang , Jianghua He , Jinxiang Hu , Prabhakar Chalise and Devin C. Koestler

From the journal Statistical Applications in Genetics and Molecular Biology

https://doi.org/10.1515/sagmb-2022-0031

Showing a limited preview of this publication:

Abstract

Component-wise Sparse Mixture Regression (CSMR) is a recently proposed regression-based clustering method that shows promise in detecting heterogeneous relationships between molecular markers and a continuous phenotype of interest. However, CSMR can yield inconsistent results when applied to high-dimensional molecular data, which we hypothesize is in part due to inherent limitations associated with the feature selection method used in the CSMR algorithm. To assess this hypothesis, we explored whether substituting different regularized regression methods (i.e. Lasso, Elastic Net, Smoothly Clipped Absolute Deviation (SCAD), Minmax Convex Penalty (MCP), and Adaptive-Lasso) within the CSMR framework can improve the clustering accuracy and internal consistency (IC) of CSMR in high-dimensional settings. We calculated the true positive rate (TPR), true negative rate (TNR), IC and clustering accuracy of our proposed modifications, benchmarked against the existing CSMR algorithm, using an extensive set of simulation studies and real biological datasets. Our results demonstrated that substituting Adaptive-Lasso within the existing feature selection method used in CSMR led to significantly improved IC and clustering accuracy, with strong performance even in high-dimensional scenarios. In conclusion, our modifications of the CSMR method resulted in improved clustering performance and may thus serve as viable alternatives for the regression-based clustering of high-dimensional datasets.

Keywords: disease heterogeneity; mixture modeling; supervised learning

Corresponding author: Devin C. Koestler, Department of Biostatistics & Data Science, University of Kansas Medical Center, 3901 Rainbow Blvd., Robinson Hall, 5028K, Kansas City, KS 66160, USA, E-mail: dkoestler@kumc.edu

Prabhakar Chalise and Devin C. Koestler contributed equally to this work.

Funding source: National Cancer Institute (NCI) Cancer Center Support Grant

Award Identifier / Grant number: P30 CA168524

Funding source: the Kansas Institute for Precision Medicine COBRE, supported by the National Institute of General Medical Science award

Award Identifier / Grant number: P20 GM130423

Funding source: the Kansas IDeA Network of Biomedical Research Excellence Bioinformatics Core, supported by the National Institute of General Medical Science award

Award Identifier / Grant number: P20 GM103418

Author contributions: All the authors have accepted responsibility for the entire content of this submitted manuscript and approved submission.
Research funding: None declared.
Conflict of interest statement: The authors declare no conflicts of interest regarding this article.

References

Balakrishnan, S., Wainwright, M.J., and Yu, B. (2017). Statistical guarantees for the EM algorithm: from population to sample-based analysis. Ann. Stat. 45: 77–120, https://doi.org/10.1214/16-aos1435.Search in Google Scholar

Barretina, J., Caponigro, G., Stransky, N., Venkatesan, K., Margolin, A.A., Kim, S., Wilson, C.J., Lehár, J., Kryukov, G.V., Sonkin, D., et al.. (2012). The Cancer Cell Line Encyclopedia enables predictive modelling of anticancer drug sensitivity. Nature 483: 603–607, https://doi.org/10.1038/nature11003.Search in Google Scholar PubMed PubMed Central

Bayazit, Y.A. and Yilmaz, M. (2006). An overview of hereditary hearing loss. ORL J. Otorhinolaryngol. Relat. Spec. 68: 57–63, https://doi.org/10.1159/000091090.Search in Google Scholar PubMed

Chang, W., Wan, C., Yu, C., Yao, W., Zhang, C., and Cao, S. (2020a). RobMixReg: an R package for robust, flexible and high dimensional mixture regression. bioRxiv, 2020.2008.2002.233460.10.1101/2020.08.02.233460Search in Google Scholar

Chang, W., Wan, C., Zang, Y., Zhang, C., and Cao, S. (2020b). Supervised clustering of high-dimensional data using regularized mixture modeling. Briefings Bioinf. 22: 1–11, https://doi.org/10.1093/bib/bbaa291.Search in Google Scholar PubMed PubMed Central

Chang, W., Zhang, C., and Cao, S. (2022). Response to ‘Letter to the Editor: on the stability and internal consistency of component-wise sparse mixture regression based clustering’, Zhang et al. Briefings Bioinf. 23: 1–3, https://doi.org/10.1093/bib/bbac262.Search in Google Scholar PubMed PubMed Central

Clarke, R., Ressom, H.W., Wang, A., Xuan, J., Liu, M.C., Gehan, E.A., and Wang, Y. (2008). The properties of high-dimensional data spaces: implications for exploring gene and protein expression data. Nat. Rev. Cancer 8: 37–49, https://doi.org/10.1038/nrc2294.Search in Google Scholar PubMed PubMed Central

Cohen, J.C., Kiss, R.S., Pertsemlidis, A., Marcel, Y.L., McPherson, R., and Hobbs, H.H. (2004). Multiple rare alleles contribute to low plasma levels of HDL cholesterol. Science 305: 869–872, https://doi.org/10.1126/science.1099870.Search in Google Scholar PubMed

Cohen, J.C., Pertsemlidis, A., Fahmi, S., Esmail, S., Vega, G.L., Grundy, S.M., and Hobbs, H.H. (2006). Multiple rare variants in NPC1L1 associated with reduced sterol absorption and plasma low-density lipoprotein levels. Proc. Natl. Acad. Sci. U. S. A. 103: 1810–1815, https://doi.org/10.1073/pnas.0508483103.Search in Google Scholar PubMed PubMed Central

Dempster, A.P., Laird, N.M., and Rubin, D.B. (1977). Maximum likelihood from incomplete data via the EM algorithm. J. R. Stat. Soc. Ser. B Methodol. 39: 1–22, https://doi.org/10.1111/j.2517-6161.1977.tb01600.x.Search in Google Scholar

Dror, A.A. and Avraham, K.B. (2009). Hearing loss: mechanisms revealed by genetics and cell biology. Annu. Rev. Genet. 43: 411–437, https://doi.org/10.1146/annurev-genet-102108-134135.Search in Google Scholar PubMed

Eschrich, S., Yang, I., Bloom, G., Kwong, K.Y., Boulware, D., Cantor, A., Coppola, D., Kruhøffer, M., Aaltonen, L., Orntoft, T.F., et al.. (2005). Molecular staging for survival prediction of colorectal cancer patients. J. Clin. Oncol. 23: 3526–3535, https://doi.org/10.1200/jco.2005.00.695.Search in Google Scholar

Fahmi, S., Yang, C., Esmail, S., Hobbs, H.H., and Cohen, J.C. (2008). Functional characterization of genetic variants in NPC1L1 supports the sequencing extremes strategy to identify complex trait genes. Hum. Mol. Genet. 17: 2101–2107, https://doi.org/10.1093/hmg/ddn108.Search in Google Scholar PubMed PubMed Central

Fan, J. and Li, R. (2001). Variable selection via nonconcave penalized likelihood and its oracle properties. J. Am. Stat. Assoc. 96: 1348–1360, https://doi.org/10.1198/016214501753382273.Search in Google Scholar

Fraley, C. and Raftery, A.E. (2002). Model-based clustering, discriminant analysis, and density estimation. J. Am. Stat. Assoc. 97: 611–631, https://doi.org/10.1198/016214502760047131.Search in Google Scholar

Frayling Timothy, M., Timpson Nicholas, J., Weedon Michael, N., Zeggini, E., Freathy Rachel, M., Lindgren, C.M., Perry, J.R.B., Elliott, K.S., Lango, H., Rayner, N.W., et al.. (2007). A common variant in the FTO gene is associated with body mass Index and predisposes to childhood and adult obesity. Science 316: 889–894, https://doi.org/10.1126/science.1141634.Search in Google Scholar PubMed PubMed Central

Hannum, G., Guinney, J., Zhao, L., Zhang, L., Hughes, G., Sadda, S., Klotzle, B., Bibikova, M., Fan, J.B., Gao, Y., et al.. (2013). Genome-wide methylation profiles reveal quantitative views of human aging rates. Mol. Cell 49: 359–367, https://doi.org/10.1016/j.molcel.2012.10.016.Search in Google Scholar PubMed PubMed Central

Harbeck, N., Penault-Llorca, F., Cortes, J., Gnant, M., Houssami, N., Poortmans, P., Ruddy, K., Tsang, J., and Cardoso, F. (2019). Breast cancer. Nat. Rev. Dis. Primers 5: 66, https://doi.org/10.1038/s41572-019-0111-2.Search in Google Scholar PubMed

Leisch, F. (2004). FlexMix: a general framework for finite mixture models and latent class regression in R. J. Stat. Softw. 11: 1–18, https://doi.org/10.18637/jss.v011.i08.Search in Google Scholar

Mallick, H., Alhamzawi, R., Paul, E., and Svetnik, V. (2021). The reciprocal Bayesian LASSO. Stat. Med. 40: 4830–4849, https://doi.org/10.1002/sim.9098.Search in Google Scholar PubMed

Matsui, S., Yamanaka, T., Barlogie, B., Shaughnessy, J.D.Jr., and Crowley, J. (2008). Clustering of significant genes in prognostic studies with microarrays: application to a clinical study for multiple myeloma. Stat. Med. 27: 1106–1120, https://doi.org/10.1002/sim.2997.Search in Google Scholar PubMed

Melchor, L., Molyneux, G., Mackay, A., Magnay, F.A., Atienza, M., Kendrick, H., Nava‐Rodrigues, D., López‐García, M.Á., Milanezi, F., Greenow, K., et al.. (2014). Identification of cellular and genetic drivers of breast cancer heterogeneity in genetically engineered mouse tumour models. J. Pathol. 233: 124–137, https://doi.org/10.1002/path.4345.Search in Google Scholar PubMed

Nigam, B., Ahirwal, P., Salve, S., and Vamney, S. (2011). Document classification using expectation maximization with semi supervised learning. Int. J. Soft Comput. 2: 386–397, https://doi.org/10.5121/ijsc.2011.2404.Search in Google Scholar

Petit, C. (1996). Genes responsible for human hereditary deafness: symphony of a thousand. Nat. Genet. 14: 385–391, https://doi.org/10.1038/ng1296-385.Search in Google Scholar PubMed

Romero, R., Espinoza, J., Gotsch, F., Kusanovic, J.P., Friel, L.A., Erez, O., Mazaki-Tovi, S., Than, N., Hassan, S., and Tromp, G. (2006). The use of high-dimensional biology (genomics, transcriptomics, proteomics, and metabolomics) to understand the preterm parturition syndrome. BJOG: Int. J. Obstet. Gynaecol. 113: 118–135, https://doi.org/10.1111/j.1471-0528.2006.01150.x.Search in Google Scholar PubMed PubMed Central

Shi, J., Ren, M., Jia, J., Tang, M., Guo, Y., Ni, X., and Shi, T. (2019). Genotype-phenotype association analysis reveals new pathogenic factors for osteogenesis imperfecta disease. Front. Pharmacol. 10: 1200, https://doi.org/10.3389/fphar.2019.01200.Search in Google Scholar PubMed PubMed Central

Siminovitch, K.A. (2004). PTPN22 and autoimmune disease. Nat. Genet. 36: 1248–1249, https://doi.org/10.1038/ng1204-1248.Search in Google Scholar PubMed

Walsh, T. and King, M.-C. (2007). Ten genes for inherited breast cancer. Cancer Cell 11: 103–105, https://doi.org/10.1016/j.ccr.2007.01.010.Search in Google Scholar PubMed

Wang, H. and Leng, C. (2007). Unified LASSO estimation by least squares approximation. J. Am. Stat. Assoc. 102: 1039–1048, https://doi.org/10.1198/016214507000000509.Search in Google Scholar

Wang, Y., Jatkoe, T., Zhang, Y., Mutch, M.G., Talantov, D., Jiang, J., McLeod, H.L., and Atkins, D. (2004). Gene expression profiles and molecular markers to predict recurrence of Dukes’ B colon cancer. J. Clin. Oncol. 22: 1564–1571, https://doi.org/10.1200/jco.2004.08.186.Search in Google Scholar

Wang, H., Lengerich, B.J., Aragam, B., and Xing, E.P. (2019). Precision Lasso: accounting for correlations and linear dependencies in high-dimensional genomic data. Bioinformatics 35: 1181–1187, https://doi.org/10.1093/bioinformatics/bty750.Search in Google Scholar PubMed PubMed Central

Wu, C.F.J. (1983). On the convergence properties of the EM algorithm. Ann. Stat. 11: 95–103, https://doi.org/10.1214/aos/1176346060.Search in Google Scholar

Xu, H., Caramanis, C., and Mannor, S. (2012). Sparse algorithms are not stable: a no-free-lunch theorem. IEEE Trans. Pattern Anal. Mach. Intell. 34: 187–193.10.1109/TPAMI.2011.177Search in Google Scholar PubMed

Yao, J., Zhao, Q., Yuan, Y., Zhang, L., Liu, X., Yung, W.K.A., and Weinstein, J.N. (2012). Identification of common prognostic gene expression signatures with biological meanings from microarray gene expression datasets. PLoS One 7: e45894, https://doi.org/10.1371/journal.pone.0045894.Search in Google Scholar PubMed PubMed Central

Yuan, M. and Lin, Y. (2007). On the non-negative garrotte estimator. J. R. Stat. Soc., B: Stat. Methodol. 69: 143–161, https://doi.org/10.1111/j.1467-9868.2007.00581.x.Search in Google Scholar

Zhang, C.-H. (2010). Nearly unbiased variable selection under minimax concave penalty. Ann. Stat. 38: 894–942, https://doi.org/10.1214/09-aos729.Search in Google Scholar

Zhang, C.-H. and Huang, J. (2008). The sparsity and bias of the Lasso selection in high-dimensional linear regression. Ann. Stat. 36: 1567–1594, https://doi.org/10.1214/07-aos520.Search in Google Scholar

Zhang, Y., Hapala, J., Brenner, H., and Wagner, W. (2017). Individual CpG sites that are associated with age and life expectancy become hypomethylated upon aging. Clin. Epigenet. 9: 1–6, https://doi.org/10.1186/s13148-017-0315-9.Search in Google Scholar PubMed PubMed Central

Zhang, B., He, J., Hu, J., Koestler, D.C., and Chalise, P. (2021). Letter to the Editor: on the stability and internal consistency of component-wise sparse mixture regression-based clustering. Briefings Bioinf. 23: 1–5, https://doi.org/10.1093/bib/bbab532.Search in Google Scholar PubMed PubMed Central

Zou, H. (2006). The adaptive Lasso and its oracle properties. J. Am. Stat. Assoc. 101: 1418–1429, https://doi.org/10.1198/016214506000000735.Search in Google Scholar

Zou, H. and Hastie, T. (2005). Regularization and variable selection via the elastic Net. J. R. Stat. Soc., B: Stat. Methodol. 67: 301–320, https://doi.org/10.1111/j.1467-9868.2005.00503.x.Search in Google Scholar

Supplementary Material

This article contains supplementary material (https://doi.org/10.1515/sagmb-2022-0031).

Received: 2022-06-29

Accepted: 2023-05-31

Published Online: 2023-07-25

Improving the accuracy and internal consistency of regression-based clustering of high-dimensional datasets

Abstract

References

Supplementary Material

Journal and Issue

Articles in the same Issue