Abstract
Component-wise Sparse Mixture Regression (CSMR) is a recently proposed regression-based clustering method that shows promise in detecting heterogeneous relationships between molecular markers and a continuous phenotype of interest. However, CSMR can yield inconsistent results when applied to high-dimensional molecular data, which we hypothesize is in part due to inherent limitations associated with the feature selection method used in the CSMR algorithm. To assess this hypothesis, we explored whether substituting different regularized regression methods (i.e. Lasso, Elastic Net, Smoothly Clipped Absolute Deviation (SCAD), Minmax Convex Penalty (MCP), and Adaptive-Lasso) within the CSMR framework can improve the clustering accuracy and internal consistency (IC) of CSMR in high-dimensional settings. We calculated the true positive rate (TPR), true negative rate (TNR), IC and clustering accuracy of our proposed modifications, benchmarked against the existing CSMR algorithm, using an extensive set of simulation studies and real biological datasets. Our results demonstrated that substituting Adaptive-Lasso within the existing feature selection method used in CSMR led to significantly improved IC and clustering accuracy, with strong performance even in high-dimensional scenarios. In conclusion, our modifications of the CSMR method resulted in improved clustering performance and may thus serve as viable alternatives for the regression-based clustering of high-dimensional datasets.
Funding source: National Cancer Institute (NCI) Cancer Center Support Grant
Award Identifier / Grant number: P30 CA168524
Funding source: the Kansas Institute for Precision Medicine COBRE, supported by the National Institute of General Medical Science award
Award Identifier / Grant number: P20 GM130423
Funding source: the Kansas IDeA Network of Biomedical Research Excellence Bioinformatics Core, supported by the National Institute of General Medical Science award
Award Identifier / Grant number: P20 GM103418
-
Author contributions: All the authors have accepted responsibility for the entire content of this submitted manuscript and approved submission.
-
Research funding: None declared.
-
Conflict of interest statement: The authors declare no conflicts of interest regarding this article.
References
Balakrishnan, S., Wainwright, M.J., and Yu, B. (2017). Statistical guarantees for the EM algorithm: from population to sample-based analysis. Ann. Stat. 45: 77–120, https://doi.org/10.1214/16-aos1435.Search in Google Scholar
Barretina, J., Caponigro, G., Stransky, N., Venkatesan, K., Margolin, A.A., Kim, S., Wilson, C.J., Lehár, J., Kryukov, G.V., Sonkin, D., et al.. (2012). The Cancer Cell Line Encyclopedia enables predictive modelling of anticancer drug sensitivity. Nature 483: 603–607, https://doi.org/10.1038/nature11003.Search in Google Scholar PubMed PubMed Central
Bayazit, Y.A. and Yilmaz, M. (2006). An overview of hereditary hearing loss. ORL J. Otorhinolaryngol. Relat. Spec. 68: 57–63, https://doi.org/10.1159/000091090.Search in Google Scholar PubMed
Chang, W., Wan, C., Yu, C., Yao, W., Zhang, C., and Cao, S. (2020a). RobMixReg: an R package for robust, flexible and high dimensional mixture regression. bioRxiv, 2020.2008.2002.233460.10.1101/2020.08.02.233460Search in Google Scholar
Chang, W., Wan, C., Zang, Y., Zhang, C., and Cao, S. (2020b). Supervised clustering of high-dimensional data using regularized mixture modeling. Briefings Bioinf. 22: 1–11, https://doi.org/10.1093/bib/bbaa291.Search in Google Scholar PubMed PubMed Central
Chang, W., Zhang, C., and Cao, S. (2022). Response to ‘Letter to the Editor: on the stability and internal consistency of component-wise sparse mixture regression based clustering’, Zhang et al. Briefings Bioinf. 23: 1–3, https://doi.org/10.1093/bib/bbac262.Search in Google Scholar PubMed PubMed Central
Clarke, R., Ressom, H.W., Wang, A., Xuan, J., Liu, M.C., Gehan, E.A., and Wang, Y. (2008). The properties of high-dimensional data spaces: implications for exploring gene and protein expression data. Nat. Rev. Cancer 8: 37–49, https://doi.org/10.1038/nrc2294.Search in Google Scholar PubMed PubMed Central
Cohen, J.C., Kiss, R.S., Pertsemlidis, A., Marcel, Y.L., McPherson, R., and Hobbs, H.H. (2004). Multiple rare alleles contribute to low plasma levels of HDL cholesterol. Science 305: 869–872, https://doi.org/10.1126/science.1099870.Search in Google Scholar PubMed
Cohen, J.C., Pertsemlidis, A., Fahmi, S., Esmail, S., Vega, G.L., Grundy, S.M., and Hobbs, H.H. (2006). Multiple rare variants in NPC1L1 associated with reduced sterol absorption and plasma low-density lipoprotein levels. Proc. Natl. Acad. Sci. U. S. A. 103: 1810–1815, https://doi.org/10.1073/pnas.0508483103.Search in Google Scholar PubMed PubMed Central
Dempster, A.P., Laird, N.M., and Rubin, D.B. (1977). Maximum likelihood from incomplete data via the EM algorithm. J. R. Stat. Soc. Ser. B Methodol. 39: 1–22, https://doi.org/10.1111/j.2517-6161.1977.tb01600.x.Search in Google Scholar
Dror, A.A. and Avraham, K.B. (2009). Hearing loss: mechanisms revealed by genetics and cell biology. Annu. Rev. Genet. 43: 411–437, https://doi.org/10.1146/annurev-genet-102108-134135.Search in Google Scholar PubMed
Eschrich, S., Yang, I., Bloom, G., Kwong, K.Y., Boulware, D., Cantor, A., Coppola, D., Kruhøffer, M., Aaltonen, L., Orntoft, T.F., et al.. (2005). Molecular staging for survival prediction of colorectal cancer patients. J. Clin. Oncol. 23: 3526–3535, https://doi.org/10.1200/jco.2005.00.695.Search in Google Scholar
Fahmi, S., Yang, C., Esmail, S., Hobbs, H.H., and Cohen, J.C. (2008). Functional characterization of genetic variants in NPC1L1 supports the sequencing extremes strategy to identify complex trait genes. Hum. Mol. Genet. 17: 2101–2107, https://doi.org/10.1093/hmg/ddn108.Search in Google Scholar PubMed PubMed Central
Fan, J. and Li, R. (2001). Variable selection via nonconcave penalized likelihood and its oracle properties. J. Am. Stat. Assoc. 96: 1348–1360, https://doi.org/10.1198/016214501753382273.Search in Google Scholar
Fraley, C. and Raftery, A.E. (2002). Model-based clustering, discriminant analysis, and density estimation. J. Am. Stat. Assoc. 97: 611–631, https://doi.org/10.1198/016214502760047131.Search in Google Scholar
Frayling Timothy, M., Timpson Nicholas, J., Weedon Michael, N., Zeggini, E., Freathy Rachel, M., Lindgren, C.M., Perry, J.R.B., Elliott, K.S., Lango, H., Rayner, N.W., et al.. (2007). A common variant in the FTO gene is associated with body mass Index and predisposes to childhood and adult obesity. Science 316: 889–894, https://doi.org/10.1126/science.1141634.Search in Google Scholar PubMed PubMed Central
Hannum, G., Guinney, J., Zhao, L., Zhang, L., Hughes, G., Sadda, S., Klotzle, B., Bibikova, M., Fan, J.B., Gao, Y., et al.. (2013). Genome-wide methylation profiles reveal quantitative views of human aging rates. Mol. Cell 49: 359–367, https://doi.org/10.1016/j.molcel.2012.10.016.Search in Google Scholar PubMed PubMed Central
Harbeck, N., Penault-Llorca, F., Cortes, J., Gnant, M., Houssami, N., Poortmans, P., Ruddy, K., Tsang, J., and Cardoso, F. (2019). Breast cancer. Nat. Rev. Dis. Primers 5: 66, https://doi.org/10.1038/s41572-019-0111-2.Search in Google Scholar PubMed
Leisch, F. (2004). FlexMix: a general framework for finite mixture models and latent class regression in R. J. Stat. Softw. 11: 1–18, https://doi.org/10.18637/jss.v011.i08.Search in Google Scholar
Mallick, H., Alhamzawi, R., Paul, E., and Svetnik, V. (2021). The reciprocal Bayesian LASSO. Stat. Med. 40: 4830–4849, https://doi.org/10.1002/sim.9098.Search in Google Scholar PubMed
Matsui, S., Yamanaka, T., Barlogie, B., Shaughnessy, J.D.Jr., and Crowley, J. (2008). Clustering of significant genes in prognostic studies with microarrays: application to a clinical study for multiple myeloma. Stat. Med. 27: 1106–1120, https://doi.org/10.1002/sim.2997.Search in Google Scholar PubMed
Melchor, L., Molyneux, G., Mackay, A., Magnay, F.A., Atienza, M., Kendrick, H., Nava‐Rodrigues, D., López‐García, M.Á., Milanezi, F., Greenow, K., et al.. (2014). Identification of cellular and genetic drivers of breast cancer heterogeneity in genetically engineered mouse tumour models. J. Pathol. 233: 124–137, https://doi.org/10.1002/path.4345.Search in Google Scholar PubMed
Nigam, B., Ahirwal, P., Salve, S., and Vamney, S. (2011). Document classification using expectation maximization with semi supervised learning. Int. J. Soft Comput. 2: 386–397, https://doi.org/10.5121/ijsc.2011.2404.Search in Google Scholar
Petit, C. (1996). Genes responsible for human hereditary deafness: symphony of a thousand. Nat. Genet. 14: 385–391, https://doi.org/10.1038/ng1296-385.Search in Google Scholar PubMed
Romero, R., Espinoza, J., Gotsch, F., Kusanovic, J.P., Friel, L.A., Erez, O., Mazaki-Tovi, S., Than, N., Hassan, S., and Tromp, G. (2006). The use of high-dimensional biology (genomics, transcriptomics, proteomics, and metabolomics) to understand the preterm parturition syndrome. BJOG: Int. J. Obstet. Gynaecol. 113: 118–135, https://doi.org/10.1111/j.1471-0528.2006.01150.x.Search in Google Scholar PubMed PubMed Central
Shi, J., Ren, M., Jia, J., Tang, M., Guo, Y., Ni, X., and Shi, T. (2019). Genotype-phenotype association analysis reveals new pathogenic factors for osteogenesis imperfecta disease. Front. Pharmacol. 10: 1200, https://doi.org/10.3389/fphar.2019.01200.Search in Google Scholar PubMed PubMed Central
Siminovitch, K.A. (2004). PTPN22 and autoimmune disease. Nat. Genet. 36: 1248–1249, https://doi.org/10.1038/ng1204-1248.Search in Google Scholar PubMed
Walsh, T. and King, M.-C. (2007). Ten genes for inherited breast cancer. Cancer Cell 11: 103–105, https://doi.org/10.1016/j.ccr.2007.01.010.Search in Google Scholar PubMed
Wang, H. and Leng, C. (2007). Unified LASSO estimation by least squares approximation. J. Am. Stat. Assoc. 102: 1039–1048, https://doi.org/10.1198/016214507000000509.Search in Google Scholar
Wang, Y., Jatkoe, T., Zhang, Y., Mutch, M.G., Talantov, D., Jiang, J., McLeod, H.L., and Atkins, D. (2004). Gene expression profiles and molecular markers to predict recurrence of Dukes’ B colon cancer. J. Clin. Oncol. 22: 1564–1571, https://doi.org/10.1200/jco.2004.08.186.Search in Google Scholar
Wang, H., Lengerich, B.J., Aragam, B., and Xing, E.P. (2019). Precision Lasso: accounting for correlations and linear dependencies in high-dimensional genomic data. Bioinformatics 35: 1181–1187, https://doi.org/10.1093/bioinformatics/bty750.Search in Google Scholar PubMed PubMed Central
Wu, C.F.J. (1983). On the convergence properties of the EM algorithm. Ann. Stat. 11: 95–103, https://doi.org/10.1214/aos/1176346060.Search in Google Scholar
Xu, H., Caramanis, C., and Mannor, S. (2012). Sparse algorithms are not stable: a no-free-lunch theorem. IEEE Trans. Pattern Anal. Mach. Intell. 34: 187–193.10.1109/TPAMI.2011.177Search in Google Scholar PubMed
Yao, J., Zhao, Q., Yuan, Y., Zhang, L., Liu, X., Yung, W.K.A., and Weinstein, J.N. (2012). Identification of common prognostic gene expression signatures with biological meanings from microarray gene expression datasets. PLoS One 7: e45894, https://doi.org/10.1371/journal.pone.0045894.Search in Google Scholar PubMed PubMed Central
Yuan, M. and Lin, Y. (2007). On the non-negative garrotte estimator. J. R. Stat. Soc., B: Stat. Methodol. 69: 143–161, https://doi.org/10.1111/j.1467-9868.2007.00581.x.Search in Google Scholar
Zhang, C.-H. (2010). Nearly unbiased variable selection under minimax concave penalty. Ann. Stat. 38: 894–942, https://doi.org/10.1214/09-aos729.Search in Google Scholar
Zhang, C.-H. and Huang, J. (2008). The sparsity and bias of the Lasso selection in high-dimensional linear regression. Ann. Stat. 36: 1567–1594, https://doi.org/10.1214/07-aos520.Search in Google Scholar
Zhang, Y., Hapala, J., Brenner, H., and Wagner, W. (2017). Individual CpG sites that are associated with age and life expectancy become hypomethylated upon aging. Clin. Epigenet. 9: 1–6, https://doi.org/10.1186/s13148-017-0315-9.Search in Google Scholar PubMed PubMed Central
Zhang, B., He, J., Hu, J., Koestler, D.C., and Chalise, P. (2021). Letter to the Editor: on the stability and internal consistency of component-wise sparse mixture regression-based clustering. Briefings Bioinf. 23: 1–5, https://doi.org/10.1093/bib/bbab532.Search in Google Scholar PubMed PubMed Central
Zou, H. (2006). The adaptive Lasso and its oracle properties. J. Am. Stat. Assoc. 101: 1418–1429, https://doi.org/10.1198/016214506000000735.Search in Google Scholar
Zou, H. and Hastie, T. (2005). Regularization and variable selection via the elastic Net. J. R. Stat. Soc., B: Stat. Methodol. 67: 301–320, https://doi.org/10.1111/j.1467-9868.2005.00503.x.Search in Google Scholar
Supplementary Material
This article contains supplementary material (https://doi.org/10.1515/sagmb-2022-0031).
© 2023 Walter de Gruyter GmbH, Berlin/Boston