Abstract
Proteins are the building blocks of all living things. Protein function must be ascertained if the molecular mechanism of life is to be understood. While CNN is good at capturing short-term relationships, GRU and LSTM can capture long-term dependencies. A hybrid approach that combines the complementary benefits of these deep-learning models motivates our work. Protein Language models, which use attention networks to gather meaningful data and build representations for proteins, have seen tremendous success in recent years processing the protein sequences. In this paper, we propose a hybrid CNN + BiGRU – Attention based model with protein language model embedding that effectively combines the output of CNN with the output of BiGRU-Attention for predicting protein functions. We evaluated the performance of our proposed hybrid model on human and yeast datasets. The proposed hybrid model improves the Fmax value over the state-of-the-art model SDN2GO for the cellular component prediction task by 1.9 %, for the molecular function prediction task by 3.8 % and for the biological process prediction task by 0.6 % for human dataset and for yeast dataset the cellular component prediction task by 2.4 %, for the molecular function prediction task by 5.2 % and for the biological process prediction task by 1.2 %.
-
Research ethics: Not applicable.
-
Author contributions: The authors have accepted responsibility for the entire content of this manuscript and approved its submission.
-
Competing interests: The authors state no conflict of interest.
-
Research funding: None declared.
-
Data availability: The raw data can be obtained on request from the corresponding author.
References
Asgari, E. and Mofrad, M.R.K. (2015). ProtVec: a continuous distributed representation of biological sequences for proteomics and genomics. PLoS One 10: e0141287. https://doi.org/10.1371/journal.pone.0141287.Search in Google Scholar PubMed PubMed Central
Ashburner, M., Ball, C., Blake, J., Botstein, D., Butler, H., Cherry, J.M., Davis, A.P., Dolinski, K., Dwight, S.S., Eppig, J.T., et al.. (2000). Gene ontology: tool for the unification of biology. The gene ontology consortium. Nature genetics 25: 25–29. https://doi.org/10.1038/75556.Search in Google Scholar PubMed PubMed Central
Bahdanau, D., Cho, K., and Bengio, Y. (2014). Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473.Search in Google Scholar
Barrell, D., Dimmer, E., Huntley, R.P., Binns, D., O’Donovan, C., and Apweiler, R. (2009). The Goa database in 2009-an integrated gene ontology annotation resource. Nucleic Acids Res. 37: D396–D403. https://doi.org/10.1093/nar/gkn803.Search in Google Scholar PubMed PubMed Central
Cai, Y., Wang, J., and Deng, L. (2020). SDN2GO: an integrated deep learning model for protein function prediction. Front. Bioeng. Biotechnol. 8: 391, https://doi.org/10.3389/fbioe.2020.00391.Search in Google Scholar PubMed PubMed Central
Cao, R., Freitas, C., Chan, L., Sun, M., Jiang, H., and Chen, Z. (2017). ProLanGO: protein function prediction using neural machine translation based on a recurrent neural network. Molecules 22: 1732. https://doi.org/10.3390/molecules22101732.Search in Google Scholar PubMed PubMed Central
Chen, H., Sun, M., Tu, C., Lin, Y., and Liu, Z. (2016). Neural sentiment classification with user and product attention. In: Proceedings of the 2016 conference on empirical methods in natural language processing, pp. 1650–1659.10.18653/v1/D16-1171Search in Google Scholar
Cho, K., van Merrienboer, B., Gulcehre, C., Bahdanau, D., Bougares, F., Schwenk, H., and Bengio, Y. (2014). Learning phrase representations using RNN encoder-decoder for statistical machine translation. In: Proceedings of the 2014 conference on empirical methods in natural language processing. Association for Computational Linguistics.10.3115/v1/D14-1179Search in Google Scholar
Choi, K., Lee, Y., Kim, C., Yoon, M. (2021). An effective GCN-based hierarchical multilabel classification for protein function prediction. arXiv:2112.02810.Search in Google Scholar
Clark, W.T. and Radivojac, P. (2011a). Analysis of protein function and its prediction from amino acid sequence. Proteins Struct. Funct. Bioinf. 79: 2086–2096. https://doi.org/10.1002/prot.23029.Search in Google Scholar PubMed
Clark, W.T. and Radivojac, P. (2011b). Analysis of protein function and its prediction from amino acid sequence. Proteins Struct. Funct. Bioinf. 79: 2086–2096. https://doi.org/10.1002/prot.23029.Search in Google Scholar
Consortium, U. (2015). Uniprot: a hub for protein information. Nucleic Acids Res. 43: D204–D212. https://doi.org/10.1093/nar/gku989.Search in Google Scholar PubMed PubMed Central
Dutta, P. and Saha, S. (2017). Fusion of expression values and protein interaction information using multi-objective optimization for improving gene clustering. Comput. Biol. Med. 89: 31–43. https://doi.org/10.1016/j.compbiomed.2017.07.015.Search in Google Scholar PubMed
Dutta, P. and Saha, S. (2020). Amalgamation of protein sequence, structure and textual information for improving protein-protein interaction identification. In: Proceedings of the 58th annual meeting of the association for computational linguistics, pp. 6396–6407.10.18653/v1/2020.acl-main.570Search in Google Scholar
Elnaggar, A., Heinzinger, M., Dallago, C., Rehawi, G., Wang, Y., Jones, L., Gibbs, T., Feher, T., Angerer, C., Steinegger, M., et al.. (2021). ProtTrans: towards cracking the language of life’s code through self-supervised deep learning and high performance computing. IEEE Trans. Pattern Anal. Mach. Intell. 14: 1.10.1101/2020.07.12.199554Search in Google Scholar
Elsayed, N., Maida, A.S., and Bayoumi, M. (2019). Deep gated recurrent and convolutional network hybrid model for univariate time series classification. Int. J. Adv. Comput. Sci. Appl. 10: 654–664. https://doi.org/10.14569/ijacsa.2019.0100582.Search in Google Scholar
Forslund, K. and Sonnhammer, E.L. (2008). Predicting protein function from domain content. Bioinformatics 24: 1681–1687. https://doi.org/10.1093/bioinformatics/btn447.Search in Google Scholar
Giri, S.J., Dutta, P.Student Member, Halan, P., and Saha, S. (2020). MultiPredGO: deep multi-modal protein function prediction by amalgamating protein structure, sequence, and interaction information. IEEE J. Biomed. Health Inform 25: 1832–1838.10.1109/JBHI.2020.3022806Search in Google Scholar PubMed
Heinzinger, M., Elnaggar, A., Wang, Y., Dallago, C., Nechaev, D., Matthes, F., and Rost, B. (2019). Modeling aspects of the language of life through transfer-learning protein sequences. BMC Bioinform. 20: 723. https://doi.org/10.1186/s12859-019-3220-8.Search in Google Scholar PubMed PubMed Central
Hunter, S., Apweiler, R., Attwood, T.K., Bairoch, A., Bateman, A., Binns, D., Bork, P., Das, U., Daugherty, L., Duquenne, L., et al.. (2009). Interpro: the integrative protein signature database. Nucleic Acids Res. 37: D211–D215. https://doi.org/10.1093/nar/gkn785.Search in Google Scholar PubMed PubMed Central
Jiang, Y., Oron, T.R., Clark, W.T., Bankapur, A.R., D’Andrea, D., Lepore, R., Funk, C.S., Kahanda, I., Verspoor, K.M., Ben-Hur, A., et al.. (2016). An expanded evaluation of protein function prediction methods shows an improvement in accuracy. Genome Biol. 17: 184. https://doi.org/10.1186/s13059-016-1037-6.Search in Google Scholar PubMed PubMed Central
Jinbao, T., Weiwei, K., Qiaoxin, T., and Zhaoqian, W. (2021). Text classification method based on LSTM-attention and CNN hybrid model. Comput. Eng. Appl. 57: 154–162.10.1145/3488933.3488970Search in Google Scholar
Kabir, A. and Shehu, A. (2022). Transformer neural networks attending to both sequence and structure for protein prediction tasks, arXiv:2206.11057.Search in Google Scholar
Kabir, A. and Shehu, A. (2022). GOProFormer: a multi-modal transformer method for gene ontology protein function prediction.10.1101/2022.10.20.513033Search in Google Scholar
Kingma, D.P. and Ba, J. (2014). Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980.Search in Google Scholar
Kuang, S., Li, J., Branco, A., Luo, W.-H., and Xiong, D. (2018). Attention focusing for neural machine translation by bridging source and target embeddings. In: Proceedings of the 56th annual meeting of the association for computational linguistics, Vol. 1, Long Papers, pp. 1767–1776.10.18653/v1/P18-1164Search in Google Scholar
Kulmanov, M. and Hoehndorf, R. (2020). Deepgoplus: improved protein function prediction from sequence. Bioinformatics 36: 422–429. https://doi.org/10.1093/bioinformatics/btz595.Search in Google Scholar PubMed PubMed Central
Kulmanov, M., Khan, M.A., and Hoehndorf, R. (2018). DeepGO: predicting protein functions from sequence and interactions using a deep ontology-aware classifier. Bioinformatics 34: 660–668. https://doi.org/10.1093/bioinformatics/btx624.Search in Google Scholar PubMed PubMed Central
Le, N.Q.K., Yapp, E.K.Y., and Yeh, H.Y. (2019). ET-GRU: using multi-layer gated recurrent units to identify electron transport proteins. BMC Bioinf. 20: 377. https://doi.org/10.1186/s12859-019-2972-5.Search in Google Scholar PubMed PubMed Central
LeCun, Y., Boser, B., Denker, J.S., Henderson, D., Howard, R.E., Hubbard, W., and Jackel, L.D. (1989). Backpropagation applied to handwritten zip code recognition. Neural Comput. 1: 541–551. https://doi.org/10.1162/neco.1989.1.4.541.Search in Google Scholar
Li, Y., Wang, X., and Xu, P. (2018). Chinese text classification model based on deep learning. Future Internet 10: 113, https://doi.org/10.3390/fi10110113.Search in Google Scholar
Li, J., Wang, L., Zhang, X., Liu, B., and Wang, Y. (2020). Gonet: a deep network to annotate proteins via recurrent convolution networks. In: 2020 IEEE international conference on bioinformatics and biomedicine (BIBM). IEEE, pp. 29–34.10.1109/BIBM49941.2020.9313235Search in Google Scholar
Lin, Z., Akin, H., Rao, R., Hie, B., Zhu, Z., Lu, W., dos Santos Costa, A., Fazel-Zarandi, M., Sercu, T., Candido, S., et al.. (2022). Language models of protein sequences at the scale of evolution enable accurate structure prediction. bioRxiv 379: 1123–1130, https://doi.org/10.1101/2022.07.20.500902.Search in Google Scholar
Marquet, C., Heinzinger, M., Olenyi, T., Dallago, C., Erckert, K., Bernhofer, M., Nechaev, D., and Rost, B. (2022). Embeddings from protein language models predict conservation and variant effects. Hum. Genet. 141: 1629–1647.10.1007/s00439-021-02411-ySearch in Google Scholar PubMed PubMed Central
Nambiar, A., Heflin, M., Liu, S., Maslov, S., Hopkins, M., and Ritz, A. (2020). Transforming the language of life: transformer neural networks for protein prediction tasks. In: Proceedings of the international conference on bioinformatics, computational biology, and health informatics (BCB). ACM, pp. 1–8.10.1101/2020.06.15.153643Search in Google Scholar
Pearson, W.R. (2013). An introduction to sequence similarity (“homology”) searching. Curr. Protoc. Bioinformatics 42: 3–1. https://doi.org/10.1002/0471250953.bi0301s42.Search in Google Scholar PubMed PubMed Central
Piovesan, D., Giollo, M., Leonardi, E., Ferrari, C., and Tosatto, S.C. (2015). INGA: protein function prediction combining interaction networks, domain assignments and sequence similarity. Nucleic Acids Res. 43: W134–W140. https://doi.org/10.1093/nar/gkv523.Search in Google Scholar PubMed PubMed Central
Ranjan, A., Fahad, M.S., Fernandez-Baca, D., Deepak, A., and Tripathi, S. (2019). Deep robust framework for protein function prediction using variable-length protein sequences. IEEE ACM Trans. Comput. Biol. Bioinf. 17: 1648–1659, https://doi.org/10.1109/tcbb.2019.2911609.Search in Google Scholar PubMed
Ranjan, A., Fernandez-Baca, D., Tripathi, S., and Deepak, A. (2021a). An ensemble Tf-Idf based approach to protein function prediction via sequence segmentation. IEEE ACM Trans. Comput. Biol. Bioinf. 19: 2685–2696. https://doi.org/10.1109/TCBB.2021.3093060.Search in Google Scholar PubMed
Ranjan, A., Tiwari, A., and Deepak, A. (2021b). A sub-sequence based approach to protein function prediction via multi-attention based multi-aspect network. IEEE ACM Trans. Comput. Biol. Bioinf. 20: 94–105. https://doi.org/10.1109/TCBB.2021.3130923.Search in Google Scholar PubMed
Ranjan, A., Fahad, M.S., Fernandez-Baca, D., Tripathi, S., and Deepak, A. (2022). MCWS-transformers: towards an efficient modeling of protein sequences via multi context-window based scaled self-attention. IEEE ACM Trans. Comput. Biol. Bioinf. 20: 1188–1199, https://doi.org/10.1109/TCBB.2022.3173789.Search in Google Scholar PubMed
Rives, A., Meier, J., Sercu, T., Goyal, S., Lin, Z., Liu, J., Guo, D., Ott, M., Zitnick, C.L., Ma, J., et al.. (2021). Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences. Proc. Natl. Acad. Sci. U. S. A. 118: e2016239118. https://doi.org/10.1073/pnas.2016239118.Search in Google Scholar PubMed PubMed Central
Roy, A., Yang, J., and Zhang, Y. (2012). COFACTOR: an accurate comparative algorithm for structure-based protein function annotation. Nucleic Acids Res. 40: W471–W477, https://doi.org/10.1093/nar/gks372.Search in Google Scholar PubMed PubMed Central
Sharan, R., Ulitsky, I., and Shamir, R. (2007). Network-based prediction of protein function. Mol. Syst. Biol. 3: 88–100, https://doi.org/10.1038/msb4100129.Search in Google Scholar PubMed PubMed Central
Stark, H., Dallago, C., Heinzinger, M., and Rost, B. (2021). Light attention predicts protein location from the language of life. Bioinform. Adv. 1: vbab035. https://doi.org/10.1093/bioadv/vbab035.Search in Google Scholar PubMed PubMed Central
Strodthoff, N., Wagner, P., Wenzel, M., and Samek, W. (2020). UDSMProt: universal deep sequence models for protein classification. Bioinformatics 36: 2401–2409. https://doi.org/10.1093/bioinformatics/btaa003.Search in Google Scholar PubMed PubMed Central
Szklarczyk, D., Franceschini, A., Wyder, S., Forslund, K., Heller, D., HuertaCepas, J., Simonovic, M., Roth, A., Santos, A., Tsafou, K.P., et al.. (2015). String v10: protein-protein interaction networks, integrated over the tree of life. Nucleic Acids Res. 43: D447–D452. https://doi.org/10.1093/nar/gku1003.Search in Google Scholar PubMed PubMed Central
Wang, H., Yan, L., Huang, H., and Ding, C. (2016). From protein sequence to protein function via multi-label linear discriminant analysis. IEEE ACM Trans. Comput. Biol. Bioinf. 14: 503–513. https://doi.org/10.1109/tcbb.2016.2591529.Search in Google Scholar PubMed
Wang, H., Yan, L., Huang, H., and Ding, C. (2017). From protein sequence to protein function via multi-label linear discriminant analysis. IEEE ACM Trans. Comput. Biol. Bioinf. 14: 503–513, https://doi.org/10.1109/tcbb.2016.2591529.Search in Google Scholar
Yang, J., Yan, R., Roy, A., Xu, D., Poisson, J., and Zhang, Y. (2015). The itasser suite: protein structure and function prediction. Nat. Methods 12: 7. https://doi.org/10.1038/nmeth.3213.Search in Google Scholar PubMed PubMed Central
Yang, Z., Yang, D., Dyer, C., He, X., Smola, A.J., and Hovy, E.H. (2016). Hierarchical attention networks for document classification. In: Proc. HLT-NAACL, pp. 1480–1489.10.18653/v1/N16-1174Search in Google Scholar
Yang, L., Wei, P., Zhong, C., Li, X., and Tang, Y. Y. (2020). Protein structure prediction based on BN-GRU method. Int. J. Wavelets Multiresolut. Inf. Process. 18: 2050045, https://doi.org/10.1142/s0219691320500459.Search in Google Scholar
You, R., Yao, S., Xiong, Y., Huang, X., Sun, F., Mamitsuka, H., and Zhu, S. (2019). Netgo: improving large-scale protein function prediction with massive network information. Nucleic Acids Res. 47: W379–W387. https://doi.org/10.1093/nar/gkz388.Search in Google Scholar PubMed PubMed Central
Zhang, Y., Yuan, H., Wang, J., and Zhang, X. (2017). Using a CNN-LSTM model for sentiment Intensity prediction [C]. In: Proceedings of the 8th workshop on computational approaches to subjectivity, sentiment and social media analysis. Association for Computational Linguistics, pp. 200–204.10.18653/v1/W17-5227Search in Google Scholar
Zhang, C., Zheng, W., Freddolino, P.L., and Zhang, Y. (2018). Metago: predicting gene ontology of non-homologous proteins through low-resolution protein structure prediction and protein-protein network mapping. J. Mol. Biol. 430: 2256–2265. https://doi.org/10.1016/j.jmb.2018.03.004.Search in Google Scholar PubMed PubMed Central
© 2023 Walter de Gruyter GmbH, Berlin/Boston