A novel hybrid CNN and BiGRU-Attention based deep learning model for protein function prediction

Lavkush Sharma; Akshay Deepak; Ashish Ranjan; Gopalakrishnan Krishnasamy

doi:10.1515/sagmb-2022-0057

Published by De Gruyter September 4, 2023

A novel hybrid CNN and BiGRU-Attention based deep learning model for protein function prediction

Lavkush Sharma , Akshay Deepak , Ashish Ranjan and Gopalakrishnan Krishnasamy

From the journal Statistical Applications in Genetics and Molecular Biology

https://doi.org/10.1515/sagmb-2022-0057

Showing a limited preview of this publication:

Abstract

Proteins are the building blocks of all living things. Protein function must be ascertained if the molecular mechanism of life is to be understood. While CNN is good at capturing short-term relationships, GRU and LSTM can capture long-term dependencies. A hybrid approach that combines the complementary benefits of these deep-learning models motivates our work. Protein Language models, which use attention networks to gather meaningful data and build representations for proteins, have seen tremendous success in recent years processing the protein sequences. In this paper, we propose a hybrid CNN + BiGRU – Attention based model with protein language model embedding that effectively combines the output of CNN with the output of BiGRU-Attention for predicting protein functions. We evaluated the performance of our proposed hybrid model on human and yeast datasets. The proposed hybrid model improves the Fmax value over the state-of-the-art model SDN2GO for the cellular component prediction task by 1.9 %, for the molecular function prediction task by 3.8 % and for the biological process prediction task by 0.6 % for human dataset and for yeast dataset the cellular component prediction task by 2.4 %, for the molecular function prediction task by 5.2 % and for the biological process prediction task by 1.2 %.

Keywords: attention technique; CNN; gated recurrent unit; protein language models; protein sequence

Corresponding author: Lavkush Sharma, Department of Computer Science and Engineering, National Institute of Technology Patna, Patna, Bihar, India, E-mail: lavkushs.phd20.cs@nitp.ac.in

Research ethics: Not applicable.
Author contributions: The authors have accepted responsibility for the entire content of this manuscript and approved its submission.
Competing interests: The authors state no conflict of interest.
Research funding: None declared.
Data availability: The raw data can be obtained on request from the corresponding author.

References

Asgari, E. and Mofrad, M.R.K. (2015). ProtVec: a continuous distributed representation of biological sequences for proteomics and genomics. PLoS One 10: e0141287. https://doi.org/10.1371/journal.pone.0141287.Search in Google Scholar PubMed PubMed Central

Ashburner, M., Ball, C., Blake, J., Botstein, D., Butler, H., Cherry, J.M., Davis, A.P., Dolinski, K., Dwight, S.S., Eppig, J.T., et al.. (2000). Gene ontology: tool for the unification of biology. The gene ontology consortium. Nature genetics 25: 25–29. https://doi.org/10.1038/75556.Search in Google Scholar PubMed PubMed Central

Bahdanau, D., Cho, K., and Bengio, Y. (2014). Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473.Search in Google Scholar

Barrell, D., Dimmer, E., Huntley, R.P., Binns, D., O’Donovan, C., and Apweiler, R. (2009). The Goa database in 2009-an integrated gene ontology annotation resource. Nucleic Acids Res. 37: D396–D403. https://doi.org/10.1093/nar/gkn803.Search in Google Scholar PubMed PubMed Central

Cai, Y., Wang, J., and Deng, L. (2020). SDN2GO: an integrated deep learning model for protein function prediction. Front. Bioeng. Biotechnol. 8: 391, https://doi.org/10.3389/fbioe.2020.00391.Search in Google Scholar PubMed PubMed Central

Cao, R., Freitas, C., Chan, L., Sun, M., Jiang, H., and Chen, Z. (2017). ProLanGO: protein function prediction using neural machine translation based on a recurrent neural network. Molecules 22: 1732. https://doi.org/10.3390/molecules22101732.Search in Google Scholar PubMed PubMed Central

Chen, H., Sun, M., Tu, C., Lin, Y., and Liu, Z. (2016). Neural sentiment classification with user and product attention. In: Proceedings of the 2016 conference on empirical methods in natural language processing, pp. 1650–1659.10.18653/v1/D16-1171Search in Google Scholar

Cho, K., van Merrienboer, B., Gulcehre, C., Bahdanau, D., Bougares, F., Schwenk, H., and Bengio, Y. (2014). Learning phrase representations using RNN encoder-decoder for statistical machine translation. In: Proceedings of the 2014 conference on empirical methods in natural language processing. Association for Computational Linguistics.10.3115/v1/D14-1179Search in Google Scholar

Choi, K., Lee, Y., Kim, C., Yoon, M. (2021). An effective GCN-based hierarchical multilabel classification for protein function prediction. arXiv:2112.02810.Search in Google Scholar

Clark, W.T. and Radivojac, P. (2011a). Analysis of protein function and its prediction from amino acid sequence. Proteins Struct. Funct. Bioinf. 79: 2086–2096. https://doi.org/10.1002/prot.23029.Search in Google Scholar PubMed

Clark, W.T. and Radivojac, P. (2011b). Analysis of protein function and its prediction from amino acid sequence. Proteins Struct. Funct. Bioinf. 79: 2086–2096. https://doi.org/10.1002/prot.23029.Search in Google Scholar

Consortium, U. (2015). Uniprot: a hub for protein information. Nucleic Acids Res. 43: D204–D212. https://doi.org/10.1093/nar/gku989.Search in Google Scholar PubMed PubMed Central

Dutta, P. and Saha, S. (2017). Fusion of expression values and protein interaction information using multi-objective optimization for improving gene clustering. Comput. Biol. Med. 89: 31–43. https://doi.org/10.1016/j.compbiomed.2017.07.015.Search in Google Scholar PubMed

Dutta, P. and Saha, S. (2020). Amalgamation of protein sequence, structure and textual information for improving protein-protein interaction identification. In: Proceedings of the 58th annual meeting of the association for computational linguistics, pp. 6396–6407.10.18653/v1/2020.acl-main.570Search in Google Scholar

Elnaggar, A., Heinzinger, M., Dallago, C., Rehawi, G., Wang, Y., Jones, L., Gibbs, T., Feher, T., Angerer, C., Steinegger, M., et al.. (2021). ProtTrans: towards cracking the language of life’s code through self-supervised deep learning and high performance computing. IEEE Trans. Pattern Anal. Mach. Intell. 14: 1.10.1101/2020.07.12.199554Search in Google Scholar

Elsayed, N., Maida, A.S., and Bayoumi, M. (2019). Deep gated recurrent and convolutional network hybrid model for univariate time series classification. Int. J. Adv. Comput. Sci. Appl. 10: 654–664. https://doi.org/10.14569/ijacsa.2019.0100582.Search in Google Scholar

Forslund, K. and Sonnhammer, E.L. (2008). Predicting protein function from domain content. Bioinformatics 24: 1681–1687. https://doi.org/10.1093/bioinformatics/btn447.Search in Google Scholar

Giri, S.J., Dutta, P.Student Member, Halan, P., and Saha, S. (2020). MultiPredGO: deep multi-modal protein function prediction by amalgamating protein structure, sequence, and interaction information. IEEE J. Biomed. Health Inform 25: 1832–1838.10.1109/JBHI.2020.3022806Search in Google Scholar PubMed

Heinzinger, M., Elnaggar, A., Wang, Y., Dallago, C., Nechaev, D., Matthes, F., and Rost, B. (2019). Modeling aspects of the language of life through transfer-learning protein sequences. BMC Bioinform. 20: 723. https://doi.org/10.1186/s12859-019-3220-8.Search in Google Scholar PubMed PubMed Central

Hunter, S., Apweiler, R., Attwood, T.K., Bairoch, A., Bateman, A., Binns, D., Bork, P., Das, U., Daugherty, L., Duquenne, L., et al.. (2009). Interpro: the integrative protein signature database. Nucleic Acids Res. 37: D211–D215. https://doi.org/10.1093/nar/gkn785.Search in Google Scholar PubMed PubMed Central

Jiang, Y., Oron, T.R., Clark, W.T., Bankapur, A.R., D’Andrea, D., Lepore, R., Funk, C.S., Kahanda, I., Verspoor, K.M., Ben-Hur, A., et al.. (2016). An expanded evaluation of protein function prediction methods shows an improvement in accuracy. Genome Biol. 17: 184. https://doi.org/10.1186/s13059-016-1037-6.Search in Google Scholar PubMed PubMed Central

Jinbao, T., Weiwei, K., Qiaoxin, T., and Zhaoqian, W. (2021). Text classification method based on LSTM-attention and CNN hybrid model. Comput. Eng. Appl. 57: 154–162.10.1145/3488933.3488970Search in Google Scholar

Kabir, A. and Shehu, A. (2022). Transformer neural networks attending to both sequence and structure for protein prediction tasks, arXiv:2206.11057.Search in Google Scholar

Kabir, A. and Shehu, A. (2022). GOProFormer: a multi-modal transformer method for gene ontology protein function prediction.10.1101/2022.10.20.513033Search in Google Scholar

Kingma, D.P. and Ba, J. (2014). Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980.Search in Google Scholar

Kuang, S., Li, J., Branco, A., Luo, W.-H., and Xiong, D. (2018). Attention focusing for neural machine translation by bridging source and target embeddings. In: Proceedings of the 56th annual meeting of the association for computational linguistics, Vol. 1, Long Papers, pp. 1767–1776.10.18653/v1/P18-1164Search in Google Scholar

Kulmanov, M. and Hoehndorf, R. (2020). Deepgoplus: improved protein function prediction from sequence. Bioinformatics 36: 422–429. https://doi.org/10.1093/bioinformatics/btz595.Search in Google Scholar PubMed PubMed Central

Kulmanov, M., Khan, M.A., and Hoehndorf, R. (2018). DeepGO: predicting protein functions from sequence and interactions using a deep ontology-aware classifier. Bioinformatics 34: 660–668. https://doi.org/10.1093/bioinformatics/btx624.Search in Google Scholar PubMed PubMed Central

Le, N.Q.K., Yapp, E.K.Y., and Yeh, H.Y. (2019). ET-GRU: using multi-layer gated recurrent units to identify electron transport proteins. BMC Bioinf. 20: 377. https://doi.org/10.1186/s12859-019-2972-5.Search in Google Scholar PubMed PubMed Central

LeCun, Y., Boser, B., Denker, J.S., Henderson, D., Howard, R.E., Hubbard, W., and Jackel, L.D. (1989). Backpropagation applied to handwritten zip code recognition. Neural Comput. 1: 541–551. https://doi.org/10.1162/neco.1989.1.4.541.Search in Google Scholar

Li, Y., Wang, X., and Xu, P. (2018). Chinese text classification model based on deep learning. Future Internet 10: 113, https://doi.org/10.3390/fi10110113.Search in Google Scholar

Li, J., Wang, L., Zhang, X., Liu, B., and Wang, Y. (2020). Gonet: a deep network to annotate proteins via recurrent convolution networks. In: 2020 IEEE international conference on bioinformatics and biomedicine (BIBM). IEEE, pp. 29–34.10.1109/BIBM49941.2020.9313235Search in Google Scholar

Lin, Z., Akin, H., Rao, R., Hie, B., Zhu, Z., Lu, W., dos Santos Costa, A., Fazel-Zarandi, M., Sercu, T., Candido, S., et al.. (2022). Language models of protein sequences at the scale of evolution enable accurate structure prediction. bioRxiv 379: 1123–1130, https://doi.org/10.1101/2022.07.20.500902.Search in Google Scholar

Marquet, C., Heinzinger, M., Olenyi, T., Dallago, C., Erckert, K., Bernhofer, M., Nechaev, D., and Rost, B. (2022). Embeddings from protein language models predict conservation and variant effects. Hum. Genet. 141: 1629–1647.10.1007/s00439-021-02411-ySearch in Google Scholar PubMed PubMed Central

Nambiar, A., Heflin, M., Liu, S., Maslov, S., Hopkins, M., and Ritz, A. (2020). Transforming the language of life: transformer neural networks for protein prediction tasks. In: Proceedings of the international conference on bioinformatics, computational biology, and health informatics (BCB). ACM, pp. 1–8.10.1101/2020.06.15.153643Search in Google Scholar

Pearson, W.R. (2013). An introduction to sequence similarity (“homology”) searching. Curr. Protoc. Bioinformatics 42: 3–1. https://doi.org/10.1002/0471250953.bi0301s42.Search in Google Scholar PubMed PubMed Central

Piovesan, D., Giollo, M., Leonardi, E., Ferrari, C., and Tosatto, S.C. (2015). INGA: protein function prediction combining interaction networks, domain assignments and sequence similarity. Nucleic Acids Res. 43: W134–W140. https://doi.org/10.1093/nar/gkv523.Search in Google Scholar PubMed PubMed Central

Ranjan, A., Fahad, M.S., Fernandez-Baca, D., Deepak, A., and Tripathi, S. (2019). Deep robust framework for protein function prediction using variable-length protein sequences. IEEE ACM Trans. Comput. Biol. Bioinf. 17: 1648–1659, https://doi.org/10.1109/tcbb.2019.2911609.Search in Google Scholar PubMed

Ranjan, A., Fernandez-Baca, D., Tripathi, S., and Deepak, A. (2021a). An ensemble Tf-Idf based approach to protein function prediction via sequence segmentation. IEEE ACM Trans. Comput. Biol. Bioinf. 19: 2685–2696. https://doi.org/10.1109/TCBB.2021.3093060.Search in Google Scholar PubMed

Ranjan, A., Tiwari, A., and Deepak, A. (2021b). A sub-sequence based approach to protein function prediction via multi-attention based multi-aspect network. IEEE ACM Trans. Comput. Biol. Bioinf. 20: 94–105. https://doi.org/10.1109/TCBB.2021.3130923.Search in Google Scholar PubMed

Ranjan, A., Fahad, M.S., Fernandez-Baca, D., Tripathi, S., and Deepak, A. (2022). MCWS-transformers: towards an efficient modeling of protein sequences via multi context-window based scaled self-attention. IEEE ACM Trans. Comput. Biol. Bioinf. 20: 1188–1199, https://doi.org/10.1109/TCBB.2022.3173789.Search in Google Scholar PubMed

Rives, A., Meier, J., Sercu, T., Goyal, S., Lin, Z., Liu, J., Guo, D., Ott, M., Zitnick, C.L., Ma, J., et al.. (2021). Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences. Proc. Natl. Acad. Sci. U. S. A. 118: e2016239118. https://doi.org/10.1073/pnas.2016239118.Search in Google Scholar PubMed PubMed Central

Roy, A., Yang, J., and Zhang, Y. (2012). COFACTOR: an accurate comparative algorithm for structure-based protein function annotation. Nucleic Acids Res. 40: W471–W477, https://doi.org/10.1093/nar/gks372.Search in Google Scholar PubMed PubMed Central

Sharan, R., Ulitsky, I., and Shamir, R. (2007). Network-based prediction of protein function. Mol. Syst. Biol. 3: 88–100, https://doi.org/10.1038/msb4100129.Search in Google Scholar PubMed PubMed Central

Stark, H., Dallago, C., Heinzinger, M., and Rost, B. (2021). Light attention predicts protein location from the language of life. Bioinform. Adv. 1: vbab035. https://doi.org/10.1093/bioadv/vbab035.Search in Google Scholar PubMed PubMed Central

Strodthoff, N., Wagner, P., Wenzel, M., and Samek, W. (2020). UDSMProt: universal deep sequence models for protein classification. Bioinformatics 36: 2401–2409. https://doi.org/10.1093/bioinformatics/btaa003.Search in Google Scholar PubMed PubMed Central

Szklarczyk, D., Franceschini, A., Wyder, S., Forslund, K., Heller, D., HuertaCepas, J., Simonovic, M., Roth, A., Santos, A., Tsafou, K.P., et al.. (2015). String v10: protein-protein interaction networks, integrated over the tree of life. Nucleic Acids Res. 43: D447–D452. https://doi.org/10.1093/nar/gku1003.Search in Google Scholar PubMed PubMed Central

Wang, H., Yan, L., Huang, H., and Ding, C. (2016). From protein sequence to protein function via multi-label linear discriminant analysis. IEEE ACM Trans. Comput. Biol. Bioinf. 14: 503–513. https://doi.org/10.1109/tcbb.2016.2591529.Search in Google Scholar PubMed

Wang, H., Yan, L., Huang, H., and Ding, C. (2017). From protein sequence to protein function via multi-label linear discriminant analysis. IEEE ACM Trans. Comput. Biol. Bioinf. 14: 503–513, https://doi.org/10.1109/tcbb.2016.2591529.Search in Google Scholar

Yang, J., Yan, R., Roy, A., Xu, D., Poisson, J., and Zhang, Y. (2015). The itasser suite: protein structure and function prediction. Nat. Methods 12: 7. https://doi.org/10.1038/nmeth.3213.Search in Google Scholar PubMed PubMed Central

Yang, Z., Yang, D., Dyer, C., He, X., Smola, A.J., and Hovy, E.H. (2016). Hierarchical attention networks for document classification. In: Proc. HLT-NAACL, pp. 1480–1489.10.18653/v1/N16-1174Search in Google Scholar

Yang, L., Wei, P., Zhong, C., Li, X., and Tang, Y. Y. (2020). Protein structure prediction based on BN-GRU method. Int. J. Wavelets Multiresolut. Inf. Process. 18: 2050045, https://doi.org/10.1142/s0219691320500459.Search in Google Scholar

You, R., Yao, S., Xiong, Y., Huang, X., Sun, F., Mamitsuka, H., and Zhu, S. (2019). Netgo: improving large-scale protein function prediction with massive network information. Nucleic Acids Res. 47: W379–W387. https://doi.org/10.1093/nar/gkz388.Search in Google Scholar PubMed PubMed Central

Zhang, Y., Yuan, H., Wang, J., and Zhang, X. (2017). Using a CNN-LSTM model for sentiment Intensity prediction [C]. In: Proceedings of the 8th workshop on computational approaches to subjectivity, sentiment and social media analysis. Association for Computational Linguistics, pp. 200–204.10.18653/v1/W17-5227Search in Google Scholar

Zhang, C., Zheng, W., Freddolino, P.L., and Zhang, Y. (2018). Metago: predicting gene ontology of non-homologous proteins through low-resolution protein structure prediction and protein-protein network mapping. J. Mol. Biol. 430: 2256–2265. https://doi.org/10.1016/j.jmb.2018.03.004.Search in Google Scholar PubMed PubMed Central

Received: 2022-11-27

Accepted: 2023-04-20

Published Online: 2023-09-04

A novel hybrid CNN and BiGRU-Attention based deep learning model for protein function prediction

Abstract

References

Journal and Issue

Articles in the same Issue