Skip to main content
Log in

Significance of Sequence Features in Classification of Protein–Protein Interactions Using Machine Learning

  • Published:
The Protein Journal Aims and scope Submit manuscript

Abstract

Protein–protein interactions are crucial for the entry of viruses into the cell. Understanding the mechanism of interactions is essential in studying human-virus association, developing new biologics and drug candidates, as well as viral infections and antiviral responses. Experimental methods to analyze human-virus protein–protein interactions based on protein sequence data are time-consuming and labor-intensive, so machine learning models are being developed to predict interactions and determine large-scale interactomes between species. The present work highlights the importance of sequence features in classifying interacting and non-interacting proteins from the protein sequence data. Higher dimensional amino acid sequence features such as Amino Acid Composition (AAC), Dipeptide Composition (DPC), Grouped Amino Acid Composition (GAAC), Pseudo-Amino Acid Composition (PAAC) etc., are extracted. Following feature extraction, three datasets were created: Dataset 1 contains all of the extracted features. While Datasets 2 and 3 contain the most relevant features obtained through dimensionality reduction. To analyze the importance of high-dimensional features and their participation in protein–protein interactions, a random forest classifier is trained on three datasets. With dimensionality reduction, the model exhibited exceptional accuracy, indicating that dimensionality reduction fails to capture the complexity of interactions and the underlying relationships between human and viral proteins. As a result of retaining high-dimensional features, it is possible to capture all the characteristics of protein–protein interactions that resemble host–pathogen associations, leading to the development of biologically meaningful models. Our proposed approach is a more realistic and comprehensive classification model, leading to deeper insights and better applications in virology and drug development.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5

Similar content being viewed by others

Data Availability

The authors declare that all the data used in the current study are available on request to the corresponding author.

References

  1. Zhao J, Cui W, Tian B-P (2020) The potential intermediate hosts for SARS-CoV-2. Front Microbiol 11:580137. https://doi.org/10.3389/fmicb.2020.580137

    Article  PubMed  PubMed Central  Google Scholar 

  2. WHO Coronavirus (COVID-19) dashboard (no date) Who.int. Available at: https://covid19.who.int/. Accessed 12 July 2023

  3. Morgan OW et al (2022) How better pandemic and epidemic intelligence will prepare the world for future threats. Nat Med 28(8):1526–1528. https://doi.org/10.1038/s41591-022-01900-5

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  4. Dey L, Chakraborty S, Mukhopadhyay A (2020) Machine learning techniques for sequence-based prediction of viral-host interactions between SARS-CoV-2 and human proteins. Biomed J 43(5):438–450. https://doi.org/10.31219/osf.io/tpn3e

    Article  PubMed  PubMed Central  Google Scholar 

  5. Wang X et al (2019) A novel conjoint triad auto covariance (CTAC) coding method for predicting protein-protein interaction based on amino acid sequence. Math Biosci 313:41–47. https://doi.org/10.1016/j.mbs.2019.04.002

    Article  MathSciNet  CAS  PubMed  Google Scholar 

  6. Zheng N et al (2019) Targeting virus-host Protein Interactions: Feature extraction and machine learning approaches. Curr Drug Metab 20(3):177–184. https://doi.org/10.2174/1389200219666180829121038

    Article  CAS  PubMed  Google Scholar 

  7. Hou Q et al (2022) Ten quick tips for sequence-based prediction of protein properties using machine learning. PLoS Comput Biol 18(12):e1010669. https://doi.org/10.1371/journal.pcbi.1010669

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  8. Shen J et al (2007) Predicting protein-protein interactions based only on sequences information. Proc Natl Acad Sci USA 104(11):4337–4341. https://doi.org/10.1073/pnas.0607879104

    Article  CAS  PubMed  PubMed Central  ADS  Google Scholar 

  9. Guo Y et al (2008) Using support vector machine combined with auto covariance to predict protein-protein interactions from protein sequences. Nucleic Acids Res 36:3025–3030

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  10. Valente GT et al (2013) The development of a universal in silico predictor of protein-protein interactions. PLoS ONE 8(5):e65587. https://doi.org/10.1371/journal.pone.0065587

    Article  CAS  PubMed  PubMed Central  ADS  Google Scholar 

  11. You ZH et al (2015) Detecting protein-protein interactions with a novel matrix-based protein sequence representation and support vector machines. BioMed research international

  12. Sun T et al (2017) Sequence-based prediction of protein protein interaction using a deep-learning algorithm. BMC Bioinform 18(1):277. https://doi.org/10.1186/s12859-017-1700-2

    Article  CAS  Google Scholar 

  13. Yang X et al (2020) Prediction of human-virus protein-protein interactions through a sequence embedding-based machine learning method. Comput Struct Biotechnol J 18:153–161. https://doi.org/10.1016/j.csbj.2019.12.005

    Article  MathSciNet  CAS  PubMed  Google Scholar 

  14. Ofer D, Brandes N, Linial M (2021) The language of proteins: NLP, machine learning & protein sequences. Comput Struct Biotechnol J 19:1750–1758. https://doi.org/10.1016/j.csbj.2021.03.022

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  15. Le QV, Mikolov T (2014) Distributed representations of sentences and documents. arXiv [cs.CL]. Available at: http://arxiv.org/abs/1405.4053

  16. Consortium U (2015) UniProt: a hub for protein information. Nucleic Acids Res 43(D1):D204-212

    Article  Google Scholar 

  17. Bairoch A, Apweiler R (2000) The SWISS-PROT protein sequence database and its supplement TrEMBL in 2000. Nucleic Acids Res 28(1):45–48. https://doi.org/10.1093/nar/28.1.45

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  18. Xenarios I et al (2001) DIP: the database of interacting proteins: 2001 update. Nucleic Acids Res 29(1):239–241. https://doi.org/10.1093/nar/29.1.239

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  19. Hermjakob H et al (2004) IntAct: an open source molecular interaction database. Nucleic Acids Res 32:D452–D455. https://doi.org/10.1093/nar/gkh052

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  20. Oughtred R et al (2021) The BioGRID database: a comprehensive biomedical resource of curated protein, genetic, and chemical interactions. Protein Sci 30(1):187–200. https://doi.org/10.1002/pro.3978

    Article  CAS  PubMed  Google Scholar 

  21. Bader GD, Betel D, Hogue CWV (2003) BIND: the biomolecular interaction network database. Nucleic Acids Res 31(1):248–250. https://doi.org/10.1093/nar/gkg056

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  22. Tsukiyama S et al (2021) LSTM-PHV: Prediction of human-virus protein-protein interactions by LSTM with word2vec. bioRxiv. https://doi.org/10.1101/2021.02.26.432975

    Article  Google Scholar 

  23. Chen Z et al (2021) iLearnPlus: a comprehensive and automated machine-learning platform for nucleic acid and protein sequence analysis, prediction and visualisation. Nucleic Acids Res 49(10):e60

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  24. Usman M et al (2022) AFP-SRC: identification of antifreeze proteins using sparse representation classifier. Neural Comput Appl 34(3):2275–2285. https://doi.org/10.1007/s00521-021-06558-7

    Article  Google Scholar 

  25. Hicks SA et al (2022) On evaluation metrics for medical applications of artificial intelligence. Sci Rep 12(1):5979. https://doi.org/10.1038/s41598-022-09954-8

    Article  MathSciNet  CAS  PubMed  PubMed Central  ADS  Google Scholar 

  26. Bao W, Gu Y, Chen B, Yu H (2023) Golgi_DF: golgi proteins classification with deep forest. Front Neurosci 17:1197824

    Article  PubMed  PubMed Central  Google Scholar 

  27. Bao W, Cui Q, Chen B, Yang B (2022). Phage_UniR_LGBM: phage virion proteins classification with UniRep features and LightGBM model. Computational and mathematical methods in medicine

Download references

Acknowledgements

The authors would like to thank University of Kerala for providing research fellowship to Sini S Raj. We are also grateful to the members of the Machine Intelligence Research Lab, Department of Computer Science, University of Kerala for their extended support.

Funding

No funding was received for this study.

Author information

Authors and Affiliations

Authors

Contributions

Conception and design SSR and SSVC Material preparation, data collection and analysis were performed by SSR. The first draft of the manuscript was written by SSR and it was critically reviewed by SSVC. SSR and SSVC read and approved the final manuscript.

Corresponding author

Correspondence to Sini S. Raj.

Ethics declarations

Conflicts of interest

All authors declare that they have no conflicts of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Raj, S.S., Chandra, S.S.V. Significance of Sequence Features in Classification of Protein–Protein Interactions Using Machine Learning. Protein J 43, 72–83 (2024). https://doi.org/10.1007/s10930-023-10168-8

Download citation

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10930-023-10168-8

Keywords

Navigation