Abstract
Protein–protein interactions are crucial for the entry of viruses into the cell. Understanding the mechanism of interactions is essential in studying human-virus association, developing new biologics and drug candidates, as well as viral infections and antiviral responses. Experimental methods to analyze human-virus protein–protein interactions based on protein sequence data are time-consuming and labor-intensive, so machine learning models are being developed to predict interactions and determine large-scale interactomes between species. The present work highlights the importance of sequence features in classifying interacting and non-interacting proteins from the protein sequence data. Higher dimensional amino acid sequence features such as Amino Acid Composition (AAC), Dipeptide Composition (DPC), Grouped Amino Acid Composition (GAAC), Pseudo-Amino Acid Composition (PAAC) etc., are extracted. Following feature extraction, three datasets were created: Dataset 1 contains all of the extracted features. While Datasets 2 and 3 contain the most relevant features obtained through dimensionality reduction. To analyze the importance of high-dimensional features and their participation in protein–protein interactions, a random forest classifier is trained on three datasets. With dimensionality reduction, the model exhibited exceptional accuracy, indicating that dimensionality reduction fails to capture the complexity of interactions and the underlying relationships between human and viral proteins. As a result of retaining high-dimensional features, it is possible to capture all the characteristics of protein–protein interactions that resemble host–pathogen associations, leading to the development of biologically meaningful models. Our proposed approach is a more realistic and comprehensive classification model, leading to deeper insights and better applications in virology and drug development.
Similar content being viewed by others
Data Availability
The authors declare that all the data used in the current study are available on request to the corresponding author.
References
Zhao J, Cui W, Tian B-P (2020) The potential intermediate hosts for SARS-CoV-2. Front Microbiol 11:580137. https://doi.org/10.3389/fmicb.2020.580137
WHO Coronavirus (COVID-19) dashboard (no date) Who.int. Available at: https://covid19.who.int/. Accessed 12 July 2023
Morgan OW et al (2022) How better pandemic and epidemic intelligence will prepare the world for future threats. Nat Med 28(8):1526–1528. https://doi.org/10.1038/s41591-022-01900-5
Dey L, Chakraborty S, Mukhopadhyay A (2020) Machine learning techniques for sequence-based prediction of viral-host interactions between SARS-CoV-2 and human proteins. Biomed J 43(5):438–450. https://doi.org/10.31219/osf.io/tpn3e
Wang X et al (2019) A novel conjoint triad auto covariance (CTAC) coding method for predicting protein-protein interaction based on amino acid sequence. Math Biosci 313:41–47. https://doi.org/10.1016/j.mbs.2019.04.002
Zheng N et al (2019) Targeting virus-host Protein Interactions: Feature extraction and machine learning approaches. Curr Drug Metab 20(3):177–184. https://doi.org/10.2174/1389200219666180829121038
Hou Q et al (2022) Ten quick tips for sequence-based prediction of protein properties using machine learning. PLoS Comput Biol 18(12):e1010669. https://doi.org/10.1371/journal.pcbi.1010669
Shen J et al (2007) Predicting protein-protein interactions based only on sequences information. Proc Natl Acad Sci USA 104(11):4337–4341. https://doi.org/10.1073/pnas.0607879104
Guo Y et al (2008) Using support vector machine combined with auto covariance to predict protein-protein interactions from protein sequences. Nucleic Acids Res 36:3025–3030
Valente GT et al (2013) The development of a universal in silico predictor of protein-protein interactions. PLoS ONE 8(5):e65587. https://doi.org/10.1371/journal.pone.0065587
You ZH et al (2015) Detecting protein-protein interactions with a novel matrix-based protein sequence representation and support vector machines. BioMed research international
Sun T et al (2017) Sequence-based prediction of protein protein interaction using a deep-learning algorithm. BMC Bioinform 18(1):277. https://doi.org/10.1186/s12859-017-1700-2
Yang X et al (2020) Prediction of human-virus protein-protein interactions through a sequence embedding-based machine learning method. Comput Struct Biotechnol J 18:153–161. https://doi.org/10.1016/j.csbj.2019.12.005
Ofer D, Brandes N, Linial M (2021) The language of proteins: NLP, machine learning & protein sequences. Comput Struct Biotechnol J 19:1750–1758. https://doi.org/10.1016/j.csbj.2021.03.022
Le QV, Mikolov T (2014) Distributed representations of sentences and documents. arXiv [cs.CL]. Available at: http://arxiv.org/abs/1405.4053
Consortium U (2015) UniProt: a hub for protein information. Nucleic Acids Res 43(D1):D204-212
Bairoch A, Apweiler R (2000) The SWISS-PROT protein sequence database and its supplement TrEMBL in 2000. Nucleic Acids Res 28(1):45–48. https://doi.org/10.1093/nar/28.1.45
Xenarios I et al (2001) DIP: the database of interacting proteins: 2001 update. Nucleic Acids Res 29(1):239–241. https://doi.org/10.1093/nar/29.1.239
Hermjakob H et al (2004) IntAct: an open source molecular interaction database. Nucleic Acids Res 32:D452–D455. https://doi.org/10.1093/nar/gkh052
Oughtred R et al (2021) The BioGRID database: a comprehensive biomedical resource of curated protein, genetic, and chemical interactions. Protein Sci 30(1):187–200. https://doi.org/10.1002/pro.3978
Bader GD, Betel D, Hogue CWV (2003) BIND: the biomolecular interaction network database. Nucleic Acids Res 31(1):248–250. https://doi.org/10.1093/nar/gkg056
Tsukiyama S et al (2021) LSTM-PHV: Prediction of human-virus protein-protein interactions by LSTM with word2vec. bioRxiv. https://doi.org/10.1101/2021.02.26.432975
Chen Z et al (2021) iLearnPlus: a comprehensive and automated machine-learning platform for nucleic acid and protein sequence analysis, prediction and visualisation. Nucleic Acids Res 49(10):e60
Usman M et al (2022) AFP-SRC: identification of antifreeze proteins using sparse representation classifier. Neural Comput Appl 34(3):2275–2285. https://doi.org/10.1007/s00521-021-06558-7
Hicks SA et al (2022) On evaluation metrics for medical applications of artificial intelligence. Sci Rep 12(1):5979. https://doi.org/10.1038/s41598-022-09954-8
Bao W, Gu Y, Chen B, Yu H (2023) Golgi_DF: golgi proteins classification with deep forest. Front Neurosci 17:1197824
Bao W, Cui Q, Chen B, Yang B (2022). Phage_UniR_LGBM: phage virion proteins classification with UniRep features and LightGBM model. Computational and mathematical methods in medicine
Acknowledgements
The authors would like to thank University of Kerala for providing research fellowship to Sini S Raj. We are also grateful to the members of the Machine Intelligence Research Lab, Department of Computer Science, University of Kerala for their extended support.
Funding
No funding was received for this study.
Author information
Authors and Affiliations
Contributions
Conception and design SSR and SSVC Material preparation, data collection and analysis were performed by SSR. The first draft of the manuscript was written by SSR and it was critically reviewed by SSVC. SSR and SSVC read and approved the final manuscript.
Corresponding author
Ethics declarations
Conflicts of interest
All authors declare that they have no conflicts of interest.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Raj, S.S., Chandra, S.S.V. Significance of Sequence Features in Classification of Protein–Protein Interactions Using Machine Learning. Protein J 43, 72–83 (2024). https://doi.org/10.1007/s10930-023-10168-8
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10930-023-10168-8