Significance of Sequence Features in Classification of Protein–Protein Interactions Using Machine Learning

Raj, Sini S.; Chandra, S. S. Vinod

doi:10.1007/s10930-023-10168-8

Significance of Sequence Features in Classification of Protein–Protein Interactions Using Machine Learning

Published: 19 December 2023

Volume 43, pages 72–83, (2024)
Cite this article

The Protein Journal Aims and scope Submit manuscript

Sini S. Raj¹ &
S. S. Vinod Chandra¹

181 Accesses
1 Citation
1 Altmetric
Explore all metrics

Abstract

Protein–protein interactions are crucial for the entry of viruses into the cell. Understanding the mechanism of interactions is essential in studying human-virus association, developing new biologics and drug candidates, as well as viral infections and antiviral responses. Experimental methods to analyze human-virus protein–protein interactions based on protein sequence data are time-consuming and labor-intensive, so machine learning models are being developed to predict interactions and determine large-scale interactomes between species. The present work highlights the importance of sequence features in classifying interacting and non-interacting proteins from the protein sequence data. Higher dimensional amino acid sequence features such as Amino Acid Composition (AAC), Dipeptide Composition (DPC), Grouped Amino Acid Composition (GAAC), Pseudo-Amino Acid Composition (PAAC) etc., are extracted. Following feature extraction, three datasets were created: Dataset 1 contains all of the extracted features. While Datasets 2 and 3 contain the most relevant features obtained through dimensionality reduction. To analyze the importance of high-dimensional features and their participation in protein–protein interactions, a random forest classifier is trained on three datasets. With dimensionality reduction, the model exhibited exceptional accuracy, indicating that dimensionality reduction fails to capture the complexity of interactions and the underlying relationships between human and viral proteins. As a result of retaining high-dimensional features, it is possible to capture all the characteristics of protein–protein interactions that resemble host–pathogen associations, leading to the development of biologically meaningful models. Our proposed approach is a more realistic and comprehensive classification model, leading to deeper insights and better applications in virology and drug development.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Performance of rotation forest ensemble classifier and feature extractor in predicting protein interactions using amino acid sequences

Article Open access 24 December 2019

Predicting protein-protein interactions via multivariate mutual information of protein sequences

Article Open access 27 September 2016

An improved efficient rotation forest algorithm to predict the interactions among proteins

Article 29 March 2017

Data Availability

The authors declare that all the data used in the current study are available on request to the corresponding author.

References

Zhao J, Cui W, Tian B-P (2020) The potential intermediate hosts for SARS-CoV-2. Front Microbiol 11:580137. https://doi.org/10.3389/fmicb.2020.580137
Article PubMed PubMed Central Google Scholar
WHO Coronavirus (COVID-19) dashboard (no date) Who.int. Available at: https://covid19.who.int/. Accessed 12 July 2023
Morgan OW et al (2022) How better pandemic and epidemic intelligence will prepare the world for future threats. Nat Med 28(8):1526–1528. https://doi.org/10.1038/s41591-022-01900-5
Article CAS PubMed PubMed Central Google Scholar
Dey L, Chakraborty S, Mukhopadhyay A (2020) Machine learning techniques for sequence-based prediction of viral-host interactions between SARS-CoV-2 and human proteins. Biomed J 43(5):438–450. https://doi.org/10.31219/osf.io/tpn3e
Article PubMed PubMed Central Google Scholar
Wang X et al (2019) A novel conjoint triad auto covariance (CTAC) coding method for predicting protein-protein interaction based on amino acid sequence. Math Biosci 313:41–47. https://doi.org/10.1016/j.mbs.2019.04.002
Article MathSciNet CAS PubMed Google Scholar
Zheng N et al (2019) Targeting virus-host Protein Interactions: Feature extraction and machine learning approaches. Curr Drug Metab 20(3):177–184. https://doi.org/10.2174/1389200219666180829121038
Article CAS PubMed Google Scholar
Hou Q et al (2022) Ten quick tips for sequence-based prediction of protein properties using machine learning. PLoS Comput Biol 18(12):e1010669. https://doi.org/10.1371/journal.pcbi.1010669
Article CAS PubMed PubMed Central Google Scholar
Shen J et al (2007) Predicting protein-protein interactions based only on sequences information. Proc Natl Acad Sci USA 104(11):4337–4341. https://doi.org/10.1073/pnas.0607879104
Article CAS PubMed PubMed Central ADS Google Scholar
Guo Y et al (2008) Using support vector machine combined with auto covariance to predict protein-protein interactions from protein sequences. Nucleic Acids Res 36:3025–3030
Article CAS PubMed PubMed Central Google Scholar
Valente GT et al (2013) The development of a universal in silico predictor of protein-protein interactions. PLoS ONE 8(5):e65587. https://doi.org/10.1371/journal.pone.0065587
Article CAS PubMed PubMed Central ADS Google Scholar
You ZH et al (2015) Detecting protein-protein interactions with a novel matrix-based protein sequence representation and support vector machines. BioMed research international
Sun T et al (2017) Sequence-based prediction of protein protein interaction using a deep-learning algorithm. BMC Bioinform 18(1):277. https://doi.org/10.1186/s12859-017-1700-2
Article CAS Google Scholar
Yang X et al (2020) Prediction of human-virus protein-protein interactions through a sequence embedding-based machine learning method. Comput Struct Biotechnol J 18:153–161. https://doi.org/10.1016/j.csbj.2019.12.005
Article MathSciNet CAS PubMed Google Scholar
Ofer D, Brandes N, Linial M (2021) The language of proteins: NLP, machine learning & protein sequences. Comput Struct Biotechnol J 19:1750–1758. https://doi.org/10.1016/j.csbj.2021.03.022
Article CAS PubMed PubMed Central Google Scholar
Le QV, Mikolov T (2014) Distributed representations of sentences and documents. arXiv [cs.CL]. Available at: http://arxiv.org/abs/1405.4053
Consortium U (2015) UniProt: a hub for protein information. Nucleic Acids Res 43(D1):D204-212
Article Google Scholar
Bairoch A, Apweiler R (2000) The SWISS-PROT protein sequence database and its supplement TrEMBL in 2000. Nucleic Acids Res 28(1):45–48. https://doi.org/10.1093/nar/28.1.45
Article CAS PubMed PubMed Central Google Scholar
Xenarios I et al (2001) DIP: the database of interacting proteins: 2001 update. Nucleic Acids Res 29(1):239–241. https://doi.org/10.1093/nar/29.1.239
Article CAS PubMed PubMed Central Google Scholar
Hermjakob H et al (2004) IntAct: an open source molecular interaction database. Nucleic Acids Res 32:D452–D455. https://doi.org/10.1093/nar/gkh052
Article CAS PubMed PubMed Central Google Scholar
Oughtred R et al (2021) The BioGRID database: a comprehensive biomedical resource of curated protein, genetic, and chemical interactions. Protein Sci 30(1):187–200. https://doi.org/10.1002/pro.3978
Article CAS PubMed Google Scholar
Bader GD, Betel D, Hogue CWV (2003) BIND: the biomolecular interaction network database. Nucleic Acids Res 31(1):248–250. https://doi.org/10.1093/nar/gkg056
Article CAS PubMed PubMed Central Google Scholar
Tsukiyama S et al (2021) LSTM-PHV: Prediction of human-virus protein-protein interactions by LSTM with word2vec. bioRxiv. https://doi.org/10.1101/2021.02.26.432975
Article Google Scholar
Chen Z et al (2021) iLearnPlus: a comprehensive and automated machine-learning platform for nucleic acid and protein sequence analysis, prediction and visualisation. Nucleic Acids Res 49(10):e60
Article CAS PubMed PubMed Central Google Scholar
Usman M et al (2022) AFP-SRC: identification of antifreeze proteins using sparse representation classifier. Neural Comput Appl 34(3):2275–2285. https://doi.org/10.1007/s00521-021-06558-7
Article Google Scholar
Hicks SA et al (2022) On evaluation metrics for medical applications of artificial intelligence. Sci Rep 12(1):5979. https://doi.org/10.1038/s41598-022-09954-8
Article MathSciNet CAS PubMed PubMed Central ADS Google Scholar
Bao W, Gu Y, Chen B, Yu H (2023) Golgi_DF: golgi proteins classification with deep forest. Front Neurosci 17:1197824
Article PubMed PubMed Central Google Scholar
Bao W, Cui Q, Chen B, Yang B (2022). Phage_UniR_LGBM: phage virion proteins classification with UniRep features and LightGBM model. Computational and mathematical methods in medicine

Download references

Acknowledgements

The authors would like to thank University of Kerala for providing research fellowship to Sini S Raj. We are also grateful to the members of the Machine Intelligence Research Lab, Department of Computer Science, University of Kerala for their extended support.

Funding

No funding was received for this study.

Author information

Authors and Affiliations

Machine Intelligence Research Lab, Department of Computer Science, University of Kerala, Thiruvananthapuram, Kerala, India
Sini S. Raj & S. S. Vinod Chandra

Authors

Sini S. Raj
View author publications
You can also search for this author in PubMed Google Scholar
S. S. Vinod Chandra
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

Conception and design SSR and SSVC Material preparation, data collection and analysis were performed by SSR. The first draft of the manuscript was written by SSR and it was critically reviewed by SSVC. SSR and SSVC read and approved the final manuscript.

Corresponding author

Correspondence to Sini S. Raj.

Ethics declarations

Conflicts of interest

All authors declare that they have no conflicts of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Raj, S.S., Chandra, S.S.V. Significance of Sequence Features in Classification of Protein–Protein Interactions Using Machine Learning. Protein J 43, 72–83 (2024). https://doi.org/10.1007/s10930-023-10168-8

Download citation

Accepted: 30 October 2023
Published: 19 December 2023
Issue Date: February 2024
DOI: https://doi.org/10.1007/s10930-023-10168-8

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Significance of Sequence Features in Classification of Protein–Protein Interactions Using Machine Learning

Abstract

Access this article

Similar content being viewed by others

Performance of rotation forest ensemble classifier and feature extractor in predicting protein interactions using amino acid sequences

Predicting protein-protein interactions via multivariate mutual information of protein sequences

An improved efficient rotation forest algorithm to predict the interactions among proteins

Data Availability

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflicts of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Significance of Sequence Features in Classification of Protein–Protein Interactions Using Machine Learning

Abstract

Access this article

Similar content being viewed by others

Performance of rotation forest ensemble classifier and feature extractor in predicting protein interactions using amino acid sequences

Predicting protein-protein interactions via multivariate mutual information of protein sequences

An improved efficient rotation forest algorithm to predict the interactions among proteins

Data Availability

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflicts of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation