Skip to main content
Log in

Use of 2D FFT and DTW in Protein Sequence Comparison

  • Published:
The Protein Journal Aims and scope Submit manuscript

Abstract

Protein sequence comparison remains a challenging work for the researchers owing to the computational complexity due to the presence of 20 amino acids compared with only four nucleotides in Genome sequences. Further, protein sequences of different species are of different lengths; it throws additional changes to the researchers to develop methods, specially alignment-free methods, to compare protein sequences. In this work, an efficient technique to compare protein sequences is developed by a graphical representation. First, the classified grouping of 20 amino acids with a cardinality of 4 based on polar class is considered to narrow down the representational range from 20 to 4. Then a unit vector technique based on a two-quadrant Cartesian system is proposed to provide a new two-dimensional graphical representation of the protein sequence. Now, two approaches are proposed to cope with the varying lengths of protein sequences from various species: one uses Dynamic Time Warping (DTW), while the other one uses a two-dimensional Fast Fourier Transform (2D FFT). Next, the effectiveness of these two techniques is analyzed using two evaluation criteria—quantitative measures based on symmetric distance (SD) and computational speed. An analysis is performed on five data sets of 9 ND4, 9 ND5, 9 ND6, 12 Baculovirus, and 24 TF proteins under the two methods. It is found that the FFT-based method produces the same results as DTW but in less computational time. It is found that the result of the proposed method agrees with the known biological reference. Further, the present method produces better clustering than the existing ones.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13

Similar content being viewed by others

References

  1. Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ (1990) Basic local alignment search tool. J Mol Biol 215(3):403–410. https://doi.org/10.1016/S0022-2836(05)80360-2

    Article  CAS  PubMed  Google Scholar 

  2. Thompson JD, Higgins DG, Gibson TJ (1994) CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. Nucleic Acids Res 22(22):4673–4680. https://doi.org/10.1093/nar/22.22.4673

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  3. Zielezinski A, Vinga S, Almeida J, Karlowski WM (2017) Alignment-free sequence comparison: benefits, applications, and tools. Genome Biol 18:186. https://doi.org/10.1186/s13059-017-1319-7

    Article  PubMed  PubMed Central  Google Scholar 

  4. Hamori E, Ruskin J (1983) H curves, a novel method of representation of nucleotide series especially suited for long DNA sequences. J Biol Chem 258(2):1318–1327. https://doi.org/10.1016/S0021-9258(18)33196-X

    Article  CAS  PubMed  Google Scholar 

  5. Gates MA (1986) A simple way to look at DNA. J Theor Biol 119(3):319–328

    Article  ADS  CAS  PubMed  Google Scholar 

  6. Jeffrey HJ (1990) Chaos game representation of gene structure. Nucleic Acids Res 18(8):2163–2170. https://doi.org/10.1093/nar/18.8.2163

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  7. Nandy A (1994) A new graphical representation and analysis of DNA sequence structure: I. Methodology and application to globin genes. Curr Sci 66:309–314

    CAS  Google Scholar 

  8. Leong PM, Morgenthaler S (1995) Random walk and gap plots of DNA sequences. Bioinformatics 11(5):503–507

    Article  CAS  Google Scholar 

  9. Hoang T, Yin C, Yau S-T (2016) Numerical encoding of DNA sequences by chaos game representation with application in similarity comparison. Genomics 108(3–4):134–142. https://doi.org/10.1016/j.ygeno.2016.08.002

    Article  CAS  PubMed  Google Scholar 

  10. Jin X, Jiang Q, Chen Y et al (2017) Similarity/dissimilarity calculation methods of DNA sequences: a survey. J Mol Graph Model 76:342–355. https://doi.org/10.1016/j.jmgm.2017.07.019

    Article  CAS  PubMed  Google Scholar 

  11. Abd Elwahaab MA, Abo-Elkhier MM, Abo el Maaty MI (2019) A statistical similarity/dissimilarity analysis of protein sequences based on a novel group representative vector. Biomed Res Int 2019:1–9. https://doi.org/10.1155/2019/8702968

    Article  CAS  Google Scholar 

  12. He P-A, Xu S, Dai, Q.i., Yao, Y. (2016) A generalization of CGR representation for analyzing and comparing protein sequences. Int J Quantum Chem 116(6):476–482. https://doi.org/10.1002/qua.25068

    Article  CAS  Google Scholar 

  13. Hu H, Li Z, Dong H, Zhou T (2017) Graphical representation and similarity analysis of protein sequences based on fractal interpolation. IEEE/ACM Trans Comput Biol Bioinform 14(1):182–192. https://doi.org/10.1109/TCBB.2015.2511731

    Article  PubMed  Google Scholar 

  14. Li C, Li X, Lin YX (2016) Numerical characterization of protein sequences based on the generalized Chou’s pseudo amino acid composition. Appl Sci 6:406. https://doi.org/10.3390/app6120406

    Article  CAS  Google Scholar 

  15. Ma T, Liu Y, Dai Q, Yao Y, He PA (2014) A graphical representation of protein based on a novel iterated function system. Physics A 403:21–28. https://doi.org/10.1016/j.physa.2014.01.067

    Article  ADS  Google Scholar 

  16. Mervat MA, Marwa AA, Moheb IA, Jiangke Y (2019) Measuring similarity among protein sequences using a new descriptor. Biomed Res Int 22:2796971. https://doi.org/10.1155/2019/2796971

    Article  CAS  Google Scholar 

  17. Mu Z, Yu T, Liu X, Zheng H, Wei L, Liu J (2021) FEGS: a novel feature extraction model for protein sequences and its applications. BMC Bioinform 22:1–5

    Article  Google Scholar 

  18. Wu C, Gao R, De Marinis Y, Zhang Y (2018) A novel model for protein sequence similarity analysis based on spectral radius. J Theor Biol 446:61–70. https://doi.org/10.1016/j.jtbi.2018.03.001

    Article  ADS  CAS  PubMed  PubMed Central  Google Scholar 

  19. Yao Y-H, Dai Q, Li C, He P-A, Nan X-Y, Zhang Y-Z (2008) Analysis of similarity/dissimilarity of protein sequences. Proteins 73(4):864–871. https://doi.org/10.1002/prot.22110

    Article  CAS  PubMed  Google Scholar 

  20. Yao YH, Yan S, Han J, Dai Q, He PA (2014) A novel descriptor of protein sequences and its application. J Theor Biol 347:109–117. https://doi.org/10.1016/j.jtbi.2014.01.001

    Article  ADS  CAS  PubMed  Google Scholar 

  21. Zhang Y, Ruan J, He PA (2013) Analyzes of the similarities of protein sequences based on the pseudo amino acid composition. Chem Phys Lett 590:239–244. https://doi.org/10.1016/j.cplett.2013.10.076

    Article  ADS  CAS  Google Scholar 

  22. Lochel HF, Eger D, Sperlea T, Heider D (2020) Deep learning on chaos game representation for proteins. Bioinformatics 36:272–279. https://doi.org/10.1093/bioinformatics/btz493

    Article  CAS  PubMed  Google Scholar 

  23. Li C, Dai Q, He PA (2022) A time series representation of protein sequences for similarity comparison. J Theor Biol 538:111039. https://doi.org/10.1016/j.jtbi.2022.111039

    Article  CAS  PubMed  Google Scholar 

  24. Akbar S, Hayat M, Tahir M, Chong KT (2020) cACP-2LFS: classification of anticancer peptides using sequential discriminative model of KSAAP and two-level feature selection approach. IEEE Access 8:131939–131948

    Article  Google Scholar 

  25. Akbar S, Hayat M, Iqbal M, Jan MA (2017) iACP-GAEnsC: evolutionary genetic algorithm based ensemble classification of anticancer peptides by utilizing hybrid feature space. Artif Intell Med 79:62–70

    Article  PubMed  Google Scholar 

  26. Ahmad A, Akbar S, Khan S, Hayat M, Ali F, Ahmed A, Tahir M (2021) Deep-AntiFP: prediction of antifungal peptides using distanct multi-informative features incorporating with deep neural networks. Chemom Intell Lab Syst 208:104214

    Article  CAS  Google Scholar 

  27. Ahmad A, Akbar S, Tahir M, Hayat M, Ali F (2022) iAFPs-EnC-GA: identifying antifungal peptides using sequential and evolutionary descriptors based multi-information fusion and ensemble learning approach. Chemom Intell Lab Syst 222:104516

    Article  CAS  Google Scholar 

  28. Akbar S, Ahmad A, Hayat M, Rehman AU, Khan S, Ali F (2021) iAtbP-Hyb-EnC: prediction of antitubercular peptides via heterogeneous feature representation and genetic algorithm based ensemble learning model. Comput Biol Med 137:104778

    Article  PubMed  Google Scholar 

  29. Sakoe H, Chiba S (1978) Dynamic-programming algorithm optimization for spoken word recognition. IEEE Trans Acoust Speech Signal Process 26:43–49

    Article  Google Scholar 

  30. Gold O, Sharir M (2018) Dynamic time warping and geometric edit distance: breaking the quadratic barrier. ACM Trans Algorithms (TALG) 14(4):1–17

    Article  MathSciNet  Google Scholar 

  31. Zhang Y, Yu X (2010) Analysis of protein sequence similarity. In: 2010 IEEE fifth international conference on bio-inspired computing: theories and applications (BIC-TA), pp 1255–1258. IEEE.

  32. Pal J, Ghosh S, Maji B, Bhattacharya DK (2022) Mathematical approach to protein sequence comparison based on physiochemical properties. ACS Omega 7(43):39446–39455

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  33. Pal J, Ghosh S, Maji B, Bhattacharya DK (2018) Protein sequence comparison under a new complex representation of amino acids based on their physio-chemical properties. Int J Eng Technol 7:181–184

    Article  CAS  Google Scholar 

  34. Oppenheim AV, Buck JR, Schafer RW (2001) Discrete-time signal processing, vol 2. Prentice Hall, Upper Saddle River

    Google Scholar 

  35. Cooley JW, Tukey OW (1965) An algorithm for the machine calculation of complex Fourier series. Math Comput 19:297–301

    Article  MathSciNet  Google Scholar 

  36. Yu ZG, Anh V, Lau KS (2004) Chaos game representation of protein sequences based on the detailed HP model and their multifractal and correlation analyses. J Theor Biol 226(3):341–348

    Article  ADS  MathSciNet  CAS  PubMed  Google Scholar 

  37. Yau SST, Wang J, Niknejad A, Lu C, Jin N, Ho YK (2003) DNA sequence representation without degeneracy. Nucleic Acids Res 31(12):3078–3080

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  38. Tamura K, Stecher G, Kumar S (2021) MEGA11: molecular evolutionary genetics analysis version 11. Mol Biol Evol 38(7):3022–3027

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  39. King BR, Aburdene M, Thompson A, Warres Z (2014) Application of discrete Fourier inter-coefficient difference for assessing genetic sequence similarity. EURASIP J Bioinform Syst Biol 2014(1):1–12

    Article  CAS  Google Scholar 

  40. Aamir KM, Maud MA, Loan A (2005) On Cooley-Tukey FFT method for zero padded signals. In: Proceedings of the IEEE symposium on emerging technologies, 2005 (pp 41–45). IEEE.

  41. Felsenstein J (2004) PHYLIP (phylogeny inference package) version 3.6. Distributed by the author. http://www.evolution.gs.washington.edu/phylip.Html.

  42. Yao YH, Kong F, Dai Q, He PA (2013) A sequence-segmented method applied to the similarity analysis of long protein sequence. Commun Math Comput Chem 70(1):431–450

    MathSciNet  CAS  Google Scholar 

  43. Yao Y, Yan S, Xu H, Han J, Nan X, He PA, Dai Q (2014) Similarity/dissimilarity analysis of protein sequences based on a new spectrum-like graphical representation. Evol Bioinform 10:EBO-S14713

    Article  Google Scholar 

  44. Yu L, Zhang Y, Gutman I, Shi Y, Dehmer M (2017) Protein sequence comparison based on physicochemical properties and the position-feature energy matrix. Sci Rep 7(1):1–9

    Google Scholar 

Download references

Funding

We wish to confirm that there has been no financial support for this work that could have influenced its outcome.

Author information

Authors and Affiliations

Authors

Contributions

All authors contributed to the study conception and design. Material preparation, data collection and analysis were performed by JP, SG, BM. The first draft of the manuscript was written by DKB and all authors commented on previous versions of the manuscript. All authors read and approved the final manuscript.

Corresponding author

Correspondence to Jayanta Pal.

Ethics declarations

Conflict of interest

The authors have no relevant financial or non-financial interests to disclose.

Consent to Participate

Not applicable.

Consent to Publish

Not applicable.

Ethical Approval

This article does not contain any studies with human participants or animals performed by any of the authors.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Below is the link to the electronic supplementary material.

Supplementary file1 (PDF 1271 kb)

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Pal, J., Ghosh, S., Maji, B. et al. Use of 2D FFT and DTW in Protein Sequence Comparison. Protein J 43, 1–11 (2024). https://doi.org/10.1007/s10930-023-10160-2

Download citation

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10930-023-10160-2

Keywords

Navigation