Enhancing Children’s Short Utterance-Based ASV Using Inverse Gamma-tone Filtered Cepstral coefficients

Aziz, Shahid; Shahnawazuddin, S.

doi:10.1007/s00034-023-02592-z

Enhancing Children’s Short Utterance-Based ASV Using Inverse Gamma-tone Filtered Cepstral coefficients

Published: 10 February 2024

Volume 43, pages 3020–3041, (2024)
Cite this article

Circuits, Systems, and Signal Processing Aims and scope Submit manuscript

90 Accesses
1 Altmetric
Explore all metrics

Abstract

The task of developing an automatic speaker verification (ASV) system for children’s speech is extremely challenging due to the dearth of domain-specific data. The challenges are further exacerbated in the case of short utterances of speech, a relatively unexplored domain in the case of children’s ASV. Voice-based biometric systems require an adequate amount of speech data for enrollment and verification; otherwise, the performance considerably degrades. It is for this reason that the trade-off between convenience and security is gruelling to maintain in practical scenarios. In this paper, we have focused on data paucity and preservation of the higher-frequency contents in order to enhance the performance of a short utterance-based children’s speaker verification system. To deal with data scarcity, an out-of-domain data augmentation approach has been proposed. Since the out-of-domain data used are from adult speakers which are acoustically very different from children’s speech, we have made use of techniques like prosody modification, formant modification, and voice conversion in order to render it acoustically similar to children’s speech prior to augmentation. This helps in not only increasing the amount of training data but also in effectively capturing the missing target attributes which helps in boosting the verification. Further to that, we have resorted to concatenation of the classical Mel-frequency cepstral coefficients (MFCC) features with the Gamma-tone frequency cepstral coefficient (GTF-CC) or with the Inverse Gamma-tone frequency cepstral coefficient (IGTF-CC) features. The feature concatenation of MFCC and IGTF-CC is employed with the sole intention of effectively modeling the human auditory system along with the preservation of higher-frequency contents in the children’s speech data. This feature concatenation approach, when combined with data augmentation, helps in further improvement in the verification performance. The experimental results testify our claims, wherein we have achieved an overall relative reduction of \(38.5\%\) for equal error rate.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Enhancing Children’s Short Utterance Based ASV Using Data Augmentation Techniques and Feature Concatenation Approach

Role of Data Augmentation and Effective Conservation of High-Frequency Contents in the Context Children’s Speaker Verification System

Article 05 February 2024

Short-Utterance-Based Children’s Speaker Verification in Low-Resource Conditions

Article 05 November 2023

Data Availability

The data sets generated during and/or analyzed during the current study are available from the corresponding author on reasonable request.

References

A. Batliner, M. Blomberg, S. D’Arcy, D. Elenius, D. Giuliani, M. Gerosa, C. Hacker, M. Russell, M. Wong, The PF_STAR children’s speech corpus. Proceedings INTERSPEECH, pp. 2761–2764 (2005)
E.P. Damskägg, V. Välimäki, Audio time stretching using fuzzy classification of spectral bins. Appl. Sci. 7(12), 1293 (2017)
Article Google Scholar
S. Davis, P. Mermelstein, Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences. IEEE Trans. Acoust. Speech Signal Process. 28(4), 357–366 (1980). https://doi.org/10.1109/TASSP.1980.1163420
Article Google Scholar
D. Dimitriadis, P. Maragos, A. Potamianos, On the effects of filterbank design and energy computation on robust speech recognition. IEEE Trans. Audio Speech Lang. Process. 19(6), 1504–1516 (2010)
Article Google Scholar
M. Eskenazi, J. Mostow, D. Graff, The CMU Kids Corpus LDC97S63. https://catalog.ldc.upenn.edu/LDC97S63 (1997)
M. Fedila, M. Bengherabi, A. Amrouche, Gammatone filterbank and symbiotic combination of amplitude and phase-based spectra for robust speaker verification under noisy conditions and compression artifacts. Multimed. Tools Appl. 77(13), 16721–16739 (2018)
Article Google Scholar
M. Gerosa, D. Giuliani, S. Narayanan, A. Potamianos, A review of ASR technologies for children’s speech. Proceeding Workshop on Child, Computer and Interaction, pp. 7:1–7:8 (2009)
B. Gold, N. Morgan, D. Ellis, Speech and audio signal processing: processing and perception of speech and music (Wiley, 2011)
B. Gold, N. Morgan, D. Ellis, D. O’Shaughnessy, Speech and audio signal processing: processing and perception of speech and music, second edition. J. Acoust. Soc. Am. 132, 1861–2 (2012). https://doi.org/10.1121/1.4742973
T. Kaneko, H. Kameoka, Parallel-data-free voice conversion using cycle-consistent adversarial networks. arXiv preprint arXiv:1711.11293 (2017)
H.K. Kathania, S.R. Kadiri, P. Alku, M. Kurimo, Study of formant modification for children asr. ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), (IEEE, 2020), pp. 7429–7433
H.K. Kathania, S. Shahnawazuddin, W. Ahmad, N. Adiga, Role of linear, mel and inverse-mel filterbanks in automatic recognition of speech from high-pitched speakers. Circuits Syst. Signal Process. 38(10), 4667–4682 (2019)
Article Google Scholar
V. Kumar, A. Kumar, S. Shahnawazuddin, Creating robust children’s ASR system in zero-resource condition through out-of-domain data augmentation. Circuits Syst. Signal Process. 41(4), 2205–2220 (2022)
Article Google Scholar
S. Lee, A. Potamianos, S.S. Narayanan, Acoustics of children’s speech: developmental changes of temporal and spectral parameters. J. Acoust. Soc. Am. 105(3), 1455–1468 (1999)
Article Google Scholar
R.D. Patterson, I. Nimmo-Smith, J. Holdsworth, P. Rice, An efficient auditory filterbank based on the gammatone function. A meeting of the IOC Speech Group on Auditory Modelling at RSRE, vol. 2 (1987)
V. Peddinti, D. Povey, S. Khudanpur, A time delay neural network architecture for efficient modeling of long temporal contexts. Proceedings INTERSPEECH (2015)
A. Poddar, M. Sahidullah, G. Saha, Speaker verification with short utterances: a review of challenges, trends and opportunities. IET Biom. 7(2), 91–101 (2018)
Article Google Scholar
A. Poddar, M. Sahidullah, G. Saha, Quality measures for speaker verification with short utterances. Digit. Signal Process. 88, 66–79 (2019). https://doi.org/10.1016/j.dsp.2019.01.023
Article Google Scholar
D. Povey, A. Ghoshal, G. Boulianne, L. Burget, O. Glembek, N. Goel, M. Hannemann, P. Motlicek, Y. Qian, P. Schwarz, J. Silovsky, G. Stemmer, K. Vesely, The Kaldi Speech recognition toolkit. Proceedings ASRU (2011)
D. Povey, X. Zhang, S. Khudanpur, Parallel training of deep neural networks with natural gradient and parameter averaging. Proceedings ICLR (2015)
S.R.M. Prasanna, D. Govind, K.S. Rao, B. Yegnanarayana, Fast prosody modification using instants of significant excitation. Proceedings International Conference on Speech Prosody (2010)
P. Rajan, T. Kinnunen, C. Hanilci, J. Pohjalainen, P. Alku, Using group delay functions from all-pole models for speaker recognition. INTERSPEECH, pp. 2489–2493 (2013)
T. Robinson, J. Fransen, D. Pye, J. Foote, S. Renals, WSJCAM0: A British English speech corpus for large vocabulary continuous speech recognition. Proceedings ICASSP 1, 81–84 (1995)
Google Scholar
M. Russell, S. D’Arcy, Challenges for computer recognition of children’s speech. Proceedings Speech and Language Technologies in Education (SLaTE) (2007)
S. Safavi, M. Russell, P. Jančovič, Automatic speaker, age-group and gender identification from children’s speech. Comput. Speech Lang. 50, 141–156 (2018)
Article Google Scholar
S. Shahnawazuddin, N. Adiga, H.K. Kathania, B.T. Sai, Creating speaker independent ASR system through prosody modification based data augmentation. Pattern Recogn. Lett. 131, 213–218 (2020). https://doi.org/10.1016/j.patrec.2019.12.019
Article Google Scholar
S. Shahnawazuddin, N. Adiga, B.T. Sai, W. Ahmad, H.K. Kathania, Developing speaker independent ASR system using limited data through prosody modification based on fuzzy classification of spectral bins. Digit. Signal Process. 93, 34–42 (2019)
Article Google Scholar
S. Shahnawazuddin, W. Ahmad, N. Adiga, A. Kumar, In-domain and out-of-domain data augmentation to improve children’s speaker verification system in limited data scenario. In: ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 7554–7558 (2020)
K. Shobaki, J.P. Hosom, R. Cole, Cslu: Kids’ speech version 1.1. Linguistic Data Consortium (2007)
D. Snyder, D. Garcia-Romero, D. Povey, S. Khudanpur, Deep neural network embeddings for text-independent speaker verification. Proceedings INTERSPEECH, pp. 999–1003 (2017)
D. Snyder, D. Garcia-Romero, G. Sell, D. Povey, S. Khudanpur, X-Vectors: Robust DNN Embeddings for Speaker Recognition. Proceedings ICASSP, pp. 5329–5333 (2018)
G. Yeung, A. Alwan, On the difficulties of automatic speech recognition for kindergarten-aged children. Interspeech 2018 (2018)

Download references

Funding

Funding information is not applicable / no funding was received.

Author information

Authors and Affiliations

Department of Electronics and Communication Engineering, National Institute of Technology Patna, Patna, India
Shahid Aziz & S. Shahnawazuddin

Authors

Shahid Aziz
View author publications
You can also search for this author in PubMed Google Scholar
S. Shahnawazuddin
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Shahid Aziz.

Ethics declarations

Conflict of interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Ethical Approval

The work presented in the uploaded manuscript is an original one, and the manuscript is not currently under consideration for publication elsewhere.

Consent for Publication

It is hereby confirmed that the manuscript has been read and approved for submission by all the named authors. It is therefore requested to consider the submitted manuscript for publication in the esteemed journal.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Aziz, S., Shahnawazuddin, S. Enhancing Children’s Short Utterance-Based ASV Using Inverse Gamma-tone Filtered Cepstral coefficients. Circuits Syst Signal Process 43, 3020–3041 (2024). https://doi.org/10.1007/s00034-023-02592-z

Download citation

Received: 13 April 2023
Revised: 21 December 2023
Accepted: 21 December 2023
Published: 10 February 2024
Issue Date: May 2024
DOI: https://doi.org/10.1007/s00034-023-02592-z

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Enhancing Children’s Short Utterance-Based ASV Using Inverse Gamma-tone Filtered Cepstral coefficients

Abstract

Access this article

Similar content being viewed by others

Enhancing Children’s Short Utterance Based ASV Using Data Augmentation Techniques and Feature Concatenation Approach

Role of Data Augmentation and Effective Conservation of High-Frequency Contents in the Context Children’s Speaker Verification System

Short-Utterance-Based Children’s Speaker Verification in Low-Resource Conditions

Data Availability

References

Funding

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Ethical Approval

Consent for Publication

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Enhancing Children’s Short Utterance-Based ASV Using Inverse Gamma-tone Filtered Cepstral coefficients

Abstract

Access this article

Similar content being viewed by others

Enhancing Children’s Short Utterance Based ASV Using Data Augmentation Techniques and Feature Concatenation Approach

Role of Data Augmentation and Effective Conservation of High-Frequency Contents in the Context Children’s Speaker Verification System

Short-Utterance-Based Children’s Speaker Verification in Low-Resource Conditions

Data Availability

References

Funding

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Ethical Approval

Consent for Publication

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation