当前位置: X-MOL 学术Pattern Recogn. Lett. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Deep neural networks for automatic speaker recognition do not learn supra-segmental temporal features
Pattern Recognition Letters ( IF 5.1 ) Pub Date : 2024-03-26 , DOI: 10.1016/j.patrec.2024.03.016
Daniel Neururer , Volker Dellwo , Thilo Stadelmann

While deep neural networks have shown impressive results in automatic speaker recognition and related tasks, it is dissatisfactory how little is understood about what exactly is responsible for these results. Part of the success has been attributed in prior work to their capability to model supra-segmental temporal information (SST), i.e., learn rhythmic-prosodic characteristics of speech in addition to spectral features. In this paper, we (i) present and apply a novel test to quantify to what extent the performance of state-of-the-art neural networks for speaker recognition can be explained by modeling SST; and (ii) present several means to force respective nets to focus more on SST and evaluate their merits. We find that a variety of CNN- and RNN-based neural network architectures for speaker recognition do not model SST to any sufficient degree, even when forced. The results provide a highly relevant basis for impactful future research into better exploitation of the full speech signal and give insights into the inner workings of such networks, enhancing explainability of deep learning for speech technologies.

中文翻译:

用于自动说话人识别的深度神经网络不学习超分段时间特征

虽然深度神经网络在自动说话人识别和相关任务中显示出令人印象深刻的结果,但令人不满意的是,人们对这些结果的确切原因了解甚少。在之前的工作中,成功的部分原因在于他们能够对超分段时间信息(SST)进行建模,即除了​​频谱特征之外,还可以学习语音的节奏韵律特征。在本文中,我们 (i) 提出并应用了一种新颖的测试来量化最先进的说话人识别神经网络的性能可以在多大程度上通过建模 SST 来解释; (ii) 提出几种方法,迫使各自的网络更多地关注 SST 并评估其优点。我们发现,用于说话人识别的各种基于 CNN 和 RNN 的神经网络架构即使在被迫的情况下也无法对 SST 进行足够的建模。这些结果为更好地利用完整语音信号的有影响力的未来研究提供了高度相关的基础,并深入了解此类网络的内部工作原理,从而增强了语音技术深度学习的可解释性。
更新日期:2024-03-26
down
wechat
bug