How phonemes contribute to deep speaker models?,arXiv - CS - Sound

当前位置： X-MOL 学术 › arXiv.cs.SD › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

How phonemes contribute to deep speaker models?
arXiv - CS - Sound Pub Date : 2024-02-05 , DOI: arxiv-2402.02730
Pengqi Li, Tianhao Wang, Lantian Li, Askar Hamdulla, Dong Wang

Which phonemes convey more speaker traits is a long-standing question, and various perception experiments were conducted with human subjects. For speaker recognition, studies were conducted with the conventional statistical models and the drawn conclusions are more or less consistent with the perception results. However, which phonemes are more important with modern deep neural models is still unexplored, due to the opaqueness of the decision process. This paper conducts a novel study for the attribution of phonemes with two types of deep speaker models that are based on TDNN and CNN respectively, from the perspective of model explanation. Specifically, we conducted the study by two post-explanation methods: LayerCAM and Time Align Occlusion (TAO). Experimental results showed that: (1) At the population level, vowels are more important than consonants, confirming the human perception studies. However, fricatives are among the most unimportant phonemes, which contrasts with previous studies. (2) At the speaker level, a large between-speaker variation is observed regarding phoneme importance, indicating that whether a phoneme is important or not is largely speaker-dependent.

中文翻译：

音素如何影响深度说话人模型？

哪些音素更能传达说话者的特征是一个长期存在的问题，并且针对人类受试者进行了各种感知实验。对于说话人识别，利用传统的统计模型进行研究，得出的结论与感知结果或多或少一致。然而，由于决策过程的不透明性，对于现代深度神经模型来说哪些音素更重要仍未被探索。本文从模型解释的角度，分别基于TDNN和CNN的两类深度说话人模型对音素归因进行了新颖的研究。具体来说，我们通过两种后解释方法进行了研究：LayerCAM和时间对齐遮挡（TAO）。实验结果表明：（1）在人群水平上，元音比辅音更重要，证实了人类感知研究。然而，摩擦音是最不重要的音素之一，这与之前的研究形成鲜明对比。（2）在说话人层面，观察到音素重要性存在很大的说话人之间的差异，这表明音素是否重要很大程度上取决于说话人。

更新日期：2024-02-06

点击分享查看原文

点击收藏

阅读更多本刊最新论文

全部期刊列表>>