当前位置: X-MOL 学术arXiv.cs.SD › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
A text-dependent speaker verification application framework based on Chinese numerical string corpus
arXiv - CS - Sound Pub Date : 2023-12-04 , DOI: arxiv-2312.01645
Litong Zheng, Feng Hong, Weijie Xu

Researches indicate that text-dependent speaker verification (TD-SV) often outperforms text-independent verification (TI-SV) in short speech scenarios. However, collecting large-scale fixed text speech data is challenging, and as speech length increases, factors like sentence rhythm and pauses affect TDSV's sensitivity to text sequence. Based on these factors, We propose the hypothesis that strategies such as more fine-grained pooling methods on time scales and decoupled representations of speech speaker embedding and text embedding are more suitable for TD-SV. We have introduced an end-to-end TD-SV system based on a dataset comprising longer Chinese numerical string texts. It contains a text embedding network, a speaker embedding network, and back-end fusion. First, we recorded a dataset consisting of long Chinese numerical text named SHAL, which is publicly available on the Open-SLR website. We addressed the issue of dataset scarcity by augmenting it using Tacotron2 and HiFi-GAN. Next, we introduced a dual representation of speech with text embedding and speaker embedding. In the text embedding network, we employed an enhanced Transformer and introduced a triple loss that includes text classification loss, CTC loss, and decoder loss. For the speaker embedding network, we enhanced a sliding window attentive statistics pooling (SWASP), combined with attentive statistics pooling (ASP) to create a multi-scale pooling method. Finally, we fused text embedding and speaker embedding. Our pooling methods achieved an equal error rate (EER) performance improvement of 49.2% on Hi-Mia and 75.0% on SHAL, respectively.

中文翻译:

基于中文数字串语料库的文本相关说话人验证应用框架

研究表明,在短语音场景中,文本相关的说话人验证(TD-SV)通常优于文本无关的验证(TI-SV)。然而,收集大规模固定文本语音数据具有挑战性,并且随着语音长度的增加,句子节奏和停顿等因素会影响 TDSV 对文本序列的敏感度。基于这些因素,我们提出这样的假设:时间尺度上更细粒度的池化方法以及语音说话者嵌入和文本嵌入的解耦表示等策略更适合 TD-SV。我们引入了一种基于包含较长中文数字字符串文本的数据集的端到端 TD-SV 系统。它包含文本嵌入网络、说话人嵌入网络和后端融合。首先,我们记录了一个由长中文数字文本组成的数据集,名为 SHAL,该数据集在 Open-SLR 网站上公开提供。我们通过使用 Tacotron2 和 HiFi-GAN 增强数据集来解决数据集稀缺的问题。接下来,我们引入了文本嵌入和说话人嵌入的语音双重表示。在文本嵌入网络中,我们采用了增强型 Transformer 并引入了三重损失,包括文本分类损失、CTC 损失和解码器损失。对于说话人嵌入网络,我们增强了滑动窗口注意统计池化(SWASP),结合注意统计池化(ASP)创建了多尺度池化方法。最后,我们融合了文本嵌入和说话人嵌入。我们的池化方法在 Hi-Mia 上和 SHAL 上分别实现了 49.2% 和 75.0% 的等错误率 (EER) 性能改进。
更新日期:2023-12-05
down
wechat
bug