Golden Gemini is All You Need: Finding the Sweet Spots for Speaker Verification,arXiv - CS - Sound

当前位置： X-MOL 学术 › arXiv.cs.SD › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Golden Gemini is All You Need: Finding the Sweet Spots for Speaker Verification
arXiv - CS - Sound Pub Date : 2023-12-06 , DOI: arxiv-2312.03620
Tianchi Liu, Kong Aik Lee, Qiongqiong Wang, Haizhou Li

Previous studies demonstrate the impressive performance of residual neural networks (ResNet) in speaker verification. The ResNet models treat the time and frequency dimensions equally. They follow the default stride configuration designed for image recognition, where the horizontal and vertical axes exhibit similarities. This approach ignores the fact that time and frequency are asymmetric in speech representation. In this paper, we address this issue and look for optimal stride configurations specifically tailored for speaker verification. We represent the stride space on a trellis diagram, and conduct a systematic study on the impact of temporal and frequency resolutions on the performance and further identify two optimal points, namely Golden Gemini, which serves as a guiding principle for designing 2D ResNet-based speaker verification models. By following the principle, a state-of-the-art ResNet baseline model gains a significant performance improvement on VoxCeleb, SITW, and CNCeleb datasets with 7.70%/11.76% average EER/minDCF reductions, respectively, across different network depths (ResNet18, 34, 50, and 101), while reducing the number of parameters by 16.5% and FLOPs by 4.1%. We refer to it as Gemini ResNet. Further investigation reveals the efficacy of the proposed Golden Gemini operating points across various training conditions and architectures. Furthermore, we present a new benchmark, namely the Gemini DF-ResNet, using a cutting-edge model.

中文翻译：

Golden Gemini 就是您所需要的：找到扬声器验证的最佳位置

先前的研究证明了残差神经网络（ResNet）在说话人验证中的令人印象深刻的性能。ResNet 模型同等对待时间和频率维度。它们遵循为图像识别设计的默认步幅配置，其中水平轴和垂直轴表现出相似性。这种方法忽略了语音表示中时间和频率不对称的事实。在本文中，我们解决了这个问题，并寻找专门为说话者验证量身定制的最佳步幅配置。我们在网格图上表示步幅空间，并对时间和频率分辨率对性能的影响进行系统研究，并进一步确定两个最佳点，即 Golden Gemini，作为设计基于 2D ResNet 的扬声器的指导原则验证模型。通过遵循这一原则，最先进的 ResNet 基线模型在 VoxCeleb、SITW 和 CNCeleb 数据集上获得了显着的性能改进，在不同的网络深度上，平均 EER/minDCF 分别降低了 7.70%/11.76%（ResNet18、 34、50和101），同时参数数量减少了16.5%，FLOPs减少了4.1%。我们将其称为 Gemini ResNet。进一步的调查揭示了所提出的 Golden Gemini 操作点在各种训练条件和架构中的有效性。此外，我们使用尖端模型提出了一个新的基准，即 Gemini DF-ResNet。

更新日期：2023-12-07

点击分享查看原文

点击收藏

阅读更多本刊最新论文

全部期刊列表>>