当前位置: X-MOL 学术arXiv.cs.SD › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Improving Design of Input Condition Invariant Speech Enhancement
arXiv - CS - Sound Pub Date : 2024-01-25 , DOI: arxiv-2401.14271
Wangyou Zhang, Jee-weon Jung, Shinji Watanabe, Yanmin Qian

Building a single universal speech enhancement (SE) system that can handle arbitrary input is a demanded but underexplored research topic. Towards this ultimate goal, one direction is to build a single model that handles diverse audio duration, sampling frequencies, and microphone variations in noisy and reverberant scenarios, which we define here as "input condition invariant SE". Such a model was recently proposed showing promising performance; however, its multi-channel performance degraded severely in real conditions. In this paper we propose novel architectures to improve the input condition invariant SE model so that performance in simulated conditions remains competitive while real condition degradation is much mitigated. For this purpose, we redesign the key components that comprise such a system. First, we identify that the channel-modeling module's generalization to unseen scenarios can be sub-optimal and redesign this module. We further introduce a two-stage training strategy to enhance training efficiency. Second, we propose two novel dual-path time-frequency blocks, demonstrating superior performance with fewer parameters and computational costs compared to the existing method. All proposals combined, experiments on various public datasets validate the efficacy of the proposed model, with significantly improved performance on real conditions. Recipe with full model details is released at https://github.com/espnet/espnet.



构建一个可以处理任意输入的通用语音增强(SE)系统是一个迫切需要但尚未充分探索的研究课题。为了实现这一最终目标,一个方向是构建一个单一模型来处理噪声和混响场景中的不同音频持续时间、采样频率和麦克风变化,我们在此将其定义为“输入条件不变 SE”。最近提出的这种模型显示出有希望的性能;然而,其多通道性能在实际条件下严重下降。在本文中,我们提出了新颖的架构来改进输入条件不变的 SE 模型,以便在模拟条件下的性能保持竞争力,同时大大减轻实际条件的退化。为此,我们重新设计了构成该系统的关键组件。首先,我们发现通道建模模块对未见过的场景的泛化可能不是最佳的,并重新设计了该模块。我们进一步引入两阶段训练策略来提高训练效率。其次,我们提出了两种新颖的双路径时频块,与现有方法相比,以更少的参数和计算成本展示了优越的性能。所有提案结合起来,在各种公共数据集上进行的实验验证了所提出模型的有效性,在实际条件下的性能显着提高。包含完整模型详细信息的配方已在 https://github.com/espnet/espnet 上发布。