Diffusion-Based Speech Enhancement in Matched and Mismatched Conditions Using a Heun-Based Sampler,arXiv - CS - Sound

当前位置： X-MOL 学术 › arXiv.cs.SD › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Diffusion-Based Speech Enhancement in Matched and Mismatched Conditions Using a Heun-Based Sampler
arXiv - CS - Sound Pub Date : 2023-12-05 , DOI: arxiv-2312.02683
Philippe Gonzalez, Zheng-Hua Tan, Jan Østergaard, Jesper Jensen, Tommy Sonne Alstrøm, Tobias May

Diffusion models are a new class of generative models that have recently been applied to speech enhancement successfully. Previous works have demonstrated their superior performance in mismatched conditions compared to state-of-the art discriminative models. However, this was investigated with a single database for training and another one for testing, which makes the results highly dependent on the particular databases. Moreover, recent developments from the image generation literature remain largely unexplored for speech enhancement. These include several design aspects of diffusion models, such as the noise schedule or the reverse sampler. In this work, we systematically assess the generalization performance of a diffusion-based speech enhancement model by using multiple speech, noise and binaural room impulse response (BRIR) databases to simulate mismatched acoustic conditions. We also experiment with a noise schedule and a sampler that have not been applied to speech enhancement before. We show that the proposed system substantially benefits from using multiple databases for training, and achieves superior performance compared to state-of-the-art discriminative models in both matched and mismatched conditions. We also show that a Heun-based sampler achieves superior performance at a smaller computational cost compared to a sampler commonly used for speech enhancement.

中文翻译：

使用基于 Heun 的采样器在匹配和不匹配条件下进行基于扩散的语音增强

扩散模型是一类新型生成模型，最近已成功应用于语音增强。与最先进的判别模型相比，之前的作品已经证明了它们在不匹配条件下的优越性能。然而，这是使用一个用于训练的数据库和另一个用于测试的数据库进行调查的，这使得结果高度依赖于特定的数据库。此外，图像生成文献的最新进展在很大程度上仍未被探索用于语音增强。其中包括扩散模型的几个设计方面，例如噪声调度或反向采样器。在这项工作中，我们通过使用多个语音、噪声和双耳房间脉冲响应 (BRIR) 数据库来模拟不匹配的声学条件，系统地评估基于扩散的语音增强模型的泛化性能。我们还尝试了之前未应用于语音增强的噪声调度和采样器。我们表明，所提出的系统大大受益于使用多个数据库进行训练，并且在匹配和不匹配条件下与最先进的判别模型相比，都实现了卓越的性能。我们还表明，与常用于语音增强的采样器相比，基于 Heun 的采样器以更小的计算成本实现了卓越的性能。

更新日期：2023-12-06

点击分享查看原文

点击收藏

阅读更多本刊最新论文

全部期刊列表>>