Schrodinger Bridges Beat Diffusion Models on Text-to-Speech Synthesis,arXiv - CS - Sound

当前位置： X-MOL 学术 › arXiv.cs.SD › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Schrodinger Bridges Beat Diffusion Models on Text-to-Speech Synthesis
arXiv - CS - Sound Pub Date : 2023-12-06 , DOI: arxiv-2312.03491
Zehua Chen, Guande He, Kaiwen Zheng, Xu Tan, Jun Zhu

In text-to-speech (TTS) synthesis, diffusion models have achieved promising generation quality. However, because of the pre-defined data-to-noise diffusion process, their prior distribution is restricted to a noisy representation, which provides little information of the generation target. In this work, we present a novel TTS system, Bridge-TTS, making the first attempt to substitute the noisy Gaussian prior in established diffusion-based TTS methods with a clean and deterministic one, which provides strong structural information of the target. Specifically, we leverage the latent representation obtained from text input as our prior, and build a fully tractable Schrodinger bridge between it and the ground-truth mel-spectrogram, leading to a data-to-data process. Moreover, the tractability and flexibility of our formulation allow us to empirically study the design spaces such as noise schedules, as well as to develop stochastic and deterministic samplers. Experimental results on the LJ-Speech dataset illustrate the effectiveness of our method in terms of both synthesis quality and sampling efficiency, significantly outperforming our diffusion counterpart Grad-TTS in 50-step/1000-step synthesis and strong fast TTS models in few-step scenarios. Project page: https://bridge-tts.github.io/

中文翻译：

文本转语音合成中的薛定谔桥击败扩散模型

在文本到语音 (TTS) 合成中，扩散模型已经实现了有希望的生成质量。然而，由于预定义的数据到噪声的扩散过程，它们的先验分布仅限于噪声表示，这提供了生成目标的很少信息。在这项工作中，我们提出了一种新颖的 TTS 系统 Bridge-TTS，首次尝试用干净且确定性的方法替代已建立的基于扩散的 TTS 方法中的噪声高斯先验，从而提供目标的强大结构信息。具体来说，我们利用从文本输入获得的潜在表示作为先验，并在它和真实梅尔谱图之间建立一个完全易于处理的薛定谔桥梁，从而实现数据到数据的过程。此外，我们的公式的易处理性和灵活性使我们能够凭经验研究噪声表等设计空间，以及开发随机和确定性采样器。LJ-Speech 数据集上的实验结果说明了我们的方法在合成质量和采样效率方面的有效性，在 50 步/1000 步合成中显着优于我们的扩散对应 Grad-TTS，在几步中表现出强大的快速 TTS 模型场景。项目页面：https://bridge-tts.github.io/

更新日期：2023-12-07

点击分享查看原文

点击收藏

阅读更多本刊最新论文

全部期刊列表>>