Natural language guidance of high-fidelity text-to-speech with synthetic annotations,arXiv - CS - Sound

当前位置： X-MOL 学术 › arXiv.cs.SD › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Natural language guidance of high-fidelity text-to-speech with synthetic annotations
arXiv - CS - Sound Pub Date : 2024-02-02 , DOI: arxiv-2402.01912
Dan Lyth, Simon King

Text-to-speech models trained on large-scale datasets have demonstrated impressive in-context learning capabilities and naturalness. However, control of speaker identity and style in these models typically requires conditioning on reference speech recordings, limiting creative applications. Alternatively, natural language prompting of speaker identity and style has demonstrated promising results and provides an intuitive method of control. However, reliance on human-labeled descriptions prevents scaling to large datasets. Our work bridges the gap between these two approaches. We propose a scalable method for labeling various aspects of speaker identity, style, and recording conditions. We then apply this method to a 45k hour dataset, which we use to train a speech language model. Furthermore, we propose simple methods for increasing audio fidelity, significantly outperforming recent work despite relying entirely on found data. Our results demonstrate high-fidelity speech generation in a diverse range of accents, prosodic styles, channel conditions, and acoustic conditions, all accomplished with a single model and intuitive natural language conditioning. Audio samples can be heard at https://text-description-to-speech.com/.

中文翻译：

具有合成注释的高保真文本到语音的自然语言指导

在大规模数据集上训练的文本转语音模型表现出了令人印象深刻的上下文学习能力和自然性。然而，在这些模型中控制说话人的身份和风格通常需要对参考语音录音进行调节，从而限制了创造性应用。另外，说话者身份和风格的自然语言提示已显示出有希望的结果，并提供了直观的控制方法。然而，对人工标记描述的依赖阻碍了扩展到大型数据集。我们的工作弥合了这两种方法之间的差距。我们提出了一种可扩展的方法来标记说话者身份、风格和录音条件的各个方面。然后，我们将此方法应用于 45k 小时的数据集，用于训练语音语言模型。此外，我们提出了提高音频保真度的简单方法，尽管完全依赖于发现的数据，但其性能明显优于最近的工作。我们的结果证明了在各种口音、韵律风格、通道条件和声学条件下的高保真语音生成，所有这些都通过单一模型和直观的自然语言调节来完成。音频样本可以在 https://text-description-to-speech.com/ 上听到。

更新日期：2024-02-06

点击分享查看原文

点击收藏

阅读更多本刊最新论文

全部期刊列表>>