SR-TTS: a rhyme-based end-to-end speech synthesis system,Frontiers in Neurorobotics

当前位置： X-MOL 学术 › Front. Neurorobotics › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

SR-TTS: a rhyme-based end-to-end speech synthesis system
Frontiers in Neurorobotics ( IF 3.1 ) Pub Date : 2024-02-27 , DOI: 10.3389/fnbot.2024.1322312
Yihao Yao , Tao Liang , Rui Feng , Keke Shi , Junxiao Yu , Wei Wang , Jianqing Li

Deep learning has significantly advanced text-to-speech (TTS) systems. These neural network-based systems have enhanced speech synthesis quality and are increasingly vital in applications like human-computer interaction. However, conventional TTS models still face challenges, as the synthesized speeches often lack naturalness and expressiveness. Additionally, the slow inference speed, reflecting low efficiency, contributes to the reduced voice quality. This paper introduces SynthRhythm-TTS (SR-TTS), an optimized Transformer-based structure designed to enhance synthesized speech. SR-TTS not only improves phonological quality and naturalness but also accelerates the speech generation process, thereby increasing inference efficiency. SR-TTS contains an encoder, a rhythm coordinator, and a decoder. In particular, a pre-duration predictor within the cadence coordinator and a self-attention-based feature predictor work together to enhance the naturalness and articulatory accuracy of speech. In addition, the introduction of causal convolution enhances the consistency of the time series. The cross-linguistic capability of SR-TTS is validated by training it on both English and Chinese corpora. Human evaluation shows that SR-TTS outperforms existing techniques in terms of speech quality and naturalness of expression. This technology is particularly suitable for applications that require high-quality natural speech, such as intelligent assistants, speech synthesized podcasts, and human-computer interaction.

中文翻译：

SR-TTS：基于韵律的端到端语音合成系统

深度学习拥有显着先进的文本转语音 (TTS) 系统。这些基于神经网络的系统提高了语音合成质量，并且在人机交互等应用中变得越来越重要。然而，传统的 TTS 模型仍然面临挑战，因为合成的语音往往缺乏自然性和表现力。此外，推理速度慢，效率低，导致语音质量下降。本文介绍了 SynthRhythm-TTS (SR-TTS)，这是一种基于 Transformer 的优化结构，旨在增强合成语音。SR-TTS不仅提高了语音质量和自然度，还加速了语音生成过程，从而提高了推理效率。SR-TTS 包含编码器、节奏协调器和解码器。特别是，节奏协调器内的预持续时间预测器和基于自注意力的特征预测器一起工作，以提高语音的自然度和发音准确性。此外，因果卷积的引入增强了时间序列的一致性。SR-TTS 的跨语言能力通过在英语和中文语料库上的训练得到验证。人工评估表明，SR-TTS 在语音质量和表达自然度方面优于现有技术。该技术特别适合需要高质量自然语音的应用，例如智能助手、语音合成播客、人机交互等。

更新日期：2024-02-27

点击分享查看原文

点击收藏

公开下载

阅读更多本刊最新论文本刊介绍/投稿指南

全部期刊列表>>