M[formula omitted]TTS: Multi-modal text-to-speech of multi-scale style control for dubbing,Pattern Recognition Letters

当前位置： X-MOL 学术 › Pattern Recogn. Lett. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

M[formula omitted]TTS: Multi-modal text-to-speech of multi-scale style control for dubbing
Pattern Recognition Letters ( IF 5.1 ) Pub Date : 2024-02-10 , DOI: 10.1016/j.patrec.2024.02.005
Yan Liu , Li-Fang Wei , Xinyuan Qian , Tian-Hao Zhang , Song-Lu Chen , Xu-Cheng Yin

Dubbing refers to the procedure of recording characters by professional voice actors in films and games. It is more expressive and immersive than conventional Text-to-Speech (TTS) technologies and requires synchronization and style consistency of audio and video. Previous dubbing methods use video to provide either a global style vector or a local prosody embedding, limiting the expressiveness of the predicted waveform. To generate more expressive audio with precise visual temporal alignment, we propose a multi-modal multi-scale expressive speech synthesis method, namely multi-modal multi-scale TTS (MTTS), which introduces an auxiliary video input to provide style embeddings. Specifically, MTTS adopts a memory network to bridge heterogeneous modalities and further solve the training-inference style mismatch in conventional multi-modal TTS. To enhance the expressiveness of synthesized audio, a multi-scale style modeling scheme is used for recovering style characteristics at different scales. In addition, MTTS can convert the style of speech by choosing different reference videos. We conduct extensive experiments on the public GRID corpus, where our proposed MTTS can generate high-quality video-aligned speech. It also shows superior performance over the other comparable methods, both subjectively and objectively.

中文翻译：

M[公式省略]TTS：配音的多尺度风格控制的多模态文本转语音

配音是指由专业配音演员在电影、游戏中录制角色的过程。它比传统的文本转语音（TTS）技术更具表现力和沉浸感，并且要求音频和视频的同步和风格一致性。以前的配音方法使用视频来提供全局风格向量或局部韵律嵌入，限制了预测波形的表现力。为了生成具有精确视觉时间对齐的更具表现力的音频，我们提出了一种多模态多尺度表达性语音合成方法，即多模态多尺度 TTS（MTTS），它引入了辅助视频输入来提供风格嵌入。具体来说，MTTS采用记忆网络来桥接异构模态，并进一步解决传统多模态TTS中训练-推理风格不匹配的问题。为了增强合成音频的表现力，使用多尺度风格建模方案来恢复不同尺度的风格特征。此外，MTTS还可以通过选择不同的参考视频来转换演讲风格。我们在公共 GRID 语料库上进行了广泛的实验，我们提出的 MTTS 可以生成高质量的视频对齐语音。无论是主观还是客观，它都比其他类似方法表现出优越的性能。

更新日期：2024-02-10

点击分享查看原文

点击收藏

阅读更多本刊最新论文本刊介绍/投稿指南

全部期刊列表>>