Neural Text to Articulate Talk: Deep Text to Audiovisual Speech Synthesis achieving both Auditory and Photo-realism,arXiv - CS - Sound

当前位置： X-MOL 学术 › arXiv.cs.SD › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Neural Text to Articulate Talk: Deep Text to Audiovisual Speech Synthesis achieving both Auditory and Photo-realism
arXiv - CS - Sound Pub Date : 2023-12-11 , DOI: arxiv-2312.06613
Georgios Milis, Panagiotis P. Filntisis, Anastasios Roussos, Petros Maragos

Recent advances in deep learning for sequential data have given rise to fast and powerful models that produce realistic videos of talking humans. The state of the art in talking face generation focuses mainly on lip-syncing, being conditioned on audio clips. However, having the ability to synthesize talking humans from text transcriptions rather than audio is particularly beneficial for many applications and is expected to receive more and more attention, following the recent breakthroughs in large language models. For that, most methods implement a cascaded 2-stage architecture of a text-to-speech module followed by an audio-driven talking face generator, but this ignores the highly complex interplay between audio and visual streams that occurs during speaking. In this paper, we propose the first, to the best of our knowledge, text-driven audiovisual speech synthesizer that uses Transformers and does not follow a cascaded approach. Our method, which we call NEUral Text to ARticulate Talk (NEUTART), is a talking face generator that uses a joint audiovisual feature space, as well as speech-informed 3D facial reconstructions and a lip-reading loss for visual supervision. The proposed model produces photorealistic talking face videos with human-like articulation and well-synced audiovisual streams. Our experiments on audiovisual datasets as well as in-the-wild videos reveal state-of-the-art generation quality both in terms of objective metrics and human evaluation.

中文翻译：

神经文本到清晰对话：深度文本到视听语音合成，实现听觉和照片真实感

序列数据深度学习的最新进展催生了快速而强大的模型，可以生成人类说话的逼真视频。说话脸部生成的最新技术主要集中在以音频剪辑为条件的口型同步上。然而，从文本转录而不是音频中合成说话的人类的能力对于许多应用来说特别有益，并且随着大型语言模型最近的突破，预计将受到越来越多的关注。为此，大多数方法实现了文本转语音模块的级联两级架构，后跟音频驱动的说话面孔生成器，但这忽略了说话期间发生的音频和视频流之间高度复杂的相互作用。在本文中，据我们所知，我们提出了第一个文本驱动的视听语音合成器，它使用 Transformer 并且不遵循级联方法。我们的方法称为 NEUral Text to ARticulate Talk (NEUTART)，是一种会说话的面部生成器，它使用联合视听特征空间，以及基于语音的 3D 面部重建和用于视觉监督的唇读损失。所提出的模型可生成具有拟人清晰度和同步良好的视听流的逼真说话脸部视频。我们对视听数据集以及野外视频的实验揭示了客观指标和人类评估方面最先进的生成质量。

更新日期：2023-12-15

点击分享查看原文

点击收藏

阅读更多本刊最新论文

全部期刊列表>>