当前位置: X-MOL 学术Comput. Electr. Eng. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
TellMeTalk: Multimodal-driven talking face video generation
Computers & Electrical Engineering ( IF 4.3 ) Pub Date : 2024-01-20 , DOI: 10.1016/j.compeleceng.2023.109049
Pengfei Li , Huihuang Zhao , Qingyun Liu , Peng Tang , Lin Zhang

In this paper, we present TellMeTalk, an innovative approach for generating expressive talking face videos based on multimodal inputs. Our approach demonstrates robustness across various identities, languages, expressions, and head movements. It overcomes four key limitations of existing talking face video generation methods: (1) reliance on single-modal learning from audio or text, lacking the complementary nature of multimodal inputs; (2) deployment of traditional convolutional neural network generation, leading to restricted capture of spatial features; (3) the absence of natural head movements and expressions; and (4) limitations of artifacts, prominent boundaries caused by image overlapping, and unclear mouth regions. To address these challenges, we propose a face motion network to imbue character images with facial expressions and head movements. We also take text and reference audio as input to generate personalized audio. Furthermore, we introduce a generator equipped with a cross-attention module and Fast Fourier Convolutional blocks to model spatial dependencies. Finally, a face restoration module is designed to reduce artifacts and prominent boundaries. Extensive experiments demonstrate our method produces high-quality expressive talking face videos. Compared to state-of-the-art approaches, our method exhibits superior performance in terms of video quality and precise synchronization of lip movements. The source code is available at https://github.com/lifemo/TellMeTalk.



中文翻译:

TellMeTalk:多模态驱动的人脸视频生成

在本文中,我们提出了 TellMeTalk,这是一种基于多模态输入生成富有表现力的人脸视频的创新方法。我们的方法展示了对各种身份、语言、表达和头部运动的稳健性。它克服了现有说话人脸视频生成方法的四个关键限制:(1)依赖于音频或文本的单模态学习,缺乏多模态输入的互补性;(2)部署传统的卷积神经网络生成,导致空间特征的捕捉受限;(3) 缺乏自然的头部动作和表情;(4)伪影的局限性、图像重叠导致的边界突出以及嘴部区域不清晰。为了解决这些挑战,我们提出了一个面部运动网络,将面部表情和头部运动融入角色图像中。我们还以文本和参考音频作为输入来生成个性化音频。此外,我们引入了一个配备交叉注意模块和快速傅里叶卷积块的生成器来建模空间依赖性。最后,设计了面部恢复模块来减少伪影和突出的边界。大量的实验证明我们的方法可以产生高质量的富有表现力的说话脸部视频。与最先进的方法相比,我们的方法在视频质量和嘴唇运动的精确同步方面表现出卓越的性能。源代码可在https://github.com/lifemo/TellMeTalk获取。

更新日期:2024-01-21
down
wechat
bug