Target Speech Extraction with Pre-trained AV-HuBERT and Mask-And-Recover Strategy,arXiv - CS - Sound

当前位置： X-MOL 学术 › arXiv.cs.SD › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Target Speech Extraction with Pre-trained AV-HuBERT and Mask-And-Recover Strategy
arXiv - CS - Sound Pub Date : 2024-03-24 , DOI: arxiv-2403.16078
Wenxuan Wu, Xueyuan Chen, Xixin Wu, Haizhou Li, Helen Meng

Audio-visual target speech extraction (AV-TSE) is one of the enabling technologies in robotics and many audio-visual applications. One of the challenges of AV-TSE is how to effectively utilize audio-visual synchronization information in the process. AV-HuBERT can be a useful pre-trained model for lip-reading, which has not been adopted by AV-TSE. In this paper, we would like to explore the way to integrate a pre-trained AV-HuBERT into our AV-TSE system. We have good reasons to expect an improved performance. To benefit from the inter and intra-modality correlations, we also propose a novel Mask-And-Recover (MAR) strategy for self-supervised learning. The experimental results on the VoxCeleb2 dataset show that our proposed model outperforms the baselines both in terms of subjective and objective metrics, suggesting that the pre-trained AV-HuBERT model provides more informative visual cues for target speech extraction. Furthermore, through a comparative study, we confirm that the proposed Mask-And-Recover strategy is significantly effective.

中文翻译：

使用预先训练的 AV-HuBERT 和掩码和恢复策略进行目标语音提取

视听目标语音提取（AV-TSE）是机器人和许多视听应用中的支持技术之一。 AV-TSE的挑战之一是如何在过程中有效利用视听同步信息。 AV-HuBERT 可以成为一种有用的唇读预训练模型，但 AV-TSE 尚未采用该模型。在本文中，我们希望探索将预先训练的 AV-HuBERT 集成到我们的 AV-TSE 系统中的方法。我们有充分的理由期待性能的提高。为了受益于模态间和模态内的相关性，我们还提出了一种用于自监督学习的新颖的掩模和恢复（MAR）策略。 VoxCeleb2 数据集上的实验结果表明，我们提出的模型在主观和客观指标方面均优于基线，这表明预训练的 AV-HuBERT 模型为目标语音提取提供了更多信息丰富的视觉线索。此外，通过比较研究，我们证实所提出的掩模和恢复策略是非常有效的。

更新日期：2024-03-26

点击分享查看原文

点击收藏

阅读更多本刊最新论文

全部期刊列表>>