Token-disentangling Mutual Transformer for multimodal emotion recognition,Engineering Applications of Artificial Intelligence

当前位置： X-MOL 学术 › Eng. Appl. Artif. Intell. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Token-disentangling Mutual Transformer for multimodal emotion recognition
Engineering Applications of Artificial Intelligence ( IF 8 ) Pub Date : 2024-04-06 , DOI: 10.1016/j.engappai.2024.108348
Guanghao Yin , Yuanyuan Liu , Tengfei Liu , Haoyu Zhang , Fang Fang , Chang Tang , Liangxiao Jiang

Multimodal emotion recognition presents a complex challenge, as it involves the identification of human emotions using various modalities such as video, text, and audio. Existing methods focus mainly on the fusion information from multimodal data, but ignore the interaction of the modality-specific heterogeneity features that contribute differently to emotions, leading to sub-optimal results. To tackle this challenge, we propose a novel Token-disentangling Mutual Transformer (TMT) for robust multimodal emotion recognition, by effectively disentangling and interacting inter-modality emotion consistency features and intra-modality emotion heterogeneity features. Specifically, the TMT consists of two main modules: multimodal emotion Token disentanglement and Token mutual Transformer. In the multimodal emotion Token disentanglement, we introduce a Token separation encoder with an elaborated Token disentanglement regularization, which effectively disentangle the inter-modality emotion consistency feature Token from each intra-modality emotion heterogeneity feature Token; consequently, the emotion-related consistency and heterogeneity information can be performed independently and comprehensively. Furthermore, we devise the Token mutual Transformer with two cross-modal encoders to interact and fuse the disentangled feature Tokens by using bi-directional query learning, which delivers more comprehensive and complementary multimodal emotion representations for multimodal emotion recognition. We evaluate our model on three popular three-modality emotion datasets, namely CMU-MOSI, CMU-MOSEI, and CH-SIMS, and the experimental results affirm the superior performance of our model compared to state-of-the-art methods, achieving state-of-the-art recognition performance. Evaluation Codes and models are released at .

中文翻译：

用于多模态情感识别的令牌解缠互变换器

多模态情感识别提出了一个复杂的挑战，因为它涉及使用视频、文本和音频等多种模态来识别人类情感。现有方法主要关注多模态数据的融合信息，但忽略了对情绪有不同贡献的特定模态异质性特征的相互作用，导致次优结果。为了应对这一挑战，我们提出了一种新颖的令牌解缠互变换器（TMT），通过有效解缠和交互模态间情感一致性特征和模态内情感异质性特征，实现鲁棒的多模态情感识别。具体来说，TMT由两个主要模块组成：多模态情感Token解缠和Token互转换器。在多模态情感Token解缠中，我们引入了一种具有精心设计的Token解缠正则化的Token分离编码器，可以有效地将模态间情感一致性特征Token与每个模态内情感异质性特征Token解开；因此，与情绪相关的一致性和异质性信息可以独立且综合地进行。此外，我们设计了具有两个跨模态编码器的令牌互变换器，通过使用双向查询学习来交互和融合解开的特征令牌，从而为多模态情感识别提供更全面和互补的多模态情感表示。我们在三个流行的三模态情感数据集（即 CMU-MOSI、CMU-MOSEI 和 CH-SIMS）上评估我们的模型，实验结果证实了我们的模型与最先进的方法相比具有优越的性能，实现了最先进的识别性能。评估代码和模型发布于。

更新日期：2024-04-06

点击分享查看原文

点击收藏

阅读更多本刊最新论文本刊介绍/投稿指南

全部期刊列表>>