Multi-geometry embedded transformer for facial expression recognition in videos,Expert Systems with Applications

当前位置： X-MOL 学术 › Expert Syst. Appl. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Multi-geometry embedded transformer for facial expression recognition in videos
Expert Systems with Applications ( IF 8.5 ) Pub Date : 2024-03-20 , DOI: 10.1016/j.eswa.2024.123635
Dongliang Chen , Guihua Wen , Huihui Li , Pei Yang , Chuyun Chen , Bao Wang

Dynamic facial expressions in videos express more realistic emotional states, and recognizing emotions from in-the-wild facial expression videos is a challenging task due to the changeable posture, partial occlusion and various light conditions. Although current methods have designed transformer-based models to learn spatial–temporal features, they cannot explore useful local geometry structures from both spatial and temporal views to capture subtle emotional features for the videos with varied poses and facial occlusion. To this end, we propose a novel multi-geometry embedded transformer (MGET), which adapts multi-geometry knowledge into transformers and excavates spatial–temporal geometry information as complementary to learn effective emotional features. Specifically, from a new perspective, we first design a multi-geometry distance learning (MGDL) to capture emotion-related geometry structure knowledge under Euclidean and Hyperbolic spaces. Especially based on the advantages of hyperbolic geometry, it finds the more subtle emotional changes among local spatial and temporal features. Secondly, we combine MGDL with transformer to design spatial–temporal MGETs, which capture important spatial and temporal multi-geometry features to embed them into their corresponding original features, and then perform cross-regions and cross-frame interaction on these multi-level features. Finally, MGET gains superior performance on DFEW, FERV39k and AFEW datasets, where the unweighted average recall (UAR) and weighted average recall (WAR) are 58.65%/69.91%, 41.91%/50.76% and 53.23%/55.40%, respectively, and the gained improvements are 2.55%/0.66%, 3.69%/2.63% and 3.66%/1.14% compared to M3DFEL, Logo-Forme and EST methods.

中文翻译：

用于视频中面部表情识别的多几何嵌入式变压器

视频中的动态面部表情表达了更真实的情绪状态，由于姿势多变、部分遮挡和各种光照条件，从野外面部表情视频中识别情绪是一项具有挑战性的任务。尽管当前的方法已经设计了基于变压器的模型来学习时空特征，但它们无法从空间和时间视图探索有用的局部几何结构，以捕获具有不同姿势和面部遮挡的视频的微妙情感特征。为此，我们提出了一种新颖的多几何嵌入式变压器（MGET），它将多几何知识融入变压器中，并挖掘时空几何信息作为补充，以学习有效的情感特征。具体来说，从一个新的角度，我们首先设计了一种多几何远程学习（MGDL）来捕获欧几里得和双曲空间下与情感相关的几何结构知识。尤其是基于双曲几何的优势，发现局部时空特征之间更微妙的情感变化。其次，我们将MGDL与Transformer结合起来设计时空MGET，捕获重要的时空多几何特征并将其嵌入到相应的原始特征中，然后对这些多层次特征进行跨区域和跨帧交互。最后，MGET 在 DFEW、FERV39k 和 AFEW 数据集上获得了优异的性能，其中未加权平均召回率（UAR）和加权平均召回率（WAR）分别为 58.65%/69.91%、41.91%/50.76% 和 53.23%/55.40%，与M3DFEL、Logo-Forme和EST方法相比，获得的改进分别为2.55%/0.66%、3.69%/2.63%和3.66%/1.14%。

更新日期：2024-03-20

点击分享查看原文

点击收藏

阅读更多本刊最新论文本刊介绍/投稿指南

全部期刊列表>>