当前位置: X-MOL 学术Pattern Recogn. Lett. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
EgoCap and EgoFormer: First-person image captioning with context fusion
Pattern Recognition Letters ( IF 5.1 ) Pub Date : 2024-03-20 , DOI: 10.1016/j.patrec.2024.03.012
Zhuangzhuang Dai , Vu Tran , Andrew Markham , Niki Trigoni , M. Arif Rahman , L.N.S. Wijayasingha , John Stankovic , Chen Li

First-person captioning is significant because it provides veracious descriptions of egocentric scenes in a unique perspective. Also, there is a need to caption the scene, a.k.a. life-logging, for patients, travellers, and emergency responders in an egocentric narrative. Ego-captioning is indeed non-trivial since (1) Ego-images can be noisy due to motion and angles; (2) Describing a scene in a first-person narrative involves drastically different semantics; (3) Empirical implications have to be made on top of visual appearance because the cameraperson is often outside the field of view. We note we humans make good sense out of casual footage thanks to our contextual awareness in judging when and where the event unfolds, and whom the cameraperson is interacting with. This inspires the infusion of such “contexts” for situation-aware captioning. We create which contains 2.1K ego-images, over 10K ego-captions, and 6.3K contextual labels, to close the gap of lacking ego-captioning datasets. We propose , a dual-encoder transformer-based network which fuses both contextual and visual features. The context encoder is pre-trained on ImageNet before fine tuning with context classification tasks. Similar to visual attention, we exploit stacked multi-head attention layers in the captioning decoder to reinforce attention to the context features. The has realized state-of-the-art performance on achieving a CIDEr score of 125.52. The dataset and are publicly available at .

中文翻译:

EgoCap 和 EgoFormer:具有上下文融合的第一人称图像字幕

第一人称字幕很重要,因为它以独特的视角对以自我为中心的场景进行了真实的描述。此外,还需要以自我为中心的叙述方式为患者、旅行者和急救人员记录场景,即生活记录。自我字幕确实很重要,因为 (1) 自我图像可能会因运动和角度而产生噪音; (2)以第一人称叙述描述场景涉及截然不同的语义; (3) 经验意义必须建立在视觉外观之上,因为摄影师经常在视野之外。我们注意到,我们人类能够很好地理解随意的镜头,这要归功于我们的情境意识,可以判断事件发生的时间和地点,以及摄影师正在与谁互动。这激发了为情景感知字幕注入此类“上下文”。我们创建了包含 2.1K 个自我图像、超过 10K 个自我字幕和 6.3K 个上下文标签,以缩小缺乏自我字幕数据集的差距。我们提出了一种基于双编码器变压器的网络,它融合了上下文和视觉特征。上下文编码器在使用上下文分类任务进行微调之前在 ImageNet 上进行了预训练。与视觉注意力类似,我们利用字幕解码器中的堆叠多头注意力层来加强对上下文特征的注意力。其 CIDEr 分数达到 125.52,实现了最先进的性能。该数据集可在 上公开获取。
更新日期:2024-03-20
down
wechat
bug