Learning topic emotion and logical semantic for video paragraph captioning,Displays

当前位置： X-MOL 学术 › Displays › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Learning topic emotion and logical semantic for video paragraph captioning
Displays ( IF 4.3 ) Pub Date : 2024-04-04 , DOI: 10.1016/j.displa.2024.102706
Qinyu Li , Hanli Wang , Xiaokai Yi

Video paragraph captioning aims to generate multiple descriptive sentences for videos, which strive to replicate human writing in accuracy, logicality, and richness. However, current research focuses on the accuracy and temporal order of events, ignoring emotion and other critical logical relations embedded in human language, such as causal and adversative relations. The ignorance impairs the reasonable transition across generated event descriptions and restricts the vividness of expression, resulting in a gap from the standard of human language. To resolve the problem, a framework that integrates logic and emotion representation learning is proposed to narrow the gap. Concretely, a large-scale inter-event relation corpus is constructed based on the EMVPC dataset. This corpus is named EMVPC-EvtRel (standing for “EMVPC-Event Relations”) and contains six widely-used logical relations in human writing, 127 explicit inter-sentence connectives, and over 20,000 pairs of event segments with newly annotated logical relations. A logical semantic representation learning method is developed for recognizing the dependencies between visual events, thereby enhancing the characteristics of video contents and boosting the logicality of generated paragraphs. Moreover, a fine-grained emotion recognition module is designed to uncover emotion features embedded in videos. Finally, experimental results on the EMVPC dataset demonstrate the superiority of the proposed method compared to existing state-of-the-art approaches.

中文翻译：

学习视频段落字幕的主题情感和逻辑语义

视频段落字幕旨在为视频生成多个描述性句子，力求在准确性、逻辑性和丰富性上复制人类写作。然而，当前的研究侧重于事件的准确性和时间顺序，忽略了人类语言中嵌入的情感和其他关键逻辑关系，例如因果关系和对抗关系。这种无知损害了生成的事件描述的合理过渡，限制了表达的生动性，导致与人类语言标准的差距。为了解决这个问题，提出了一个集成逻辑和情感表征学习的框架来缩小差距。具体来说，基于EMVPC数据集构建了大规模事件间关系语料库。该语料库被命名为 EMVPC-EvtRel（代表“EMVPC-Event Relations”），包含人类写作中广泛使用的 6 种逻辑关系、127 个明确的句子间连接词以及超过 20,000 对带有新注释逻辑关系的事件片段。开发了一种逻辑语义表示学习方法，用于识别视觉事件之间的依赖关系，从而增强视频内容的特征并提高生成段落的逻辑性。此外，还设计了细粒度的情感识别模块来发现视频中嵌入的情感特征。最后，EMVPC 数据集上的实验结果证明了所提出的方法相对于现有最先进方法的优越性。

更新日期：2024-04-04

点击分享查看原文

点击收藏

阅读更多本刊最新论文本刊介绍/投稿指南

全部期刊列表>>