Video Frame-wise Explanation Driven Contrastive Learning for Procedural Text Generation,Computer Vision and Image Understanding

当前位置： X-MOL 学术 › Comput. Vis. Image Underst. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Video Frame-wise Explanation Driven Contrastive Learning for Procedural Text Generation
Computer Vision and Image Understanding ( IF 4.5 ) Pub Date : 2024-02-08 , DOI: 10.1016/j.cviu.2024.103954
Zhihao Wang , Lin Li , Zhongwei Xie , Chuanbo Liu

Procedural text generation from visual observation of instructional videos, such as assembling, biochemical experiments, and cooking, is an essential task for scene understanding and real-world applications. The major difference from general captioning tasks is two-fold: it has a flow of material combination in instructional steps, and the materials change their state through action-involved manipulations. However, existing works do not adequately address both two issues. To this end, this paper proposes a procedural text generation framework, namely , with ideo rame-wise eplanation driven ontrastive earning () module and ction used aterial epresentation earning () module, generating a procedural text from the step’s frame sequence of an instructional video. The VFXCL utilizes an explanation method to determine the frame’s importance in a step’s frame sequence and derive the positive and negative sequences for self-supervised contrastive learning, aiming at enhancing step representation learning for capturing the inter-step differences; The AFMRL leverages identified actions and materials to update material states after manipulations, which contributes to step representation learning via intra-step action fused material state tracking. By integrating the two modules, they collaboratively extract the information essential for the decoder to accurately generate procedural text. The experimental results show the effectiveness of the proposed framework, which outperforms state-of-the-art video procedural text generation models.

中文翻译：

用于程序文本生成的视频逐帧解释驱动的对比学习

通过对教学视频（例如组装、生化实验和烹饪）的视觉观察来生成程序文本是场景理解和实际应用的一项重要任务。与一般字幕任务的主要区别有两个：它在教学步骤中具有材料组合流程，并且材料通过涉及动作的操作来改变其状态。然而，现有的工作并没有充分解决这两个问题。为此，本文提出了一种程序文本生成框架，即使用ideo rame-wise eplanationdriven ontrastive Earning()模块和ction使用的aterial epresentation Earning()模块，从教学视频的步骤帧序列生成程序文本。。 VFXCL利用解释方法来确定步骤帧序列中帧的重要性，并导出用于自监督对比学习的正序列和负序列，旨在增强步骤表示学习以捕获步骤间差异； AFMRL 利用已识别的动作和材料来更新操作后的材料状态，这有助于通过步骤内动作融合材料状态跟踪来进行步骤表示学习。通过集成这两个模块，它们协作提取解码器准确生成程序文本所必需的信息。实验结果表明了所提出框架的有效性，该框架优于最先进的视频程序文本生成模型。

更新日期：2024-02-08

点击分享查看原文

点击收藏

阅读更多本刊最新论文本刊介绍/投稿指南

全部期刊列表>>