当前位置: X-MOL 学术IEEE Trans. Image Process. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
In Defense of Clip-Based Video Relation Detection
IEEE Transactions on Image Processing ( IF 10.6 ) Pub Date : 2024-03-27 , DOI: 10.1109/tip.2024.3379935
Meng Wei 1 , Long Chen 2 , Wei Ji 3 , Xiaoyu Yue 4 , Roger Zimmermann 3
Affiliation  

Video Visual Relation Detection (VidVRD) aims to detect visual relationship triplets in videos using spatial bounding boxes and temporal boundaries. Existing VidVRD methods can be broadly categorized into bottom-up and top-down paradigms, depending on their approach to classifying relations. Bottom-up methods follow a clip-based approach where they classify relations of short clip tubelet pairs and then merge them into long video relations. On the other hand, top-down methods directly classify long video tubelet pairs. While recent video-based methods utilizing video tubelets have shown promising results, we argue that the effective modeling of spatial and temporal context plays a more significant role than the choice between clip tubelets and video tubelets. This motivates us to revisit the clip-based paradigm and explore the key success factors in VidVRD. In this paper, we propose a Hierarchical Context Model (HCM) that enriches the object-based spatial context and relation-based temporal context based on clips. We demonstrate that using clip tubelets can achieve superior performance compared to most video-based methods. Additionally, using clip tubelets offers more flexibility in model designs and helps alleviate the limitations associated with video tubelets, such as the challenging long-term object tracking problem and the loss of temporal information in long-term tubelet feature compression. Extensive experiments conducted on two challenging VidVRD benchmarks validate that our HCM achieves a new state-of-the-art performance, highlighting the effectiveness of incorporating advanced spatial and temporal context modeling within the clip-based paradigm.

中文翻译:

捍卫基于剪辑的视频关系检测

视频视觉关系检测(VidVRD)旨在使用空间边界框和时间边界检测视频中的视觉关系三元组。现有的 VidVRD 方法可以大致分为自下而上和自上而下的范例,具体取决于它们对关系进行分类的方法。自下而上的方法遵循基于剪辑的方法,对短剪辑小管对的关系进行分类,然后将它们合并为长视频关系。另一方面,自上而下的方法直接对长视频管对进行分类。虽然最近利用视频 Tubelet 的基于视频的方法已经显示出有希望的结果,但我们认为,空间和时间上下文的有效建模比剪辑 Tubelet 和视频 Tubelet 之间的选择发挥着更重要的作用。这促使我们重新审视基于剪辑的范例并探索 VidVRD 的关键成功因素。在本文中,我们提出了一种层次上下文模型(HCM),它丰富了基于对象的空间上下文和基于剪辑的基于关系的时间上下文。我们证明,与大多数基于视频的方法相比,使用 Clip Tubelet 可以实现卓越的性能。此外,使用剪辑 Tubelet 为模型设计提供了更大的灵活性,并有助于减轻与视频 Tubelet 相关的限制,例如具有挑战性的长期对象跟踪问题以及长期 Tubelet 特征压缩中的时间信息丢失。在两个具有挑战性的 VidVRD 基准上进行的大量实验验证了我们的 HCM 实现了新的最先进的性能,突出了在基于剪辑的范例中整合先进的空间和时间上下文建模的有效性。
更新日期:2024-03-27
down
wechat
bug