当前位置: X-MOL 学术Int. J. Comput. Vis. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Adaptive Multi-Source Predictor for Zero-Shot Video Object Segmentation
International Journal of Computer Vision ( IF 19.5 ) Pub Date : 2024-03-07 , DOI: 10.1007/s11263-024-02024-8
Xiaoqi Zhao , Shijie Chang , Youwei Pang , Jiaxing Yang , Lihe Zhang , Huchuan Lu

Static and moving objects often occur in real-life videos. Most video object segmentation methods only focus on extracting and exploiting motion cues to perceive moving objects. Once faced with the frames of static objects, the moving object predictors may predict failed results caused by uncertain motion information, such as low-quality optical flow maps. Besides, different sources such as RGB, depth, optical flow and static saliency can provide useful information about the objects. However, existing approaches only consider either the RGB or RGB and optical flow. In this paper, we propose a novel adaptive multi-source predictor for zero-shot video object segmentation (ZVOS). In the static object predictor, the RGB source is converted to depth and static saliency sources, simultaneously. In the moving object predictor, we propose the multi-source fusion structure. First, the spatial importance of each source is highlighted with the help of the interoceptive spatial attention module (ISAM). Second, the motion-enhanced module (MEM) is designed to generate pure foreground motion attention for improving the representation of static and moving features in the decoder. Furthermore, we design a feature purification module (FPM) to filter the inter-source incompatible features. By using the ISAM, MEM and FPM, the multi-source features are effectively fused. In addition, we put forward an adaptive predictor fusion network (APF) to evaluate the quality of the optical flow map and fuse the predictions from the static object predictor and the moving object predictor in order to prevent over-reliance on the failed results caused by low-quality optical flow maps. Experiments show that the proposed model outperforms the state-of-the-art methods on three challenging ZVOS benchmarks. And, the static object predictor precisely predicts a high-quality depth map and static saliency map at the same time.



中文翻译:

用于零镜头视频对象分割的自适应多源预测器

静态和移动的物体经常出现在现实生活的视频中。大多数视频对象分割方法仅专注于提取和利用运动线索来感知移动对象。一旦面对静态物体的帧,运动物体预测器可能会因不确定的运动信息(例如低质量的光流图)而导致预测失败的结果。此外,RGB、深度、光流和静态显着性等不同来源可以提供有关对象的有用信息。然而,现有方法仅考虑 RGB 或 RGB 和光流。在本文中,我们提出了一种用于零镜头视频对象分割(ZVOS)的新型自适应多源预测器。在静态对象预测器中,RGB 源同时转换为深度和静态显着性源。在运动目标预测器中,我们提出了多源融合结构。首先,借助内感受空间注意模块(ISAM)突出每个来源的空间重要性。其次,运动增强模块(MEM)旨在生成纯前景运动注意力,以改善解码器中静态和运动特征的表示。此外,我们设计了一个特征净化模块(FPM)来过滤源间不兼容的特征。通过使用ISAM、MEM和FPM,有效地融合了多源特征。此外,我们提出了一种自适应预测融合网络(APF)来评估光流图的质量,并融合静态对象预测器和运动对象预测器的预测,以防止过度依赖由以下原因引起的失败结果低质量的光流图。实验表明,所提出的模型在三个具有挑战性的 ZVOS 基准测试中优于最先进的方法。并且,静态对象预测器同时精确预测高质量的深度图和静态显着性图。

更新日期:2024-03-08
down
wechat
bug