当前位置: X-MOL 学术Int. J. Comput. Vis. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
UniMod1K: Towards a More Universal Large-Scale Dataset and Benchmark for Multi-modal Learning
International Journal of Computer Vision ( IF 19.5 ) Pub Date : 2024-02-22 , DOI: 10.1007/s11263-024-01999-8
Xue-Feng Zhu , Tianyang Xu , Zongtao Liu , Zhangyong Tang , Xiao-Jun Wu , Josef Kittler

The emergence of large-scale high-quality datasets has stimulated the rapid development of deep learning in recent years. However, most computer vision tasks focus on the visual modality only, resulting in a huge imbalance in the number of annotated data for other modalities. While several multi-modal datasets have been made available, the majority of them are confined to only two modalities, serving a single specific computer vision task. To redress the data deficiency for multi-modal learning and applications, a new dataset named UniMod1K is presented in this work. UniMod1K involves three data modalities: vision, depth, and language. For the vision and depth modalities, the UniMod1K dataset contains 1050 RGB-D sequences, comprising a total of some 2.5 million frames. Regarding the language modality, the proposed dataset includes 1050 sentences describing the target object in each video. To demonstrate the advantages of training on a larger multi-modal dataset, such as UniMod1K, and to stimulate research enabled by the dataset, we address several multi-modal tasks, namely multi-modal object tracking and monocular depth estimation. To establish a performance baseline, we propose novel baseline methods for RGB-D object tracking, vision-language tracking and vision-depth-language tracking. Additionally, we conduct comprehensive experiments for each of these tasks. The results highlight the potential of the UniMod1K dataset to improve the performance of multi-modal approaches. The dataset and codes can be accessed at https://github.com/xuefeng-zhu5/UniMod1K.



中文翻译:

UniMod1K:迈向更通用的大规模数据集和多模态学习基准

近年来大规模高质量数据集的出现刺激了深度学习的快速发展。然而,大多数计算机视觉任务仅关注视觉模态,导致其他模态的注释数据数量巨大不平衡。虽然已经提供了多个多模态数据集,但其中大多数仅限于两种模态,服务于单个特定的计算机视觉任务。为了弥补多模态学习和应用的数据不足,本文提出了一个名为 UniMod1K 的新数据集。UniMod1K 涉及三种数据模式:视觉、深度和语言。对于视觉和深度模态,UniMod1K 数据集包含 1050 个 RGB-D 序列,总共约 250 万帧。关于语言模态,所提出的数据集包括描述每个视频中的目标对象的 1050 个句子。为了展示在更大的多模态数据集(例如 UniMod1K)上进行训练的优势,并促进该数据集支持的研究,我们解决了几个多模态任务,即多模态对象跟踪和单目深度估计。为了建立性能基线,我们提出了用于 RGB-D 对象跟踪、视觉语言跟踪和视觉深度语言跟踪的新颖基线方法。此外,我们还对每项任务进行了全面的实验。结果凸显了 UniMod1K 数据集在提高多模态方法性能方面的潜力。数据集和代码可以访问https://github.com/xuefeng-zhu5/UniMod1K。

更新日期:2024-02-22
down
wechat
bug