UniMod1K: Towards a More Universal Large-Scale Dataset and Benchmark for Multi-modal Learning,International Journal of Computer Vision

当前位置： X-MOL 学术 › Int. J. Comput. Vis. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

UniMod1K: Towards a More Universal Large-Scale Dataset and Benchmark for Multi-modal Learning
International Journal of Computer Vision ( IF 19.5 ) Pub Date : 2024-02-22 , DOI: 10.1007/s11263-024-01999-8
Xue-Feng Zhu , Tianyang Xu , Zongtao Liu , Zhangyong Tang , Xiao-Jun Wu , Josef Kittler

The emergence of large-scale high-quality datasets has stimulated the rapid development of deep learning in recent years. However, most computer vision tasks focus on the visual modality only, resulting in a huge imbalance in the number of annotated data for other modalities. While several multi-modal datasets have been made available, the majority of them are confined to only two modalities, serving a single specific computer vision task. To redress the data deficiency for multi-modal learning and applications, a new dataset named UniMod1K is presented in this work. UniMod1K involves three data modalities: vision, depth, and language. For the vision and depth modalities, the UniMod1K dataset contains 1050 RGB-D sequences, comprising a total of some 2.5 million frames. Regarding the language modality, the proposed dataset includes 1050 sentences describing the target object in each video. To demonstrate the advantages of training on a larger multi-modal dataset, such as UniMod1K, and to stimulate research enabled by the dataset, we address several multi-modal tasks, namely multi-modal object tracking and monocular depth estimation. To establish a performance baseline, we propose novel baseline methods for RGB-D object tracking, vision-language tracking and vision-depth-language tracking. Additionally, we conduct comprehensive experiments for each of these tasks. The results highlight the potential of the UniMod1K dataset to improve the performance of multi-modal approaches. The dataset and codes can be accessed at https://github.com/xuefeng-zhu5/UniMod1K.

中文翻译：

UniMod1K：迈向更通用的大规模数据集和多模态学习基准

近年来大规模高质量数据集的出现刺激了深度学习的快速发展。然而，大多数计算机视觉任务仅关注视觉模态，导致其他模态的注释数据数量巨大不平衡。虽然已经提供了多个多模态数据集，但其中大多数仅限于两种模态，服务于单个特定的计算机视觉任务。为了弥补多模态学习和应用的数据不足，本文提出了一个名为 UniMod1K 的新数据集。UniMod1K 涉及三种数据模式：视觉、深度和语言。对于视觉和深度模态，UniMod1K 数据集包含 1050 个 RGB-D 序列，总共约 250 万帧。关于语言模态，所提出的数据集包括描述每个视频中的目标对象的 1050 个句子。为了展示在更大的多模态数据集（例如 UniMod1K）上进行训练的优势，并促进该数据集支持的研究，我们解决了几个多模态任务，即多模态对象跟踪和单目深度估计。为了建立性能基线，我们提出了用于 RGB-D 对象跟踪、视觉语言跟踪和视觉深度语言跟踪的新颖基线方法。此外，我们还对每项任务进行了全面的实验。结果凸显了 UniMod1K 数据集在提高多模态方法性能方面的潜力。数据集和代码可以访问https://github.com/xuefeng-zhu5/UniMod1K。

更新日期：2024-02-22

点击分享查看原文

点击收藏

阅读更多本刊最新论文本刊介绍/投稿指南

全部期刊列表>>