Simple contrastive learning in a self-supervised manner for robust visual question answering,Computer Vision and Image Understanding

当前位置： X-MOL 学术 › Comput. Vis. Image Underst. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Simple contrastive learning in a self-supervised manner for robust visual question answering
Computer Vision and Image Understanding ( IF 4.5 ) Pub Date : 2024-02-27 , DOI: 10.1016/j.cviu.2024.103976
Shuwen Yang , Luwei Xiao , Xingjiao Wu , Junjie Xu , Linlin Wang , Liang He

Recent observations have revealed that Visual Question Answering models are susceptible to learning the spurious correlations formed by dataset biases, i.e., the language priors, instead of the intended solution. For instance, given a question and a relative image, some VQA systems are prone to provide the frequently occurring answer in the dataset while disregarding the image content. Such a preferred tendency has caused them to be brittle in real-world settings, harming the robustness of VQA models. We experimentally found that conventional VQA methods often confuse negative samples that with identical questions but different images, which results in the generation of linguistic bias. In this paper, we propose a simple contrastive learning scheme, namely SCLSM, to mitigate the above issues in a self-supervised manner. We construct several special negative samples and introduce a debiasing-aware contrastive learning approach to help the model learn more discriminative multimodal features, thus improving the ability of debiasing. The SCLSM is compatible with numerous VQA baselines. Experimental results on the widely-used public datasets VQA-CP v2 and VQA v2 validate the effectiveness of our proposed model.

中文翻译：

以自我监督的方式进行简单的对比学习，以实现稳健的视觉问答

最近的观察表明，视觉问答模型很容易学习由数据集偏差（即语言先验）形成的虚假相关性，而不是预期的解决方案。例如，给定一个问题和一个相关图像，一些 VQA 系统很容易提供数据集中频繁出现的答案，而忽略图像内容。这种偏好趋势导致它们在现实环境中变得脆弱，损害了 VQA 模型的稳健性。我们通过实验发现，传统的 VQA 方法经常将具有相同问题但不同图像的负样本混淆，从而导致语言偏差的产生。在本文中，我们提出了一种简单的对比学习方案，即 SCLSM，以自我监督的方式缓解上述问题。我们构造了几个特殊的负样本，并引入了一种去偏感知对比学习方法，帮助模型学习更具辨别力的多模态特征，从而提高去偏能力。 SCLSM 与众多 VQA 基线兼容。在广泛使用的公共数据集 VQA-CP v2 和 VQA v2 上的实验结果验证了我们提出的模型的有效性。

更新日期：2024-02-27

点击分享查看原文

点击收藏

阅读更多本刊最新论文本刊介绍/投稿指南

全部期刊列表>>