Knowledge-Augmented Visual Question Answering With Natural Language Explanation,IEEE Transactions on Image Processing

当前位置： X-MOL 学术 › IEEE Trans. Image Process. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Knowledge-Augmented Visual Question Answering With Natural Language Explanation
IEEE Transactions on Image Processing ( IF 10.6 ) Pub Date : 2024-03-28 , DOI: 10.1109/tip.2024.3379900
Jiayuan Xie ₁ , Yi Cai ₂ , Jiali Chen ₂ , Ruohang Xu ₂ , Jiexin Wang ₂ , Qing Li ₁

Affiliation

Visual question answering with natural language explanation (VQA-NLE) is a challenging task that requires models to not only generate accurate answers but also to provide explanations that justify the relevant decision-making processes. This task is accomplished by generating natural language sentences based on the given question-image pair. However, existing methods often struggle to ensure consistency between the answers and explanations due to their disregard of the crucial interactions between these factors. Moreover, existing methods overlook the potential benefits of incorporating additional knowledge, which hinders their ability to effectively bridge the semantic gap between questions and images, leading to less accurate explanations. In this paper, we present a novel approach denoted the knowledge-based iterative consensus VQA-NLE (KICNLE) model to address these limitations. To maintain consistency, our model incorporates an iterative consensus generator that adopts a multi-iteration generative method, enabling multiple iterations of the answer and explanation in each generation. In each iteration, the current answer is utilized to generate an explanation, which in turn guides the generation of a new answer. Additionally, a knowledge retrieval module is introduced to provide potentially valid candidate knowledge, guide the generation process, effectively bridge the gap between questions and images, and enable the production of high-quality answer-explanation pairs. Extensive experiments conducted on three different datasets demonstrate the superiority of our proposed KICNLE model over competing state-of-the-art approaches. Our code is available at https://github.com/Gary-code/KICNLE .

中文翻译：

具有自然语言解释的知识增强视觉问答

具有自然语言解释的视觉问答（VQA-NLE）是一项具有挑战性的任务，要求模型不仅能够生成准确的答案，而且还需要提供解释来证明相关决策过程的合理性。该任务是通过根据给定的问题-图像对生成自然语言句子来完成的。然而，现有的方法往往难以确保答案和解释之间的一致性，因为它们忽视了这些因素之间的关键相互作用。此外，现有方法忽视了纳入额外知识的潜在好处，这阻碍了它们有效弥合问题和图像之间语义差距的能力，导致解释不太准确。在本文中，我们提出了一种称为基于知识的迭代共识 VQA-NLE (KICNLE) 模型的新方法来解决这些局限性。为了保持一致性，我们的模型采用了迭代共识生成器，该生成器采用多次迭代生成方法，使得每一代中的答案和解释都可以多次迭代。在每次迭代中，当前答案用于生成解释，进而指导新答案的生成。此外，引入知识检索模块来提供潜在有效的候选知识，指导生成过程，有效地弥合问题和图像之间的差距，并能够生成高质量的答案-解释对。在三个不同数据集上进行的广泛实验证明了我们提出的 KICNLE 模型相对于竞争的最先进方法的优越性。我们的代码位于https://github.com/Gary-code/KICNLE 。

更新日期：2024-03-28

点击分享查看原文

点击收藏

阅读更多本刊最新论文本刊介绍/投稿指南

全部期刊列表>>