Knowledge enhancement and scene understanding for knowledge-based visual question answering

Su, Zhenqiang; Gou, Gang

doi:10.1007/s10115-023-02028-9

Knowledge enhancement and scene understanding for knowledge-based visual question answering

Regular Paper
Published: 14 December 2023

Volume 66, pages 2193–2208, (2024)
Cite this article

Knowledge and Information Systems Aims and scope Submit manuscript

Zhenqiang Su^1,2 &
Gang Gou²

258 Accesses
Explore all metrics

Abstract

Knowledge-based visual question answering calls for not only paying attention to the visual content of images but also the support of relevant outside knowledge for improved question and answer thinking. The semantics of the questions should not be overlooked since knowledge retrieval relies on more than just visual information. This paper first proposed a question-based semantic retrieval strategy to compensate for the absence of image retrieval knowledge in order to better combine visual and knowledge information. Secondly, image caption is added to help the model better achieve scene understanding. Finally, modal knowledge is represented and accumulated through the triplets. Experimental results on the OK-VQA dataset show that the proposed method achieves an improvement of 4.24% and 1.90% over the two baseline methods, respectively, which proves the effectiveness of this method.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A focus fusion attention mechanism integrated with image captions for knowledge graph-based visual question answering

Article 13 February 2024

Knowledge-aware image understanding with multi-level visual representation enhancement for visual question answering

Article 12 December 2023

A common-specific feature cross-fusion attention mechanism for KGVQA

Article 13 April 2024

Data Availibility

The datasets analyzed during the current study are available from the corresponding author on reasonable request.

References

Anderson P, He X, Buehler C, et al (2018) Bottom-up and top-down attention for image captioning and visual question answering. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 6077–6086
Antol S, Agrawal A, Lu J, et al (2015) Vqa: Visual question answering. In: Proceedings of the IEEE international conference on computer vision, pp 2425–2433
Ben-Younes H, Cadene R, Cord M, et al (2017) Mutan: Multimodal tucker fusion for visual question answering. In: Proceedings of the IEEE international conference on computer vision, pp 2612–2620
Bordes A, Usunier N, Garcia-Durán A, et al (2013) Translating embeddings for modeling multi-relational data. In: Proceedings of the 26th International Conference on Neural Information Processing Systems-Volume 2, pp 2787–2795
Chen K, Wang J, Chen LC, et al (2015) Abc-cnn: An attention based convolutional neural network for visual question answering. arXiv preprint arXiv:1511.05960
Ding Y, Yu J, Liu B, et al (2022) Mukea: Multimodal knowledge extraction and accumulation for knowledge-based visual question answering. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 5089–5098
Fukui A, Park DH, Yang D, et al (2016) Multimodal compact bilinear pooling for visual question answering and visual grounding. In: Conference on Empirical Methods in Natural Language Processing, ACL, pp 457–468
Gardères F, Ziaeefard M, Abeloos B et al (2020) Conceptbert: Concept-aware representation for visual question answering. Findings of the Association for Computational Linguistics: EMNLP 2020:489–498
He K, Zhang X, Ren S, et al (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 770–778
Hwang JD, Bhagavatula C, Le Bras R, et al (2021) (comet-) atomic 2020: On symbolic and neural commonsense knowledge graphs. In: Proceedings of the AAAI Conference on Artificial Intelligence, pp 6384–6392
Joshi V, Peters ME, Hopkins M (2018) Extending a parser to distant domains using a few dozen partially annotated examples. In: Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp 1190–1199
Kannan AV, Fradkin D, Akrotirianakis I, et al (2020) Multimodal knowledge graph for deep learning papers and code. In: Proceedings of the 29th ACM International Conference on Information & Knowledge Management, pp 3417–3420
Kim JH, Jun J, Zhang BT (2018) Bilinear attention networks. In: Proceedings of the 32nd International Conference on Neural Information Processing Systems, pp 1571–1581
Li M, Zareian A, Lin Y, et al (2020a) Gaia: A fine-grained multimedia knowledge extraction system. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics: System Demonstrations, pp 77–86
Li X, Yin X, Li C, et al (2020b) Oscar: Object-semantics aligned pre-training for vision-language tasks. In: European Conference on Computer Vision, Springer, pp 121–137
Lin TY, Maire M, Belongie S, et al (2014) Microsoft coco: Common objects in context. In: European conference on computer vision, Springer, pp 740–755
Loshchilov I, Hutter F (2018) Fixing weight decay regularization in adam. In: International Conference on Learning Representations
Lu J, Batra D, Parikh D, et al (2019) Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Advances in neural information processing systems 32
Malinowski M, Rohrbach M, Fritz M (2015) Ask your neurons: A neural-based approach to answering questions about images. In: Proceedings of the IEEE international conference on computer vision, pp 1–9
Manola F, Miller E, McBride B, et al (2004) Rdf primer. W3C recommendation 10(1-107):6
Marino K, Rastegari M, Farhadi A, et al (2019) Ok-vqa: A visual question answering benchmark requiring external knowledge. In: Proceedings of the IEEE/cvf conference on computer vision and pattern recognition, pp 3195–3204
Marino K, Chen X, Parikh D, et al (2021) Krisp: Integrating implicit and symbolic knowledge for open-domain knowledge-based vqa. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 14111–14121
Mikolov T, Chen K, Corrado G, et al (2013) Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781
Mokady R, Hertz A, Bermano AH (2021) Clipcap: Clip prefix for image captioning. arXiv preprint arXiv:2111.09734
Patashnik O, Wu Z, Shechtman E, et al (2021) Styleclip: Text-driven manipulation of stylegan imagery. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp 2085–2094
Pennington J, Socher R, Manning CD (2014) Glove: Global vectors for word representation. In: Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pp 1532–1543
Pezeshkpour P, Chen L, Singh S (2018) Embedding multimodal relational data for knowledge base completion. In: Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pp 3208–3218
Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, PMLR, pp 8748–8763
Redmon J, Divvala S, Girshick R, et al (2016) You only look once: Unified, real-time object detection. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 779–788
Reimers N (2020) sentence embeddings using siamese bert-networks. emnlp-ijcnlp 2019–2019 conf empir methods nat lang process 9th int jt conf nat lang process proc conf; 2019: 3982-3992
Ren M, Kiros R, Zemel R (2015) Image question answering: A visual semantic embedding model and a new dataset. Proc Advances in Neural Inf Process Syst 1(2):5
Google Scholar
Ren S, He K, Girshick R et al (2016) Faster r-cnn: Towards real-time object detection with region proposal networks. IEEE Trans Pattern Anal Mach Intell 39(6):1137–1149
Article PubMed Google Scholar
Sap M, Le Bras R, Allaway E, et al (2019) Atomic: An atlas of machine commonsense for if-then reasoning. In: Proceedings of the AAAI conference on artificial intelligence, pp 3027–3035
Sharma H, Jalal AS (2021) Image captioning improved visual question answering. Multimedia tools and applications pp 1–22
Shen S, Li LH, Tan H, et al (2021) How much can clip benefit vision-and-language tasks? arXiv preprint arXiv:2107.06383
Speer R, Chin J, Havasi C (2017) ConceptNet 5.5: an open multilingual graph of general knowledge. In: Proceedings of the thirty-first AAAI conference on artificial intelligence. 2017 presented at: AAAI’17: thirty-first AAAI conference on artificial intelligence
Su Z, Gou G (2022) Visual question answering research on joint knowledge and visual information reasoning. Computer Engineering and Applications
Sun R, Cao X, Zhao Y, et al (2020) Multi-modal knowledge graphs for recommender systems. In: Proceedings of the 29th ACM international conference on information & knowledge management, pp 1405–1414
Tan H, Bansal M (2019) Lxmert: Learning cross-modality encoder representations from transformers. In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp 5100–5111
Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. In: Proceedings of the 31st International Conference on Neural Information Processing Systems, pp 6000–6010
Vrandečić D, Krötzsch M (2014) Wikidata: a free collaborative knowledgebase. Commun ACM 57(10):78–85
Article Google Scholar
Wang G, Zhu M, Xu C, et al (2021) Exploiting image captions and external knowledge as representation enhancement for visual question answering. In: Proceedings of the 20th Chinese National Conference on Computational Linguistics, pp 316–326
Wang P, Wu Q, Shen C, et al (2017a) Explicit knowledge-based reasoning for visual question answering. In: Proceedings of the 26th International Joint Conference on Artificial Intelligence, pp 1290–1296
Wang P, Wu Q, Shen C et al (2017) Fvqa: Fact-based visual question answering. IEEE Trans Pattern Anal Mach Intell 40(10):2413–2427
Article PubMed Google Scholar
Wang P, Yang A, Men R, et al (2022) Ofa: Unifying architectures, tasks, and modalities through a simple sequence-to-sequence learning framework. In: International Conference on Machine Learning, PMLR, pp 23318–23340
Yu J, Zhu Z, Wang Y et al (2020) Cross-modal knowledge reasoning for knowledge-based visual question answering. Pattern Recogn 108:107563
Article Google Scholar
Yu Z, Yu J, Cui Y, et al (2019) Deep modular co-attention networks for visual question answering. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 6281–6290
Zhang P, Li X, Hu X, et al (2021) Vinvl: Revisiting visual representations in vision-language models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 5579–5588
Zhu Z, Yu J, Wang Y, et al (2021) Mucko: multi-layer cross-modal knowledge reasoning for fact-based visual question answering. In: Proceedings of the Twenty-Ninth International Conference on International Joint Conferences on Artificial Intelligence, pp 1097–1103

Download references

Acknowledgements

This work is supported by the National Natural Science Foundation of China (Grant No. 62162010) and the Guizhou Science and Technology Support Program, Qiankehe Support [2022] General 267.

Author information

Authors and Affiliations

State Key Laboratory of Public Big Data, Guizhou University, Guiyang, 550025, China
Zhenqiang Su
College of Computer Science and Technology, Guizhou University, Guiyang, 550025, China
Zhenqiang Su & Gang Gou

Authors

Zhenqiang Su
View author publications
You can also search for this author in PubMed Google Scholar
Gang Gou
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

Zhenqiang Su: Conceptualization, Methodology, Validation, Writing - Original Draft. Gang Gou: Funding Acquisition, Supervision, Writing - Review & Editing.

Corresponding author

Correspondence to Gang Gou.

Ethics declarations

Conflicts of interest

We declare that we have no conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Su, Z., Gou, G. Knowledge enhancement and scene understanding for knowledge-based visual question answering. Knowl Inf Syst 66, 2193–2208 (2024). https://doi.org/10.1007/s10115-023-02028-9

Download citation

Received: 06 February 2023
Revised: 29 April 2023
Accepted: 13 November 2023
Published: 14 December 2023
Issue Date: March 2024
DOI: https://doi.org/10.1007/s10115-023-02028-9

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Knowledge enhancement and scene understanding for knowledge-based visual question answering

Abstract

Access this article

Similar content being viewed by others

A focus fusion attention mechanism integrated with image captions for knowledge graph-based visual question answering

Knowledge-aware image understanding with multi-level visual representation enhancement for visual question answering

A common-specific feature cross-fusion attention mechanism for KGVQA

Data Availibility

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflicts of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Knowledge enhancement and scene understanding for knowledge-based visual question answering

Abstract

Access this article

Similar content being viewed by others

A focus fusion attention mechanism integrated with image captions for knowledge graph-based visual question answering

Knowledge-aware image understanding with multi-level visual representation enhancement for visual question answering

A common-specific feature cross-fusion attention mechanism for KGVQA

Data Availibility

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflicts of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation