Abstract
Knowledge-based visual question answering calls for not only paying attention to the visual content of images but also the support of relevant outside knowledge for improved question and answer thinking. The semantics of the questions should not be overlooked since knowledge retrieval relies on more than just visual information. This paper first proposed a question-based semantic retrieval strategy to compensate for the absence of image retrieval knowledge in order to better combine visual and knowledge information. Secondly, image caption is added to help the model better achieve scene understanding. Finally, modal knowledge is represented and accumulated through the triplets. Experimental results on the OK-VQA dataset show that the proposed method achieves an improvement of 4.24% and 1.90% over the two baseline methods, respectively, which proves the effectiveness of this method.
Similar content being viewed by others
Data Availibility
The datasets analyzed during the current study are available from the corresponding author on reasonable request.
References
Anderson P, He X, Buehler C, et al (2018) Bottom-up and top-down attention for image captioning and visual question answering. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 6077–6086
Antol S, Agrawal A, Lu J, et al (2015) Vqa: Visual question answering. In: Proceedings of the IEEE international conference on computer vision, pp 2425–2433
Ben-Younes H, Cadene R, Cord M, et al (2017) Mutan: Multimodal tucker fusion for visual question answering. In: Proceedings of the IEEE international conference on computer vision, pp 2612–2620
Bordes A, Usunier N, Garcia-Durán A, et al (2013) Translating embeddings for modeling multi-relational data. In: Proceedings of the 26th International Conference on Neural Information Processing Systems-Volume 2, pp 2787–2795
Chen K, Wang J, Chen LC, et al (2015) Abc-cnn: An attention based convolutional neural network for visual question answering. arXiv preprint arXiv:1511.05960
Ding Y, Yu J, Liu B, et al (2022) Mukea: Multimodal knowledge extraction and accumulation for knowledge-based visual question answering. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 5089–5098
Fukui A, Park DH, Yang D, et al (2016) Multimodal compact bilinear pooling for visual question answering and visual grounding. In: Conference on Empirical Methods in Natural Language Processing, ACL, pp 457–468
Gardères F, Ziaeefard M, Abeloos B et al (2020) Conceptbert: Concept-aware representation for visual question answering. Findings of the Association for Computational Linguistics: EMNLP 2020:489–498
He K, Zhang X, Ren S, et al (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 770–778
Hwang JD, Bhagavatula C, Le Bras R, et al (2021) (comet-) atomic 2020: On symbolic and neural commonsense knowledge graphs. In: Proceedings of the AAAI Conference on Artificial Intelligence, pp 6384–6392
Joshi V, Peters ME, Hopkins M (2018) Extending a parser to distant domains using a few dozen partially annotated examples. In: Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp 1190–1199
Kannan AV, Fradkin D, Akrotirianakis I, et al (2020) Multimodal knowledge graph for deep learning papers and code. In: Proceedings of the 29th ACM International Conference on Information & Knowledge Management, pp 3417–3420
Kim JH, Jun J, Zhang BT (2018) Bilinear attention networks. In: Proceedings of the 32nd International Conference on Neural Information Processing Systems, pp 1571–1581
Li M, Zareian A, Lin Y, et al (2020a) Gaia: A fine-grained multimedia knowledge extraction system. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics: System Demonstrations, pp 77–86
Li X, Yin X, Li C, et al (2020b) Oscar: Object-semantics aligned pre-training for vision-language tasks. In: European Conference on Computer Vision, Springer, pp 121–137
Lin TY, Maire M, Belongie S, et al (2014) Microsoft coco: Common objects in context. In: European conference on computer vision, Springer, pp 740–755
Loshchilov I, Hutter F (2018) Fixing weight decay regularization in adam. In: International Conference on Learning Representations
Lu J, Batra D, Parikh D, et al (2019) Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Advances in neural information processing systems 32
Malinowski M, Rohrbach M, Fritz M (2015) Ask your neurons: A neural-based approach to answering questions about images. In: Proceedings of the IEEE international conference on computer vision, pp 1–9
Manola F, Miller E, McBride B, et al (2004) Rdf primer. W3C recommendation 10(1-107):6
Marino K, Rastegari M, Farhadi A, et al (2019) Ok-vqa: A visual question answering benchmark requiring external knowledge. In: Proceedings of the IEEE/cvf conference on computer vision and pattern recognition, pp 3195–3204
Marino K, Chen X, Parikh D, et al (2021) Krisp: Integrating implicit and symbolic knowledge for open-domain knowledge-based vqa. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 14111–14121
Mikolov T, Chen K, Corrado G, et al (2013) Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781
Mokady R, Hertz A, Bermano AH (2021) Clipcap: Clip prefix for image captioning. arXiv preprint arXiv:2111.09734
Patashnik O, Wu Z, Shechtman E, et al (2021) Styleclip: Text-driven manipulation of stylegan imagery. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp 2085–2094
Pennington J, Socher R, Manning CD (2014) Glove: Global vectors for word representation. In: Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pp 1532–1543
Pezeshkpour P, Chen L, Singh S (2018) Embedding multimodal relational data for knowledge base completion. In: Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pp 3208–3218
Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, PMLR, pp 8748–8763
Redmon J, Divvala S, Girshick R, et al (2016) You only look once: Unified, real-time object detection. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 779–788
Reimers N (2020) sentence embeddings using siamese bert-networks. emnlp-ijcnlp 2019–2019 conf empir methods nat lang process 9th int jt conf nat lang process proc conf; 2019: 3982-3992
Ren M, Kiros R, Zemel R (2015) Image question answering: A visual semantic embedding model and a new dataset. Proc Advances in Neural Inf Process Syst 1(2):5
Ren S, He K, Girshick R et al (2016) Faster r-cnn: Towards real-time object detection with region proposal networks. IEEE Trans Pattern Anal Mach Intell 39(6):1137–1149
Sap M, Le Bras R, Allaway E, et al (2019) Atomic: An atlas of machine commonsense for if-then reasoning. In: Proceedings of the AAAI conference on artificial intelligence, pp 3027–3035
Sharma H, Jalal AS (2021) Image captioning improved visual question answering. Multimedia tools and applications pp 1–22
Shen S, Li LH, Tan H, et al (2021) How much can clip benefit vision-and-language tasks? arXiv preprint arXiv:2107.06383
Speer R, Chin J, Havasi C (2017) ConceptNet 5.5: an open multilingual graph of general knowledge. In: Proceedings of the thirty-first AAAI conference on artificial intelligence. 2017 presented at: AAAI’17: thirty-first AAAI conference on artificial intelligence
Su Z, Gou G (2022) Visual question answering research on joint knowledge and visual information reasoning. Computer Engineering and Applications
Sun R, Cao X, Zhao Y, et al (2020) Multi-modal knowledge graphs for recommender systems. In: Proceedings of the 29th ACM international conference on information & knowledge management, pp 1405–1414
Tan H, Bansal M (2019) Lxmert: Learning cross-modality encoder representations from transformers. In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp 5100–5111
Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. In: Proceedings of the 31st International Conference on Neural Information Processing Systems, pp 6000–6010
Vrandečić D, Krötzsch M (2014) Wikidata: a free collaborative knowledgebase. Commun ACM 57(10):78–85
Wang G, Zhu M, Xu C, et al (2021) Exploiting image captions and external knowledge as representation enhancement for visual question answering. In: Proceedings of the 20th Chinese National Conference on Computational Linguistics, pp 316–326
Wang P, Wu Q, Shen C, et al (2017a) Explicit knowledge-based reasoning for visual question answering. In: Proceedings of the 26th International Joint Conference on Artificial Intelligence, pp 1290–1296
Wang P, Wu Q, Shen C et al (2017) Fvqa: Fact-based visual question answering. IEEE Trans Pattern Anal Mach Intell 40(10):2413–2427
Wang P, Yang A, Men R, et al (2022) Ofa: Unifying architectures, tasks, and modalities through a simple sequence-to-sequence learning framework. In: International Conference on Machine Learning, PMLR, pp 23318–23340
Yu J, Zhu Z, Wang Y et al (2020) Cross-modal knowledge reasoning for knowledge-based visual question answering. Pattern Recogn 108:107563
Yu Z, Yu J, Cui Y, et al (2019) Deep modular co-attention networks for visual question answering. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 6281–6290
Zhang P, Li X, Hu X, et al (2021) Vinvl: Revisiting visual representations in vision-language models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 5579–5588
Zhu Z, Yu J, Wang Y, et al (2021) Mucko: multi-layer cross-modal knowledge reasoning for fact-based visual question answering. In: Proceedings of the Twenty-Ninth International Conference on International Joint Conferences on Artificial Intelligence, pp 1097–1103
Acknowledgements
This work is supported by the National Natural Science Foundation of China (Grant No. 62162010) and the Guizhou Science and Technology Support Program, Qiankehe Support [2022] General 267.
Author information
Authors and Affiliations
Contributions
Zhenqiang Su: Conceptualization, Methodology, Validation, Writing - Original Draft. Gang Gou: Funding Acquisition, Supervision, Writing - Review & Editing.
Corresponding author
Ethics declarations
Conflicts of interest
We declare that we have no conflict of interest.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Su, Z., Gou, G. Knowledge enhancement and scene understanding for knowledge-based visual question answering. Knowl Inf Syst 66, 2193–2208 (2024). https://doi.org/10.1007/s10115-023-02028-9
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10115-023-02028-9