Skip to main content
Log in

Knowledge enhancement and scene understanding for knowledge-based visual question answering

  • Regular Paper
  • Published:
Knowledge and Information Systems Aims and scope Submit manuscript

Abstract

Knowledge-based visual question answering calls for not only paying attention to the visual content of images but also the support of relevant outside knowledge for improved question and answer thinking. The semantics of the questions should not be overlooked since knowledge retrieval relies on more than just visual information. This paper first proposed a question-based semantic retrieval strategy to compensate for the absence of image retrieval knowledge in order to better combine visual and knowledge information. Secondly, image caption is added to help the model better achieve scene understanding. Finally, modal knowledge is represented and accumulated through the triplets. Experimental results on the OK-VQA dataset show that the proposed method achieves an improvement of 4.24% and 1.90% over the two baseline methods, respectively, which proves the effectiveness of this method.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5

Similar content being viewed by others

Data Availibility

The datasets analyzed during the current study are available from the corresponding author on reasonable request.

References

  1. Anderson P, He X, Buehler C, et al (2018) Bottom-up and top-down attention for image captioning and visual question answering. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 6077–6086

  2. Antol S, Agrawal A, Lu J, et al (2015) Vqa: Visual question answering. In: Proceedings of the IEEE international conference on computer vision, pp 2425–2433

  3. Ben-Younes H, Cadene R, Cord M, et al (2017) Mutan: Multimodal tucker fusion for visual question answering. In: Proceedings of the IEEE international conference on computer vision, pp 2612–2620

  4. Bordes A, Usunier N, Garcia-Durán A, et al (2013) Translating embeddings for modeling multi-relational data. In: Proceedings of the 26th International Conference on Neural Information Processing Systems-Volume 2, pp 2787–2795

  5. Chen K, Wang J, Chen LC, et al (2015) Abc-cnn: An attention based convolutional neural network for visual question answering. arXiv preprint arXiv:1511.05960

  6. Ding Y, Yu J, Liu B, et al (2022) Mukea: Multimodal knowledge extraction and accumulation for knowledge-based visual question answering. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 5089–5098

  7. Fukui A, Park DH, Yang D, et al (2016) Multimodal compact bilinear pooling for visual question answering and visual grounding. In: Conference on Empirical Methods in Natural Language Processing, ACL, pp 457–468

  8. Gardères F, Ziaeefard M, Abeloos B et al (2020) Conceptbert: Concept-aware representation for visual question answering. Findings of the Association for Computational Linguistics: EMNLP 2020:489–498

  9. He K, Zhang X, Ren S, et al (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 770–778

  10. Hwang JD, Bhagavatula C, Le Bras R, et al (2021) (comet-) atomic 2020: On symbolic and neural commonsense knowledge graphs. In: Proceedings of the AAAI Conference on Artificial Intelligence, pp 6384–6392

  11. Joshi V, Peters ME, Hopkins M (2018) Extending a parser to distant domains using a few dozen partially annotated examples. In: Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp 1190–1199

  12. Kannan AV, Fradkin D, Akrotirianakis I, et al (2020) Multimodal knowledge graph for deep learning papers and code. In: Proceedings of the 29th ACM International Conference on Information & Knowledge Management, pp 3417–3420

  13. Kim JH, Jun J, Zhang BT (2018) Bilinear attention networks. In: Proceedings of the 32nd International Conference on Neural Information Processing Systems, pp 1571–1581

  14. Li M, Zareian A, Lin Y, et al (2020a) Gaia: A fine-grained multimedia knowledge extraction system. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics: System Demonstrations, pp 77–86

  15. Li X, Yin X, Li C, et al (2020b) Oscar: Object-semantics aligned pre-training for vision-language tasks. In: European Conference on Computer Vision, Springer, pp 121–137

  16. Lin TY, Maire M, Belongie S, et al (2014) Microsoft coco: Common objects in context. In: European conference on computer vision, Springer, pp 740–755

  17. Loshchilov I, Hutter F (2018) Fixing weight decay regularization in adam. In: International Conference on Learning Representations

  18. Lu J, Batra D, Parikh D, et al (2019) Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Advances in neural information processing systems 32

  19. Malinowski M, Rohrbach M, Fritz M (2015) Ask your neurons: A neural-based approach to answering questions about images. In: Proceedings of the IEEE international conference on computer vision, pp 1–9

  20. Manola F, Miller E, McBride B, et al (2004) Rdf primer. W3C recommendation 10(1-107):6

  21. Marino K, Rastegari M, Farhadi A, et al (2019) Ok-vqa: A visual question answering benchmark requiring external knowledge. In: Proceedings of the IEEE/cvf conference on computer vision and pattern recognition, pp 3195–3204

  22. Marino K, Chen X, Parikh D, et al (2021) Krisp: Integrating implicit and symbolic knowledge for open-domain knowledge-based vqa. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 14111–14121

  23. Mikolov T, Chen K, Corrado G, et al (2013) Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781

  24. Mokady R, Hertz A, Bermano AH (2021) Clipcap: Clip prefix for image captioning. arXiv preprint arXiv:2111.09734

  25. Patashnik O, Wu Z, Shechtman E, et al (2021) Styleclip: Text-driven manipulation of stylegan imagery. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp 2085–2094

  26. Pennington J, Socher R, Manning CD (2014) Glove: Global vectors for word representation. In: Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pp 1532–1543

  27. Pezeshkpour P, Chen L, Singh S (2018) Embedding multimodal relational data for knowledge base completion. In: Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pp 3208–3218

  28. Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, PMLR, pp 8748–8763

  29. Redmon J, Divvala S, Girshick R, et al (2016) You only look once: Unified, real-time object detection. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 779–788

  30. Reimers N (2020) sentence embeddings using siamese bert-networks. emnlp-ijcnlp 2019–2019 conf empir methods nat lang process 9th int jt conf nat lang process proc conf; 2019: 3982-3992

  31. Ren M, Kiros R, Zemel R (2015) Image question answering: A visual semantic embedding model and a new dataset. Proc Advances in Neural Inf Process Syst 1(2):5

    Google Scholar 

  32. Ren S, He K, Girshick R et al (2016) Faster r-cnn: Towards real-time object detection with region proposal networks. IEEE Trans Pattern Anal Mach Intell 39(6):1137–1149

    Article  PubMed  Google Scholar 

  33. Sap M, Le Bras R, Allaway E, et al (2019) Atomic: An atlas of machine commonsense for if-then reasoning. In: Proceedings of the AAAI conference on artificial intelligence, pp 3027–3035

  34. Sharma H, Jalal AS (2021) Image captioning improved visual question answering. Multimedia tools and applications pp 1–22

  35. Shen S, Li LH, Tan H, et al (2021) How much can clip benefit vision-and-language tasks? arXiv preprint arXiv:2107.06383

  36. Speer R, Chin J, Havasi C (2017) ConceptNet 5.5: an open multilingual graph of general knowledge. In: Proceedings of the thirty-first AAAI conference on artificial intelligence. 2017 presented at: AAAI’17: thirty-first AAAI conference on artificial intelligence

  37. Su Z, Gou G (2022) Visual question answering research on joint knowledge and visual information reasoning. Computer Engineering and Applications

  38. Sun R, Cao X, Zhao Y, et al (2020) Multi-modal knowledge graphs for recommender systems. In: Proceedings of the 29th ACM international conference on information & knowledge management, pp 1405–1414

  39. Tan H, Bansal M (2019) Lxmert: Learning cross-modality encoder representations from transformers. In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp 5100–5111

  40. Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. In: Proceedings of the 31st International Conference on Neural Information Processing Systems, pp 6000–6010

  41. Vrandečić D, Krötzsch M (2014) Wikidata: a free collaborative knowledgebase. Commun ACM 57(10):78–85

    Article  Google Scholar 

  42. Wang G, Zhu M, Xu C, et al (2021) Exploiting image captions and external knowledge as representation enhancement for visual question answering. In: Proceedings of the 20th Chinese National Conference on Computational Linguistics, pp 316–326

  43. Wang P, Wu Q, Shen C, et al (2017a) Explicit knowledge-based reasoning for visual question answering. In: Proceedings of the 26th International Joint Conference on Artificial Intelligence, pp 1290–1296

  44. Wang P, Wu Q, Shen C et al (2017) Fvqa: Fact-based visual question answering. IEEE Trans Pattern Anal Mach Intell 40(10):2413–2427

    Article  PubMed  Google Scholar 

  45. Wang P, Yang A, Men R, et al (2022) Ofa: Unifying architectures, tasks, and modalities through a simple sequence-to-sequence learning framework. In: International Conference on Machine Learning, PMLR, pp 23318–23340

  46. Yu J, Zhu Z, Wang Y et al (2020) Cross-modal knowledge reasoning for knowledge-based visual question answering. Pattern Recogn 108:107563

    Article  Google Scholar 

  47. Yu Z, Yu J, Cui Y, et al (2019) Deep modular co-attention networks for visual question answering. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 6281–6290

  48. Zhang P, Li X, Hu X, et al (2021) Vinvl: Revisiting visual representations in vision-language models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 5579–5588

  49. Zhu Z, Yu J, Wang Y, et al (2021) Mucko: multi-layer cross-modal knowledge reasoning for fact-based visual question answering. In: Proceedings of the Twenty-Ninth International Conference on International Joint Conferences on Artificial Intelligence, pp 1097–1103

Download references

Acknowledgements

This work is supported by the National Natural Science Foundation of China (Grant No. 62162010) and the Guizhou Science and Technology Support Program, Qiankehe Support [2022] General 267.

Author information

Authors and Affiliations

Authors

Contributions

Zhenqiang Su: Conceptualization, Methodology, Validation, Writing - Original Draft. Gang Gou: Funding Acquisition, Supervision, Writing - Review & Editing.

Corresponding author

Correspondence to Gang Gou.

Ethics declarations

Conflicts of interest

We declare that we have no conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Su, Z., Gou, G. Knowledge enhancement and scene understanding for knowledge-based visual question answering. Knowl Inf Syst 66, 2193–2208 (2024). https://doi.org/10.1007/s10115-023-02028-9

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10115-023-02028-9

Keywords

Navigation