Skip to main content
Log in

Video emotional description with fact reinforcement and emotion awaking

  • Original Research
  • Published:
Journal of Ambient Intelligence and Humanized Computing Aims and scope Submit manuscript

Abstract

Video description aims to translate the visual content in a video with appropriate natural language. Most of current works only focus on the description of factual content, paying insufficient attention to the emotions in the video. And the sentences always lack flexibility and vividness. In this work, a fact enhancement and emotion awakening based model is proposed to describe the video, making the sentence more attractive and colorful. The strategy of deep incremental leaning is employed to build a multi-layer sequential network firstly, and multi-stage training method is used to sufficiently optimize the model. Secondly, the modules of fact inspiration, fact reinforcement and emotion awakening are constructed layer by layer to discovery more facts and embed emotions naturally. The three modules are cumulatively trained to sufficiently mine the factual and emotional information. Two public datasets including EmVidCap-S and EmVidCap are employed to evaluate the proposed model. The experimental results show that the performance of the proposed model is superior to not only the baseline models, but also the other popular methods.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10

Similar content being viewed by others

Data availability

All the experimental data of this study have been included in the paper, which can be consulted through the charts and tables in the paper. For the datasets employed in this work, EmVidCap (including EmVidCap-S and EmVidCap-L) can be obtained by contacting the author. For MSVD datasets, it can be obtained on [“http://www.cs.utexas.edu/users/ml/clamp/videoDescription/Youtubeclips.tar"].

References

  • Banerjee S, Lavie A (2005) Meteor: an automatic metric for mt evaluation with improved correlation with human judgments. In: Annual Meeting of the Association for Computational Linguistics Workshop, pp 65–72

  • Chang X, Yu Y, Yang Y et al (2017) Semantic pooling for complex event analysis in untrimmed videos. IEEE Trans Pattern Anal Mach Intell 39(8):1617–1632

    Article  Google Scholar 

  • Chang X, Ren P, Xu P et al (2023) A comprehensive survey of scene graphs: generation and application. IEEE Trans Pattern Anal Mach Intell 45(1):1–26

    Article  Google Scholar 

  • Chen S, Jiang Y (2019) Motion guided spatial attention for video captioning. In: AAAI Conference on artificial intelligence, pp 8191–8198

  • Chen T, Zhang Z, You Q, et al (2018) “factual” or “emotional”: Stylized image captioning with adaptive learning and attention. In: European Conference on computer vision, pp 527–543

  • Chollet F (2017) Xception: deep learning with depthwise separable convolutions. In: IEEE Conference on computer vision and pattern recognition, pp 1251–1258

  • Deb T, Sadmanee A, Bhaumik K et al (2022) Variational stacked local attention networks for diverse video captioning. In: IEEE Winter Conference on applications of computer vision, pp 4070–4079

  • Fan S, Shen Z, Jiang M et al (2018) Emotional attention: a study of image sentiment and visual attention. In: IEEE Conference on computer vision and pattern recognition, pp 7521–7531

  • Fu T, Li L, Gan Z et al (2023) An empirical study of end-to-end video-language transformers with masked visual modeling. In: IEEE Conference on computer vision and pattern recognition, pp 22898–22909

  • Gan C, Gan Z, He X et al (2017) Stylenet: generating attractive visual captions with styles. In: IEEE Conference on computer vision and pattern recognition, pp 955–964

  • Gao L, Guo Z, Zhang H et al (2017) Video captioning with attention-based lstm and semantic consistency. IEEE Trans Multimed 19(9):2045–2055

    Article  Google Scholar 

  • Guadarrama S, Krishnamoorthy N, Malkarnenkar G et al (2013) Youtube2text: recognizing and describing arbitrary activities using semantic hierarchies and zero-shot. In: IEEE International Conference on computer vision, pp 2712–2719

  • Gupta A, Srinivasan P, Shi J et al (2009) Understanding videos, constructing plots learning a visually grounded storyline model from annotated videos. In: IEEE Conference on computer vision and pattern recognition, pp 2012–2019

  • He K, Zhang X, Ren S et al (2016) Deep residual learning for image recognition. In: IEEE Conference on computer vision and pattern recognition, pp 770–778

  • Hinton G, Salakhutdinov R (2006) Reducing the dimensionality of data with neural networks. Science 313(5786):504–507

    Article  MathSciNet  Google Scholar 

  • Jiang Y, Xu B, Xue X (2014) Predicting emotions in user-generated videos. In: AAAI Conference on artificial intelligence, pp 73–79

  • Jia Y, Shelhamer E, Donahue J et al (2014) Caffe: convolutional architecture for fast feature embedding. In: ACM Conference on multimedia, pp 675–678

  • Karayil T, Irfan A, Raue F et al (2019) Conditional gans for image captioning with sentiments. In: International Conference on artificial neural networks, pp 300–312

  • Kojima A, Tamura T, Fukunaga K (2002) Natural language description of human activities from video images based on concept hierarchy of actions. Int J Comput Vis 50(2):171–184

    Article  Google Scholar 

  • Li L, Gong B (2019) End-to-end video captioning with multitask reinforcement learning. In: IEEE Winter Conference on applications of computer vision, pp 339–348

  • Li C, Wang G, Wang B et al (2023a) Ds-net++: dynamic weight slicing for efficient inference in cnns and vision transformers. IEEE Trans Pattern Anal Mach Intell 45(4):4430–4446

    Google Scholar 

  • Li M, Huang P, Chang X et al (2023b) Video pivoting unsupervised multi-modal machine translation. IEEE Trans Pattern Anal Mach Intell 45(3):3918–3932

    Google Scholar 

  • Lin C, Och F (2004) Automatic evaluation of machine translation quality using longest common subsequence and skip-bigram statistics. In: Annual Meeting of the Association for Computational Linguistics, pp 21–26

  • Lin T, Maire M, Belongie S et al (2014) Microsoft coco: common objects in context. In: European Conference on computer vision, pp 740–755

  • Lin K, Li L, Lin C et al (2022) Swinbert: End-to-end transformers with sparse attention for video captioning. In: IEEE Conference on computer vision and pattern recognition, pp 17949–17958

  • Liu T, Wan J, Dai X et al (2020) Sentiment recognition for short annotated gifs using visualtextual fusion. IEEE Trans Multimed 22(4):1098–1110

    Article  Google Scholar 

  • Liu S, Ren Z, Yuan J (2021) Sibnet: sibling convolutional encoder for video captioning. IEEE Trans Pattern Anal Mach Intell 43(9):3259–3272

    Article  Google Scholar 

  • Luo G, Zhou Y, Sun X et al (2022) Towards lightweight transformer via group-wise transformation for vision-and-language tasks. IEEE Trans Image Process 31:3386–3398

    Article  Google Scholar 

  • Mathews A, Xie L, He X (2016) Senticap: generating image descriptions with sentiments. In: AAAI Conference on artificial intelligence, pp 3574–3580

  • Nagel H (1994) A vision of “vision and language” comprises action: an example from road traffic. Artif Intell Rev 8(2):189–214

    Article  Google Scholar 

  • Pan B, Cai H, Huang D et al (2020) Spatio-temporal graph for video captioning with knowledge distillation. In: IEEE Conference on computer vision and pattern recognition, pp 10870–10879

  • Papineni K, Roukos S, Ward T et al (2002) Bleu: a method for automatic evaluation of machine translation. In: Annual Meeting of the Association for Computational Linguistics, pp 311–318

  • Park J, Rohrbach M, Darrell T et al (2019) Adversarial inference for multi-sentence video description. In: IEEE Conference on computer vision and pattern recognition, pp 6591–6601

  • Pei W, Zhang J, Wang X et al (2019) Memory-attended recurrent network for video captioning. In: IEEE Conference on computer vision and pattern recognition, pp 8347–8356

  • Perez-Martin J, Bustos B, Guimaraes S et al (2022) A comprehensive review of the video-to-text problem. Artif Intell Rev 55:4165–4239

    Article  Google Scholar 

  • Ren S, He K, Girshick R et al (2017) Faster r-cnn: towards real-time object detection with region proposal networks. IEEE Trans Pattern Anal Mach Intell 39(6):1137–1149

    Article  Google Scholar 

  • Russakovsky O, Deng J, Su H et al (2015) Imagenet large scale visual recognition challenge. Int J Comput Vis 115(3):211–252

    Article  MathSciNet  Google Scholar 

  • Song J, Guo Z, Gao L et al (2017) Hierarchical lstm with adjusted temporal attention for video captioning. In: International Joint Conference on artificial intelligence, pp 2737–2743

  • Tang P, Wang H, Kwong S (2018) Deep sequential fusion lstm network for image description. Neurocomputing 312:154–164

    Article  Google Scholar 

  • Tang P, Wang H, Li Q (2019) Rich visual and language representation with complementary semantics for video captioning. ACM Trans Multimed Comput Commun Appl 15(2):311–323

    Article  Google Scholar 

  • Vaswani A, Shazeer N, Parmar N et al (2017) Attention is all you need. In: International Conference on neural information processing systems, pp 5998–6008

  • Vedantam R, Zitnick C, Parikh D (2015) Cider: consensus-based image description evaluation. In: IEEE Conference on computer vision and pattern recognition, pp 4566–4575

  • Venugopalan S, Rohrbach M, Donahue J et al (2015) Sequence to sequence-video to text. In: IEEE International Conference on computer vision, pp 4534–4542

  • Wang J, Wang W, Huang Y, et al (2018a) M3: multimodal memory modelling for video captioning. In: IEEE Conference on computer vision and pattern recognition, pp 7512–7520

  • Wang X, Chen W, Wu J et al (2018b) Video captioning via hierarchical reinforcement learning. In: IEEE Conference on computer vision and pattern recognition, pp 4213–4222

  • Wang T, Zhang R, Lu Z et al (2021) End-to-end dense video vaptioning with parallel decoding. In: IEEE International Conference on computer vision, pp 4847–6857

  • Wang H, Tang P, Li Q et al (2022) Emotion expression with fact transfer for video description. IEEE Trans Multimed 24:715–727

    Article  Google Scholar 

  • Xue F, Shi Z, Wei F et al (2022) Go wider instead of deeper. In: AAAI Conference on artificial intelligence, pp 8779–8787

  • Yan C, Chang X, Luo M et al (2020) Self-weighted robust lda for multiclass classification with edge classes. ACM Trans Intell Syst Technol 12(1):41–419

    Google Scholar 

  • Yan C, Chang X, Li Z et al (2022) Zeronas: differentiable generative adversarial networks search for zero-shot learning. IEEE Trans Pattern Anal Mach Intell 44(12):9733–9740

    Article  Google Scholar 

  • Yang Y, Zhou J, Ai J et al (2018) Video captioning by adversarial lstm. IEEE Trans Image Process 27(11):5600–5611

    Article  MathSciNet  Google Scholar 

  • You Q, Luo J, Jin H, et al (2015) Robust image sentiment analysis using progressively trained and domain transferred deep networks. In: AAAI Conference on artificial intelligence, pp 381–388

  • Yuan D, Chang X, Li Z et al (2022) Learning adaptive spatial-temporal context-aware correlation filters for uav tracking. ACM Trans Multimed Comput Commun Appl 18(3):701–718

    Article  Google Scholar 

  • Zhang Z, Shi Y, Yuan C et al (2020) Object relational graph with teacher-recommended learning for video captioning. In: IEEE Conference on computer vision and pattern recognition, pp 13275–13285

  • Zhang L, Chang X, Liu J et al (2023) Tn-zstad: transferable network for zero-shot temporal activity detection. IEEE Trans Pattern Anal Mach Intell 45(3):3848–3861

    Google Scholar 

  • Zhao Z, Lu H, Cai D, et al. (2017) Microblog sentiment classification via recurrent random walk network learning. In: International Joint Conference on Artificial Intelligence, pp 3532–3538

  • Zhao B, Li X, Lu X (2019) Cam-rnn: co-attention model based rnn for video captioning. IEEE Trans Image Process 28(11):5552–5565

    Article  MathSciNet  Google Scholar 

Download references

Acknowledgements

This work was supported in part by National Natural Science Foundation of China (No. 62362041, 62062041, 62362003), Scientific Research Foundation of Education Bureau of Jiangxi Province (No. GJJ211009), Jiangxi Provincial Natural Science Foundation (No. 20212BAB202020, 20232BAB202017), Ph.D. Research Initiation Project of Jinggangshan University (No. JZB1923).

Author information

Authors and Affiliations

Authors

Corresponding authors

Correspondence to Pengjie Tang or Ai Zhang.

Ethics declarations

Conflict of interest

All authors declare that they have no conflict of interest with other people or organizations.

Ethical approval

This article does not contain any study with human participants performed by any of the authors.

Informed Consent

Informed consent was obtained from all individual participants included in the study.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Tang, P., Rao, H., Zhang, A. et al. Video emotional description with fact reinforcement and emotion awaking. J Ambient Intell Human Comput (2024). https://doi.org/10.1007/s12652-024-04779-x

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s12652-024-04779-x

Keywords

Navigation