Skip to main content
Log in

Video Summarization Using Knowledge Distillation-Based Attentive Network

  • Published:
Cognitive Computation Aims and scope Submit manuscript

Abstract

The vast volumes of videos produced daily require highly efficient measures to ensure that key information is reported for effective review and storage, which leads to the popularity of video summarization techniques. Deep learning has shown its advantages in video summarization, especially convolutional neural network, which are effective in extracting features for video summarization. However, the deep network layers and the limited range of temporal dependence make it challenging to deploy the network and thus affect the accuracy of identifying important video frames. To tackle these issues, we present a knowledge distillation-based attentive network (KDAN) for supervised video summarization in this paper. The proposed method separates the full convolutional network from the attention mechanism based on the idea of education and learning processes in biology and uses a full convolutional network as a teacher network to guide the learning of the student network consisting of an attention mechanism. The obtained lightweight network considers the knowledge learned from both networks, thus solving the problems of explosion in the number of participants and slow training. We have conducted experiments on two widely used benchmarks SumMe and TVSum. DANtea achieves F-scores 53.09 and 60.30, and DAN achieves F-scores 51.26 and 61.55 in Canonical settings on the SumMe and TVSum datasets, respectively. Experiments on two public benchmarks SumMe and TVSum demonstrate the effectiveness and superiority of the proposed network over existing state-of-the-art methods.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2

Similar content being viewed by others

Data Availability

Underlying data experiments in this study were conducted using the publicly available datasets. The datasets used in this study are available on https://github.com/KaiyangZhou/pytorch-vsumm-reinforce

References

  1. Chen H, Ding G, Lin Z, Guo Y, Shan C, Han J. Image captioning with memorized knowledge. Cognit Comput. 2021;13(4):807–20.

    Article  Google Scholar 

  2. Mei S, Guan G, Wang Z, Wan S, He M, Feng DD. Video summarization via minimum sparse reconstruction. Pattern Recognit. 2015;48(2):522–33.

    Article  Google Scholar 

  3. Zhang K, Chao WL, Sha F, Grauman K. Video summarization with long short-term memory. Lect Notes Comput Sci (including Subser Lect Notes Artif Intell Lect Notes Bioinformatics). 2016;9911 LNCS:766–782. https://doi.org/10.1007/978-3-319-46478-7_47.

  4. Elhamifar E, Sapiro G, Sastry SS. Dissimilarity-based sparse subset selection. IEEE Trans Pattern Anal Mach Intell. 2015;38(11):2182–97.

    Article  Google Scholar 

  5. Mitra A, Biswas S, Bhattacharyya C. Bayesian modeling of temporal coherence in videos for entity discovery and summarization. IEEE Trans Pattern Anal Mach Intell. 2016;39(3):430–43.

    Article  Google Scholar 

  6. Fajtl J, Sokeh HS, Argyriou V, Monekosso D, Remagnino P. Summarizing videos with attention. Lect Notes Comput Sci (including Subser Lect Notes Artif Intell Lect Notes Bioinformatics). 2019;11367 LNCS:39–54. https://doi.org/10.1007/978-3-030-21074-8_4.

  7. Ji Z, Xiong K, Pang Y, Member S, Li X. Video summarization with attention-based encoder–decoder networks. 2020;30(6):1709–1717.

  8. Zhou K, Qiao Y, Xiang T. Deep reinforcement learning for unsupervised video summarization with diversity-representativeness reward. 32nd AAAI Conf. Artif Intell AAAI. 2018;2018:7582–9.

    Google Scholar 

  9. Muhammad K, Hussain T, Baik SW. Efficient CNN based summarization of surveillance videos for resource-constrained devices. Pattern Recognit Lett. 2020;130:370–5.

    Article  Google Scholar 

  10. Li Z, Yang L. Weakly supervised deep reinforcement learning for video summarization with semantically meaningful reward. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. 2021;3239–3247.

  11. Hochreiter S, Schmidhuber J. Long short-term memory. Neural Comput. 1997;9(8):1735–80.

    Article  Google Scholar 

  12. Rochan M, Ye L, Wang Y. Video summarization using fully convolutional sequence networks. In Proceedings of the European conference on computer vision (ECCV). 2018;347–363.

  13. Jian M, Wang J, Yu H, Wang G-G. Integrating object proposal with attention networks for video saliency detection. Inf Sci (Ny). 2021;576:819–30.

    Article  MathSciNet  Google Scholar 

  14. Li X, Liu Y, Wang K, Wang F-Y. A recurrent attention and interaction model for pedestrian trajectory prediction. IEEE/CAA J Autom Sin. 2020;7(5):1361–70.

    Google Scholar 

  15. Zhu W, Lu J, Han Y, Zhou J. Learning multiscale hierarchical attention for video summarization. Pattern Recognit. 2022;122: 108312. https://doi.org/10.1016/j.patcog.2021.108312.

    Article  Google Scholar 

  16. Li X, Li M, Yan P, et al. Deep learning attention mechanism in medical image analysis: basics and beyonds. International Journal of Network Dynamics and Intelligence. 2023;2(1):93–116.

    Article  Google Scholar 

  17. Niu Z, Zhong G, Yu H. A review on the attention mechanism of deep learning. Neurocomputing. 2021;452:48–62.

    Article  Google Scholar 

  18. Lindsay GW. Convolutional neural networks as a model of the visual system: past, present, and future. J Cogn Neurosci. 2021;33(10):2017–31.

    Article  Google Scholar 

  19. Spratling MW, Johnson MH. A feedback model of visual attention. J Cogn Neurosci. 2004;16(2):219–37.

    Article  Google Scholar 

  20. Chen L-C, Papandreou G, Schroff F, Adam H. Rethinking atrous convolution for semantic image segmentation. 2017. arXiv Prepr. arXiv1706.05587.

  21. De Schotten MT. et al. A lateralized brain network for visuo-spatial attention. Nat Preced. 2011;1.

  22. Zhang Y, Li K, Li K, Wang L, Zhong B, Fu Y. Image super-resolution using very deep residual channel attention networks. In Proceedings of the European conference on computer vision (ECCV). 2018;286–301.

  23. Hu J, Shen L, Sun G. Squeeze-and-excitation networks. In Proceedings of the IEEE conference on computer vision and pattern recognition. 2018;7132–7141.

  24. Liang G, Lv Y, Li S, Zhang S, Zhang Y. Unsupervised video summarization with a convolutional attentive adversarial network. 2021;1–26. [Online]. Available: http://arxiv.org/abs/2105.11131.

  25. Gupta D, Sharma A. Attentive convolution network-based video summarization. 2021;778. Springer Singapore. https://doi.org/10.1007/978-981-16-3067-5_25.

  26. Gygli M, Grabner H, Riemenschneider H, Van Gool L. Creating summaries from user videos. In European conference on computer vision. 2014;505–520.

  27. Song Y, Vallmitjana J, Stent A, Jaimes A. Tvsum: summarizing web videos using titles. In Proceedings of the IEEE conference on computer vision and pattern recognition. 2015;5179–5187.

  28. Ye F, Zhang S, Wang P, Chan C-Y. A survey of deep reinforcement learning algorithms for motion planning and control of autonomous vehicles. In IEEE Intelligent Vehicles Symposium (IV). 2021;2021:1073–80.

    Google Scholar 

  29. Yue W, Wang Z, Zhang J, Liu X. An overview of recommendation techniques and their applications in healthcare. IEEE/CAA J Autom Sin. 2021;8(4):701–17.

    Article  MathSciNet  Google Scholar 

  30. Yan X, Hu S, Mao Y, Ye Y, Yu H. Deep multi-view learning methods: a review. Neurocomputing. 2021;448:106–29.

    Article  Google Scholar 

  31. Cheng H, Wang Z, Wei Z, Ma L, Liu X. On adaptive learning framework for deep weighted sparse autoencoder: a multiobjective evolutionary algorithm. IEEE Trans Cybern. 2020.

  32. Liao J, Lam HK, Gulati S, et al. Improved computer-aided diagnosis system for nonerosive reflux disease using contrastive self-supervised learning with transfer learning. International Journal of Network Dynamics and Intelligence. 2023;2(3): 100010.

    Article  Google Scholar 

  33. Chen Y, Tao L, Wang X, Yamasaki T. Weakly supervised video summarization by hierarchical reinforcement learning. In Proceedings of the ACM Multimedia Asia. 2019;1–6.

  34. Mahasseni B, Lam M, Todorovic S. Unsupervised video summarization with adversarial lstm networks. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition. 2017;202–211.

  35. Su M, Ma R, Zhang B, Li K. Recurrent unit augmented memory network for video summarisation. IET Comput Vis. 2023.

  36. Yao T, Mei T, Rui Y. Highlight detection with pairwise deep ranking for first-person video summarization. In Proceedings of the IEEE conference on computer vision and pattern recognition. 2016;982–990.

  37. Zhao B, Li X, Lu X. Hierarchical recurrent neural network for video summarization. In Proceedings of the 25th ACM international conference on Multimedia. 2017;863–871.

  38. Hinton G, Vinyals O, Dean J. Distilling the knowledge in a neural network. 2015;2(7) arXiv Prepr. arXiv1503.02531.

  39. Chen G, Choi W, Yu X, Han T, Chandraker M. Learning efficient object detection models with knowledge distillation. Adv Neural Inf Process Syst. 2017;30.

  40. Zhang Z, Zhu X, Ye M. Fast human pose estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2019;3517–3526.

  41. Meng Z, Li J, Zhao Y, Gong Y. Conditional teacher-student learning. In ICASSP 2019–2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). 2019;6445–6449.

  42. Romero A, Ballas N, Kahou SE, Chassang A, Gatta C, Bengio Y. Fitnets: hints for thin deep nets. 2014. arXiv Prepr. arXiv1412.6550.

  43. Yim J, Joo D, Bae J, Kim J. A gift from knowledge distillation: fast optimization, network minimization and transfer learning. In Proceedings of the IEEE conference on computer vision and pattern recognition. 2017;4133–4141.

  44. Potapov D, Douze M, Harchaoui Z, Schmid C. Category-specific video summarization. In European conference on computer vision. 2014;540–555.

  45. Wang X, Girshick R, Gupta A, He K. Non-local neural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition. 2018;7794–7803.

  46. Szegedy C, et al. Going deeper with convolutions. In Proceedings of the IEEE conference on computer vision and pattern recognition. 2015;1–9.

  47. Russakovsky O, et al. Imagenet large scale visual recognition challenge. Int J Comput Vis. 2015;115(3):211–52.

    Article  MathSciNet  Google Scholar 

  48. DKingma DP, Ba J. Adam: a method for stochastic optimization. ICLR. 2015. 2015;9. arXiv Prepr. arXiv1412.6980.

  49. Wang J, Wang W, Wang Z, Wang L, Feng D, Tan T. Stacked memory network for video summarization. In Proceedings of the 27th ACM International Conference on Multimedia. 2019;836–844.

  50. Ji Z, Xiong K, Pang Y, Li X. Video summarization with attention-based encoder–decoder networks. IEEE Trans Circuits Syst Video Technol. 2019;30(6):1709–17.

    Article  Google Scholar 

  51. Zhao B, Li H, Lu X, Li X. Reconstructive sequence-graph network for video summarization. IEEE Trans Pattern Anal Mach Intell. 2021;8828:1–10. https://doi.org/10.1109/TPAMI.2021.3072117.

  52. Liu T, Meng Q, Huang J-J, Vlontzos A, Rueckert D, Kainz B. Video summarization through reinforcement learning with a 3D spatio-temporal u-net. IEEE Trans Image Process. 2022;31:1573–86.

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Hui Yu.

Ethics declarations

Ethical Approval

This article does not contain any studies with human participants or animals performed by any of the authors.

Competing Interests

The authors declare no competing interests.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Qin, J., Yu, H., Liang, W. et al. Video Summarization Using Knowledge Distillation-Based Attentive Network. Cogn Comput (2024). https://doi.org/10.1007/s12559-023-10243-3

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s12559-023-10243-3

Keywords

Navigation