Abstract
The vast volumes of videos produced daily require highly efficient measures to ensure that key information is reported for effective review and storage, which leads to the popularity of video summarization techniques. Deep learning has shown its advantages in video summarization, especially convolutional neural network, which are effective in extracting features for video summarization. However, the deep network layers and the limited range of temporal dependence make it challenging to deploy the network and thus affect the accuracy of identifying important video frames. To tackle these issues, we present a knowledge distillation-based attentive network (KDAN) for supervised video summarization in this paper. The proposed method separates the full convolutional network from the attention mechanism based on the idea of education and learning processes in biology and uses a full convolutional network as a teacher network to guide the learning of the student network consisting of an attention mechanism. The obtained lightweight network considers the knowledge learned from both networks, thus solving the problems of explosion in the number of participants and slow training. We have conducted experiments on two widely used benchmarks SumMe and TVSum. DANtea achieves F-scores 53.09 and 60.30, and DAN achieves F-scores 51.26 and 61.55 in Canonical settings on the SumMe and TVSum datasets, respectively. Experiments on two public benchmarks SumMe and TVSum demonstrate the effectiveness and superiority of the proposed network over existing state-of-the-art methods.
Similar content being viewed by others
Data Availability
Underlying data experiments in this study were conducted using the publicly available datasets. The datasets used in this study are available on https://github.com/KaiyangZhou/pytorch-vsumm-reinforce
References
Chen H, Ding G, Lin Z, Guo Y, Shan C, Han J. Image captioning with memorized knowledge. Cognit Comput. 2021;13(4):807–20.
Mei S, Guan G, Wang Z, Wan S, He M, Feng DD. Video summarization via minimum sparse reconstruction. Pattern Recognit. 2015;48(2):522–33.
Zhang K, Chao WL, Sha F, Grauman K. Video summarization with long short-term memory. Lect Notes Comput Sci (including Subser Lect Notes Artif Intell Lect Notes Bioinformatics). 2016;9911 LNCS:766–782. https://doi.org/10.1007/978-3-319-46478-7_47.
Elhamifar E, Sapiro G, Sastry SS. Dissimilarity-based sparse subset selection. IEEE Trans Pattern Anal Mach Intell. 2015;38(11):2182–97.
Mitra A, Biswas S, Bhattacharyya C. Bayesian modeling of temporal coherence in videos for entity discovery and summarization. IEEE Trans Pattern Anal Mach Intell. 2016;39(3):430–43.
Fajtl J, Sokeh HS, Argyriou V, Monekosso D, Remagnino P. Summarizing videos with attention. Lect Notes Comput Sci (including Subser Lect Notes Artif Intell Lect Notes Bioinformatics). 2019;11367 LNCS:39–54. https://doi.org/10.1007/978-3-030-21074-8_4.
Ji Z, Xiong K, Pang Y, Member S, Li X. Video summarization with attention-based encoder–decoder networks. 2020;30(6):1709–1717.
Zhou K, Qiao Y, Xiang T. Deep reinforcement learning for unsupervised video summarization with diversity-representativeness reward. 32nd AAAI Conf. Artif Intell AAAI. 2018;2018:7582–9.
Muhammad K, Hussain T, Baik SW. Efficient CNN based summarization of surveillance videos for resource-constrained devices. Pattern Recognit Lett. 2020;130:370–5.
Li Z, Yang L. Weakly supervised deep reinforcement learning for video summarization with semantically meaningful reward. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. 2021;3239–3247.
Hochreiter S, Schmidhuber J. Long short-term memory. Neural Comput. 1997;9(8):1735–80.
Rochan M, Ye L, Wang Y. Video summarization using fully convolutional sequence networks. In Proceedings of the European conference on computer vision (ECCV). 2018;347–363.
Jian M, Wang J, Yu H, Wang G-G. Integrating object proposal with attention networks for video saliency detection. Inf Sci (Ny). 2021;576:819–30.
Li X, Liu Y, Wang K, Wang F-Y. A recurrent attention and interaction model for pedestrian trajectory prediction. IEEE/CAA J Autom Sin. 2020;7(5):1361–70.
Zhu W, Lu J, Han Y, Zhou J. Learning multiscale hierarchical attention for video summarization. Pattern Recognit. 2022;122: 108312. https://doi.org/10.1016/j.patcog.2021.108312.
Li X, Li M, Yan P, et al. Deep learning attention mechanism in medical image analysis: basics and beyonds. International Journal of Network Dynamics and Intelligence. 2023;2(1):93–116.
Niu Z, Zhong G, Yu H. A review on the attention mechanism of deep learning. Neurocomputing. 2021;452:48–62.
Lindsay GW. Convolutional neural networks as a model of the visual system: past, present, and future. J Cogn Neurosci. 2021;33(10):2017–31.
Spratling MW, Johnson MH. A feedback model of visual attention. J Cogn Neurosci. 2004;16(2):219–37.
Chen L-C, Papandreou G, Schroff F, Adam H. Rethinking atrous convolution for semantic image segmentation. 2017. arXiv Prepr. arXiv1706.05587.
De Schotten MT. et al. A lateralized brain network for visuo-spatial attention. Nat Preced. 2011;1.
Zhang Y, Li K, Li K, Wang L, Zhong B, Fu Y. Image super-resolution using very deep residual channel attention networks. In Proceedings of the European conference on computer vision (ECCV). 2018;286–301.
Hu J, Shen L, Sun G. Squeeze-and-excitation networks. In Proceedings of the IEEE conference on computer vision and pattern recognition. 2018;7132–7141.
Liang G, Lv Y, Li S, Zhang S, Zhang Y. Unsupervised video summarization with a convolutional attentive adversarial network. 2021;1–26. [Online]. Available: http://arxiv.org/abs/2105.11131.
Gupta D, Sharma A. Attentive convolution network-based video summarization. 2021;778. Springer Singapore. https://doi.org/10.1007/978-981-16-3067-5_25.
Gygli M, Grabner H, Riemenschneider H, Van Gool L. Creating summaries from user videos. In European conference on computer vision. 2014;505–520.
Song Y, Vallmitjana J, Stent A, Jaimes A. Tvsum: summarizing web videos using titles. In Proceedings of the IEEE conference on computer vision and pattern recognition. 2015;5179–5187.
Ye F, Zhang S, Wang P, Chan C-Y. A survey of deep reinforcement learning algorithms for motion planning and control of autonomous vehicles. In IEEE Intelligent Vehicles Symposium (IV). 2021;2021:1073–80.
Yue W, Wang Z, Zhang J, Liu X. An overview of recommendation techniques and their applications in healthcare. IEEE/CAA J Autom Sin. 2021;8(4):701–17.
Yan X, Hu S, Mao Y, Ye Y, Yu H. Deep multi-view learning methods: a review. Neurocomputing. 2021;448:106–29.
Cheng H, Wang Z, Wei Z, Ma L, Liu X. On adaptive learning framework for deep weighted sparse autoencoder: a multiobjective evolutionary algorithm. IEEE Trans Cybern. 2020.
Liao J, Lam HK, Gulati S, et al. Improved computer-aided diagnosis system for nonerosive reflux disease using contrastive self-supervised learning with transfer learning. International Journal of Network Dynamics and Intelligence. 2023;2(3): 100010.
Chen Y, Tao L, Wang X, Yamasaki T. Weakly supervised video summarization by hierarchical reinforcement learning. In Proceedings of the ACM Multimedia Asia. 2019;1–6.
Mahasseni B, Lam M, Todorovic S. Unsupervised video summarization with adversarial lstm networks. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition. 2017;202–211.
Su M, Ma R, Zhang B, Li K. Recurrent unit augmented memory network for video summarisation. IET Comput Vis. 2023.
Yao T, Mei T, Rui Y. Highlight detection with pairwise deep ranking for first-person video summarization. In Proceedings of the IEEE conference on computer vision and pattern recognition. 2016;982–990.
Zhao B, Li X, Lu X. Hierarchical recurrent neural network for video summarization. In Proceedings of the 25th ACM international conference on Multimedia. 2017;863–871.
Hinton G, Vinyals O, Dean J. Distilling the knowledge in a neural network. 2015;2(7) arXiv Prepr. arXiv1503.02531.
Chen G, Choi W, Yu X, Han T, Chandraker M. Learning efficient object detection models with knowledge distillation. Adv Neural Inf Process Syst. 2017;30.
Zhang Z, Zhu X, Ye M. Fast human pose estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2019;3517–3526.
Meng Z, Li J, Zhao Y, Gong Y. Conditional teacher-student learning. In ICASSP 2019–2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). 2019;6445–6449.
Romero A, Ballas N, Kahou SE, Chassang A, Gatta C, Bengio Y. Fitnets: hints for thin deep nets. 2014. arXiv Prepr. arXiv1412.6550.
Yim J, Joo D, Bae J, Kim J. A gift from knowledge distillation: fast optimization, network minimization and transfer learning. In Proceedings of the IEEE conference on computer vision and pattern recognition. 2017;4133–4141.
Potapov D, Douze M, Harchaoui Z, Schmid C. Category-specific video summarization. In European conference on computer vision. 2014;540–555.
Wang X, Girshick R, Gupta A, He K. Non-local neural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition. 2018;7794–7803.
Szegedy C, et al. Going deeper with convolutions. In Proceedings of the IEEE conference on computer vision and pattern recognition. 2015;1–9.
Russakovsky O, et al. Imagenet large scale visual recognition challenge. Int J Comput Vis. 2015;115(3):211–52.
DKingma DP, Ba J. Adam: a method for stochastic optimization. ICLR. 2015. 2015;9. arXiv Prepr. arXiv1412.6980.
Wang J, Wang W, Wang Z, Wang L, Feng D, Tan T. Stacked memory network for video summarization. In Proceedings of the 27th ACM International Conference on Multimedia. 2019;836–844.
Ji Z, Xiong K, Pang Y, Li X. Video summarization with attention-based encoder–decoder networks. IEEE Trans Circuits Syst Video Technol. 2019;30(6):1709–17.
Zhao B, Li H, Lu X, Li X. Reconstructive sequence-graph network for video summarization. IEEE Trans Pattern Anal Mach Intell. 2021;8828:1–10. https://doi.org/10.1109/TPAMI.2021.3072117.
Liu T, Meng Q, Huang J-J, Vlontzos A, Rueckert D, Kainz B. Video summarization through reinforcement learning with a 3D spatio-temporal u-net. IEEE Trans Image Process. 2022;31:1573–86.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Ethical Approval
This article does not contain any studies with human participants or animals performed by any of the authors.
Competing Interests
The authors declare no competing interests.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Qin, J., Yu, H., Liang, W. et al. Video Summarization Using Knowledge Distillation-Based Attentive Network. Cogn Comput (2024). https://doi.org/10.1007/s12559-023-10243-3
Received:
Accepted:
Published:
DOI: https://doi.org/10.1007/s12559-023-10243-3