Abstract
With the advent of egocentric cameras, there are new challenges where traditional computer vision are not sufficient to handle this kind of videos. Moreover, egocentric cameras often offer multiple modalities which need to be modeled jointly to exploit complimentary information. In this paper, we proposed a sparse low-rank bilinear score pooling approach for egocentric hand action recognition from RGB-D videos. It consists of five blocks: a baseline CNN to encode RGB and depth information for producing classification probabilities; a novel bilinear score pooling block to generate a score matrix; a sparse low rank matrix recovery block to reduce redundant features, which is common in bilinear pooling; a one layer CNN for frame-level classification; and an RNN for video level classification. We proposed to fuse classification probabilities instead of traditional CNN features from RGB and depth modality, involving an effective yet simple sparse low rank bilinear score pooling to produce a fused RGB-D score matrix. To demonstrate the efficacy of our method, we perform extensive experiments over two large-scale hand action datasets, namely, THU-READ and FPHA, and two smaller datasets, GUN-71 and HAD. We observe that the proposed method outperforms state-of-the-art methods and achieves accuracies of 78.55% and 96.87% over the THU-READ dataset in cross-subject and cross-group settings, respectively. Further, we achieved accuracies of 91.59% and 43.87% over the FPHA and Gun-71 datasets, respectively.
- Galen Andrew, Raman Arora, Jeff Bilmes, and Karen Livescu. 2013. Deep canonical correlation analysis. In Procedings of the International conference on machine learning. PMLR, 1247–1255.Google Scholar
- Anurag Arnab, Mostafa Dehghani, Georg Heigold, Chen Sun, Mario Lučić, and Cordelia. Schmid. 2021. Vivit: A video vision transformer. In Procedings of the IEEE/CVF International Conference on Computer Vision. 6836–6846.Google ScholarCross Ref
- Stephen Boyd, Neal Parikh, Eric Chu, Borja Peleato, Jonathan Eckstein, et al. 2011. Distributed optimization and statistical learning via the alternating direction method of multipliers. Vol. 3. Now Publishers Inc. 1–122 pages.Google Scholar
- Jun Cheng, Ziliang Ren, Qieshi Zhang, Xiangyang Gao, and Fusheng Hao. 2021. Cross-modality compensation convolutional neural networks for RGB-D action recognition. IEEE Trans. Circuits Syst. Video Technol. 32, 3 (2021), 1498–1509. https://doi.org/10.1109/tcsvt.2021.3076165Google ScholarCross Ref
- Navneet Dalal, Bill Triggs, and Cordelia Schmid. 2006. Human Detection Using Oriented Histograms of Flow and Appearance. In Procedings of the European Conference Computer Vision. Springer, 428–441.Google ScholarDigital Library
- Yong Du, Wei Wang, and Liang Wang. 2015. Hierarchical recurrent neural network for skeleton based action recognition. In Procedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE.Google Scholar
- Melih Engin, Lei Wang, Luping Zhou, and Xinwang Liu. 2018. DeepKSPD: Learning Kernel-Matrix-Based SPD Representation For Fine-Grained Image Recognition. In Procedings of the European Conference Computer Vision. Springer, 629–645.Google ScholarDigital Library
- Alireza Fathi, Ali Farhadi, and James M. Rehg. 2011. Understanding egocentric activities. In Procedings of the IEEE International Conference on Computer Vision. IEEE.Google ScholarDigital Library
- Alireza Fathi, Xiaofeng Ren, and James M. Rehg. 2011. Learning to recognize objects in egocentric activities. In Procedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE.Google ScholarDigital Library
- Christoph Feichtenhofer, Haoqi Fan, Jitendra Malik, and Kaiming He. 2019. Slowfast networks for video recognition. In Procedings of the IEEE/CVF International Conference on Computer Vision. 6202–6211.Google ScholarCross Ref
- Christoph Feichtenhofer, Axel Pinz, and Andrew Zisserman. 2016. Convolutional two-stream network fusion for video action recognition. In Procedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1933–1941.Google ScholarCross Ref
- Yang Gao, Oscar Beijbom, Ning Zhang, and Trevor Darrell. 2016. Compact Bilinear Pooling. In Procedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE.Google ScholarCross Ref
- Guillermo Garcia-Hernando and Tae-Kyun Kim. 2017. Transition Forests: Learning Discriminative Temporal Transitions for Action Recognition and Detection. In Procedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE.Google ScholarCross Ref
- Guillermo Garcia-Hernando, Shanxin Yuan, Seungryul Baek, and Tae-Kyun Kim. 2018. First-Person Hand Action Benchmark with RGB-D Videos and 3D Hand Pose Annotations. In Procedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. IEEE.Google ScholarCross Ref
- Kristen Grauman, Andrew Westbury, Eugene Byrne, Zachary Chavis, Antonino Furnari, Rohit Girdhar, Jackson Hamburger, Hao Jiang, Miao Liu, Xingyu Liu, et al. 2022. Ego4d: Around the world in 3,000 hours of egocentric video. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 18995–19012.Google ScholarCross Ref
- Rogez Gregory, S. Supancic James, and Ramanan Deva. 2015. Understanding Everyday Hands in Action from RGB-D Images. In Procedings of the IEEE International Conference on Computer Vision. IEEE.Google Scholar
- Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Procedings of the IEEE Conference on Computer Vision and Pattern Recognition. 770–778.Google ScholarCross Ref
- Jian-Fang Hu, Wei-Shi Zheng, Jianhuang Lai, and Jianguo Zhang. 2015. Jointly learning heterogeneous features for RGB-D activity recognition. In Procedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE.Google ScholarCross Ref
- Catalin Ionescu, Orestis Vantzos, and Cristian Sminchisescu. 2015. Matrix Backpropagation for Deep Networks with Structured Layers. In Procedings of the IEEE International Conference on Computer Vision. IEEE.Google ScholarDigital Library
- Purushottam Kar and Harish Karnick. 2012. Random feature maps for dot product kernels. In Artificial Intell. statistics. PMLR, 583–591.Google Scholar
- Evangelos Kazakos, Arsha Nagrani, Andrew Zisserman, and Dima Damen. 2019. Epic-fusion: Audio-visual temporal binding for egocentric action recognition. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 5492–5501.Google ScholarCross Ref
- Yu Kong and Yun Fu. 2015. Bilinear heterogeneous information machine for RGB-D action recognition. In Procedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE.Google ScholarCross Ref
- Ivan Laptev, Marcin Marszalek, Cordelia Schmid, and Benjamin Rozenfeld. 2008. Learning realistic human actions from movies. In Procedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE.Google ScholarCross Ref
- Peihua Li, Jiangtao Xie, Qilong Wang, and Wangmeng Zuo. 2017. Is Second-Order Information Helpful for Large-Scale Visual Recognition?. In Procedings of the IEEE International Conference on Computer Vision. IEEE.Google ScholarCross Ref
- Xiangyu Li, Yonghong Hou, Pichao Wang, Zhimin Gao, Mingliang Xu, and Wanqing Li. 2021. Trear: Transformer-based rgb-d egocentric action recognition. IEEE Trans. Cognitive Dev. Syst. 14, 1 (2021), 246–252.Google ScholarCross Ref
- Yin Li, Zhefan Ye, and James M. Rehg. 2015. Delving into egocentric actions. In Procedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE.Google ScholarCross Ref
- Tsung-Yu Lin and Subhransu Maji. 2017. Improved bilinear pooling with CNNs. arXiv preprint arXiv:1707.06772(2017).Google Scholar
- Tsung-Yu Lin, Aruni RoyChowdhury, and Subhransu Maji. 2015. Bilinear CNN Models for Fine-Grained Visual Recognition. In Procedings of the IEEE International Conference on Computer Vision. IEEE.Google ScholarDigital Library
- Jianbo Liu, Yongcheng Liu, Ying Wang, Veronique Prinet, Shiming Xiang, and Chunhong Pan. 2020. Decoupled Representation Learning for Skeleton-Based Gesture Recognition. In Procedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. IEEE.Google ScholarCross Ref
- Jun Liu, Amir Shahroudy, Dong Xu, Alex C. Kot, and Gang Wang. 2017. Deep multimodal feature analysis for action recognition in RGB+D videos. IEEE Trans. Pattern Anal. Mach. Intell. 40, 5 (2017), 1045–1058.Google Scholar
- Minghuang Ma, Haoqi Fan, and Kris M. Kitani. 2016. Going Deeper into First-Person Activity Recognition. In Procedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE.Google ScholarCross Ref
- Shaobo Min, Hantao Yao, Hongtao Xie, Zheng-Jun Zha, and Yongdong Zhang. 2020. Multi-Objective Matrix Normalization for Fine-Grained Visual Recognition. IEEE Trans. Image Process. 29 (2020), 4996–5009.Google ScholarDigital Library
- Mohammad Moghimi, Pablo Azagra, Luis Montesano, Ana C. Murillo, and Serge Belongie. 2014. Experiments on an RGB-D Wearable Vision System for Egocentric Activity Recognition. In Procedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops. IEEE.Google ScholarDigital Library
- Eshed Ohn-Bar and Mohan Manubhai Trivedi. 2014. Hand gesture recognition in real time for automotive interfaces: A multimodal vision-based approach and evaluations. IEEE Trans. Intell. Transp. Syst. 15, 6 (2014), 2368–2377.Google ScholarCross Ref
- Omar Oreifej and Zicheng Liu. 2013. HON4D: Histogram of Oriented 4D Normals for Activity Recognition from Depth Sequences. In Procedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE.Google ScholarDigital Library
- Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, et al. 2019. Pytorch: An imperative style, high-performance deep learning library. Procedings of the Advances Neuro Information Processing Systems 32 (2019).Google Scholar
- Hossein Rahmani and Ajmal Mian. 2016. 3D Action Recognition from Novel Viewpoints. In Procedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE.Google ScholarCross Ref
- Liangliang Ren, Jiwen Lu, Jianjiang Feng, and Jie Zhou. 2017. Multi-modal uniform deep learning for RGB-D person re-identification. Pattern Recognit. 72(2017), 446–457.Google ScholarDigital Library
- Qinghongya Shi, Hong-Bo Zhang, Zhe Li, Ji-Xiang Du, Qing Lei, and Jing-Hua Liu. 2022. Shuffle-invariant Network for Action Recognition in Videos. ACM Trans. Multimedia Comput. Commun. Appl. 18, 3 (2022), 1–18.Google ScholarDigital Library
- Karen Simonyan and Andrew Zisserman. 2014. Two-stream convolutional networks for action recognition in videos. In Procedings of the Advances Neuro Information Processing Systems. 568–576.Google Scholar
- Karen Simonyan and Andrew Zisserman. 2014. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556(2014).Google Scholar
- Swathikiran Sudhakaran, Sergio Escalera, and Oswald Lanz. 2019. LSTA: Long Short-Term Attention for Egocentric Action Recognition. In Procedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. IEEE.Google ScholarCross Ref
- Swathikiran Sudhakaran and Oswald Lanz. 2018. Attention is all we need: Nailing down object-centric attention for egocentric activity recognition. arXiv preprint arXiv:1807.11794(2018).Google Scholar
- Min Tan, Fu Yuan, Jun Yu, Guijun Wang, and Xiaoling Gu. 2022. Fine-grained image classification via multi-scale selective hierarchical biquadratic pooling. ACM Trans. Multimedia Comput. Commun. Appl. 18, 1s (2022), 1–23.Google ScholarDigital Library
- Yansong Tang, Yi Tian, Jiwen Lu, Jianjiang Feng, and Jie Zhou. 2017. Action recognition in RGB-D egocentric videos. In Procedings of the IEEE International Conference on Image Processing. IEEE.Google ScholarDigital Library
- Yansong Tang, Zian Wang, Jiwen Lu, Jianjiang Feng, and Jie Zhou. 2019. Multi-Stream Deep Neural Networks for RGB-D Egocentric Action Recognition. IEEE Trans. Circuits Syst. Video Technol. 29, 10 (oct 2019), 3001–3015.Google ScholarCross Ref
- Raviteja Vemulapalli, Felipe Arrate, and Rama Chellappa. 2014. Human Action Recognition by Representing 3D Skeletons as Points in a Lie Group. In Procedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE.Google ScholarDigital Library
- Heng Wang and Cordelia Schmid. 2013. Action Recognition with Improved Trajectories. In Procedings of the IEEE International Conference on Computer Vision. IEEE.Google ScholarDigital Library
- Heng Wang, Muhammad Muneeb Ullah, Alexander Klaser, Ivan Laptev, and Cordelia Schmid. 2009. Evaluation of local spatio-temporal features for action recognition. In Procedings of the British Machine Vision Conference. British Machine Vision Association.Google ScholarCross Ref
- Qilong Wang, Peihua Li, and Lei Zhang. 2017. G2DeNet: Global Gaussian Distribution Embedding Network and Its Application to Visual Recognition. In Procedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE.Google ScholarCross Ref
- Chi Xu, Lakshmi Narasimhan Govindarajan, and Li Cheng. 2017. Hand action detection from ego-centric depth sequences with error-correcting Hough transform. Pattern Recogit. 72(2017), 494–503.Google ScholarDigital Library
- Haotian Xu, Xiaobo Jin, Qiufeng Wang, Amir Hussain, and Kaizhu Huang. 2022. Exploiting Attention-Consistency Loss For Spatial-Temporal Stream Action Recognition. ACM Trans. Multimedia Comput. Commun. Appl.(2022).Google Scholar
- Jun Ye, Hao Hu, Guo-Jun Qi, and Kien A Hua. 2017. A temporal order modeling approach to human action recognition from multimodal sensor data. ACM Trans. Multimedia Comput. Commun. Appl. 13, 2 (2017), 1–22.Google ScholarDigital Library
- Hasan F. M. Zaki, Faisal Shafait, and Ajmal Mian. 2017. Modeling Sub-Event Dynamics in First-Person Action Recognition. In Procedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE.Google ScholarCross Ref
- Mihai Zanfir, Marius Leordeanu, and Cristian Sminchisescu. 2013. The Moving Pose: An Efficient 3D Kinematics Descriptor for Low-Latency Action Recognition and Detection. In Procedings of the IEEE International Conference on Computer Vision. IEEE.Google ScholarDigital Library
- Xikang Zhang, Yin Wang, Mengran Gou, Mario Sznaier, and Octavia Camps. 2016. Efficient Temporal Sequence Comparison and Classification Using Gram Matrix Embeddings on a Riemannian Manifold. In Procedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE.Google ScholarCross Ref
- Na Zheng, Xuemeng Song, Tianyu Su, Weifeng Liu, Yan Yan, and Liqiang Nie. 2022. Egocentric Early Action Prediction via Adversarial Knowledge Distillation. ACM Trans. Multimedia Comput. Commun. Appl.(2022).Google Scholar
- Yizhou Zhou, Xiaoyan Sun, Zheng-Jun Zha, and Wenjun Zeng. 2018. MiCT: Mixed 3D/2D Convolutional Tube for Human Action Recognition. In Procedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. IEEE.Google ScholarCross Ref
Index Terms
- Multimodal Score Fusion with Sparse Low Rank Bilinear Pooling for Egocentric Hand Action Recognition
Recommendations
Sparse Support Matrix Machine
We propose a novel matrix classifier to simultaneously leverage the structural information within matrices and select useful features.We regularize the combination of nuclear norm and l1 norm of the regression matrix and develop an efficient solver ...
Fast Hierarchical Solvers For Sparse Matrices Using Extended Sparsification and Low-Rank Approximation
Inversion of sparse matrices with standard direct solve schemes is robust but computationally expensive. Iterative solvers, on the other hand, demonstrate better scalability but need to be used with an appropriate preconditioner (e.g., ILU, AMG, Gauss--...
Recovering Low-Rank and Sparse Components of Matrices from Incomplete and Noisy Observations
Many problems can be characterized by the task of recovering the low-rank and sparse components of a given matrix. Recently, it was discovered that this nondeterministic polynomial-time hard (NP-hard) task can be well accomplished, both theoretically ...
Comments