research-article

Free Access

Just Accepted

Multimodal Score Fusion with Sparse Low Rank Bilinear Pooling for Egocentric Hand Action Recognition

Author:
Kankana Roy

Department of Oncology pathology, Karolinska Institute Stockholm, Sweden

Department of Oncology pathology, Karolinska Institute Stockholm, Sweden
Search about this author

ACM Transactions on Multimedia Computing, Communications, and ApplicationsAccepted on March 2024https://doi.org/10.1145/3656044

Online AM:02 April 2024Publication History

ACM Transactions on Multimedia Computing, Communications, and Applications

Abstract

With the advent of egocentric cameras, there are new challenges where traditional computer vision are not sufficient to handle this kind of videos. Moreover, egocentric cameras often offer multiple modalities which need to be modeled jointly to exploit complimentary information. In this paper, we proposed a sparse low-rank bilinear score pooling approach for egocentric hand action recognition from RGB-D videos. It consists of five blocks: a baseline CNN to encode RGB and depth information for producing classification probabilities; a novel bilinear score pooling block to generate a score matrix; a sparse low rank matrix recovery block to reduce redundant features, which is common in bilinear pooling; a one layer CNN for frame-level classification; and an RNN for video level classification. We proposed to fuse classification probabilities instead of traditional CNN features from RGB and depth modality, involving an effective yet simple sparse low rank bilinear score pooling to produce a fused RGB-D score matrix. To demonstrate the efficacy of our method, we perform extensive experiments over two large-scale hand action datasets, namely, THU-READ and FPHA, and two smaller datasets, GUN-71 and HAD. We observe that the proposed method outperforms state-of-the-art methods and achieves accuracies of 78.55% and 96.87% over the THU-READ dataset in cross-subject and cross-group settings, respectively. Further, we achieved accuracies of 91.59% and 43.87% over the FPHA and Gun-71 datasets, respectively.

References

Galen Andrew, Raman Arora, Jeff Bilmes, and Karen Livescu. 2013. Deep canonical correlation analysis. In Procedings of the International conference on machine learning. PMLR, 1247–1255.Google Scholar
Anurag Arnab, Mostafa Dehghani, Georg Heigold, Chen Sun, Mario Lučić, and Cordelia. Schmid. 2021. Vivit: A video vision transformer. In Procedings of the IEEE/CVF International Conference on Computer Vision. 6836–6846.Google ScholarCross Ref
Stephen Boyd, Neal Parikh, Eric Chu, Borja Peleato, Jonathan Eckstein, et al. 2011. Distributed optimization and statistical learning via the alternating direction method of multipliers. Vol. 3. Now Publishers Inc. 1–122 pages.Google Scholar
Jun Cheng, Ziliang Ren, Qieshi Zhang, Xiangyang Gao, and Fusheng Hao. 2021. Cross-modality compensation convolutional neural networks for RGB-D action recognition. IEEE Trans. Circuits Syst. Video Technol. 32, 3 (2021), 1498–1509. https://doi.org/10.1109/tcsvt.2021.3076165Google ScholarCross Ref
Navneet Dalal, Bill Triggs, and Cordelia Schmid. 2006. Human Detection Using Oriented Histograms of Flow and Appearance. In Procedings of the European Conference Computer Vision. Springer, 428–441.Google ScholarDigital Library
Yong Du, Wei Wang, and Liang Wang. 2015. Hierarchical recurrent neural network for skeleton based action recognition. In Procedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE.Google Scholar
Melih Engin, Lei Wang, Luping Zhou, and Xinwang Liu. 2018. DeepKSPD: Learning Kernel-Matrix-Based SPD Representation For Fine-Grained Image Recognition. In Procedings of the European Conference Computer Vision. Springer, 629–645.Google ScholarDigital Library
Alireza Fathi, Ali Farhadi, and James M. Rehg. 2011. Understanding egocentric activities. In Procedings of the IEEE International Conference on Computer Vision. IEEE.Google ScholarDigital Library
Alireza Fathi, Xiaofeng Ren, and James M. Rehg. 2011. Learning to recognize objects in egocentric activities. In Procedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE.Google ScholarDigital Library
Christoph Feichtenhofer, Haoqi Fan, Jitendra Malik, and Kaiming He. 2019. Slowfast networks for video recognition. In Procedings of the IEEE/CVF International Conference on Computer Vision. 6202–6211.Google ScholarCross Ref
Christoph Feichtenhofer, Axel Pinz, and Andrew Zisserman. 2016. Convolutional two-stream network fusion for video action recognition. In Procedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1933–1941.Google ScholarCross Ref
Yang Gao, Oscar Beijbom, Ning Zhang, and Trevor Darrell. 2016. Compact Bilinear Pooling. In Procedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE.Google ScholarCross Ref
Guillermo Garcia-Hernando and Tae-Kyun Kim. 2017. Transition Forests: Learning Discriminative Temporal Transitions for Action Recognition and Detection. In Procedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE.Google ScholarCross Ref
Guillermo Garcia-Hernando, Shanxin Yuan, Seungryul Baek, and Tae-Kyun Kim. 2018. First-Person Hand Action Benchmark with RGB-D Videos and 3D Hand Pose Annotations. In Procedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. IEEE.Google ScholarCross Ref
Kristen Grauman, Andrew Westbury, Eugene Byrne, Zachary Chavis, Antonino Furnari, Rohit Girdhar, Jackson Hamburger, Hao Jiang, Miao Liu, Xingyu Liu, et al. 2022. Ego4d: Around the world in 3,000 hours of egocentric video. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 18995–19012.Google ScholarCross Ref
Rogez Gregory, S. Supancic James, and Ramanan Deva. 2015. Understanding Everyday Hands in Action from RGB-D Images. In Procedings of the IEEE International Conference on Computer Vision. IEEE.Google Scholar
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Procedings of the IEEE Conference on Computer Vision and Pattern Recognition. 770–778.Google ScholarCross Ref
Jian-Fang Hu, Wei-Shi Zheng, Jianhuang Lai, and Jianguo Zhang. 2015. Jointly learning heterogeneous features for RGB-D activity recognition. In Procedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE.Google ScholarCross Ref
Catalin Ionescu, Orestis Vantzos, and Cristian Sminchisescu. 2015. Matrix Backpropagation for Deep Networks with Structured Layers. In Procedings of the IEEE International Conference on Computer Vision. IEEE.Google ScholarDigital Library
Purushottam Kar and Harish Karnick. 2012. Random feature maps for dot product kernels. In Artificial Intell. statistics. PMLR, 583–591.Google Scholar
Evangelos Kazakos, Arsha Nagrani, Andrew Zisserman, and Dima Damen. 2019. Epic-fusion: Audio-visual temporal binding for egocentric action recognition. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 5492–5501.Google ScholarCross Ref
Yu Kong and Yun Fu. 2015. Bilinear heterogeneous information machine for RGB-D action recognition. In Procedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE.Google ScholarCross Ref
Ivan Laptev, Marcin Marszalek, Cordelia Schmid, and Benjamin Rozenfeld. 2008. Learning realistic human actions from movies. In Procedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE.Google ScholarCross Ref
Peihua Li, Jiangtao Xie, Qilong Wang, and Wangmeng Zuo. 2017. Is Second-Order Information Helpful for Large-Scale Visual Recognition?. In Procedings of the IEEE International Conference on Computer Vision. IEEE.Google ScholarCross Ref
Xiangyu Li, Yonghong Hou, Pichao Wang, Zhimin Gao, Mingliang Xu, and Wanqing Li. 2021. Trear: Transformer-based rgb-d egocentric action recognition. IEEE Trans. Cognitive Dev. Syst. 14, 1 (2021), 246–252.Google ScholarCross Ref
Yin Li, Zhefan Ye, and James M. Rehg. 2015. Delving into egocentric actions. In Procedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE.Google ScholarCross Ref
Tsung-Yu Lin and Subhransu Maji. 2017. Improved bilinear pooling with CNNs. arXiv preprint arXiv:1707.06772(2017).Google Scholar
Tsung-Yu Lin, Aruni RoyChowdhury, and Subhransu Maji. 2015. Bilinear CNN Models for Fine-Grained Visual Recognition. In Procedings of the IEEE International Conference on Computer Vision. IEEE.Google ScholarDigital Library
Jianbo Liu, Yongcheng Liu, Ying Wang, Veronique Prinet, Shiming Xiang, and Chunhong Pan. 2020. Decoupled Representation Learning for Skeleton-Based Gesture Recognition. In Procedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. IEEE.Google ScholarCross Ref
Jun Liu, Amir Shahroudy, Dong Xu, Alex C. Kot, and Gang Wang. 2017. Deep multimodal feature analysis for action recognition in RGB+D videos. IEEE Trans. Pattern Anal. Mach. Intell. 40, 5 (2017), 1045–1058.Google Scholar
Minghuang Ma, Haoqi Fan, and Kris M. Kitani. 2016. Going Deeper into First-Person Activity Recognition. In Procedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE.Google ScholarCross Ref
Shaobo Min, Hantao Yao, Hongtao Xie, Zheng-Jun Zha, and Yongdong Zhang. 2020. Multi-Objective Matrix Normalization for Fine-Grained Visual Recognition. IEEE Trans. Image Process. 29 (2020), 4996–5009.Google ScholarDigital Library
Mohammad Moghimi, Pablo Azagra, Luis Montesano, Ana C. Murillo, and Serge Belongie. 2014. Experiments on an RGB-D Wearable Vision System for Egocentric Activity Recognition. In Procedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops. IEEE.Google ScholarDigital Library
Eshed Ohn-Bar and Mohan Manubhai Trivedi. 2014. Hand gesture recognition in real time for automotive interfaces: A multimodal vision-based approach and evaluations. IEEE Trans. Intell. Transp. Syst. 15, 6 (2014), 2368–2377.Google ScholarCross Ref
Omar Oreifej and Zicheng Liu. 2013. HON4D: Histogram of Oriented 4D Normals for Activity Recognition from Depth Sequences. In Procedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE.Google ScholarDigital Library
Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, et al. 2019. Pytorch: An imperative style, high-performance deep learning library. Procedings of the Advances Neuro Information Processing Systems 32 (2019).Google Scholar
Hossein Rahmani and Ajmal Mian. 2016. 3D Action Recognition from Novel Viewpoints. In Procedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE.Google ScholarCross Ref
Liangliang Ren, Jiwen Lu, Jianjiang Feng, and Jie Zhou. 2017. Multi-modal uniform deep learning for RGB-D person re-identification. Pattern Recognit. 72(2017), 446–457.Google ScholarDigital Library
Qinghongya Shi, Hong-Bo Zhang, Zhe Li, Ji-Xiang Du, Qing Lei, and Jing-Hua Liu. 2022. Shuffle-invariant Network for Action Recognition in Videos. ACM Trans. Multimedia Comput. Commun. Appl. 18, 3 (2022), 1–18.Google ScholarDigital Library
Karen Simonyan and Andrew Zisserman. 2014. Two-stream convolutional networks for action recognition in videos. In Procedings of the Advances Neuro Information Processing Systems. 568–576.Google Scholar
Karen Simonyan and Andrew Zisserman. 2014. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556(2014).Google Scholar
Swathikiran Sudhakaran, Sergio Escalera, and Oswald Lanz. 2019. LSTA: Long Short-Term Attention for Egocentric Action Recognition. In Procedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. IEEE.Google ScholarCross Ref
Swathikiran Sudhakaran and Oswald Lanz. 2018. Attention is all we need: Nailing down object-centric attention for egocentric activity recognition. arXiv preprint arXiv:1807.11794(2018).Google Scholar
Min Tan, Fu Yuan, Jun Yu, Guijun Wang, and Xiaoling Gu. 2022. Fine-grained image classification via multi-scale selective hierarchical biquadratic pooling. ACM Trans. Multimedia Comput. Commun. Appl. 18, 1s (2022), 1–23.Google ScholarDigital Library
Yansong Tang, Yi Tian, Jiwen Lu, Jianjiang Feng, and Jie Zhou. 2017. Action recognition in RGB-D egocentric videos. In Procedings of the IEEE International Conference on Image Processing. IEEE.Google ScholarDigital Library
Yansong Tang, Zian Wang, Jiwen Lu, Jianjiang Feng, and Jie Zhou. 2019. Multi-Stream Deep Neural Networks for RGB-D Egocentric Action Recognition. IEEE Trans. Circuits Syst. Video Technol. 29, 10 (oct 2019), 3001–3015.Google ScholarCross Ref
Raviteja Vemulapalli, Felipe Arrate, and Rama Chellappa. 2014. Human Action Recognition by Representing 3D Skeletons as Points in a Lie Group. In Procedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE.Google ScholarDigital Library
Heng Wang and Cordelia Schmid. 2013. Action Recognition with Improved Trajectories. In Procedings of the IEEE International Conference on Computer Vision. IEEE.Google ScholarDigital Library
Heng Wang, Muhammad Muneeb Ullah, Alexander Klaser, Ivan Laptev, and Cordelia Schmid. 2009. Evaluation of local spatio-temporal features for action recognition. In Procedings of the British Machine Vision Conference. British Machine Vision Association.Google ScholarCross Ref
Qilong Wang, Peihua Li, and Lei Zhang. 2017. G2DeNet: Global Gaussian Distribution Embedding Network and Its Application to Visual Recognition. In Procedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE.Google ScholarCross Ref
Chi Xu, Lakshmi Narasimhan Govindarajan, and Li Cheng. 2017. Hand action detection from ego-centric depth sequences with error-correcting Hough transform. Pattern Recogit. 72(2017), 494–503.Google ScholarDigital Library
Haotian Xu, Xiaobo Jin, Qiufeng Wang, Amir Hussain, and Kaizhu Huang. 2022. Exploiting Attention-Consistency Loss For Spatial-Temporal Stream Action Recognition. ACM Trans. Multimedia Comput. Commun. Appl.(2022).Google Scholar
Jun Ye, Hao Hu, Guo-Jun Qi, and Kien A Hua. 2017. A temporal order modeling approach to human action recognition from multimodal sensor data. ACM Trans. Multimedia Comput. Commun. Appl. 13, 2 (2017), 1–22.Google ScholarDigital Library
Hasan F. M. Zaki, Faisal Shafait, and Ajmal Mian. 2017. Modeling Sub-Event Dynamics in First-Person Action Recognition. In Procedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE.Google ScholarCross Ref
Mihai Zanfir, Marius Leordeanu, and Cristian Sminchisescu. 2013. The Moving Pose: An Efficient 3D Kinematics Descriptor for Low-Latency Action Recognition and Detection. In Procedings of the IEEE International Conference on Computer Vision. IEEE.Google ScholarDigital Library
Xikang Zhang, Yin Wang, Mengran Gou, Mario Sznaier, and Octavia Camps. 2016. Efficient Temporal Sequence Comparison and Classification Using Gram Matrix Embeddings on a Riemannian Manifold. In Procedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE.Google ScholarCross Ref
Na Zheng, Xuemeng Song, Tianyu Su, Weifeng Liu, Yan Yan, and Liqiang Nie. 2022. Egocentric Early Action Prediction via Adversarial Knowledge Distillation. ACM Trans. Multimedia Comput. Commun. Appl.(2022).Google Scholar
Yizhou Zhou, Xiaoyan Sun, Zheng-Jun Zha, and Wenjun Zeng. 2018. MiCT: Mixed 3D/2D Convolutional Tube for Human Action Recognition. In Procedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. IEEE.Google ScholarCross Ref

Index Terms

Multimodal Score Fusion with Sparse Low Rank Bilinear Pooling for Egocentric Hand Action Recognition
1. Computing methodologies
  1. Artificial intelligence
    1. Computer vision
      1. Computer vision tasks
        Activity recognition and understanding

Recommendations

Sparse Support Matrix Machine

We propose a novel matrix classifier to simultaneously leverage the structural information within matrices and select useful features.We regularize the combination of nuclear norm and l1 norm of the regression matrix and develop an efficient solver ...
Read More
Fast Hierarchical Solvers For Sparse Matrices Using Extended Sparsification and Low-Rank Approximation

Inversion of sparse matrices with standard direct solve schemes is robust but computationally expensive. Iterative solvers, on the other hand, demonstrate better scalability but need to be used with an appropriate preconditioner (e.g., ILU, AMG, Gauss--...
Read More
Recovering Low-Rank and Sparse Components of Matrices from Incomplete and Noisy Observations

Many problems can be characterized by the task of recovering the low-rank and sparse components of a given matrix. Recently, it was discovered that this nondeterministic polynomial-time hard (NP-hard) task can be well accomplished, both theoretically ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in

ACM Transactions on Multimedia Computing, Communications, and Applications Just Accepted
ISSN:1551-6857
EISSN:1551-6865
Table of Contents

Copyright © 2024 Copyright held by the owner/author(s). Publication rights licensed to ACM.
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Online AM: 2 April 2024
- Accepted: 25 March 2024
- Revised: 18 December 2023
- Received: 27 July 2023
Published in tomm Just Accepted

Check for updates
Author Tags
bilinear score pooling
egocentric hand action recognition
RGB-D videos
sparse
low rank
CNN
RNN
Qualifiers
- research-article
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 0
  Total Citations
  View Citations
- 32
  Total Downloads
- Downloads (Last 12 months)32
- Downloads (Last 6 weeks)32
Other Metrics
View Author Metrics
Cited By
This publication has not been cited yet

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Multimodal Score Fusion with Sparse Low Rank Bilinear Pooling for Egocentric Hand Action Recognition

ACM Transactions on Multimedia Computing, Communications, and Applications

Abstract

References

Cited By

Index Terms

Recommendations

Sparse Support Matrix Machine

Fast Hierarchical Solvers For Sparse Matrices Using Extended Sparsification and Low-Rank Approximation

Recovering Low-Rank and Sparse Components of Matrices from Incomplete and Noisy Observations

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Multimodal Score Fusion with Sparse Low Rank Bilinear Pooling for Egocentric Hand Action Recognition

ACM Transactions on Multimedia Computing, Communications, and Applications

Abstract

References

Cited By

Index Terms

Recommendations

Sparse Support Matrix Machine

Fast Hierarchical Solvers For Sparse Matrices Using Extended Sparsification and Low-Rank Approximation

Recovering Low-Rank and Sparse Components of Matrices from Incomplete and Noisy Observations

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media