skip to main content
research-article
Free Access
Just Accepted

Multimodal Score Fusion with Sparse Low Rank Bilinear Pooling for Egocentric Hand Action Recognition

Online AM:02 April 2024Publication History
Skip Abstract Section

Abstract

With the advent of egocentric cameras, there are new challenges where traditional computer vision are not sufficient to handle this kind of videos. Moreover, egocentric cameras often offer multiple modalities which need to be modeled jointly to exploit complimentary information. In this paper, we proposed a sparse low-rank bilinear score pooling approach for egocentric hand action recognition from RGB-D videos. It consists of five blocks: a baseline CNN to encode RGB and depth information for producing classification probabilities; a novel bilinear score pooling block to generate a score matrix; a sparse low rank matrix recovery block to reduce redundant features, which is common in bilinear pooling; a one layer CNN for frame-level classification; and an RNN for video level classification. We proposed to fuse classification probabilities instead of traditional CNN features from RGB and depth modality, involving an effective yet simple sparse low rank bilinear score pooling to produce a fused RGB-D score matrix. To demonstrate the efficacy of our method, we perform extensive experiments over two large-scale hand action datasets, namely, THU-READ and FPHA, and two smaller datasets, GUN-71 and HAD. We observe that the proposed method outperforms state-of-the-art methods and achieves accuracies of 78.55% and 96.87% over the THU-READ dataset in cross-subject and cross-group settings, respectively. Further, we achieved accuracies of 91.59% and 43.87% over the FPHA and Gun-71 datasets, respectively.

References

  1. Galen Andrew, Raman Arora, Jeff Bilmes, and Karen Livescu. 2013. Deep canonical correlation analysis. In Procedings of the International conference on machine learning. PMLR, 1247–1255.Google ScholarGoogle Scholar
  2. Anurag Arnab, Mostafa Dehghani, Georg Heigold, Chen Sun, Mario Lučić, and Cordelia. Schmid. 2021. Vivit: A video vision transformer. In Procedings of the IEEE/CVF International Conference on Computer Vision. 6836–6846.Google ScholarGoogle ScholarCross RefCross Ref
  3. Stephen Boyd, Neal Parikh, Eric Chu, Borja Peleato, Jonathan Eckstein, et al. 2011. Distributed optimization and statistical learning via the alternating direction method of multipliers. Vol.  3. Now Publishers Inc. 1–122 pages.Google ScholarGoogle Scholar
  4. Jun Cheng, Ziliang Ren, Qieshi Zhang, Xiangyang Gao, and Fusheng Hao. 2021. Cross-modality compensation convolutional neural networks for RGB-D action recognition. IEEE Trans. Circuits Syst. Video Technol. 32, 3 (2021), 1498–1509. https://doi.org/10.1109/tcsvt.2021.3076165Google ScholarGoogle ScholarCross RefCross Ref
  5. Navneet Dalal, Bill Triggs, and Cordelia Schmid. 2006. Human Detection Using Oriented Histograms of Flow and Appearance. In Procedings of the European Conference Computer Vision. Springer, 428–441.Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Yong Du, Wei Wang, and Liang Wang. 2015. Hierarchical recurrent neural network for skeleton based action recognition. In Procedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE.Google ScholarGoogle Scholar
  7. Melih Engin, Lei Wang, Luping Zhou, and Xinwang Liu. 2018. DeepKSPD: Learning Kernel-Matrix-Based SPD Representation For Fine-Grained Image Recognition. In Procedings of the European Conference Computer Vision. Springer, 629–645.Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Alireza Fathi, Ali Farhadi, and James M. Rehg. 2011. Understanding egocentric activities. In Procedings of the IEEE International Conference on Computer Vision. IEEE.Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Alireza Fathi, Xiaofeng Ren, and James M. Rehg. 2011. Learning to recognize objects in egocentric activities. In Procedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE.Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Christoph Feichtenhofer, Haoqi Fan, Jitendra Malik, and Kaiming He. 2019. Slowfast networks for video recognition. In Procedings of the IEEE/CVF International Conference on Computer Vision. 6202–6211.Google ScholarGoogle ScholarCross RefCross Ref
  11. Christoph Feichtenhofer, Axel Pinz, and Andrew Zisserman. 2016. Convolutional two-stream network fusion for video action recognition. In Procedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1933–1941.Google ScholarGoogle ScholarCross RefCross Ref
  12. Yang Gao, Oscar Beijbom, Ning Zhang, and Trevor Darrell. 2016. Compact Bilinear Pooling. In Procedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE.Google ScholarGoogle ScholarCross RefCross Ref
  13. Guillermo Garcia-Hernando and Tae-Kyun Kim. 2017. Transition Forests: Learning Discriminative Temporal Transitions for Action Recognition and Detection. In Procedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE.Google ScholarGoogle ScholarCross RefCross Ref
  14. Guillermo Garcia-Hernando, Shanxin Yuan, Seungryul Baek, and Tae-Kyun Kim. 2018. First-Person Hand Action Benchmark with RGB-D Videos and 3D Hand Pose Annotations. In Procedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. IEEE.Google ScholarGoogle ScholarCross RefCross Ref
  15. Kristen Grauman, Andrew Westbury, Eugene Byrne, Zachary Chavis, Antonino Furnari, Rohit Girdhar, Jackson Hamburger, Hao Jiang, Miao Liu, Xingyu Liu, et al. 2022. Ego4d: Around the world in 3,000 hours of egocentric video. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 18995–19012.Google ScholarGoogle ScholarCross RefCross Ref
  16. Rogez Gregory, S. Supancic James, and Ramanan Deva. 2015. Understanding Everyday Hands in Action from RGB-D Images. In Procedings of the IEEE International Conference on Computer Vision. IEEE.Google ScholarGoogle Scholar
  17. Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Procedings of the IEEE Conference on Computer Vision and Pattern Recognition. 770–778.Google ScholarGoogle ScholarCross RefCross Ref
  18. Jian-Fang Hu, Wei-Shi Zheng, Jianhuang Lai, and Jianguo Zhang. 2015. Jointly learning heterogeneous features for RGB-D activity recognition. In Procedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE.Google ScholarGoogle ScholarCross RefCross Ref
  19. Catalin Ionescu, Orestis Vantzos, and Cristian Sminchisescu. 2015. Matrix Backpropagation for Deep Networks with Structured Layers. In Procedings of the IEEE International Conference on Computer Vision. IEEE.Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Purushottam Kar and Harish Karnick. 2012. Random feature maps for dot product kernels. In Artificial Intell. statistics. PMLR, 583–591.Google ScholarGoogle Scholar
  21. Evangelos Kazakos, Arsha Nagrani, Andrew Zisserman, and Dima Damen. 2019. Epic-fusion: Audio-visual temporal binding for egocentric action recognition. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 5492–5501.Google ScholarGoogle ScholarCross RefCross Ref
  22. Yu Kong and Yun Fu. 2015. Bilinear heterogeneous information machine for RGB-D action recognition. In Procedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE.Google ScholarGoogle ScholarCross RefCross Ref
  23. Ivan Laptev, Marcin Marszalek, Cordelia Schmid, and Benjamin Rozenfeld. 2008. Learning realistic human actions from movies. In Procedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE.Google ScholarGoogle ScholarCross RefCross Ref
  24. Peihua Li, Jiangtao Xie, Qilong Wang, and Wangmeng Zuo. 2017. Is Second-Order Information Helpful for Large-Scale Visual Recognition?. In Procedings of the IEEE International Conference on Computer Vision. IEEE.Google ScholarGoogle ScholarCross RefCross Ref
  25. Xiangyu Li, Yonghong Hou, Pichao Wang, Zhimin Gao, Mingliang Xu, and Wanqing Li. 2021. Trear: Transformer-based rgb-d egocentric action recognition. IEEE Trans. Cognitive Dev. Syst. 14, 1 (2021), 246–252.Google ScholarGoogle ScholarCross RefCross Ref
  26. Yin Li, Zhefan Ye, and James M. Rehg. 2015. Delving into egocentric actions. In Procedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE.Google ScholarGoogle ScholarCross RefCross Ref
  27. Tsung-Yu Lin and Subhransu Maji. 2017. Improved bilinear pooling with CNNs. arXiv preprint arXiv:1707.06772(2017).Google ScholarGoogle Scholar
  28. Tsung-Yu Lin, Aruni RoyChowdhury, and Subhransu Maji. 2015. Bilinear CNN Models for Fine-Grained Visual Recognition. In Procedings of the IEEE International Conference on Computer Vision. IEEE.Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. Jianbo Liu, Yongcheng Liu, Ying Wang, Veronique Prinet, Shiming Xiang, and Chunhong Pan. 2020. Decoupled Representation Learning for Skeleton-Based Gesture Recognition. In Procedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. IEEE.Google ScholarGoogle ScholarCross RefCross Ref
  30. Jun Liu, Amir Shahroudy, Dong Xu, Alex C. Kot, and Gang Wang. 2017. Deep multimodal feature analysis for action recognition in RGB+D videos. IEEE Trans. Pattern Anal. Mach. Intell. 40, 5 (2017), 1045–1058.Google ScholarGoogle Scholar
  31. Minghuang Ma, Haoqi Fan, and Kris M. Kitani. 2016. Going Deeper into First-Person Activity Recognition. In Procedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE.Google ScholarGoogle ScholarCross RefCross Ref
  32. Shaobo Min, Hantao Yao, Hongtao Xie, Zheng-Jun Zha, and Yongdong Zhang. 2020. Multi-Objective Matrix Normalization for Fine-Grained Visual Recognition. IEEE Trans. Image Process. 29 (2020), 4996–5009.Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. Mohammad Moghimi, Pablo Azagra, Luis Montesano, Ana C. Murillo, and Serge Belongie. 2014. Experiments on an RGB-D Wearable Vision System for Egocentric Activity Recognition. In Procedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops. IEEE.Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. Eshed Ohn-Bar and Mohan Manubhai Trivedi. 2014. Hand gesture recognition in real time for automotive interfaces: A multimodal vision-based approach and evaluations. IEEE Trans. Intell. Transp. Syst. 15, 6 (2014), 2368–2377.Google ScholarGoogle ScholarCross RefCross Ref
  35. Omar Oreifej and Zicheng Liu. 2013. HON4D: Histogram of Oriented 4D Normals for Activity Recognition from Depth Sequences. In Procedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE.Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, et al. 2019. Pytorch: An imperative style, high-performance deep learning library. Procedings of the Advances Neuro Information Processing Systems 32 (2019).Google ScholarGoogle Scholar
  37. Hossein Rahmani and Ajmal Mian. 2016. 3D Action Recognition from Novel Viewpoints. In Procedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE.Google ScholarGoogle ScholarCross RefCross Ref
  38. Liangliang Ren, Jiwen Lu, Jianjiang Feng, and Jie Zhou. 2017. Multi-modal uniform deep learning for RGB-D person re-identification. Pattern Recognit. 72(2017), 446–457.Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. Qinghongya Shi, Hong-Bo Zhang, Zhe Li, Ji-Xiang Du, Qing Lei, and Jing-Hua Liu. 2022. Shuffle-invariant Network for Action Recognition in Videos. ACM Trans. Multimedia Comput. Commun. Appl. 18, 3 (2022), 1–18.Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. Karen Simonyan and Andrew Zisserman. 2014. Two-stream convolutional networks for action recognition in videos. In Procedings of the Advances Neuro Information Processing Systems. 568–576.Google ScholarGoogle Scholar
  41. Karen Simonyan and Andrew Zisserman. 2014. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556(2014).Google ScholarGoogle Scholar
  42. Swathikiran Sudhakaran, Sergio Escalera, and Oswald Lanz. 2019. LSTA: Long Short-Term Attention for Egocentric Action Recognition. In Procedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. IEEE.Google ScholarGoogle ScholarCross RefCross Ref
  43. Swathikiran Sudhakaran and Oswald Lanz. 2018. Attention is all we need: Nailing down object-centric attention for egocentric activity recognition. arXiv preprint arXiv:1807.11794(2018).Google ScholarGoogle Scholar
  44. Min Tan, Fu Yuan, Jun Yu, Guijun Wang, and Xiaoling Gu. 2022. Fine-grained image classification via multi-scale selective hierarchical biquadratic pooling. ACM Trans. Multimedia Comput. Commun. Appl. 18, 1s (2022), 1–23.Google ScholarGoogle ScholarDigital LibraryDigital Library
  45. Yansong Tang, Yi Tian, Jiwen Lu, Jianjiang Feng, and Jie Zhou. 2017. Action recognition in RGB-D egocentric videos. In Procedings of the IEEE International Conference on Image Processing. IEEE.Google ScholarGoogle ScholarDigital LibraryDigital Library
  46. Yansong Tang, Zian Wang, Jiwen Lu, Jianjiang Feng, and Jie Zhou. 2019. Multi-Stream Deep Neural Networks for RGB-D Egocentric Action Recognition. IEEE Trans. Circuits Syst. Video Technol. 29, 10 (oct 2019), 3001–3015.Google ScholarGoogle ScholarCross RefCross Ref
  47. Raviteja Vemulapalli, Felipe Arrate, and Rama Chellappa. 2014. Human Action Recognition by Representing 3D Skeletons as Points in a Lie Group. In Procedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE.Google ScholarGoogle ScholarDigital LibraryDigital Library
  48. Heng Wang and Cordelia Schmid. 2013. Action Recognition with Improved Trajectories. In Procedings of the IEEE International Conference on Computer Vision. IEEE.Google ScholarGoogle ScholarDigital LibraryDigital Library
  49. Heng Wang, Muhammad Muneeb Ullah, Alexander Klaser, Ivan Laptev, and Cordelia Schmid. 2009. Evaluation of local spatio-temporal features for action recognition. In Procedings of the British Machine Vision Conference. British Machine Vision Association.Google ScholarGoogle ScholarCross RefCross Ref
  50. Qilong Wang, Peihua Li, and Lei Zhang. 2017. G2DeNet: Global Gaussian Distribution Embedding Network and Its Application to Visual Recognition. In Procedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE.Google ScholarGoogle ScholarCross RefCross Ref
  51. Chi Xu, Lakshmi Narasimhan Govindarajan, and Li Cheng. 2017. Hand action detection from ego-centric depth sequences with error-correcting Hough transform. Pattern Recogit. 72(2017), 494–503.Google ScholarGoogle ScholarDigital LibraryDigital Library
  52. Haotian Xu, Xiaobo Jin, Qiufeng Wang, Amir Hussain, and Kaizhu Huang. 2022. Exploiting Attention-Consistency Loss For Spatial-Temporal Stream Action Recognition. ACM Trans. Multimedia Comput. Commun. Appl.(2022).Google ScholarGoogle Scholar
  53. Jun Ye, Hao Hu, Guo-Jun Qi, and Kien A Hua. 2017. A temporal order modeling approach to human action recognition from multimodal sensor data. ACM Trans. Multimedia Comput. Commun. Appl. 13, 2 (2017), 1–22.Google ScholarGoogle ScholarDigital LibraryDigital Library
  54. Hasan F. M. Zaki, Faisal Shafait, and Ajmal Mian. 2017. Modeling Sub-Event Dynamics in First-Person Action Recognition. In Procedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE.Google ScholarGoogle ScholarCross RefCross Ref
  55. Mihai Zanfir, Marius Leordeanu, and Cristian Sminchisescu. 2013. The Moving Pose: An Efficient 3D Kinematics Descriptor for Low-Latency Action Recognition and Detection. In Procedings of the IEEE International Conference on Computer Vision. IEEE.Google ScholarGoogle ScholarDigital LibraryDigital Library
  56. Xikang Zhang, Yin Wang, Mengran Gou, Mario Sznaier, and Octavia Camps. 2016. Efficient Temporal Sequence Comparison and Classification Using Gram Matrix Embeddings on a Riemannian Manifold. In Procedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE.Google ScholarGoogle ScholarCross RefCross Ref
  57. Na Zheng, Xuemeng Song, Tianyu Su, Weifeng Liu, Yan Yan, and Liqiang Nie. 2022. Egocentric Early Action Prediction via Adversarial Knowledge Distillation. ACM Trans. Multimedia Comput. Commun. Appl.(2022).Google ScholarGoogle Scholar
  58. Yizhou Zhou, Xiaoyan Sun, Zheng-Jun Zha, and Wenjun Zeng. 2018. MiCT: Mixed 3D/2D Convolutional Tube for Human Action Recognition. In Procedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. IEEE.Google ScholarGoogle ScholarCross RefCross Ref

Index Terms

  1. Multimodal Score Fusion with Sparse Low Rank Bilinear Pooling for Egocentric Hand Action Recognition

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in

    Full Access

    • Published in

      cover image ACM Transactions on Multimedia Computing, Communications, and Applications
      ACM Transactions on Multimedia Computing, Communications, and Applications Just Accepted
      ISSN:1551-6857
      EISSN:1551-6865
      Table of Contents

      Copyright © 2024 Copyright held by the owner/author(s). Publication rights licensed to ACM.

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Online AM: 2 April 2024
      • Accepted: 25 March 2024
      • Revised: 18 December 2023
      • Received: 27 July 2023
      Published in tomm Just Accepted

      Check for updates

      Qualifiers

      • research-article
    • Article Metrics

      • Downloads (Last 12 months)32
      • Downloads (Last 6 weeks)32

      Other Metrics

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader