当前位置: X-MOL 学术ACM Trans. Intell. Syst. Technol. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Perceiving Actions via Temporal Video Frame Pairs
ACM Transactions on Intelligent Systems and Technology ( IF 5 ) Pub Date : 2024-03-17 , DOI: 10.1145/3652611
Rongchang Li 1 , Tianyang Xu 1 , Xiao-Jun Wu 1 , Zhongwei Shen 2 , Josef Kittler 3
Affiliation  

Video action recognition aims to classify the action category in given videos. In general, semantic-relevant video frame pairs reflect significant action patterns such as object appearance variation and abstract temporal concepts like speed, rhythm, etc. However, existing action recognition approaches tend to holistically extract spatiotemporal features. Though effective, there is still a risk of neglecting the crucial action features occurring across frames with a long-term temporal span. Motivated by this, in this paper, we propose to perceive actions via frame pairs directly and devise a novel Nest Structure with frame pairs as basic units. Specifically, we decompose a video sequence into all possible frame pairs and hierarchically organize them according to temporal frequency and order, thus transforming the original video sequence into a Nest Structure. Through naturally decomposing actions, the proposed structure can flexibly adapt to diverse action variations such as speed or rhythm changes. Next, we devise a Temporal Pair Analysis module (TPA) to extract discriminative action patterns based on the proposed Nest Structure. The designed TPA module consists of a pair calculation part to calculate the pair features and a pair fusion part to hierarchically fuse the pair features for recognizing actions. The proposed TPA can be flexibly integrated into existing backbones, serving as a side branch to capture various action patterns from multi-level features. Extensive experiments show that the proposed TPA module can achieve consistent improvements over several typical backbones, reaching or updating CNN-based SOTA results on several challenging action recognition benchmarks.



中文翻译:

通过时间视频帧对感知动作

视频动作识别旨在对给定视频中的动作类别进行分类。一般来说,语义相关的视频帧对反映了重要的动作模式,例如对象外观变化和抽象时间概念,例如速度、节奏等。然而,现有的动作识别方法倾向于整体提取时空特征。尽管有效,但仍然存在忽视长期时间跨度的帧之间发生的关键动作特征的风险。受此启发,在本文中,我们提出直接通过帧对感知动作,并设计一种以帧对为基本单元的新颖的嵌套结构。具体来说,我们将视频序列分解为所有可能的帧对,并根据时间频率和顺序对它们进行分层组织,从而将原始视频序列转换为嵌套结构。通过自然分解动作,所提出的结构可以灵活地适应不同的动作变化,例如速度或节奏变化。接下来,我们设计了一个时间对分析模块(TPA)来根据所提出的嵌套结构提取判别性动作模式。设计的TPA模块由计算对特征的对计算部分和对对特征进行分层融合以识别动作的对融合部分组成。所提出的 TPA 可以灵活地集成到现有的主干中,作为侧分支从多级特征中捕获各种动作模式。大量实验表明,所提出的 TPA 模块可以在几个典型的骨干网络上实现一致的改进,在几个具有挑战性的动作识别基准上达到或更新基于 CNN 的 SOTA 结果。

更新日期:2024-03-17
down
wechat
bug