当前位置: X-MOL 学术Image Vis. Comput. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Enhancing temporal action localization in an end-to-end network through estimation error incorporation
Image and Vision Computing ( IF 4.7 ) Pub Date : 2024-03-27 , DOI: 10.1016/j.imavis.2024.104994
Mozhgan Mokari , Khosrow Haj Sadeghi

Temporal action localization presents a significant challenge in computer vision, as the development of an efficient method for this task remains elusive. The objective is to identify human activities within untrimmed videos, determining when and which actions occur in each video. While using trimmed videos could potentially resolve the localization problem and enhance classification accuracy, it is impractical for real-world applications as the trimming process itself requires human intervention. This highlights the importance of temporal localization. Due to the availability of several successful approaches for action recognition in trimmed video, conventional multi-stage methods for untrimmed video, commonly employ a network to generate activity proposals, followed by a separate network for classification. These disjoint networks are optimized individually and thus usually vary from the global optimum, leading to less precise candidate action proposals. To address this challenge, we propose a novel end-to-end neural network that utilizes error estimation for precise action localization and recognition in untrimmed videos. The proposed method performs the localization and classification of action instances simultaneously, thereby optimizing the corresponding networks concurrently. To increase the precision of the action proposal boundaries, the Regression module is innovatively utilized as part of the proposed end-to-end network, along with the Evaluation and Classification modules. This module estimates the potential error in proposal time boundaries and enhances the result accuracy. We have conducted experiments on THUMOS 14 and ActivityNet-1.3, which are considered the most challenging datasets for temporal action localization. The novel, yet fairly simple, proposed network achieves remarkable performance improvement compared to the other state-of-the-art methods. This improvement, which is more pronounced in the cases of high temporal intersection with ground truth, is accomplished without requiring extra data or complicated architecture. By incorporating error estimation, we achieved improvement in mean Average Precision (mAP). The proposed approach particularly shines for the localization of challenging activities in the complex and diverse dataset ActivityNet-1.3. For instance, for the “drinking coffee” activity, the mean Average Precision (mAP) was enhanced fivefold compared to the best-reported results.

中文翻译:

通过估计误差合并增强端到端网络中的时间动作定位

时间动作定位对计算机视觉提出了重大挑战,因为为此任务开发有效方法仍然难以实现。目标是识别未经修剪的视频中的人类活动,确定每个视频中发生的时间和动作。虽然使用修剪后的视频可能会解决定位问题并提高分类准确性,但对于现实世界的应用来说是不切实际的,因为修剪过程本身需要人工干预。这凸显了时间本地化的重要性。由于在修剪视频中存在多种成功的动作识别方法,因此针对未修剪视频的传统多阶段方法通常采用网络来生成活动建议,然后使用单独的网络进行分类。这些不相交的网络是单独优化的,因此通常与全局最优值不同,导致候选行动建议不太精确。为了应对这一挑战,我们提出了一种新颖的端到端神经网络,它利用误差估计来在未修剪的视频中进行精确的动作定位和识别。该方法同时执行动作实例的定位和分类,从而同时优化相应的网络。为了提高行动建议边界的精度,回归模块与评估和分类模块一起被创新地用作所建议的端到端网络的一部分。该模块估计提案时间边界中的潜在错误并提高结果准确性。我们在 THUMOS 14 和 ActivityNet-1.3 上进行了实验,它们被认为是时间动作定位最具挑战性的数据集。与其他最先进的方法相比,所提出的新颖但相当简单的网络实现了显着的性能改进。这种改进在与真实情况高度时间交叉的情况下更为明显,并且不需要额外的数据或复杂的架构即可实现。通过合并误差估计,我们提高了平均精度 (mAP)。所提出的方法特别适用于复杂多样的数据集 ActivityNet-1.3 中具有挑战性的活动的本地化。例如,对于“喝咖啡”活动,平均精度 (mAP) 比最佳报告结果提高了五倍。
更新日期:2024-03-27
down
wechat
bug