PPTtrack: Pyramid pooling based Transformer backbone for visual tracking,Expert Systems with Applications

当前位置： X-MOL 学术 › Expert Syst. Appl. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

PPTtrack: Pyramid pooling based Transformer backbone for visual tracking
Expert Systems with Applications ( IF 8.5 ) Pub Date : 2024-03-20 , DOI: 10.1016/j.eswa.2024.123716
Jun Wang , Shuai Yang , Yuanyun Wang , Guang Yang

In visual tracking, Convolutional Neural Network (CNN) is usually used as feature extractor, and can fully explore local dependencies of image blocks, which is help for improving tracking performance. However, CNN ignores global dependencies in image blocks. The global modeling is crucial in visual tracking. Recently, Transformer has gained attention to fully explore global dependencies on sequential data. However, Transformer’s unique multi-head self-attention mechanism results in high computational complexity. In this paper, we design a pyramid pooling based Transformer backbone network for visual tracking. Pyramid pooling refers to multiple pooling operations for feature map with different receptive fields and strides. The output data of each pooling layer is concatenated to form the final pooled feature map. On the one hand, after flattening the feature map with pyramid pooling, its sequence length will be greatly reduced. This will effectively reduce the computational complexity of the multi-head self-attention. On the other hand, pyramid pooling can extract multi-scale features, makes the feature maps contain more global context information. Finally, we propose a novel tracker with the designed pyramid pooling based Transformer backbone network and the Transformer based model predictor. We train the proposed tracker by end-to-end, and evaluate it on seven tracking benchmarks including UAV123, NFS, Trackingnet, LaSOT, GOT-10K, VOT2020 and RGBT2019. The proposed tracker achieves 79.8% robustness and 35 FPS on the VOT2020 dataset. The experiment demonstrates that proposed tracker achieves superior tracking performance with state-of-the-art trackers.

中文翻译：

PPTtrack：基于金字塔池的 Transformer 主干，用于视觉跟踪

在视觉跟踪中，卷积神经网络（CNN）通常用作特征提取器，可以充分挖掘图像块的局部依赖性，这有助于提高跟踪性能。然而，CNN 忽略了图像块中的全局依赖性。全局建模对于视觉跟踪至关重要。最近，Transformer 因充分探索顺序数据的全局依赖性而受到关注。然而，Transformer独特的多头自注意力机制导致计算复杂度较高。在本文中，我们设计了一个基于金字塔池化的 Transformer 主干网络，用于视觉跟踪。金字塔池化是指对具有不同感受野和步长的特征图进行多次池化操作。每个池化层的输出数据被连接起来形成最终的池化特征图。一方面，用金字塔池化对特征图进行扁平化后，其序列长度将大大减少。这将有效降低多头自注意力的计算复杂度。另一方面，金字塔池化可以提取多尺度特征，使得特征图包含更多的全局上下文信息。最后，我们提出了一种新颖的跟踪器，具有设计的基于金字塔池的 Transformer 主干网络和基于 Transformer 的模型预测器。我们对所提出的跟踪器进行端到端训练，并在 UAV123、NFS、Trackingnet、LaSOT、GOT-10K、VOT2020 和 RGBT2019 等七个跟踪基准上对其进行评估。所提出的跟踪器在 VOT2020 数据集上实现了 79.8% 的鲁棒性和 35 FPS。实验表明，所提出的跟踪器通过最先进的跟踪器实现了卓越的跟踪性能。

更新日期：2024-03-20

点击分享查看原文

点击收藏

阅读更多本刊最新论文本刊介绍/投稿指南

全部期刊列表>>