CLDE-Net: crowd localization and density estimation based on CNN and transformer network,Multimedia Systems

当前位置： X-MOL 学术 › Multimedia Syst. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

CLDE-Net: crowd localization and density estimation based on CNN and transformer network
Multimedia Systems ( IF 3.9 ) Pub Date : 2024-04-08 , DOI: 10.1007/s00530-024-01318-8
Yaocong Hu , Yuanyuan Lin , Huicheng Yang , Bingyou Liu , Guoyang Wan , Jinwen Hong , Chao Xie , Wei Wang , Xiaobo Lu

Given a crowd image, there are two ways for human to approximate the counting number: exactly locating head points in each local region or directly estimating the total number of person based on the whole image. By imitating human visual perception, CNN and transformer are two mainstream models for solving crowd counting challenging, among which CNN has a strong ability to extract locality-oriented feature and transformer is suitable for modeling global dependencies. Based on the fact, in this paper, the proposed CLDE-Net is the first study that fulfills exact localization and direct estimation by designing the hybrid of CNN and transformer, to be specific, CNN searches all candidate head points in each local region and transformer learns the crowd density map with global receptive fields. Furthermore, we adopt two pipelines to further boost crowd counting performance: (1) cross-layer feature interaction module is employed to facilitate information transmission between two network branches of CNN and transformer and (2) dynamic factor generator is designed to adaptively fuse the result of head point localization and density map estimation. Extensive experiments show that the proposed CLDE-Net framework achieves the state-of-the-art performance on multiple data sets for crowd counting.

中文翻译：

CLDE-Net：基于CNN和Transformer网络的人群定位和密度估计

给定一幅人群图像，人类有两种方法来近似计数：精确定位每个局部区域的头部点或根据整个图像直接估计总人数。通过模仿人类视觉感知，CNN 和 Transformer 是解决人群计数挑战的两种主流模型，其中 CNN 具有很强的提取局部特征的能力，而 Transformer 适合建模全局依赖关系。基于事实，本文提出的 CLDE-Net 是第一个通过设计 CNN 和 Transformer 的混合来实现精确定位和直接估计的研究，具体来说，CNN 搜索每个局部区域和 Transformer 中的所有候选头点学习具有全局感受野的人群密度图。此外，我们采用两条管道来进一步提高人群计数性能：（1）采用跨层特征交互模块来促进 CNN 和 Transformer 两个网络分支之间的信息传输；（2）设计动态因子生成器来自适应融合结果头点定位和密度图估计。大量实验表明，所提出的 CLDE-Net 框架在人群计数的多个数据集上实现了最先进的性能。

更新日期：2024-04-09

点击分享查看原文

点击收藏

阅读更多本刊最新论文本刊介绍/投稿指南

全部期刊列表>>