Slim-neck by GSConv: a lightweight-design for real-time detector architectures

Li, Hulin; Li, Jun; Wei, Hanbing; Liu, Zheng; Zhan, Zhenfei; Ren, Qiliang

doi:10.1007/s11554-024-01436-6

Slim-neck by GSConv: a lightweight-design for real-time detector architectures

Research
Published: 29 March 2024

Volume 21, article number 62, (2024)
Cite this article

Journal of Real-Time Image Processing Aims and scope Submit manuscript

Hulin Li¹,
Jun Li²,
Hanbing Wei²,
Zheng Liu³,
Zhenfei Zhan² &
…
Qiliang Ren¹

184 Accesses
2 Citations
2 Altmetric
Explore all metrics

Abstract

Real-time object detection is significant for industrial and research fields. On edge devices, a giant model is difficult to achieve the real-time detecting requirement, and a lightweight model built from a large number of the depth-wise separable convolutional could not achieve the sufficient accuracy. We introduce a new lightweight convolutional technique, GSConv, to lighten the model but maintain the accuracy. The GSConv accomplishes an excellent trade-off between the accuracy and speed. Furthermore, we provide a design suggestion based on the GSConv, slim-neck (SNs), to achieve a higher computational cost-effectiveness of the real-time detectors. The effectiveness of the SNs was robustly demonstrated in over twenty sets comparative experiments. In particular, the real-time detectors of ameliorated by the SNs obtain the state-of-the-art (70.9% AP₅₀ for the SODA10M at a speed of ~ 100 FPS on a Tesla T4) compared with the baselines. Code is available at https://github.com/alanli1997/slim-neck-by-gsconv.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

SlimYOLOv4: lightweight object detector based on YOLOv4

Article 10 February 2022

SWD: Low-Compute Real-Time Object Detection Architecture

A Unified Multi-scale Deep Convolutional Neural Network for Fast Object Detection

References

Girshick, R., Donahue, J., Darrell, T., Malik, J.: Rich feature hierarchies for accurate object detection and semantic segmentation. In: Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR), Columbus, OH, USA 23–28 June 2014, pp. 580–587. https://doi.org/10.1109/CVPR.2014.81
Girshick, R.: Fast R-CNN. In: Proceedings of IEEE International Conference on Computer Vision (ICCV), Santiago, Chile 07–13 December 2015, pp. 1440–1448. https://doi.org/10.1109/ICCV.2015.169
Ren, S., He, K., Girshick, R., Sun, J.: Faster R-CNN: towards real-time object detection with region proposal networks. IEEE Trans Pattern Anal. Mach. Intel. 39(6), 1137–1149 (2017). https://doi.org/10.1109/TPAMI.2016.2577031
Article Google Scholar
Redmon, J., Divvala, S., Girshick, R., Farhadi, A.: You only look once: unified, real-time object detection. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA 27–30 June 2016, pp. 779–788. https://doi.org/10.1109/CVPR.2016.91
Redmon, J., Farhadi, A.: YOLO9000: better, faster, stronger. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA 21–26 July 2017; pp. 6517–6525, arXiv:1612.08242. [Online]. Available: https://arxiv.org/abs/1612.-08242v1. https://doi.org/10.1109/CVPR.2017.690
Redmon, J., Farhadi, A.: YOLOv3: an incremental improvement. arXiv eprints (2018). arXiv:1804.02767. [Online]. https://arxiv.org/abs/1804.02767
Bochkovskiy, A., Wang, C.Y., Liao, H-Y. M.: Yolov4: optimal speed and accuracy of object detection. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2020, arXiv:2004.10934. [Online]. https://arxiv.org/abs/2004.10934
Liu, W., Anguelov, D., Erhan, D., Szegedy, C.: Reed, S., Fu, C.Y., Berg, A.C.: SSD: single shot multibox detector. In: Proceedings of European Conference on Computer Vision (ECCV), Sep. 2016, pp. 21–37. https://doi.org/10.1007/978-3-319-46448-0_2
Fu, C.Y., Liu, W., Ranga, A., Tyagi, A., Berg, A.C.: DssD: deconvolutional single shot detector. arXiv eprints 2017, arXiv:1701.06659. [Online]. Available: https://arxiv.org/abs/1701.06659. https://doi.org/10.48550/arXiv.1701.06659
Chollet, F.: Xception: deep learning with depthwise separable convolutions. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA 21–26 July 2017, pp. 1800–1807. [Online]. Available: https://arxiv.org/abs/1610.02357v1. https://doi.org/10.1109/CVPR.2017.195
Howard, A.G., Zhu, M., Chen, B., Kalenichenko, D., Wang, W., Weyand, T., Andreetto, N., Hartwig, A.:. Mobilenets: efficient convolutional neural networks for mobile vision applications. arXiv eprints 2017, arXiv:1704.04861. [Online]. Available: https://arxiv.org/abs-/1704.04861
Sandler, M., Howard, A., Zhu, M., Zhmoginov, A., Chen, L.: Mobilenetv2: inverted residuals and linear bottlenecks. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018, arXiv:1801.04381. [Online]. Available: https://arxiv.org/abs/1801.-04381v4. https://doi.org/10.1109/CVPR.2018.00474
Howard, A., Sandler, M., Chu, G., Chen, L., Chen, B., Tan, M., Wang, W., Zhu, Y., Pang, R., Vasudevan, V., Le, Q.V., Adam, H.: Searching for MobileNetV3. In: Proceedings of IEEE International Conference on Computer Vision (ICCV), Apr. 2019, arXiv:1704.04861. [Online]. Available: https://arxiv.org/abs/1704.04861. https://doi.org/10.1109/ICCV.2019.00140
Zhang, X., Zhou, X., Lin, M., Sun, J.: ShuffleNet: An extremely efficient convolutional neural network for mobile devices. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Jul. 2017, arXiv:1707.01083. [Online]. Available: https://arxiv.org/abs/1707.01083v1. https://doi.org/10.1109/CVPR.2018.00716
Ma, N., Zhang, X., Zheng, H., Sun, J.: ShuffleNet V2: practical guidelines for efficient CNN architecture design. In: Proceedings of European Conference on Computer Vision (ECCV), 2018, arXiv:1807.11164. [Online]. Available: https://arxiv.org/abs/1807.11164v1. https://doi.org/10.1007/978-3-030-01264-9_8
Zablocki, É., Ben-Younes, H., Pérez, P., et al.: Explainability of deep vision-based autonomous driving systems: review and challenges. Int. J. Comput. Vis. (2022). https://doi.org/10.1007/s11263-022-01657-x
Article Google Scholar
K. Han; Y. Wang; Q. Tian; J. Guo; C. Xu; C. Xu. GhostNet: More features from cheapoperations. in Proc. IEEE Conf. Comput. Vis. Pattern Recog. (CVPR), Mar. 2020, arXiv:1911.11907. [Online]. Available: https://arxiv.org/abs/1911.11907 DOI: https://doi.org/10.1109/CVPR42600.2020.00165
Krizhevsky, A., Sutskever, I., Hinton, G.E.: ImageNet classification with deep convolutional neural networks. Assoc. Comput. Mach. 25, 84–90 (2012). https://doi.org/10.1145/3065386
Article Google Scholar
Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. CoRR 2015; arXiv:1409.1556. [Online]. Available: https://arxiv.org/abs/1409.1556
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016, pp. 770–778. https://doi.org/10.1109/CVPR.2016.90
Niu, W., Ma, X., Lin, S., Wang, S., Qian, X. Lin, X., Wang, Y. Ren, B.: PatDNN: achieving real-time DNN execution on mobile devices with pattern-based weight pruning. In: Proceedings of the Twenty-Fifth International Conference on Architectural Support for Programming Languages and Operating Systems. 2020, pp. 907–922
Lin, T.-Y., Dollár, P., Girshick, R., He, K., Hariharan, B., Belongie, S.: Feature pyramid networks for object detection. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA 21–26 July 2017; pp. 936–944. https://doi.org/10.1109/CVPR.2017.106
Wang, C.-Y., Bochkovskiy, A., Liao, H.-Y. M.: Scaled-yolov4: scaling cross stage partial network. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 20–25, 2021; pp. 13024–13033. https://doi.org/10.1109/CVPR46437.2021.01283
Tian, Z., Shen, C., Chen, H., He, T.: FCOS: fully convolutional one-stage object detection. In: International Conference on Computer Vision. (ICCV), Seoul, Korea (South) 27 October 2019–02 November 2019; pp. 9626–9635. https://doi.org/10.1109/ICCV.2019.00972
Zhu, C., He, Y., Savvides, M.: Feature selective anchor-free module for single-shot object detection. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA 15–20 June 2019; pp. 840–849. https://doi.org/10.1109/CVPR.2019.00093
He, K., Zhang, X., Ren, S., Sun, J.: Spatial pyramid pooling in deep convolutional networks for visual recognition. IEEE Trans. Pattern Anal. Mach. Intell. 37(9), 1904–1916 (2015). https://doi.org/10.1109/TPAMI.2015.2389824
Article Google Scholar
Hu, J., Shen, L., Albanie, S., Sun, G., Wu, E.: Squeeze-and-excitation networks. IEEE Trans. Pattern Anal. Mach. Intell. (2020). https://doi.org/10.1109/TPAMI.2019.2913372
Article Google Scholar
Woo, S., Park, J., Lee, J., Kweon, I.S.: CBAM: convolutional block attention module. In: Proceedings of European Conference on Computer Vision (ECCV), Jul. 2018, arXiv:1807.06521. [Online]. Available: https://arxiv.org/abs/1807.06521v1. https://doi.org/10.1007/978-3-030-01234-2_1
Hou, Q., Zhou, D., Feng, J.: Coordinate attention for efficient mobile network design. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Jun. 2021, arXiv:2103.02907. [Online]. Available: https://arxiv.org/abs/2103.02907. https://doi.org/10.1109/CVPR46437.2021.01350
Huang, G., Liu, Z., Van Der Maaten, L., Weinberger, K.Q.: Densely connected convolutional networks. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA 21–26 2017, pp. 2261–2269. https://doi.org/10.1109/CVPR.2017.243
Lee, Y., Hwang, J.-w., Lee, S., Bae, Y., Park, J.: An energy and GPU-computation efficient backbone network for real-time object detection. In: 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), Long Beach, CA, USA 16–17 June 2019, pp. 752–760. https://doi.org/10.1109/CVPRW.2019.00103
Wang, C.-Y., Mark Liao, H.-Y., Wu, Y.-H., Chen, P.-Y., Hsieh, J.-W., Yeh, I.-H.: CSPNet: a new backbone that can enhance learning capability of CNN. In: 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), Seattle, WA, USA 14–19 June 2020; pp. 1571–1580. https://doi.org/10.1109/CVPRW50498.2020.00203
Yu, J., Jiang, Y., Wang, Z., Cao, Z., Huang, T.: UnitBox: An advanced object detection network. Association for Computing Machinery, New York, NY, USA Oct. 2016; pp. 516–520. https://doi.org/10.1145/2964284.2967274
Rezatofighi, H., Tsoi, N., Gwak, J., Sadeghian, A., Reid, I., Savarese, S.: Generalized intersection over union: a metric and a loss for bounding box regression. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Jun. 2019; pp. 658–666. https://doi.org/10.1109/CVPR.2019.00075
Zheng, Z., Wang, P., Liu, W., Li, J., Ye, R., Ren, D.: Distance-IoU Loss: faster and better learning for bounding box regression. Proc. AAAI Conf. Artif. Intell. (AAAI) 34(7), 12993–13000 (2020). https://doi.org/10.1609/aaai.v34i07.6999
Article Google Scholar
Zheng, Z., Wang, P., Ren, D., Liu, W., Ye, R., Hu, Q., Zuo, W.: Enhancing geometric factors in model learning and inference for object detection and instance segmentation. IEEE Trans. Cybern. (2021). https://doi.org/10.1109/TCYB.2021.3095305
Article Google Scholar
Zhang, Y., Ren, W., Zhang, Z., Jia, Z., Wang, L., Tan, T.: Focal and efficient IoU loss for accurate bounding box regression. arXiv eprints 2021, arXiv:2101.08158 2021. [Online]. Available: https://arxiv.org/abs-/2101.08158. https://doi.org/10.1016/j.neucom.2022.07.042
Ramachandran, P., Zoph, B., Le, Q.V.: Searching for activation functions. arXiv eprints 2017, arXiv:1710.05941. [Online]. https://doi.org/10.48550/arXiv.1710.05941
Misra. Mish, D.: A self-regularized non-monotonic activation function. arXiv eprints 2020, arXiv:1908.08681. [Online]. https://arxiv.org/abs/1908.08681
Glorot, X., Bordes, A., Bengio, Y.: Deep sparse rectifier neural networks. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, 15, pp. 315–323 (2011)
Glenn, J.: Yolov5, 2022. https://github.com/ultralytics/yolov5
Wang, C.Y., Bochkovskiy, A., Liao, H.Y.M.: YOLOv7: trainable bag-of-freebies sets new state-of-the-art for real-time object detectors. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 2023, pp. 7464–7475
Zhang, S., Xie, Y., Wan, J., Xia, H., Li, S.Z., Guo, G.: WiderPerson: a diverse dataset for dense pedestrian detection in the wild. IEEE Trans. Multimedia 22(2), 380–393 (2020). https://doi.org/10.1109/TMM.2019.2929005
Article Google Scholar
Everingham, M., Ali Eslami, S.M., Gool, L.V., Williams, C.K.I., Winn, J., Zisserman, A.: The PASCAL visual object classes challenge: a retrospective. Int. J. Comput. Vis. 111, 98–136. (2015). https://doi.org/10.1007/s11263-014-0733-5
Han, J., Liang, X., Xu, H., Chen, K., Hong, L., Ye, C., Zhang, W., Li, Z., Liang, X., Xu, C.: Soda10m: towards large-scale object detection benchmark for autonomous driving. arXiv eprints 2021, arXiv: 2106.11118. https://doi.org/10.48550/arXiv.2106.11118
Xia, G.S., Bai, X., Ding, J., Zhu, Z., Belongie, S., Luo, J., Datcu, M., Pelillo, P., Zhang, L.: DOTA: a large-scale dataset for object detection in aerial images. In: 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition 2018, Salt Lake City, UT, USA. https://doi.org/10.1109/CVPR.2018.00418

Download references

Acknowledgements

The authors would like to thank Ms. Xinxin Liu (The Intelligent Transportation Big Data Center of the CQJTU) for her aid of the experimental GPUs providing.

Author information

Authors and Affiliations

College of Traffic and Transportation, Chongqing Jiaotong University, Chongqing, 400074, China
Hulin Li & Qiliang Ren
School of Mechatronics and Vehicle Engineering, Chongqing Jiaotong University, Chongqing, 400074, China
Jun Li, Hanbing Wei & Zhenfei Zhan
School of Engineering, University of British Columbia Okanagan, Kelowna, BC, V1V 1V7, Canada
Zheng Liu

Authors

Hulin Li
View author publications
You can also search for this author in PubMed Google Scholar
Jun Li
View author publications
You can also search for this author in PubMed Google Scholar
Hanbing Wei
View author publications
You can also search for this author in PubMed Google Scholar
Zheng Liu
View author publications
You can also search for this author in PubMed Google Scholar
Zhenfei Zhan
View author publications
You can also search for this author in PubMed Google Scholar
Qiliang Ren
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

Methodology was done by Hulin Li, Jun Li, Hanbing Wei and Zheng Liu; code was done by Hulin Li; experimental design was done by Hulin Li, Jun Li and Zhenfei Zhan; validation was done by Hulin Li, Jun Li, Zheng Liu and Qiliang Ren; writing—original draft preparation was done by Hulin Li; writing—review and editing was done by Hanbing Wei, Zhengfei Zhan, Zheng Liu and Qiliang Ren. All authors have read and agreed to the published version of the manuscript.

Corresponding author

Correspondence to Hulin Li.

Ethics declarations

Conflict of interests

The authors declare no conflict of interests.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

Appendix 1: The comparisons of the FLOPs and parameters amount between the DSC, SC and GSConv

The effect of DSC is obvious in the reduction in the parameters number to detection networks. The parameters amount of a conventional convolutional layer is C₁ × K₁ × K₂ × C₂, and the amount for a DSC is C₁ × K₁ × K₂ + 1 × 1 × C₁ × C₂, where C₁ and C₂ are the channels number of the feature maps from input and output, and K₁ × K₂ is the kernel size of convolutional. The calculation cost of a SC is W × H × C₁ × K₁ × K₂ × C₂, and the cost of DSC is W × H × C₁ × K₁ × K₂ + W × H × 1 × 1 × C₁ × C₂, where W and H are the width and height of the feature maps. The reason of DSC operation is cheaper than conventional convolutional could be explained by assuming the following conditions:

$$\left\{ \begin{gathered} size_{{{\text{input}}}} = W \times H \times C_{1} = 320 \times 320 \times 3 \hfill \\ size_{{{\text{output}}}} = W \times H \times C_{2} = 320 \times 320 \times 16 \hfill \\ size_{{{\text{kernel}}}} = K_{1} \times K_{2} = 3 \times 3 \hfill \\ ratio_{{\text{p}}} = \frac{{C_{1} \times K_{1} \times K_{2} + 1 \times 1 \times C_{1} \times C_{2} }}{{C_{1} \times K_{1} \times K_{2} \times C_{2} }} = \frac{1}{{C_{2} }} + \frac{1}{{K_{1} \times K_{2} }} \approx 0.174 \hfill \\ ratio_{{\text{c}}} = \frac{{W \times H \times C_{1} \times K_{1} \times K_{2} + W \times H \times 1 \times 1 \times C_{1} \times C_{2} }}{{W \times H \times C_{1} \times K_{1} \times K_{2} \times C_{2} }} \hfill \\ \, = \frac{1}{{C_{2} }} + \frac{1}{{K_{1} \times K_{2} }} \approx 0.174 \hfill \\ \end{gathered} \right.$$

where the ratio_p is a ratio of the parameters amount, between DSC and conventional convolutional layer, and the ratio_c is a ratio of the calculation cost of them. Obviously, the parameters amount and calculation cost of the DSC are indeed less and much lower than the SC. For the GSConv, the ratio of the computational cost of the GSConv to SC is:

$$ratio_{{\text{c}}}^{*} = \frac{{\frac{1}{2}W \times H \times K_{1} \times K_{2} \times C_{2} \times (C_{1} + 1)}}{{W \times H \times C_{1} \times K_{1} \times K_{2} \times C_{2} }} = \frac{{C_{1} + 1}}{{2C_{1} }} \to \frac{1}{2}$$

Appendix 2: Comparison of remote sensing image detection

We train the Yolov4 and the SNs-Yolov4 with an A40 on DOTA1.0 [46] (using the same hyperparameters) to compare the ability of the two detectors to detect small objects. Figure

6 and

7 shows the test results, and the source image is from the 23rd image of the test set of the ITCVD dataset (https://eostore.itc.utwente.nl:5001/fsdownload/zZYfgbB2X/ITCVD).

Appendix 3: Comparison of SNs-Yolo-tiny and state-of-the-art lightweight detectors at night in low-light

We test the effectiveness of our approach using a 20-s field video that was captured by a dash cam at night in low light and show four frames from the test video as an intuitive comparison in Fig.

8.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Li, H., Li, J., Wei, H. et al. Slim-neck by GSConv: a lightweight-design for real-time detector architectures. J Real-Time Image Proc 21, 62 (2024). https://doi.org/10.1007/s11554-024-01436-6

Download citation

Received: 13 October 2023
Accepted: 12 February 2024
Published: 29 March 2024
DOI: https://doi.org/10.1007/s11554-024-01436-6

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Slim-neck by GSConv: a lightweight-design for real-time detector architectures

Abstract

Access this article

Similar content being viewed by others

SlimYOLOv4: lightweight object detector based on YOLOv4

SWD: Low-Compute Real-Time Object Detection Architecture

A Unified Multi-scale Deep Convolutional Neural Network for Fast Object Detection

References

Acknowledgements