Skip to main content
Log in

Slim-neck by GSConv: a lightweight-design for real-time detector architectures

  • Research
  • Published:
Journal of Real-Time Image Processing Aims and scope Submit manuscript

Abstract

Real-time object detection is significant for industrial and research fields. On edge devices, a giant model is difficult to achieve the real-time detecting requirement, and a lightweight model built from a large number of the depth-wise separable convolutional could not achieve the sufficient accuracy. We introduce a new lightweight convolutional technique, GSConv, to lighten the model but maintain the accuracy. The GSConv accomplishes an excellent trade-off between the accuracy and speed. Furthermore, we provide a design suggestion based on the GSConv, slim-neck (SNs), to achieve a higher computational cost-effectiveness of the real-time detectors. The effectiveness of the SNs was robustly demonstrated in over twenty sets comparative experiments. In particular, the real-time detectors of ameliorated by the SNs obtain the state-of-the-art (70.9% AP50 for the SODA10M at a speed of ~ 100 FPS on a Tesla T4) compared with the baselines. Code is available at https://github.com/alanli1997/slim-neck-by-gsconv.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5

Similar content being viewed by others

References

  1. Girshick, R., Donahue, J., Darrell, T., Malik, J.: Rich feature hierarchies for accurate object detection and semantic segmentation. In: Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR), Columbus, OH, USA 23–28 June 2014, pp. 580–587. https://doi.org/10.1109/CVPR.2014.81

  2. Girshick, R.: Fast R-CNN. In: Proceedings of IEEE International Conference on Computer Vision (ICCV), Santiago, Chile 07–13 December 2015, pp. 1440–1448. https://doi.org/10.1109/ICCV.2015.169

  3. Ren, S., He, K., Girshick, R., Sun, J.: Faster R-CNN: towards real-time object detection with region proposal networks. IEEE Trans Pattern Anal. Mach. Intel. 39(6), 1137–1149 (2017). https://doi.org/10.1109/TPAMI.2016.2577031

    Article  Google Scholar 

  4. Redmon, J., Divvala, S., Girshick, R., Farhadi, A.: You only look once: unified, real-time object detection. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA 27–30 June 2016, pp. 779–788. https://doi.org/10.1109/CVPR.2016.91

  5. Redmon, J., Farhadi, A.: YOLO9000: better, faster, stronger. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA 21–26 July 2017; pp. 6517–6525, arXiv:1612.08242. [Online]. Available: https://arxiv.org/abs/1612.-08242v1. https://doi.org/10.1109/CVPR.2017.690

  6. Redmon, J., Farhadi, A.: YOLOv3: an incremental improvement. arXiv eprints (2018). arXiv:1804.02767. [Online]. https://arxiv.org/abs/1804.02767

  7. Bochkovskiy, A., Wang, C.Y., Liao, H-Y. M.: Yolov4: optimal speed and accuracy of object detection. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2020, arXiv:2004.10934. [Online]. https://arxiv.org/abs/2004.10934

  8. Liu, W., Anguelov, D., Erhan, D., Szegedy, C.: Reed, S., Fu, C.Y., Berg, A.C.: SSD: single shot multibox detector. In: Proceedings of European Conference on Computer Vision (ECCV), Sep. 2016, pp. 21–37. https://doi.org/10.1007/978-3-319-46448-0_2

  9. Fu, C.Y., Liu, W., Ranga, A., Tyagi, A., Berg, A.C.: DssD: deconvolutional single shot detector. arXiv eprints 2017, arXiv:1701.06659. [Online]. Available: https://arxiv.org/abs/1701.06659. https://doi.org/10.48550/arXiv.1701.06659

  10. Chollet, F.: Xception: deep learning with depthwise separable convolutions. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA 21–26 July 2017, pp. 1800–1807. [Online]. Available: https://arxiv.org/abs/1610.02357v1. https://doi.org/10.1109/CVPR.2017.195

  11. Howard, A.G., Zhu, M., Chen, B., Kalenichenko, D., Wang, W., Weyand, T., Andreetto, N., Hartwig, A.:. Mobilenets: efficient convolutional neural networks for mobile vision applications. arXiv eprints 2017, arXiv:1704.04861. [Online]. Available: https://arxiv.org/abs-/1704.04861

  12. Sandler, M., Howard, A., Zhu, M., Zhmoginov, A., Chen, L.: Mobilenetv2: inverted residuals and linear bottlenecks. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018, arXiv:1801.04381. [Online]. Available: https://arxiv.org/abs/1801.-04381v4. https://doi.org/10.1109/CVPR.2018.00474

  13. Howard, A., Sandler, M., Chu, G., Chen, L., Chen, B., Tan, M., Wang, W., Zhu, Y., Pang, R., Vasudevan, V., Le, Q.V., Adam, H.: Searching for MobileNetV3. In: Proceedings of IEEE International Conference on Computer Vision (ICCV), Apr. 2019, arXiv:1704.04861. [Online]. Available: https://arxiv.org/abs/1704.04861. https://doi.org/10.1109/ICCV.2019.00140

  14. Zhang, X., Zhou, X., Lin, M., Sun, J.: ShuffleNet: An extremely efficient convolutional neural network for mobile devices. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Jul. 2017, arXiv:1707.01083. [Online]. Available: https://arxiv.org/abs/1707.01083v1. https://doi.org/10.1109/CVPR.2018.00716

  15. Ma, N., Zhang, X., Zheng, H., Sun, J.: ShuffleNet V2: practical guidelines for efficient CNN architecture design. In: Proceedings of European Conference on Computer Vision (ECCV), 2018, arXiv:1807.11164. [Online]. Available: https://arxiv.org/abs/1807.11164v1. https://doi.org/10.1007/978-3-030-01264-9_8

  16. Zablocki, É., Ben-Younes, H., Pérez, P., et al.: Explainability of deep vision-based autonomous driving systems: review and challenges. Int. J. Comput. Vis. (2022). https://doi.org/10.1007/s11263-022-01657-x

    Article  Google Scholar 

  17. K. Han; Y. Wang; Q. Tian; J. Guo; C. Xu; C. Xu. GhostNet: More features from cheapoperations. in Proc. IEEE Conf. Comput. Vis. Pattern Recog. (CVPR), Mar. 2020, arXiv:1911.11907. [Online]. Available: https://arxiv.org/abs/1911.11907 DOI: https://doi.org/10.1109/CVPR42600.2020.00165

  18. Krizhevsky, A., Sutskever, I., Hinton, G.E.: ImageNet classification with deep convolutional neural networks. Assoc. Comput. Mach. 25, 84–90 (2012). https://doi.org/10.1145/3065386

    Article  Google Scholar 

  19. Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. CoRR 2015; arXiv:1409.1556. [Online]. Available: https://arxiv.org/abs/1409.1556

  20. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016, pp. 770–778. https://doi.org/10.1109/CVPR.2016.90

  21. Niu, W., Ma, X., Lin, S., Wang, S., Qian, X. Lin, X., Wang, Y. Ren, B.: PatDNN: achieving real-time DNN execution on mobile devices with pattern-based weight pruning. In: Proceedings of the Twenty-Fifth International Conference on Architectural Support for Programming Languages and Operating Systems. 2020, pp. 907–922

  22. Lin, T.-Y., Dollár, P., Girshick, R., He, K., Hariharan, B., Belongie, S.: Feature pyramid networks for object detection. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA 21–26 July 2017; pp. 936–944. https://doi.org/10.1109/CVPR.2017.106

  23. Wang, C.-Y., Bochkovskiy, A., Liao, H.-Y. M.: Scaled-yolov4: scaling cross stage partial network. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 20–25, 2021; pp. 13024–13033. https://doi.org/10.1109/CVPR46437.2021.01283

  24. Tian, Z., Shen, C., Chen, H., He, T.: FCOS: fully convolutional one-stage object detection. In: International Conference on Computer Vision. (ICCV), Seoul, Korea (South) 27 October 2019–02 November 2019; pp. 9626–9635. https://doi.org/10.1109/ICCV.2019.00972

  25. Zhu, C., He, Y., Savvides, M.: Feature selective anchor-free module for single-shot object detection. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA 15–20 June 2019; pp. 840–849. https://doi.org/10.1109/CVPR.2019.00093

  26. He, K., Zhang, X., Ren, S., Sun, J.: Spatial pyramid pooling in deep convolutional networks for visual recognition. IEEE Trans. Pattern Anal. Mach. Intell. 37(9), 1904–1916 (2015). https://doi.org/10.1109/TPAMI.2015.2389824

    Article  Google Scholar 

  27. Hu, J., Shen, L., Albanie, S., Sun, G., Wu, E.: Squeeze-and-excitation networks. IEEE Trans. Pattern Anal. Mach. Intell. (2020). https://doi.org/10.1109/TPAMI.2019.2913372

    Article  Google Scholar 

  28. Woo, S., Park, J., Lee, J., Kweon, I.S.: CBAM: convolutional block attention module. In: Proceedings of European Conference on Computer Vision (ECCV), Jul. 2018, arXiv:1807.06521. [Online]. Available: https://arxiv.org/abs/1807.06521v1. https://doi.org/10.1007/978-3-030-01234-2_1

  29. Hou, Q., Zhou, D., Feng, J.: Coordinate attention for efficient mobile network design. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Jun. 2021, arXiv:2103.02907. [Online]. Available: https://arxiv.org/abs/2103.02907. https://doi.org/10.1109/CVPR46437.2021.01350

  30. Huang, G., Liu, Z., Van Der Maaten, L., Weinberger, K.Q.: Densely connected convolutional networks. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA 21–26 2017, pp. 2261–2269. https://doi.org/10.1109/CVPR.2017.243

  31. Lee, Y., Hwang, J.-w., Lee, S., Bae, Y., Park, J.: An energy and GPU-computation efficient backbone network for real-time object detection. In: 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), Long Beach, CA, USA 16–17 June 2019, pp. 752–760. https://doi.org/10.1109/CVPRW.2019.00103

  32. Wang, C.-Y., Mark Liao, H.-Y., Wu, Y.-H., Chen, P.-Y., Hsieh, J.-W., Yeh, I.-H.: CSPNet: a new backbone that can enhance learning capability of CNN. In: 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), Seattle, WA, USA 14–19 June 2020; pp. 1571–1580. https://doi.org/10.1109/CVPRW50498.2020.00203

  33. Yu, J., Jiang, Y., Wang, Z., Cao, Z., Huang, T.: UnitBox: An advanced object detection network. Association for Computing Machinery, New York, NY, USA Oct. 2016; pp. 516–520. https://doi.org/10.1145/2964284.2967274

  34. Rezatofighi, H., Tsoi, N., Gwak, J., Sadeghian, A., Reid, I., Savarese, S.: Generalized intersection over union: a metric and a loss for bounding box regression. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Jun. 2019; pp. 658–666. https://doi.org/10.1109/CVPR.2019.00075

  35. Zheng, Z., Wang, P., Liu, W., Li, J., Ye, R., Ren, D.: Distance-IoU Loss: faster and better learning for bounding box regression. Proc. AAAI Conf. Artif. Intell. (AAAI) 34(7), 12993–13000 (2020). https://doi.org/10.1609/aaai.v34i07.6999

    Article  Google Scholar 

  36. Zheng, Z., Wang, P., Ren, D., Liu, W., Ye, R., Hu, Q., Zuo, W.: Enhancing geometric factors in model learning and inference for object detection and instance segmentation. IEEE Trans. Cybern. (2021). https://doi.org/10.1109/TCYB.2021.3095305

    Article  Google Scholar 

  37. Zhang, Y., Ren, W., Zhang, Z., Jia, Z., Wang, L., Tan, T.: Focal and efficient IoU loss for accurate bounding box regression. arXiv eprints 2021, arXiv:2101.08158 2021. [Online]. Available: https://arxiv.org/abs-/2101.08158. https://doi.org/10.1016/j.neucom.2022.07.042

  38. Ramachandran, P., Zoph, B., Le, Q.V.: Searching for activation functions. arXiv eprints 2017, arXiv:1710.05941. [Online]. https://doi.org/10.48550/arXiv.1710.05941

  39. Misra. Mish, D.: A self-regularized non-monotonic activation function. arXiv eprints 2020, arXiv:1908.08681. [Online]. https://arxiv.org/abs/1908.08681

  40. Glorot, X., Bordes, A., Bengio, Y.: Deep sparse rectifier neural networks. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, 15, pp. 315–323 (2011)

  41. Glenn, J.: Yolov5, 2022. https://github.com/ultralytics/yolov5

  42. Wang, C.Y., Bochkovskiy, A., Liao, H.Y.M.: YOLOv7: trainable bag-of-freebies sets new state-of-the-art for real-time object detectors. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 2023, pp. 7464–7475

  43. Zhang, S., Xie, Y., Wan, J., Xia, H., Li, S.Z., Guo, G.: WiderPerson: a diverse dataset for dense pedestrian detection in the wild. IEEE Trans. Multimedia 22(2), 380–393 (2020). https://doi.org/10.1109/TMM.2019.2929005

    Article  Google Scholar 

  44. Everingham, M., Ali Eslami, S.M., Gool, L.V., Williams, C.K.I., Winn, J., Zisserman, A.: The PASCAL visual object classes challenge: a retrospective. Int. J. Comput. Vis. 111, 98–136. (2015). https://doi.org/10.1007/s11263-014-0733-5

  45. Han, J., Liang, X., Xu, H., Chen, K., Hong, L., Ye, C., Zhang, W., Li, Z., Liang, X., Xu, C.: Soda10m: towards large-scale object detection benchmark for autonomous driving. arXiv eprints 2021, arXiv: 2106.11118. https://doi.org/10.48550/arXiv.2106.11118

  46. Xia, G.S., Bai, X., Ding, J., Zhu, Z., Belongie, S., Luo, J., Datcu, M., Pelillo, P., Zhang, L.: DOTA: a large-scale dataset for object detection in aerial images. In: 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition 2018, Salt Lake City, UT, USA. https://doi.org/10.1109/CVPR.2018.00418

Download references

Acknowledgements

The authors would like to thank Ms. Xinxin Liu (The Intelligent Transportation Big Data Center of the CQJTU) for her aid of the experimental GPUs providing.

Author information

Authors and Affiliations

Authors

Contributions

Methodology was done by Hulin Li, Jun Li, Hanbing Wei and Zheng Liu; code was done by Hulin Li; experimental design was done by Hulin Li, Jun Li and Zhenfei Zhan; validation was done by Hulin Li, Jun Li, Zheng Liu and Qiliang Ren; writing—original draft preparation was done by Hulin Li; writing—review and editing was done by Hanbing Wei, Zhengfei Zhan, Zheng Liu and Qiliang Ren. All authors have read and agreed to the published version of the manuscript.

Corresponding author

Correspondence to Hulin Li.

Ethics declarations

Conflict of interests

The authors declare no conflict of interests.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

Appendix 1: The comparisons of the FLOPs and parameters amount between the DSC, SC and GSConv

The effect of DSC is obvious in the reduction in the parameters number to detection networks. The parameters amount of a conventional convolutional layer is C1 × K1 × K2 × C2, and the amount for a DSC is C1 × K1 × K2 + 1 × 1 × C1 × C2, where C1 and C2 are the channels number of the feature maps from input and output, and K1 × K2 is the kernel size of convolutional. The calculation cost of a SC is W × H × C1 × K1 × K2 × C2, and the cost of DSC is W × H × C1 × K1 × K2 + W × H × 1 × 1 × C1 × C2, where W and H are the width and height of the feature maps. The reason of DSC operation is cheaper than conventional convolutional could be explained by assuming the following conditions:

$$\left\{ \begin{gathered} size_{{{\text{input}}}} = W \times H \times C_{1} = 320 \times 320 \times 3 \hfill \\ size_{{{\text{output}}}} = W \times H \times C_{2} = 320 \times 320 \times 16 \hfill \\ size_{{{\text{kernel}}}} = K_{1} \times K_{2} = 3 \times 3 \hfill \\ ratio_{{\text{p}}} = \frac{{C_{1} \times K_{1} \times K_{2} + 1 \times 1 \times C_{1} \times C_{2} }}{{C_{1} \times K_{1} \times K_{2} \times C_{2} }} = \frac{1}{{C_{2} }} + \frac{1}{{K_{1} \times K_{2} }} \approx 0.174 \hfill \\ ratio_{{\text{c}}} = \frac{{W \times H \times C_{1} \times K_{1} \times K_{2} + W \times H \times 1 \times 1 \times C_{1} \times C_{2} }}{{W \times H \times C_{1} \times K_{1} \times K_{2} \times C_{2} }} \hfill \\ \, = \frac{1}{{C_{2} }} + \frac{1}{{K_{1} \times K_{2} }} \approx 0.174 \hfill \\ \end{gathered} \right.$$

where the ratiop is a ratio of the parameters amount, between DSC and conventional convolutional layer, and the ratioc is a ratio of the calculation cost of them. Obviously, the parameters amount and calculation cost of the DSC are indeed less and much lower than the SC. For the GSConv, the ratio of the computational cost of the GSConv to SC is:

$$ratio_{{\text{c}}}^{*} = \frac{{\frac{1}{2}W \times H \times K_{1} \times K_{2} \times C_{2} \times (C_{1} + 1)}}{{W \times H \times C_{1} \times K_{1} \times K_{2} \times C_{2} }} = \frac{{C_{1} + 1}}{{2C_{1} }} \to \frac{1}{2}$$

Appendix 2: Comparison of remote sensing image detection

We train the Yolov4 and the SNs-Yolov4 with an A40 on DOTA1.0 [46] (using the same hyperparameters) to compare the ability of the two detectors to detect small objects. Figure 

Fig. 6
figure 6

Detection result of the SNs-Yolov4

6 and

Fig. 7
figure 7

Detection result of the original Yolov4

7 shows the test results, and the source image is from the 23rd image of the test set of the ITCVD dataset (https://eostore.itc.utwente.nl:5001/fsdownload/zZYfgbB2X/ITCVD).

Appendix 3: Comparison of SNs-Yolo-tiny and state-of-the-art lightweight detectors at night in low-light

We test the effectiveness of our approach using a 20-s field video that was captured by a dash cam at night in low light and show four frames from the test video as an intuitive comparison in Fig. 

Fig. 8
figure 8

Visual results of the SNs-Yolo in the field traffic (by a Jetson nano)

8.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Li, H., Li, J., Wei, H. et al. Slim-neck by GSConv: a lightweight-design for real-time detector architectures. J Real-Time Image Proc 21, 62 (2024). https://doi.org/10.1007/s11554-024-01436-6

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s11554-024-01436-6

Keywords

Navigation