Skip to main content
Log in

Scene text detection using structured information and an end-to-end trainable generative adversarial networks

  • Original Paper
  • Published:
Pattern Analysis and Applications Aims and scope Submit manuscript

Abstract

Scene text detection poses a considerable challenge due to the diverse nature of text appearance, backgrounds, and orientations. Enhancing robustness, accuracy, and efficiency in this context is vital for several applications, such as optical character recognition, image understanding, and autonomous vehicles. This paper explores the integration of generative adversarial network (GAN) and network variational autoencoder (VAE) to create a robust and potent text detection network. The proposed architecture comprises three interconnected modules: the VAE module, the GAN module, and the text detection module. In this framework, the VAE module plays a pivotal role in generating diverse and variable text regions. Subsequently, the GAN module refines and enhances these regions, ensuring heightened realism and accuracy. Then, the text detection module takes charge of identifying text regions in the input image via assigning confidence scores to each region. The comprehensive training of the entire network involves minimizing a joint loss function that encompasses the VAE loss, the GAN loss, and the text detection loss. The VAE loss ensures diversity in generated text regions and the GAN loss guarantees realism and accuracy, while the text detection loss ensures high-precision identification of text regions. The proposed method employs an encoder-decoder structure within the VAE module and a generator-discriminator structure in the GAN module. Rigorous testing on diverse datasets including Total-Text, CTW1500, ICDAR 2015, ICDAR 2017, ReCTS, TD500, COCO-Text, SynthText, Street View Text, and KIAST Scene Text demonstrates the superior performance of the proposed method compared to existing approaches.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8

Similar content being viewed by others

Data availability

No data is exclusively prepared for the preparation of this manuscript.

References

  1. Li Z, Huang Y, Peng D, He M, Jin L (2024) SideNet: learning representations from interactive side information for zero-shot Chinese character recognition. Pattern Recogn 148:110208

    Article  Google Scholar 

  2. Rainarli E (2021) A decade: review of scene text detection methods. Comput Sci Rev 42:100434

    Article  MathSciNet  Google Scholar 

  3. Khan T, Sarkar R, Mollah AF (2021) Deep learning approaches to scene text detection: a comprehensive review. Artif Intell Rev 54:3239–3298

    Article  Google Scholar 

  4. Gupta N, Jalal AS (2022) Traditional to transfer learning progression on scene text detection and recognition: a survey. Artif Intell Rev 2022:1–46

    Google Scholar 

  5. Zhu Y, Yao C, Bai X (2016) Scene text detection and recognition: Recent advances and future trends. Front Comp Sci 10:19–36

    Article  Google Scholar 

  6. Mahajan S, Rani R (2021) Text detection and localization in scene images: a broad review. Artif Intell Rev 54:4317–4377

    Article  Google Scholar 

  7. Zhao J, Wang Y, Xiao B, Shi C, Jia F, Wang C (2020) DetectGAN: GAN-based text detector for camera-captured document images. Int J Doc Anal Recogn 23:267–277

    Article  Google Scholar 

  8. Xu S, Guo C, Zhu Y, Liu G, Xiong N (2023) CNN-VAE: an intelligent text representation algorithm. J Supercomput 2023:1–26

    CAS  Google Scholar 

  9. Chen G, Long S, Yuan Z, Zhu W, Chen Q, Yilin Wu (2022) Ising granularity image analysis on VAE–GAN. Mach Vis Appl 33(6):81

    Article  Google Scholar 

  10. Zhang J, Lang X, Huang B et al (2023) VAE-CoGAN: unpaired image-to-image translation for low-level vision. SIViP 17:1019–1026

    Article  Google Scholar 

  11. Zhou X, Yao C, Wen H, Wang Y, Zhou S, He W, Liang J (2017) East: an efficient and accurate scene text detector. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 5551–5560

  12. Shi B, Bai X, Belongie S (2017) Detecting oriented text in natural images by linking segments. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 2550–2558

  13. Liao M, Zhu Z, Shi B, Xia GS, Bai X (2018) Rotation-sensitive regression for oriented scene text detection. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 5909–5918

  14. Liao M, Shi B, Bai X (2018) Textboxes++: a single-shot oriented scene text detector. IEEE Trans Image Process 27(8):3676–3690

    Article  ADS  MathSciNet  PubMed  Google Scholar 

  15. Dai Y, Huang Z, Gao Y, Xu Y, Chen K, Guo J, Qiu W (2018) Fused text segmentation networks for multi-oriented scene text detection. In: 24th international conference on pattern recognition, IEEE, pp 3604–3609

  16. Yang Q, Cheng M, Zhou W, Chen Y, Qiu M, Lin W, Chu W (2018) Inceptext: a new inception-text module with deformable psroi pooling for multi-oriented scene text detection. arXiv preprint arXiv:1805.01167

  17. Lyu P, Liao M, Yao C, Wu W, Bai X (2018) Mask textspotter: An end-to-end trainable neural network for spotting text with arbitrary shapes. In: Proceedings of the European conference on computer vision, pp 67–83

  18. Deng D, Liu H, Li X, Cai D (2018) Pixellink: detecting scene text via instance segmentation. In: Proceedings of the AAAI conference on artificial intelligence, vol. 32, no 1

  19. Lyu P, Yao C, Wu W, Yan S, Bai X (2018) Multi-oriented scene text detection via corner localization and region segmentation. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 7553–7563

  20. Long S, Ruan J, Zhang W, He X, Wu W, Yao C (2018) Textsnake: a flexible representation for detecting text of arbitrary shapes. In: Proceedings of the european conference on computer vision (ECCV), pp 20–36

  21. Jian Q (2020) Scene text detection using context-aware pyramid feature extraction. In: Proceedings of the international conference on computing and data science, pp 226–230

  22. Larbi G (2023) Two-step text detection framework in natural scenes based on Pseudo-Zernike moments and CNN. Multimed Tools Appl 82(7):10595–10616

    Article  Google Scholar 

  23. Alshawi AA, Tanha J, Balafar MA, Imanzadeh S (2023) A hybrid deep-based model for scene text detection and recognition in meter reading. Int J Inf Technol 15(7):3575–3581

    Google Scholar 

  24. Mahadshetti R, Lee GS, Choi DJ (2023) RMFPN: end-to-end scene text recognition using multi-feature pyramid network. IEEE Access 11:61892–61900

    Article  Google Scholar 

  25. Ueda A, Yang W, Sugiura K (2023) Switching text-based image encoders for captioning images with text. IEEE Access. 11:55706–55715

    Article  Google Scholar 

  26. Dang Q-V, Lee G-S (2023) Scene text segmentation via multi-task cascade transformer with paired data synthesis. IEEE Access 11:67791–67805

    Article  Google Scholar 

  27. Wang X, Wu C, Yu H, Li B, Xue X (2023) Textformer: component-aware text segmentation with transformer. In: Proceedings of the IEEE international conference on multimedia and expo, pp 1877–1882

  28. Ravi V, Acharya V, Pham TD (2022) Attention deep learning-based large-scale learning classifier for Cassava leaf disease classification. Expert Syst 39(2):e12862

    Article  Google Scholar 

  29. Ravi V, Chaganti R (2023) EfficientNet deep learning meta-classifier approach for image-based android malware detection. Multimed Tools Appl 82(16):24891–24917

    Article  Google Scholar 

  30. Xue C, Huang J, Zhang W, Shijian L, Wang C, Bai S (2023) Image-to-character-to-word transformers for accurate scene text recognition. IEEE Trans Pattern Anal Mach Intell. https://doi.org/10.1109/TPAMI.2022.3230962

    Article  PubMed  Google Scholar 

  31. Krishnan P, Kovvuri R, Pang G, Vassilev B, Hassner T (2023) Textstylebrush: transfer of text aesthetics from a single example. IEEE Trans Pattern Anal Mach Intell. https://doi.org/10.1109/TPAMI.2023.3239736

    Article  PubMed  Google Scholar 

  32. Chng C-K, Chan CS, Liu C-L (2020) Total-text: toward orientation robustness in scene text detection. Int J Document Anal Recog (IJDAR) 23(1):31–52

    Article  Google Scholar 

  33. Yuliang L, Lianwen J, Shuaitao Z, Sheng Z (2017) Detecting curve text in the wild: new dataset and new solution. arXiv preprint arXiv:1712.02170

  34. Karatzas D, Gomez-Bigorda L, Nicolaou A, Ghosh S, Bagdanov A, Iwamura M, Matas J, et al. (2015) ICDAR 2015 competition on robust reading. In: 13th international conference on document analysis and recognition, IEEE, pp 1156–1160

  35. Sanchez JA, Romero V, Toselli AH, Villegas M, Vidal E (2017) ICDAR2017 competition on handwritten text recognition on the READ dataset. In: Proceedings of the 14th IAPR international conference on document analysis and recognition, IEEE, vol. 1, pp 1383–13882017

  36. Zhang R, Zhou Y, Jiang Q, Song Q, Li N, Zhou K, Wang L, et al. (2019) ICDAR 2019 robust reading challenge on reading Chinese text on signboard. In: Proceedings of the international conference on document analysis and recognition, p. 1577–1581

  37. Cong Y MSRA Text Detection 500 Database (MSRA-TD500), 1, ID: MSRA-TD500_1, https://tc11.cvc.uab.es/datasets/MSRA-TD500_1

  38. Gomez R, Shi B, Gomez L, Numann L, Veit A, Matas J, Belongie S, Karatzas D (2017) Icdar2017 robust reading challenge on coco-text. In: Proceedings of the 14th IAPR international conference on document analysis and recognition, vol. 1, pp 1435–1443

  39. Gupta A, Vedaldi A, Zisserman A (2016) Synthetic data for text localisation in natural images. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 2315–2324

  40. Kai W, The Street View Text Dataset (SVT), 1, ID: SVT_1, https://tc11.cvc.uab.es/datasets/SVT_1

  41. http://www.iapr-tc11.org/mediawiki/index.php/KAIST_Scene_Text_Database

Download references

Funding

Not Applicable.

Author information

Authors and Affiliations

Authors

Contributions

Both the author has equal contribution.

Corresponding author

Correspondence to Palanichamy Naveen.

Ethics declarations

Conflict of interest

The authors declare that they have no known competing financial interest or personal relationship that could has appeared to influence the work reported in this paper.

Ethical approval

Not Applicable.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Naveen, P., Hassaballah, M. Scene text detection using structured information and an end-to-end trainable generative adversarial networks. Pattern Anal Applic 27, 33 (2024). https://doi.org/10.1007/s10044-024-01259-y

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s10044-024-01259-y

Keywords

Navigation