Abstract
Scene text detection poses a considerable challenge due to the diverse nature of text appearance, backgrounds, and orientations. Enhancing robustness, accuracy, and efficiency in this context is vital for several applications, such as optical character recognition, image understanding, and autonomous vehicles. This paper explores the integration of generative adversarial network (GAN) and network variational autoencoder (VAE) to create a robust and potent text detection network. The proposed architecture comprises three interconnected modules: the VAE module, the GAN module, and the text detection module. In this framework, the VAE module plays a pivotal role in generating diverse and variable text regions. Subsequently, the GAN module refines and enhances these regions, ensuring heightened realism and accuracy. Then, the text detection module takes charge of identifying text regions in the input image via assigning confidence scores to each region. The comprehensive training of the entire network involves minimizing a joint loss function that encompasses the VAE loss, the GAN loss, and the text detection loss. The VAE loss ensures diversity in generated text regions and the GAN loss guarantees realism and accuracy, while the text detection loss ensures high-precision identification of text regions. The proposed method employs an encoder-decoder structure within the VAE module and a generator-discriminator structure in the GAN module. Rigorous testing on diverse datasets including Total-Text, CTW1500, ICDAR 2015, ICDAR 2017, ReCTS, TD500, COCO-Text, SynthText, Street View Text, and KIAST Scene Text demonstrates the superior performance of the proposed method compared to existing approaches.
Similar content being viewed by others
Data availability
No data is exclusively prepared for the preparation of this manuscript.
References
Li Z, Huang Y, Peng D, He M, Jin L (2024) SideNet: learning representations from interactive side information for zero-shot Chinese character recognition. Pattern Recogn 148:110208
Rainarli E (2021) A decade: review of scene text detection methods. Comput Sci Rev 42:100434
Khan T, Sarkar R, Mollah AF (2021) Deep learning approaches to scene text detection: a comprehensive review. Artif Intell Rev 54:3239–3298
Gupta N, Jalal AS (2022) Traditional to transfer learning progression on scene text detection and recognition: a survey. Artif Intell Rev 2022:1–46
Zhu Y, Yao C, Bai X (2016) Scene text detection and recognition: Recent advances and future trends. Front Comp Sci 10:19–36
Mahajan S, Rani R (2021) Text detection and localization in scene images: a broad review. Artif Intell Rev 54:4317–4377
Zhao J, Wang Y, Xiao B, Shi C, Jia F, Wang C (2020) DetectGAN: GAN-based text detector for camera-captured document images. Int J Doc Anal Recogn 23:267–277
Xu S, Guo C, Zhu Y, Liu G, Xiong N (2023) CNN-VAE: an intelligent text representation algorithm. J Supercomput 2023:1–26
Chen G, Long S, Yuan Z, Zhu W, Chen Q, Yilin Wu (2022) Ising granularity image analysis on VAE–GAN. Mach Vis Appl 33(6):81
Zhang J, Lang X, Huang B et al (2023) VAE-CoGAN: unpaired image-to-image translation for low-level vision. SIViP 17:1019–1026
Zhou X, Yao C, Wen H, Wang Y, Zhou S, He W, Liang J (2017) East: an efficient and accurate scene text detector. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 5551–5560
Shi B, Bai X, Belongie S (2017) Detecting oriented text in natural images by linking segments. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 2550–2558
Liao M, Zhu Z, Shi B, Xia GS, Bai X (2018) Rotation-sensitive regression for oriented scene text detection. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 5909–5918
Liao M, Shi B, Bai X (2018) Textboxes++: a single-shot oriented scene text detector. IEEE Trans Image Process 27(8):3676–3690
Dai Y, Huang Z, Gao Y, Xu Y, Chen K, Guo J, Qiu W (2018) Fused text segmentation networks for multi-oriented scene text detection. In: 24th international conference on pattern recognition, IEEE, pp 3604–3609
Yang Q, Cheng M, Zhou W, Chen Y, Qiu M, Lin W, Chu W (2018) Inceptext: a new inception-text module with deformable psroi pooling for multi-oriented scene text detection. arXiv preprint arXiv:1805.01167
Lyu P, Liao M, Yao C, Wu W, Bai X (2018) Mask textspotter: An end-to-end trainable neural network for spotting text with arbitrary shapes. In: Proceedings of the European conference on computer vision, pp 67–83
Deng D, Liu H, Li X, Cai D (2018) Pixellink: detecting scene text via instance segmentation. In: Proceedings of the AAAI conference on artificial intelligence, vol. 32, no 1
Lyu P, Yao C, Wu W, Yan S, Bai X (2018) Multi-oriented scene text detection via corner localization and region segmentation. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 7553–7563
Long S, Ruan J, Zhang W, He X, Wu W, Yao C (2018) Textsnake: a flexible representation for detecting text of arbitrary shapes. In: Proceedings of the european conference on computer vision (ECCV), pp 20–36
Jian Q (2020) Scene text detection using context-aware pyramid feature extraction. In: Proceedings of the international conference on computing and data science, pp 226–230
Larbi G (2023) Two-step text detection framework in natural scenes based on Pseudo-Zernike moments and CNN. Multimed Tools Appl 82(7):10595–10616
Alshawi AA, Tanha J, Balafar MA, Imanzadeh S (2023) A hybrid deep-based model for scene text detection and recognition in meter reading. Int J Inf Technol 15(7):3575–3581
Mahadshetti R, Lee GS, Choi DJ (2023) RMFPN: end-to-end scene text recognition using multi-feature pyramid network. IEEE Access 11:61892–61900
Ueda A, Yang W, Sugiura K (2023) Switching text-based image encoders for captioning images with text. IEEE Access. 11:55706–55715
Dang Q-V, Lee G-S (2023) Scene text segmentation via multi-task cascade transformer with paired data synthesis. IEEE Access 11:67791–67805
Wang X, Wu C, Yu H, Li B, Xue X (2023) Textformer: component-aware text segmentation with transformer. In: Proceedings of the IEEE international conference on multimedia and expo, pp 1877–1882
Ravi V, Acharya V, Pham TD (2022) Attention deep learning-based large-scale learning classifier for Cassava leaf disease classification. Expert Syst 39(2):e12862
Ravi V, Chaganti R (2023) EfficientNet deep learning meta-classifier approach for image-based android malware detection. Multimed Tools Appl 82(16):24891–24917
Xue C, Huang J, Zhang W, Shijian L, Wang C, Bai S (2023) Image-to-character-to-word transformers for accurate scene text recognition. IEEE Trans Pattern Anal Mach Intell. https://doi.org/10.1109/TPAMI.2022.3230962
Krishnan P, Kovvuri R, Pang G, Vassilev B, Hassner T (2023) Textstylebrush: transfer of text aesthetics from a single example. IEEE Trans Pattern Anal Mach Intell. https://doi.org/10.1109/TPAMI.2023.3239736
Chng C-K, Chan CS, Liu C-L (2020) Total-text: toward orientation robustness in scene text detection. Int J Document Anal Recog (IJDAR) 23(1):31–52
Yuliang L, Lianwen J, Shuaitao Z, Sheng Z (2017) Detecting curve text in the wild: new dataset and new solution. arXiv preprint arXiv:1712.02170
Karatzas D, Gomez-Bigorda L, Nicolaou A, Ghosh S, Bagdanov A, Iwamura M, Matas J, et al. (2015) ICDAR 2015 competition on robust reading. In: 13th international conference on document analysis and recognition, IEEE, pp 1156–1160
Sanchez JA, Romero V, Toselli AH, Villegas M, Vidal E (2017) ICDAR2017 competition on handwritten text recognition on the READ dataset. In: Proceedings of the 14th IAPR international conference on document analysis and recognition, IEEE, vol. 1, pp 1383–13882017
Zhang R, Zhou Y, Jiang Q, Song Q, Li N, Zhou K, Wang L, et al. (2019) ICDAR 2019 robust reading challenge on reading Chinese text on signboard. In: Proceedings of the international conference on document analysis and recognition, p. 1577–1581
Cong Y MSRA Text Detection 500 Database (MSRA-TD500), 1, ID: MSRA-TD500_1, https://tc11.cvc.uab.es/datasets/MSRA-TD500_1
Gomez R, Shi B, Gomez L, Numann L, Veit A, Matas J, Belongie S, Karatzas D (2017) Icdar2017 robust reading challenge on coco-text. In: Proceedings of the 14th IAPR international conference on document analysis and recognition, vol. 1, pp 1435–1443
Gupta A, Vedaldi A, Zisserman A (2016) Synthetic data for text localisation in natural images. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 2315–2324
Kai W, The Street View Text Dataset (SVT), 1, ID: SVT_1, https://tc11.cvc.uab.es/datasets/SVT_1
http://www.iapr-tc11.org/mediawiki/index.php/KAIST_Scene_Text_Database
Funding
Not Applicable.
Author information
Authors and Affiliations
Contributions
Both the author has equal contribution.
Corresponding author
Ethics declarations
Conflict of interest
The authors declare that they have no known competing financial interest or personal relationship that could has appeared to influence the work reported in this paper.
Ethical approval
Not Applicable.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Naveen, P., Hassaballah, M. Scene text detection using structured information and an end-to-end trainable generative adversarial networks. Pattern Anal Applic 27, 33 (2024). https://doi.org/10.1007/s10044-024-01259-y
Received:
Accepted:
Published:
DOI: https://doi.org/10.1007/s10044-024-01259-y