Skip to main content
Log in

Printed Ottoman text recognition using synthetic data and data augmentation

  • Special Issue Paper
  • Published:
International Journal on Document Analysis and Recognition (IJDAR) Aims and scope Submit manuscript

Abstract

The Ottoman script, which was in use for over five centuries, is an Arabic alphabet-based writing system. It became obsolete after the change of alphabet in Turkey. There are plenty of Ottoman documents, overwhelmingly printed in Naskh style. This work presents a DL-based character recognition system for the printed Ottoman script. We first generate a synthetic text image dataset from a text corpus and then augment it using some image processing methods. We develop a hybrid convolutional neural network-bidirectional long short-term memory recognizer and train it with the original and the augmented datasets. Finally, we apply a transfer learning procedure for adapting the system to real image data. The proposed system obtains 0.11 CER on synthetic data and 0.16 CER on real data comprising of line images from a printed historical Ottoman book.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9

Similar content being viewed by others

Notes

  1. The book is publicly accessible at https://archive.org/details/dagackankurt0001ad.

References

  1. AbdelRaouf, A., Higgins, C.A., Pridmore, T.P., Khalil, M.I.: Building a multi-modal Arabic corpus (MMAC). Int. J. Doc. Anal. Recogn. 13(4), 285–302 (2010)

    Article  Google Scholar 

  2. Ahmad, I., Mahmoud, S.A., Fink, G.A.: Open-vocabulary recognition of machine-printed Arabic text using hidden Markov models. Pattern Recogn. 51, 97–111 (2016)

    Article  Google Scholar 

  3. Ahmad, R., Naz, S., Afzal, M.Z., Rashid, S.F., Liwicki, M.: A deep learning based Arabic script recognition system: benchmark on KHAT. Int. Arab J. Inf. Technol. 17(3), 299–305 (2020)

    Google Scholar 

  4. Al-Badr, B., Mahmoud, S.A.: Survey and bibliography of Arabic optical text recognition. Signal Process. 41(1), 49–77 (1995)

    Article  MATH  Google Scholar 

  5. Al-Helali, B.M., Mahmoud, S.A.: Arabic online handwriting recognition (AOHR): a survey. ACM Comput. Surv. 50(3), 33:1-33:35 (2017)

    Google Scholar 

  6. Al-Muhtaseb, H.A., Mahmoud, S.A., Qahwaji, R.: Recognition of off-line printed Arabic text using hidden Markov models. Signal Process. 88(12), 2902–2912 (2008)

    Article  MATH  Google Scholar 

  7. Alrobah, N.A., Albahli, S.: Arabic handwritten recognition using deep learning: a survey. Arab. J. Sci. Eng. 47, 9943–9963 (2022)

    Article  Google Scholar 

  8. Al-Salman, A., Alyahya, H.: Arabic online handwriting recognition: a survey. In: Hamdan, H., Boubiche, D.E., Klett, F. (eds.) Proceedings of the 1st International Conference on Internet of Things and Machine Learning, IML 2017, Liverpool, United Kingdom, October 17–18, 2017, pp. 51:1–51:4. ACM (2017)

  9. Ataer, E., Duygulu, P.: Matching ottoman words: an image retrieval approach to historical document indexing. In: Proceedings of the 6th ACM International Conference on Image and Video Retrieval, pp. 341–347 (2007)

  10. Ataer, E., Duygulu, P.: Retrieval of ottoman documents. In: Wang, J.Z., Boujemaa, N., Chen, Y. (eds.) Proceedings of the 8th ACM SIGMM International Workshop on Multimedia Information Retrieval, MIR 2006, October 26–27, 2006, Santa Barbara, CA, USA, pp. 155–162. ACM (2006)

  11. Aydemir, M.S., Aydin, B., Kaya, H., Karliaga, I., Demir, C.: Tübıtak Turkish–Ottoman handwritten recognition system. In: 2014 22nd Signal Processing and Communications Applications Conference (SIU), Trabzon, Turkey, April 23–25, 2014, pp. 1918–1921. IEEE (2014)

  12. Can, E.F., Duygulu, P., Can, F., Kalpakli, M.: Redif extraction in handwritten ottoman literary texts. In: 2010 20th International Conference on Pattern Recognition, pp. 1941–1944 (2010)

  13. Can, Y.S., Kabadayı, M.E.: Computerized counting of individuals in ottoman population registers with deep learning. In: Bai, X., Karatzas, D., Lopresti, D. (eds.) Document Analysis Systems, pp. 277–290. Springer, Cham (2020)

    Chapter  Google Scholar 

  14. Capobianco, S., Marinai, S.: Docemul: A toolkit to generate structured historical documents. In: 14th IAPR International Conference on Document Analysis and Recognition, ICDAR 2017, Kyoto, Japan, November 9–15, 2017, pp. 1186–1191. IEEE (2017)

  15. Dolek, I., Kurt, A.: A deep learning model for ottoman OCR. Concurr. Comput.: Pract. Exp. 34(20), e6937 (2022)

    Article  Google Scholar 

  16. Dutta, K., Krishnan, P., Mathew, M., Jawahar, C.V.: Improving CNN-RNN hybrid networks for handwriting recognition. In: 16th International Conference on Frontiers in Handwriting Recognition, ICFHR 2018, Niagara Falls, NY, USA, August 5–8, 2018, pp. 80–85. IEEE Computer Society (2018)

  17. Duygulu, P., Arifoglu, D., Kalpakli, M.: Cross-document word matching for segmentation and retrieval of ottoman divans. Pattern Anal. Appl. 19(3), 647–663 (2016)

    Article  MathSciNet  Google Scholar 

  18. Glorot, X., Bengio, Y.: Understanding the difficulty of training deep feedforward neural networks. In: Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics. Proceedings of Machine Learning Research, vol. 9, pp. 249–256. JMLR Workshop and Conference Proceedings (2010)

  19. Graves, A., Fernández, S., Gomez, F., Schmidhuber, J.: Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks. In: Proceedings of the 23rd International Conference on Machine Learning, pp. 369–376 (2006)

  20. Graves, A., Schmidhuber, J.: Offline handwriting recognition with multidimensional recurrent neural networks. In: Advances in Neural Information Processing Systems 21, Proceedings of the Twenty-Second Annual Conference on Neural Information Processing Systems, pp. 545–552 (2008)

  21. Graves, A.: Supervised Sequence Labelling with Recurrent Neural Networks, Studies in Computational Intelligence, vol. 385. Springer (2012)

  22. Graves, A., Liwicki, M., Fernández, S., Bertolami, R., Bunke, H., Schmidhuber, J.: A novel connectionist system for unconstrained handwriting recognition. IEEE Trans. Pattern Anal. Mach. Intell. 31(5), 855–868 (2009)

    Article  Google Scholar 

  23. Hakro, D.N., Talib, A.Z.: Printed text image database for Sindhi OCR. ACM Trans. Asian Low Resour. Lang. Inf. Process. 15(4), 21:1-21:18 (2016)

    Article  Google Scholar 

  24. Hamdi, Y., Boubaker, H., Dhieb, T., Elbaati, A., Alimi, A.M.: Hybrid DBLSTM-SVM based beta-elliptic-CNN models for online Arabic characters recognition. In: 2019 International Conference on Document Analysis and Recognition, pp. 545–550 (2019)

  25. Hosseini, F.s., Kashef, S., Shabaninia, E., Nezamabadi-pour, H.: Idpl-pfod: an image dataset of printed Farsi text for OCR research. In: Proceedings of The Second International Workshop on NLP Solutions for Under Resourced Languages (NSURL 2021) co-located with ICNLSP 2021, pp. 22–31. Association for Computational Linguistics, Trento, Italy (2021)

  26. Jaiem, F.K., Kanoun, S., Khemakhem, M., Abed, H.E., Kardoun, J.: Database for Arabic printed text recognition research. In: Petrosino, A. (ed.) Image Analysis and Processing—ICIAP 2013—17th International Conference, Naples, Italy, September 9–13, 2013. Proceedings, Part I. Lecture Notes in Computer Science, vol. 8156, pp. 251–259. Springer (2013)

  27. Jiang, Z., Ding, X., Peng, L., Liu, C.: Modified bootstrap approach with state number optimization for hidden Markov model estimation in small-size printed Arabic text line recognition. In: Perner, P. (ed.) Machine Learning and Data Mining in Pattern Recognition—10th International Conference, MLDM 2014, St. Petersburg, Russia, July 21–24, 2014. Proceedings. Lecture Notes in Computer Science, vol. 8556, pp. 437–441. Springer (2014)

  28. Journet, N., Visani, M., Mansencal, B., Kieu, V.C., Billy, A.: Doccreator: a new software for creating synthetic ground-truthed document images. J. Imaging 3(4), 62 (2017)

    Article  Google Scholar 

  29. Khorsheed, M.S.: Offline recognition of omnifont Arabic text using the HMM toolkit (HTK). Pattern Recogn. Lett. 28(12), 1563–1571 (2007)

    Article  Google Scholar 

  30. Khoury, I., Giménez, A., Juan, A., Andrés-Ferrer, J.: Window repositioning for printed Arabic recognition. Pattern Recogn. Lett. 51, 86–93 (2015)

    Article  Google Scholar 

  31. Kilic, N., Gorgel, P., Ucan, O.N., Kala, A.: Multifont Ottoman character recognition using support vector machine. In: 2008 3rd International Symposium on Communications, Control and Signal Processing, pp. 328–333 (2008)

  32. Kurt, Z., Turkmen, H., Karsligil, E.: Linear discriminant analysis in Ottoman alphabet character recognition. Appl. Therm. Eng. 28, 601–607 (2009)

    Google Scholar 

  33. Li, Z., Liu, F., Yang, W., Peng, S., Zhou, J.: A survey of convolutional neural networks: analysis, applications, and prospects. IEEE Trans. Neural Networks Learn. Syst. 33(12), 6999–7019 (2022)

  34. Märgner, V., Pechwitz, M.: Synthetic data for Arabic OCR system development. In: 6th International Conference on Document Analysis and Recognition, pp. 1159–1163. IEEE Computer Society (2001)

  35. Martínek, J., Lenc, L., Král, P.: Building an efficient OCR system for historical documents with little training data. Neural Comput. Appl. 32(23), 17209–17227 (2020)

    Article  Google Scholar 

  36. Mori, S., Suen, C.Y., Yamamoto, K.: Historical review of OCR research and development. Proc. IEEE 80(7), 1029–1058 (1992)

    Article  Google Scholar 

  37. Namysl, M., Konya, I.: Efficient, lexicon-free OCR using deep learning. In: 2019 International Conference on Document Analysis and Recognition, ICDAR, pp. 295–301. IEEE (2019)

  38. Natarajan, P., Lu, Z., Schwartz, R.M., Bazzi, I., Makhoul, J.: Multilingual machine printed OCR. Int. J. Pattern Recogn. Artif. Intell. 15(1), 43–63 (2001)

    Article  Google Scholar 

  39. Naz, S., Umar, A.I., Ahmad, R., Siddiqi, I., Ahmed, S.B., Razzak, M.I., Shafait, F.: Urdu Nastaliq recognition using convolutional-recursive deep learning. Neurocomputing 243, 80–87 (2017)

    Article  Google Scholar 

  40. Niu, S., Liu, Y., Wang, J., Song, H.: A decade survey of transfer learning (2010–2020). IEEE Trans. Artif. Intell. 1(2), 151–166 (2020)

    Article  Google Scholar 

  41. Özege, M.S.: Eski Harflerle Basılmış Türkçe Eserler Kataloğu. Fatih Yayınevi Matbaası, İstanbul (1982)

    Google Scholar 

  42. Ozturk, A., Gunes, S., Ozbay, Y.: Multifont ottoman character recognition. In: ICECS 2000. 7th IEEE International Conference on Electronics, Circuits and Systems, vol. 2, pp. 945–949 (2000)

  43. Parvez, M.T., Mahmoud, S.A.: Offline Arabic handwritten text recognition: a survey. ACM Comput. Surv. 45(2), 23:1-23:35 (2013)

    Article  MATH  Google Scholar 

  44. Pondenkandath, V., Alberti, M., Diatta, M., Ingold, R., Liwicki, M.: Historical document synthesis with generative adversarial networks. In: 2nd International Workshop on Machine Learning, WML@ICDAR 2019, Sydney, Australia, September 22–25, 2019, pp. 146–151. IEEE (2019)

  45. PourReza, M., Derakhshan, R., Fayyazi, H., Sabokrou, M.: Sub-word based Persian OCR using auto-encoder features and cascade classifier. In: 9th International Symposium on Telecommunications, IST 2018, Tehran, Iran, December 17–19, 2018, pp. 481–485. IEEE (2018)

  46. Prasad, R., Saleem, S., Kamali, M., Meermeier, R., Natarajan, P.: Improvements in hidden Markov model based Arabic OCR. In: 19th International Conference on Pattern Recognition (ICPR 2008), December 8–11, 2008, Tampa, Florida, USA, pp. 1–4. IEEE Computer Society (2008)

  47. Puigcerver, J.: Are multidimensional recurrent layers really necessary for handwritten text recognition? In: 14th IAPR International Conference on Document Analysis and Recognition, pp. 67–72. IEEE (2017)

  48. Qaroush, A., Awad, A., Modallal, M., Ziq, M.: Segmentation-based, omnifont printed Arabic character recognition without font identification. J. King Saud Univ. Comput. Inf. Sci. 34(6 Part A), 3025–3039 (2022)

  49. Radwan, M.A., Khalil, M.I., Abbas, H.M.: Neural networks pipeline for offline machine printed Arabic OCR. Neural Process. Lett. 48(2), 769–787 (2018)

    Article  Google Scholar 

  50. Rahal, N., Tounsi, M., Hussain, A., Alimi, A.M.: Deep sparse auto-encoder features learning for Arabic text recognition. IEEE Access 9, 18569–18584 (2021)

    Article  Google Scholar 

  51. Rahmati, M., Fateh, M., Rezvani, M., Tajary, A., Abolghasemi, V.: Printed Persian OCR system using deep learning. IET Image Process. 14(15), 3920–3931 (2020). https://doi.org/10.1049/iet-ipr.2019.0728

    Article  Google Scholar 

  52. Rashid, S.F., Schambach, M., Rottland, J., von der Nüll, S.: Low resolution Arabic recognition with multidimensional recurrent neural networks. In: Govindaraju, V., Natarajan, P., Chaudhury, S., Lopresti, D.P., Setlur, S., Cao, H. (eds.) Proceedings of the 4th International Workshop on Multilingual OCR, MOCR@ICDAR 2013, Washington, DC, USA, August 24, 2013, pp. 6:1–6:5. ACM (2013)

  53. Sabbour, N., Shafait, F.: A segmentation-free approach to arabic and urdu OCR. In: Document Recognition and Retrieval XX, part of the IS &T-SPIE Electronic Imaging Symposium. SPIE Proceedings, vol. 8658, p. 86580N. SPIE (2013)

  54. Sabir, E., Rawls, S., Natarajan, P.: Implicit language model in LSTM for OCR. In: 6th International Workshop on Multilingual OCR, 14th IAPR International Conference on Document Analysis and Recognition, MOCR@ICDAR 2017, Kyoto, Japan, November 9–15, 2017, pp. 27–31. IEEE (2017)

  55. Sabir, E., Rawls, S., Natarajan, P.: Implicit language model in LSTM for OCR. In: 6th International Workshop on Multilingual OCR, 14th IAPR International Conference on Document Analysis and Recognition, MOCR@ICDAR 2017, Kyoto, Japan, November 9–15, 2017, pp. 27–31. IEEE (2017)

  56. Saykol, E., Sinop, A.K., Gudukbay, U., Ulusoy, O., Cetin, A.E.: Content-based retrieval of historical ottoman documents stored as textual images. IEEE Trans. Image Process. 13(3), 314–325 (2004)

    Article  Google Scholar 

  57. Qaroush, A., Awad, A., Modallal, M., Ziq, M.: Segmentation-based, omnifont printed Arabic character recognition without font identification. J. King Saud Univer.—Comput. Inf. Sci. 34(6, Part A), 3025–3039 (2022)

    Google Scholar 

  58. Shewalkar, A., Nyavanandi, D., Ludwig, S.A.: Performance evaluation of deep neural networks applied to speech recognition: RNN, LSTM and GRU. J. Artif. Intell. Soft Comput. Res. 9(4), 235–245 (2019)

    Article  Google Scholar 

  59. Shi, B., Bai, X., Yao, C.: An end-to-end trainable neural network for image-based sequence recognition and its application to scene text recognition. IEEE Trans. Pattern Anal. Mach. Intell. 39(11), 2298–2304 (2017)

  60. Slimane, F., Ingold, R., Kanoun, S., Alimi, A.M., Hennebert, J.: A new arabic printed text image database and evaluation protocols. In: 10th International Conference on Document Analysis and Recognition, pp. 946–950. IEEE Computer Society (2009)

  61. Slimane, F., Zayene, O., Kanoun, S., Alimi, A.M., Hennebert, J., Ingold, R.: New features for complex arabic fonts in cascading recognition system. In: Proceedings of the 21st International Conference on Pattern Recognition, ICPR 2012, Tsukuba, Japan, November 11–15, 2012, pp. 738–741. IEEE Computer Society (2012)

  62. Slimane, F., Zayene, O., Kanoun, S., Alimi, A.M., Hennebert, J., Ingold, R.: New features for complex Arabic fonts in cascading recognition system. In: Proceedings of the 21st International Conference on Pattern Recognition, pp. 738–741. IEEE Computer Society (2012)

  63. Ul-Hasan, A., Ahmed, S.B., Rashid, S.F., Shafait, F., Breuel, T.M.: Offline printed Urdu Nastaleeq script recognition with bidirectional LSTM networks. In: 12th International Conference on Document Analysis and Recognition, pp. 1061–1065. IEEE Computer Society (2013)

  64. Ul-Hasan, A., Breuel, T.M.: Can we build language-independent OCR using LSTM networks? In: Govindaraju, V., Natarajan, P., Chaudhury, S., Lopresti, D.P., Setlur, S., Cao, H. (eds.) Proceedings of the 4th International Workshop on Multilingual OCR, MOCR@ICDAR 2013, Washington, DC, USA, August 24, 2013, pp. 9:1–9:5. ACM (2013)

  65. Voigtlaender, P., Doetsch, P., Ney, H.: Handwriting recognition with large multidimensional long short-term memory recurrent neural networks. In: 15th International Conference on Frontiers in Handwriting Recognition, pp. 228–233. IEEE Computer Society (2016)

  66. Wahab, M., Amin, H., Ahmed, F.: Shape analysis of Pashto script and creation of image database for OCR. In: 2009 International Conference on Emerging Technologies, pp. 287–290 (2009)

  67. Weiss, K.R., Khoshgoftaar, T.M., Wang, D.: A survey of transfer learning. J. Big Data 3, 9 (2016)

    Article  Google Scholar 

  68. Yalniz, I.Z., Altingovde, I.S., Güdükbay, U., Ulusoy, Ö.: Integrated segmentation and recognition of connected ottoman script. Opt. Eng. 48, 117205 (2009)

    Article  Google Scholar 

  69. Zahoor, S., Naz, S., Khan, N.H., Razzak, M.I.: Deep optical character recognition: a case of Pashto language. J. Electron. Imaging 29(02), 023002 (2020)

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Contributions

All authors contributed equally.

Corresponding author

Correspondence to Esma F. Bilgin Tasdemir.

Ethics declarations

Conflict of interest

The authors declare no competing interests.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Bilgin Tasdemir, E.F. Printed Ottoman text recognition using synthetic data and data augmentation. IJDAR 26, 273–287 (2023). https://doi.org/10.1007/s10032-023-00436-9

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10032-023-00436-9

Keywords

Navigation