skip to main content
research-article
Free Access
Just Accepted

SUSTEM: An Improved Rule-Based Sundanese Stemmer

Authors Info & Claims
Online AM:05 April 2024Publication History
Skip Abstract Section

Abstract

Current Sundanese stemmers either ignore reduplication words or define rules to handle only affixes. There is a significant amount of reduplication words in the Sundanese language. Because of that, it is impossible to achieve superior stemming precision in the Sundanese language without addressing reduplication words. This paper presents an improved stemmer for the Sundanese language, which handles affixed and reduplicated words. With a Sundanese root word list, we use a rules-based stemming technique. In our approach, all stems produced by the affixes removal or normalization processes are added to the stem list. Using a stem list can help increase stemmer accuracy by reducing stemming errors caused by affix removal sequence errors or morphological issues. The current Sundanese language stemmer, RBSS, was used as a comparison. Two datasets with 8218 unique affixed words and reduplication words were evaluated. The results show that our stemmer's strength and accuracy have improved noticeably. The use of stem list and word reduplication rules improved our stemmer's affixed type recognition and allowed us to achieve up to 99.30% accuracy.

References

  1. Dhafar Hamed Abd, Wasiq Khan, Khudhair Abed Thamer, and Abir J Hussain. 2021. Arabic Light Stemmer Based on ISRI Stemmer BT - Intelligent Computing Theories and Application. Springer International Publishing, Cham, 32–45.Google ScholarGoogle Scholar
  2. Mirna Adriani, Jelita Asian, Bobby Nazief, S. M.M. Tahaghoghi, and Hugh E. Williams. 2007. Stemming Indonesian. ACM Trans. Asian Lang. Inf. Process. 6, 4 (December 2007), 1–33. DOI:https://doi.org/10.1145/1316457.1316459Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. T Ahmed, S Hossain, M S Salim, A Anjum, and K M Azharul Hasan. 2021. Gold Dataset for the Evaluation of Bangla Stemmer. In 2021 5th International Conference on Electrical Information and Communication Technology (EICT), 1–6. DOI:https://doi.org/10.1109/EICT54103.2021.9733662Google ScholarGoogle ScholarCross RefCross Ref
  4. Qurat-Ul-Ain Akram, Asma Naseer, and Sarmad Hussain. 2009. Assas-Band, an Affix-Exception-List Based Urdu Stemmer. Retrieved from www.crulp.orgGoogle ScholarGoogle Scholar
  5. Nasser O Alshammari, Fawaz D Alharbi, N O Alshammari,; F D Alharbi, and F D Alharbi. 2022. 84 Combining a Novel Scoring Approach with Arabic Stemming Techniques for Arabic Chatbots Conversation Engine. ACM Trans. Asian Low-Resour. Lang. Inf. Process 21, 4 (2022). DOI:https://doi.org/10.1145/3511215Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Michela Bacchin, Nicola Ferro, and Massimo Melucci. 2002. The effectiveness of a graph-based algorithm for stemming. Lect. Notes Comput. Sci. (including Subser. Lect. Notes Artif. Intell. Lect. Notes Bioinformatics) 2555, (2002), 117–128. DOI:https://doi.org/10.1007/3-540-36227-4_12Google ScholarGoogle ScholarCross RefCross Ref
  7. Michela Bacchin, Nicola Ferro, and Massimo Melucci. 2005. A probabilistic model for stemmer generation. Inf. Process. Manag. 41, 1 (2005), 121–137. DOI:https://doi.org/10.1016/j.ipm.2004.04.006Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Bunyamin, Arief Fatchul Huda, and Arie Ardiyanti Suryani. 2021. Indonesian Stemmer for Ambiguous Word based on Context. In 2021 International Conference on Data Science and Its Applications (ICoDSA), IEEE, 1–9. DOI:https://doi.org/10.1109/ICoDSA53588.2021.9617514Google ScholarGoogle ScholarCross RefCross Ref
  9. Fatimah Djajasudarma. 1994. Tata Bahasa Acuan Bahasa Sunda (Sundanese Reference Grammar). Pusat Pembinaan dan Pengembangan Bahasa.Google ScholarGoogle Scholar
  10. Tayyaba Fatima, Raees UL Islam, Muhammad Waqas Anwar, M Hasan Jamal, M Tayyab Chaudhry, Zeeshan Gillani, and Raees Ul Islam. 2021. STEMUR: An Automated Word Conflation Algorithm for the Urdu Language. ACM Trans. Asian Low-Resour. Lang. Inf. Process 21, (2021). DOI:https://doi.org/10.1145/3476226Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. William B Frakes and Christopher J Fox. 2003. Strength and similarity of affix removal stemming algorithms. ACM SIGIR Forum 37, 1 (April 2003), 26–30. DOI:https://doi.org/10.1145/945546.945548Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Baby Gobin-Rahimbux, Ishwaree Maudhoo, and Nuzhah Gooda Sahib. 2023. KreolStem: A hybrid language-dependent stemmer for Kreol Morisien. J. Exp. Theor. Artif. Intell. (January 2023), 1–19. DOI:https://doi.org/10.1080/0952813X.2023.2165714Google ScholarGoogle ScholarCross RefCross Ref
  13. John Goldsmith. 2001. Unsupervised learning of the morphology of a natural language. Comput. Linguist. 27, 2 (2001), 153–198. DOI:https://doi.org/10.1162/089120101750300490Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. John Goldsmith. 2006. An algorithm for the unsupervised learning of morphology. Nat. Lang. Eng. 12, 4 (2006), 353–371. DOI:https://doi.org/10.1017/S1351324905004055Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Margaret A. Hafer and Stephen F. Weiss. 1974. Word segmentation by letter successor varieties. Inf. Storage Retr. 10, 11–12 (November 1974), 371–385. DOI:https://doi.org/10.1016/0020-0271(74)90044-8Google ScholarGoogle ScholarCross RefCross Ref
  16. Harald Hammarström and Lars Borin. 2011. Unsupervised learning of morphology. Comput. Linguist. 37, 2 (2011), 309–350. DOI:https://doi.org/10.1162/COLI_a_00050Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. David A Hull. 1996. Stemming algorithms: A Case Study for Detailed Evaluation. J. Am. Soc. Inf. Sci. 47, 1 (1996), 10–84.Google ScholarGoogle ScholarCross RefCross Ref
  18. Abdul Jabbar, Sajid Iqbal, Adnan Akhunzada, and Qaisar Abbas. 2018. An improved Urdu stemming algorithm for text mining based on multi-step hybrid approach. J. Exp. Theor. Artif. Intell. 30, 5 (2018), 703–723. DOI:https://doi.org/10.1080/0952813X.2018.1467495Google ScholarGoogle ScholarCross RefCross Ref
  19. Robert Krovetz. 1993. Viewing morphology as an inference process. In Proceedings of the 16th annual international ACM SIGIR conference on Research and development in information retrieval - SIGIR ’93, ACM Press, New York, New York, USA, 191–202. DOI:https://doi.org/10.1145/160688.160718Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Julie Beth Lovins. 1968. Development of a Stemming Algorithm. Mech. Transl. Comput. Linguist. 11, 1 (1968).Google ScholarGoogle Scholar
  21. P Majumder, M Mitra, S K Parui, G Kole, P Mitra, and K Datta. 2007. YASS: Yet an-other suffix stripper. ACM Trans. Inf. Syst. 25, 4 (2007). DOI:https://doi.org/10.1145/1281485.1281489Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Douglas W. Oard, Gina Anne Levow, and Clara I. Cabezas. 2001. CLEF experiments at Maryland: Statistical stemming and backoff translation. Lect. Notes Comput. Sci. (including Subser. Lect. Notes Artif. Intell. Lect. Notes Bioinformatics) 2069, (2001), 176–187. DOI:https://doi.org/10.1007/3-540-44645-1_17Google ScholarGoogle ScholarCross RefCross Ref
  23. G A M Ong and M A Ballera. 2022. A Feature-based Stochastic Morphological Analyzer for Filipino Affixed Words. In 2022 IEEE International Conference on Artificial Intelligence in Engineering and Technology (IICAIET), 1–6. DOI:https://doi.org/10.1109/IICAIET55139.2022.9936850Google ScholarGoogle ScholarCross RefCross Ref
  24. Chris D. Paice. 1990. Another Stemmer. ACM SIGIR Forum 24, 3 (1990), 56–61. DOI:https://doi.org/10.1145/101306.101310Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Chris D. Paice. 1994. An Evaluation Method for Stemming Algorithms. In Proceedings of the 17th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Springer-Verlag, Dublin, Ireland, 42–50. Retrieved from https://dl.acm.org/doi/10.5555/188490.188499Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Jiaul H Paik, Mandar Mitra, Swapan K Parui, and Kalervo Järvelin. 2011. GRAS: An Effective and Efficient Stemming Algorithm for Information Retrieval. ACM Trans. Inf. Syst. 29, 4 (December 2011), 1–24. DOI:https://doi.org/10.1145/2037661.2037664Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Jiaul H Paik, Swapan K Parui, Dipasree Pal, and Stephen E Robertson. 2013. Effective and Robust Query-Based Stemming. ACM Trans. Inf. Syst. 31, 4 (November 2013), 1–29. DOI:https://doi.org/10.1145/2536736.2536738Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Fuchun Peng, Nawaaz Ahmed, Xin Li, and Yumao Lu. 2007. Context sensitive stemming for web search. In Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval - SIGIR ’07, ACM Press, New York, New York, USA, 639. DOI:https://doi.org/10.1145/1277741.1277851Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. M F Porter. 1980. An algorithm for suffix stripping. Progr. Electron. Libr. Inf. Syst. 14, 3 (1980), 130–137.Google ScholarGoogle ScholarCross RefCross Ref
  30. Ayu Purwarianti. 2011. A non deterministic Indonesian stemmer. In Proceedings of the 2011 International Conference on Electrical Engineering and Informatics, IEEE, 1–5. DOI:https://doi.org/10.1109/ICEEI.2011.6021829Google ScholarGoogle ScholarCross RefCross Ref
  31. M V Raju and M Sreenivasulu. 2022. A Lightweight Stemmer for Telugu Language. In 2022 4th International Conference on Inventive Research in Computing Applications (ICIRCA), 1385–1388. DOI:https://doi.org/10.1109/ICIRCA54612.2022.9985623Google ScholarGoogle ScholarCross RefCross Ref
  32. R. H Robins. 1983. Sistem dan Struktur Bahasa Sunda (System and Structure ofSundanese Language). DJAMBATAN, Jakarta.Google ScholarGoogle Scholar
  33. Navanath Saharia, Utpal Sharma, and Jugal Kalita. 2014. Stemming resource-poor Indian languages. ACM Trans. Asian Lang. Inf. Process. 13, 3 (October 2014), 1–26. DOI:https://doi.org/10.1145/2629670Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. Harjit Singh. 2022. GPStemmer—A Gurmukhi Punjabi Stemmer BT - Advances in Data and Information Sciences. Springer Singapore, Singapore, 493–503.Google ScholarGoogle Scholar
  35. Jasmeet Singh and Vishal Gupta. 2017. A systematic review of text stemming techniques. Artif. Intell. Rev. 48, 2 (August 2017), 157–217. DOI:https://doi.org/10.1007/s10462-016-9498-2Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. Sandeep R. Sirsat, Vinay Chavan, and Hemant S. Mahalle. 2013. Strength and Accuracy Analysis of Affix Removal Stemming Algorithms. Int. J. Comput. Sci. Inf. Technol. 4, 2 (2013), 265–269. Retrieved from http://ijcsit.com/docs/Volume 4/Vol4Issue2/ijcsit20130402017.pdfGoogle ScholarGoogle Scholar
  37. Yayat Sudaryat, Abud Prawirasumantri, and Karna Yudibrata. 2013. Tata Basa Sunda Kiwari (Sundanese Grammar Today). Yrama Widya, Bandung.Google ScholarGoogle Scholar
  38. Arie Ardiyanti Suryani, Dwi Hendratmo Widyantoro, Ayu Purwarianti, and Yayat Sudaryat. 2018. The Rule-Based Sundanese Stemmer. ACM Trans. Asian Low-Resource Lang. Inf. Process. 17, 4 (August 2018), 1–28. DOI:https://doi.org/10.1145/3195634Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. Jinxi Xu and W Bruce Croft. 1998. Corpus-based stemming using cooccurrence of word variants. ACM Trans. Inf. Syst. 16, 1 (January 1998), 61–81. DOI:https://doi.org/10.1145/267954.267957Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. SUSTEM: An Improved Rule-Based Sundanese Stemmer
        Index terms have been assigned to the content through auto-classification.

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in

        Full Access

        • Published in

          cover image ACM Transactions on Asian and Low-Resource Language Information Processing
          ACM Transactions on Asian and Low-Resource Language Information Processing Just Accepted
          ISSN:2375-4699
          EISSN:2375-4702
          Table of Contents

          Copyright © 2024 Copyright held by the owner/author(s). Publication rights licensed to ACM

          Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

          Publisher

          Association for Computing Machinery

          New York, NY, United States

          Publication History

          • Online AM: 5 April 2024
          • Accepted: 1 April 2024
          • Revised: 15 March 2023
          • Received: 18 April 2022
          Published in tallip Just Accepted

          Check for updates

          Qualifiers

          • research-article
        • Article Metrics

          • Downloads (Last 12 months)86
          • Downloads (Last 6 weeks)86

          Other Metrics

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader