Abstract
Current Sundanese stemmers either ignore reduplication words or define rules to handle only affixes. There is a significant amount of reduplication words in the Sundanese language. Because of that, it is impossible to achieve superior stemming precision in the Sundanese language without addressing reduplication words. This paper presents an improved stemmer for the Sundanese language, which handles affixed and reduplicated words. With a Sundanese root word list, we use a rules-based stemming technique. In our approach, all stems produced by the affixes removal or normalization processes are added to the stem list. Using a stem list can help increase stemmer accuracy by reducing stemming errors caused by affix removal sequence errors or morphological issues. The current Sundanese language stemmer, RBSS, was used as a comparison. Two datasets with 8218 unique affixed words and reduplication words were evaluated. The results show that our stemmer's strength and accuracy have improved noticeably. The use of stem list and word reduplication rules improved our stemmer's affixed type recognition and allowed us to achieve up to 99.30% accuracy.
- Dhafar Hamed Abd, Wasiq Khan, Khudhair Abed Thamer, and Abir J Hussain. 2021. Arabic Light Stemmer Based on ISRI Stemmer BT - Intelligent Computing Theories and Application. Springer International Publishing, Cham, 32–45.Google Scholar
- Mirna Adriani, Jelita Asian, Bobby Nazief, S. M.M. Tahaghoghi, and Hugh E. Williams. 2007. Stemming Indonesian. ACM Trans. Asian Lang. Inf. Process. 6, 4 (December 2007), 1–33. DOI:https://doi.org/10.1145/1316457.1316459Google ScholarDigital Library
- T Ahmed, S Hossain, M S Salim, A Anjum, and K M Azharul Hasan. 2021. Gold Dataset for the Evaluation of Bangla Stemmer. In 2021 5th International Conference on Electrical Information and Communication Technology (EICT), 1–6. DOI:https://doi.org/10.1109/EICT54103.2021.9733662Google ScholarCross Ref
- Qurat-Ul-Ain Akram, Asma Naseer, and Sarmad Hussain. 2009. Assas-Band, an Affix-Exception-List Based Urdu Stemmer. Retrieved from www.crulp.orgGoogle Scholar
- Nasser O Alshammari, Fawaz D Alharbi, N O Alshammari,; F D Alharbi, and F D Alharbi. 2022. 84 Combining a Novel Scoring Approach with Arabic Stemming Techniques for Arabic Chatbots Conversation Engine. ACM Trans. Asian Low-Resour. Lang. Inf. Process 21, 4 (2022). DOI:https://doi.org/10.1145/3511215Google ScholarDigital Library
- Michela Bacchin, Nicola Ferro, and Massimo Melucci. 2002. The effectiveness of a graph-based algorithm for stemming. Lect. Notes Comput. Sci. (including Subser. Lect. Notes Artif. Intell. Lect. Notes Bioinformatics) 2555, (2002), 117–128. DOI:https://doi.org/10.1007/3-540-36227-4_12Google ScholarCross Ref
- Michela Bacchin, Nicola Ferro, and Massimo Melucci. 2005. A probabilistic model for stemmer generation. Inf. Process. Manag. 41, 1 (2005), 121–137. DOI:https://doi.org/10.1016/j.ipm.2004.04.006Google ScholarDigital Library
- Bunyamin, Arief Fatchul Huda, and Arie Ardiyanti Suryani. 2021. Indonesian Stemmer for Ambiguous Word based on Context. In 2021 International Conference on Data Science and Its Applications (ICoDSA), IEEE, 1–9. DOI:https://doi.org/10.1109/ICoDSA53588.2021.9617514Google ScholarCross Ref
- Fatimah Djajasudarma. 1994. Tata Bahasa Acuan Bahasa Sunda (Sundanese Reference Grammar). Pusat Pembinaan dan Pengembangan Bahasa.Google Scholar
- Tayyaba Fatima, Raees UL Islam, Muhammad Waqas Anwar, M Hasan Jamal, M Tayyab Chaudhry, Zeeshan Gillani, and Raees Ul Islam. 2021. STEMUR: An Automated Word Conflation Algorithm for the Urdu Language. ACM Trans. Asian Low-Resour. Lang. Inf. Process 21, (2021). DOI:https://doi.org/10.1145/3476226Google ScholarDigital Library
- William B Frakes and Christopher J Fox. 2003. Strength and similarity of affix removal stemming algorithms. ACM SIGIR Forum 37, 1 (April 2003), 26–30. DOI:https://doi.org/10.1145/945546.945548Google ScholarDigital Library
- Baby Gobin-Rahimbux, Ishwaree Maudhoo, and Nuzhah Gooda Sahib. 2023. KreolStem: A hybrid language-dependent stemmer for Kreol Morisien. J. Exp. Theor. Artif. Intell. (January 2023), 1–19. DOI:https://doi.org/10.1080/0952813X.2023.2165714Google ScholarCross Ref
- John Goldsmith. 2001. Unsupervised learning of the morphology of a natural language. Comput. Linguist. 27, 2 (2001), 153–198. DOI:https://doi.org/10.1162/089120101750300490Google ScholarDigital Library
- John Goldsmith. 2006. An algorithm for the unsupervised learning of morphology. Nat. Lang. Eng. 12, 4 (2006), 353–371. DOI:https://doi.org/10.1017/S1351324905004055Google ScholarDigital Library
- Margaret A. Hafer and Stephen F. Weiss. 1974. Word segmentation by letter successor varieties. Inf. Storage Retr. 10, 11–12 (November 1974), 371–385. DOI:https://doi.org/10.1016/0020-0271(74)90044-8Google ScholarCross Ref
- Harald Hammarström and Lars Borin. 2011. Unsupervised learning of morphology. Comput. Linguist. 37, 2 (2011), 309–350. DOI:https://doi.org/10.1162/COLI_a_00050Google ScholarDigital Library
- David A Hull. 1996. Stemming algorithms: A Case Study for Detailed Evaluation. J. Am. Soc. Inf. Sci. 47, 1 (1996), 10–84.Google ScholarCross Ref
- Abdul Jabbar, Sajid Iqbal, Adnan Akhunzada, and Qaisar Abbas. 2018. An improved Urdu stemming algorithm for text mining based on multi-step hybrid approach. J. Exp. Theor. Artif. Intell. 30, 5 (2018), 703–723. DOI:https://doi.org/10.1080/0952813X.2018.1467495Google ScholarCross Ref
- Robert Krovetz. 1993. Viewing morphology as an inference process. In Proceedings of the 16th annual international ACM SIGIR conference on Research and development in information retrieval - SIGIR ’93, ACM Press, New York, New York, USA, 191–202. DOI:https://doi.org/10.1145/160688.160718Google ScholarDigital Library
- Julie Beth Lovins. 1968. Development of a Stemming Algorithm. Mech. Transl. Comput. Linguist. 11, 1 (1968).Google Scholar
- P Majumder, M Mitra, S K Parui, G Kole, P Mitra, and K Datta. 2007. YASS: Yet an-other suffix stripper. ACM Trans. Inf. Syst. 25, 4 (2007). DOI:https://doi.org/10.1145/1281485.1281489Google ScholarDigital Library
- Douglas W. Oard, Gina Anne Levow, and Clara I. Cabezas. 2001. CLEF experiments at Maryland: Statistical stemming and backoff translation. Lect. Notes Comput. Sci. (including Subser. Lect. Notes Artif. Intell. Lect. Notes Bioinformatics) 2069, (2001), 176–187. DOI:https://doi.org/10.1007/3-540-44645-1_17Google ScholarCross Ref
- G A M Ong and M A Ballera. 2022. A Feature-based Stochastic Morphological Analyzer for Filipino Affixed Words. In 2022 IEEE International Conference on Artificial Intelligence in Engineering and Technology (IICAIET), 1–6. DOI:https://doi.org/10.1109/IICAIET55139.2022.9936850Google ScholarCross Ref
- Chris D. Paice. 1990. Another Stemmer. ACM SIGIR Forum 24, 3 (1990), 56–61. DOI:https://doi.org/10.1145/101306.101310Google ScholarDigital Library
- Chris D. Paice. 1994. An Evaluation Method for Stemming Algorithms. In Proceedings of the 17th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Springer-Verlag, Dublin, Ireland, 42–50. Retrieved from https://dl.acm.org/doi/10.5555/188490.188499Google ScholarDigital Library
- Jiaul H Paik, Mandar Mitra, Swapan K Parui, and Kalervo Järvelin. 2011. GRAS: An Effective and Efficient Stemming Algorithm for Information Retrieval. ACM Trans. Inf. Syst. 29, 4 (December 2011), 1–24. DOI:https://doi.org/10.1145/2037661.2037664Google ScholarDigital Library
- Jiaul H Paik, Swapan K Parui, Dipasree Pal, and Stephen E Robertson. 2013. Effective and Robust Query-Based Stemming. ACM Trans. Inf. Syst. 31, 4 (November 2013), 1–29. DOI:https://doi.org/10.1145/2536736.2536738Google ScholarDigital Library
- Fuchun Peng, Nawaaz Ahmed, Xin Li, and Yumao Lu. 2007. Context sensitive stemming for web search. In Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval - SIGIR ’07, ACM Press, New York, New York, USA, 639. DOI:https://doi.org/10.1145/1277741.1277851Google ScholarDigital Library
- M F Porter. 1980. An algorithm for suffix stripping. Progr. Electron. Libr. Inf. Syst. 14, 3 (1980), 130–137.Google ScholarCross Ref
- Ayu Purwarianti. 2011. A non deterministic Indonesian stemmer. In Proceedings of the 2011 International Conference on Electrical Engineering and Informatics, IEEE, 1–5. DOI:https://doi.org/10.1109/ICEEI.2011.6021829Google ScholarCross Ref
- M V Raju and M Sreenivasulu. 2022. A Lightweight Stemmer for Telugu Language. In 2022 4th International Conference on Inventive Research in Computing Applications (ICIRCA), 1385–1388. DOI:https://doi.org/10.1109/ICIRCA54612.2022.9985623Google ScholarCross Ref
- R. H Robins. 1983. Sistem dan Struktur Bahasa Sunda (System and Structure ofSundanese Language). DJAMBATAN, Jakarta.Google Scholar
- Navanath Saharia, Utpal Sharma, and Jugal Kalita. 2014. Stemming resource-poor Indian languages. ACM Trans. Asian Lang. Inf. Process. 13, 3 (October 2014), 1–26. DOI:https://doi.org/10.1145/2629670Google ScholarDigital Library
- Harjit Singh. 2022. GPStemmer—A Gurmukhi Punjabi Stemmer BT - Advances in Data and Information Sciences. Springer Singapore, Singapore, 493–503.Google Scholar
- Jasmeet Singh and Vishal Gupta. 2017. A systematic review of text stemming techniques. Artif. Intell. Rev. 48, 2 (August 2017), 157–217. DOI:https://doi.org/10.1007/s10462-016-9498-2Google ScholarDigital Library
- Sandeep R. Sirsat, Vinay Chavan, and Hemant S. Mahalle. 2013. Strength and Accuracy Analysis of Affix Removal Stemming Algorithms. Int. J. Comput. Sci. Inf. Technol. 4, 2 (2013), 265–269. Retrieved from http://ijcsit.com/docs/Volume 4/Vol4Issue2/ijcsit20130402017.pdfGoogle Scholar
- Yayat Sudaryat, Abud Prawirasumantri, and Karna Yudibrata. 2013. Tata Basa Sunda Kiwari (Sundanese Grammar Today). Yrama Widya, Bandung.Google Scholar
- Arie Ardiyanti Suryani, Dwi Hendratmo Widyantoro, Ayu Purwarianti, and Yayat Sudaryat. 2018. The Rule-Based Sundanese Stemmer. ACM Trans. Asian Low-Resource Lang. Inf. Process. 17, 4 (August 2018), 1–28. DOI:https://doi.org/10.1145/3195634Google ScholarDigital Library
- Jinxi Xu and W Bruce Croft. 1998. Corpus-based stemming using cooccurrence of word variants. ACM Trans. Inf. Syst. 16, 1 (January 1998), 61–81. DOI:https://doi.org/10.1145/267954.267957Google ScholarDigital Library
Index Terms
- SUSTEM: An Improved Rule-Based Sundanese Stemmer
Recommendations
The Rule-Based Sundanese Stemmer
Our research proposed an iterative Sundanese stemmer by removing the derivational affixes prior to the inflexional. This scheme was chosen because, in the Sundanese affixation, a confix (one of derivational affix) is applied in the last phase of a ...
A Fast Corpus-Based Stemmer
Stemming is a mechanism of word form normalization that transforms the variant word forms to their common root. In an Information Retrieval system, it is used to increase the system’s performance, specifically the recall and desirably the precision. ...
Don't have a stemmer?: be un+concern+ed
SIGIR '08: Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrievalThe choice of indexing terms used to represent documents crucially determines how e ective subsequent retrieval will be. IR systems commonly use rule-based stemmers to normalize surface word forms to combat the problem of not finding documents that ...
Comments