research-article

Free Access

Just Accepted

SUSTEM: An Improved Rule-Based Sundanese Stemmer

Authors:
Irwan Setiawan

Intelligent Knowledge Management Lab, Department of Computer Science and Information Engineering, National Cheng Kung University, Tainan, Taiwan

Department of Computer and Informatics Engineering, Politeknik Negeri Bandung, Bandung, Indonesia

Intelligent Knowledge Management Lab, Department of Computer Science and Information Engineering, National Cheng Kung University, Tainan, Taiwan

Department of Computer and Informatics Engineering, Politeknik Negeri Bandung, Bandung, Indonesia

0000-0002-4161-1495
Search about this author

,
Hung-Yu Kao

Intelligent Knowledge Management Lab, Department of Computer Science and Information Engineering, National Cheng Kung University, Tainan, Taiwan

Intelligent Knowledge Management Lab, Department of Computer Science and Information Engineering, National Cheng Kung University, Tainan, Taiwan

0000-0002-8890-8544
Search about this author

ACM Transactions on Asian and Low-Resource Language Information ProcessingAccepted on April 2024https://doi.org/10.1145/3656342

Online AM:05 April 2024Publication History

ACM Transactions on Asian and Low-Resource Language Information Processing

Abstract

Current Sundanese stemmers either ignore reduplication words or define rules to handle only affixes. There is a significant amount of reduplication words in the Sundanese language. Because of that, it is impossible to achieve superior stemming precision in the Sundanese language without addressing reduplication words. This paper presents an improved stemmer for the Sundanese language, which handles affixed and reduplicated words. With a Sundanese root word list, we use a rules-based stemming technique. In our approach, all stems produced by the affixes removal or normalization processes are added to the stem list. Using a stem list can help increase stemmer accuracy by reducing stemming errors caused by affix removal sequence errors or morphological issues. The current Sundanese language stemmer, RBSS, was used as a comparison. Two datasets with 8218 unique affixed words and reduplication words were evaluated. The results show that our stemmer's strength and accuracy have improved noticeably. The use of stem list and word reduplication rules improved our stemmer's affixed type recognition and allowed us to achieve up to 99.30% accuracy.

References

Dhafar Hamed Abd, Wasiq Khan, Khudhair Abed Thamer, and Abir J Hussain. 2021. Arabic Light Stemmer Based on ISRI Stemmer BT - Intelligent Computing Theories and Application. Springer International Publishing, Cham, 32–45.Google Scholar
Mirna Adriani, Jelita Asian, Bobby Nazief, S. M.M. Tahaghoghi, and Hugh E. Williams. 2007. Stemming Indonesian. ACM Trans. Asian Lang. Inf. Process. 6, 4 (December 2007), 1–33. DOI:https://doi.org/10.1145/1316457.1316459Google ScholarDigital Library
T Ahmed, S Hossain, M S Salim, A Anjum, and K M Azharul Hasan. 2021. Gold Dataset for the Evaluation of Bangla Stemmer. In 2021 5th International Conference on Electrical Information and Communication Technology (EICT), 1–6. DOI:https://doi.org/10.1109/EICT54103.2021.9733662Google ScholarCross Ref
Qurat-Ul-Ain Akram, Asma Naseer, and Sarmad Hussain. 2009. Assas-Band, an Affix-Exception-List Based Urdu Stemmer. Retrieved from www.crulp.orgGoogle Scholar
Nasser O Alshammari, Fawaz D Alharbi, N O Alshammari,; F D Alharbi, and F D Alharbi. 2022. 84 Combining a Novel Scoring Approach with Arabic Stemming Techniques for Arabic Chatbots Conversation Engine. ACM Trans. Asian Low-Resour. Lang. Inf. Process 21, 4 (2022). DOI:https://doi.org/10.1145/3511215Google ScholarDigital Library
Michela Bacchin, Nicola Ferro, and Massimo Melucci. 2002. The effectiveness of a graph-based algorithm for stemming. Lect. Notes Comput. Sci. (including Subser. Lect. Notes Artif. Intell. Lect. Notes Bioinformatics) 2555, (2002), 117–128. DOI:https://doi.org/10.1007/3-540-36227-4_12Google ScholarCross Ref
Michela Bacchin, Nicola Ferro, and Massimo Melucci. 2005. A probabilistic model for stemmer generation. Inf. Process. Manag. 41, 1 (2005), 121–137. DOI:https://doi.org/10.1016/j.ipm.2004.04.006Google ScholarDigital Library
Bunyamin, Arief Fatchul Huda, and Arie Ardiyanti Suryani. 2021. Indonesian Stemmer for Ambiguous Word based on Context. In 2021 International Conference on Data Science and Its Applications (ICoDSA), IEEE, 1–9. DOI:https://doi.org/10.1109/ICoDSA53588.2021.9617514Google ScholarCross Ref
Fatimah Djajasudarma. 1994. Tata Bahasa Acuan Bahasa Sunda (Sundanese Reference Grammar). Pusat Pembinaan dan Pengembangan Bahasa.Google Scholar
Tayyaba Fatima, Raees UL Islam, Muhammad Waqas Anwar, M Hasan Jamal, M Tayyab Chaudhry, Zeeshan Gillani, and Raees Ul Islam. 2021. STEMUR: An Automated Word Conflation Algorithm for the Urdu Language. ACM Trans. Asian Low-Resour. Lang. Inf. Process 21, (2021). DOI:https://doi.org/10.1145/3476226Google ScholarDigital Library
William B Frakes and Christopher J Fox. 2003. Strength and similarity of affix removal stemming algorithms. ACM SIGIR Forum 37, 1 (April 2003), 26–30. DOI:https://doi.org/10.1145/945546.945548Google ScholarDigital Library
Baby Gobin-Rahimbux, Ishwaree Maudhoo, and Nuzhah Gooda Sahib. 2023. KreolStem: A hybrid language-dependent stemmer for Kreol Morisien. J. Exp. Theor. Artif. Intell. (January 2023), 1–19. DOI:https://doi.org/10.1080/0952813X.2023.2165714Google ScholarCross Ref
John Goldsmith. 2001. Unsupervised learning of the morphology of a natural language. Comput. Linguist. 27, 2 (2001), 153–198. DOI:https://doi.org/10.1162/089120101750300490Google ScholarDigital Library
John Goldsmith. 2006. An algorithm for the unsupervised learning of morphology. Nat. Lang. Eng. 12, 4 (2006), 353–371. DOI:https://doi.org/10.1017/S1351324905004055Google ScholarDigital Library
Margaret A. Hafer and Stephen F. Weiss. 1974. Word segmentation by letter successor varieties. Inf. Storage Retr. 10, 11–12 (November 1974), 371–385. DOI:https://doi.org/10.1016/0020-0271(74)90044-8Google ScholarCross Ref
Harald Hammarström and Lars Borin. 2011. Unsupervised learning of morphology. Comput. Linguist. 37, 2 (2011), 309–350. DOI:https://doi.org/10.1162/COLI_a_00050Google ScholarDigital Library
David A Hull. 1996. Stemming algorithms: A Case Study for Detailed Evaluation. J. Am. Soc. Inf. Sci. 47, 1 (1996), 10–84.Google ScholarCross Ref
Abdul Jabbar, Sajid Iqbal, Adnan Akhunzada, and Qaisar Abbas. 2018. An improved Urdu stemming algorithm for text mining based on multi-step hybrid approach. J. Exp. Theor. Artif. Intell. 30, 5 (2018), 703–723. DOI:https://doi.org/10.1080/0952813X.2018.1467495Google ScholarCross Ref
Robert Krovetz. 1993. Viewing morphology as an inference process. In Proceedings of the 16th annual international ACM SIGIR conference on Research and development in information retrieval - SIGIR ’93, ACM Press, New York, New York, USA, 191–202. DOI:https://doi.org/10.1145/160688.160718Google ScholarDigital Library
Julie Beth Lovins. 1968. Development of a Stemming Algorithm. Mech. Transl. Comput. Linguist. 11, 1 (1968).Google Scholar
P Majumder, M Mitra, S K Parui, G Kole, P Mitra, and K Datta. 2007. YASS: Yet an-other suffix stripper. ACM Trans. Inf. Syst. 25, 4 (2007). DOI:https://doi.org/10.1145/1281485.1281489Google ScholarDigital Library
Douglas W. Oard, Gina Anne Levow, and Clara I. Cabezas. 2001. CLEF experiments at Maryland: Statistical stemming and backoff translation. Lect. Notes Comput. Sci. (including Subser. Lect. Notes Artif. Intell. Lect. Notes Bioinformatics) 2069, (2001), 176–187. DOI:https://doi.org/10.1007/3-540-44645-1_17Google ScholarCross Ref
G A M Ong and M A Ballera. 2022. A Feature-based Stochastic Morphological Analyzer for Filipino Affixed Words. In 2022 IEEE International Conference on Artificial Intelligence in Engineering and Technology (IICAIET), 1–6. DOI:https://doi.org/10.1109/IICAIET55139.2022.9936850Google ScholarCross Ref
Chris D. Paice. 1990. Another Stemmer. ACM SIGIR Forum 24, 3 (1990), 56–61. DOI:https://doi.org/10.1145/101306.101310Google ScholarDigital Library
Chris D. Paice. 1994. An Evaluation Method for Stemming Algorithms. In Proceedings of the 17th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Springer-Verlag, Dublin, Ireland, 42–50. Retrieved from https://dl.acm.org/doi/10.5555/188490.188499Google ScholarDigital Library
Jiaul H Paik, Mandar Mitra, Swapan K Parui, and Kalervo Järvelin. 2011. GRAS: An Effective and Efficient Stemming Algorithm for Information Retrieval. ACM Trans. Inf. Syst. 29, 4 (December 2011), 1–24. DOI:https://doi.org/10.1145/2037661.2037664Google ScholarDigital Library
Jiaul H Paik, Swapan K Parui, Dipasree Pal, and Stephen E Robertson. 2013. Effective and Robust Query-Based Stemming. ACM Trans. Inf. Syst. 31, 4 (November 2013), 1–29. DOI:https://doi.org/10.1145/2536736.2536738Google ScholarDigital Library
Fuchun Peng, Nawaaz Ahmed, Xin Li, and Yumao Lu. 2007. Context sensitive stemming for web search. In Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval - SIGIR ’07, ACM Press, New York, New York, USA, 639. DOI:https://doi.org/10.1145/1277741.1277851Google ScholarDigital Library
M F Porter. 1980. An algorithm for suffix stripping. Progr. Electron. Libr. Inf. Syst. 14, 3 (1980), 130–137.Google ScholarCross Ref
Ayu Purwarianti. 2011. A non deterministic Indonesian stemmer. In Proceedings of the 2011 International Conference on Electrical Engineering and Informatics, IEEE, 1–5. DOI:https://doi.org/10.1109/ICEEI.2011.6021829Google ScholarCross Ref
M V Raju and M Sreenivasulu. 2022. A Lightweight Stemmer for Telugu Language. In 2022 4th International Conference on Inventive Research in Computing Applications (ICIRCA), 1385–1388. DOI:https://doi.org/10.1109/ICIRCA54612.2022.9985623Google ScholarCross Ref
R. H Robins. 1983. Sistem dan Struktur Bahasa Sunda (System and Structure ofSundanese Language). DJAMBATAN, Jakarta.Google Scholar
Navanath Saharia, Utpal Sharma, and Jugal Kalita. 2014. Stemming resource-poor Indian languages. ACM Trans. Asian Lang. Inf. Process. 13, 3 (October 2014), 1–26. DOI:https://doi.org/10.1145/2629670Google ScholarDigital Library
Harjit Singh. 2022. GPStemmer—A Gurmukhi Punjabi Stemmer BT - Advances in Data and Information Sciences. Springer Singapore, Singapore, 493–503.Google Scholar
Jasmeet Singh and Vishal Gupta. 2017. A systematic review of text stemming techniques. Artif. Intell. Rev. 48, 2 (August 2017), 157–217. DOI:https://doi.org/10.1007/s10462-016-9498-2Google ScholarDigital Library
Sandeep R. Sirsat, Vinay Chavan, and Hemant S. Mahalle. 2013. Strength and Accuracy Analysis of Affix Removal Stemming Algorithms. Int. J. Comput. Sci. Inf. Technol. 4, 2 (2013), 265–269. Retrieved from http://ijcsit.com/docs/Volume 4/Vol4Issue2/ijcsit20130402017.pdfGoogle Scholar
Yayat Sudaryat, Abud Prawirasumantri, and Karna Yudibrata. 2013. Tata Basa Sunda Kiwari (Sundanese Grammar Today). Yrama Widya, Bandung.Google Scholar
Arie Ardiyanti Suryani, Dwi Hendratmo Widyantoro, Ayu Purwarianti, and Yayat Sudaryat. 2018. The Rule-Based Sundanese Stemmer. ACM Trans. Asian Low-Resource Lang. Inf. Process. 17, 4 (August 2018), 1–28. DOI:https://doi.org/10.1145/3195634Google ScholarDigital Library
Jinxi Xu and W Bruce Croft. 1998. Corpus-based stemming using cooccurrence of word variants. ACM Trans. Inf. Syst. 16, 1 (January 1998), 61–81. DOI:https://doi.org/10.1145/267954.267957Google ScholarDigital Library

Index Terms

SUSTEM: An Improved Rule-Based Sundanese Stemmer

Index terms have been assigned to the content through auto-classification.

Recommendations

The Rule-Based Sundanese Stemmer

Our research proposed an iterative Sundanese stemmer by removing the derivational affixes prior to the inflexional. This scheme was chosen because, in the Sundanese affixation, a confix (one of derivational affix) is applied in the last phase of a ...
Read More
A Fast Corpus-Based Stemmer

Stemming is a mechanism of word form normalization that transforms the variant word forms to their common root. In an Information Retrieval system, it is used to increase the system’s performance, specifically the recall and desirably the precision. ...
Read More
Don't have a stemmer?: be un+concern+ed
SIGIR '08: Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval

The choice of indexing terms used to represent documents crucially determines how e ective subsequent retrieval will be. IR systems commonly use rule-based stemmers to normalize surface word forms to combat the problem of not finding documents that ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in

ACM Transactions on Asian and Low-Resource Language Information Processing Just Accepted
ISSN:2375-4699
EISSN:2375-4702
Table of Contents

Copyright © 2024 Copyright held by the owner/author(s). Publication rights licensed to ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Online AM: 5 April 2024
- Accepted: 1 April 2024
- Revised: 15 March 2023
- Received: 18 April 2022
Published in tallip Just Accepted

Check for updates
Author Tags
Sundanese language
reduplication word
affixed word
stemming
Qualifiers
- research-article
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 0
  Total Citations
  View Citations
- 86
  Total Downloads
- Downloads (Last 12 months)86
- Downloads (Last 6 weeks)86
Other Metrics
View Author Metrics
Cited By
This publication has not been cited yet

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

SUSTEM: An Improved Rule-Based Sundanese Stemmer

ACM Transactions on Asian and Low-Resource Language Information Processing

Abstract

References

Cited By

Index Terms

Recommendations

The Rule-Based Sundanese Stemmer

A Fast Corpus-Based Stemmer

Don't have a stemmer?: be un+concern+ed

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

SUSTEM: An Improved Rule-Based Sundanese Stemmer

ACM Transactions on Asian and Low-Resource Language Information Processing

Abstract

References

Cited By

Index Terms

Recommendations

The Rule-Based Sundanese Stemmer

A Fast Corpus-Based Stemmer

Don't have a stemmer?: be un+concern+ed

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media