AC-IQuAD: Automatically Constructed Indonesian Question Answering Dataset by Leveraging Wikidata

Doxolodeo, Kerenza; Krisnadhi, Adila Alfa

doi:10.1007/s10579-023-09702-y

AC-IQuAD: Automatically Constructed Indonesian Question Answering Dataset by Leveraging Wikidata

Open access
Published: 03 January 2024

(2024)
Cite this article

Download PDF

You have full access to this open access article

Language Resources and Evaluation Aims and scope Submit manuscript

AC-IQuAD: Automatically Constructed Indonesian Question Answering Dataset by Leveraging Wikidata

Download PDF

Kerenza Doxolodeo¹ &
Adila Alfa Krisnadhi¹

848 Accesses
Explore all metrics

Abstract

Constructing a question-answering dataset can be prohibitively expensive, making it difficult for researchers to make one for an under-resourced language, such as Indonesian. We create a novel Indonesian Question Answering dataset that is produced automatically end-to-end. The process uses Context Free Grammar, the Wikipedia Indonesian Corpus, and the concept of the proxy model. The dataset consists of 134 thousand simple questions and 60 thousand complex questions. It achieved competitive grammatical and model accuracy compared to the translated dataset but suffers from some issues due to resource constraints.

Natural language processing: state of the art, current trends and challenges

Article 14 July 2022

A survey on large language model based autonomous agents

Article Open access 22 March 2024

A survey on deep learning approaches for text-to-SQL

Article Open access 23 January 2023

1 Introduction

Question answering (QA) is a natural language processing (NLP) task where one is given a question in a particular natural language, such as English. It must return a correct answer based on some textual reference corpus. A high-quality QA dataset is needed to train a model that solves this task. Each element of such a dataset consists of a natural language (NL) question, a piece of text, and location information in the text where an answer to the question can be found.

Traditionally, constructing a QA dataset requires access to human annotators to perform tasks such as writing the candidate questions and sourcing the context text. For instance, constructing the English SQuAD dataset leveraged Stanford’s Daemo crowdsourcing platform, paying each annotator $10.50 per hour (Rajpurkar et al., 2018). However, this cannot be attainable for researchers who study under-resourced languages for various reasons, such as the lack of a crowdsourcing platform. The Amazon Mechanical Turk platform is unavailable to continental Africa and most Asian countries, including Indonesia (Turk, 2017). Funding issues can also cause problems. This issue causes even the most extensive Indonesian dataset, TydiQA, to have a limited dataset size of six thousand (Clark et al., 2020).

Previous researchers have tried to create datasets without human help. For instance, translation algorithms have been used to translate the English QA dataset to their native language. The alignment algorithm determines which English word is translated into each word in the new language. By having the words lined up, we can trace and work out which substring in the new language is the substring of the correct answer (Carrino et al., 2020). However, this strategy can be inaccessible for exotic languages that do not have a well-trained translation model due to a lack of translation dataset between two languages or the reliance on Translation API such as Google Translate, which can be expensive.

Lewis et al. (2019) used a different strategy where they used a web dump. They trained an unsupervised translation algorithm where one turns a context paragraph into a question. Their idea is that if a translation algor is supposed to convert a text from language A to language B, one can pretend that the context sentences are written in language A. The questions are written in language B. While this scheme eliminates the need for annotators, our attempt to reproduce the paper suggests that more effort is needed due to the scarcity of corpus. For instance, the source code from Lewis et al. (2019) indicates that it requires twenty million example questions. However, Wilie et al. (2020), the most prominent web dump for the Indonesian language, contains only 2 million unique questions. This is problematic when one considers that Indonesian is the tenth most spoken language in the world, spoken by 200 million speakers (Ghosh, 2020). If one of the top 10 most-spoken languages has a dataset size issue, this begs concern for other less widely used languages.

Furthermore, one needs access to the correct answer to construct a QA dataset. This causes a chicken-and-egg problem where one needs to read the context text to get the truth. However, to train a model, one needs to know the truth. One can use a publicly available knowledge graph (KG) to circumvent this issue.

A KG contains set of facts like “Paris is the capital of France.” Typically, such a fact is represented as a triple, say (Paris, capital of, France). These facts can be used to automatically build an QA dataset by systematically generating NL questions about those facts. Some publicly accessible KGs can be used for this purpose, for example, DBPedia, which leveraged info box from Wikipedia articles (Lehmann et al., 2015), and Wikidata, whose data are crowd-sourced (Vrandecic & Krötzsch, 2014).

We follow the aforementioned idea to solve the challenge of creating an QA dataset for an under-resourced language. Specifically, we propose a novel Indonesian QA dataset created by leveraging Wikidata as a source of facts to generate potential questions with the help of a set of grammar rules that conform to Indonesian grammatical patterns. Note that Wikidata is a multilingual KG that stores the proper nouns of each entity in many languages, including Indonesian. Since Wikidata data items are typically connected to a corresponding Wikipedia page, we use the Indonesian Wikipedia Corpus to attach suitable context sentences to the candidate questions. A proxy model verifies these pairs as grammatically accurate and attached to the correct text. Finally, our approach ensures dataset diversity by conducting deduplication as a post-processing step.

The generated dataset is called AC-IQuAD. Each row of AC-IQuAD consists of an question, a context paragraph, and the location of the substring that contains the correct answer to the question. Both the question and the context text are in Indonesian. Furthermore, as answers are obtained from Wikidata facts using SPARQL queries, each Indonesian question in the dataset has an equivalent SPARQL query. This allows AC-IQuAD to have a secondary purpose as a dataset for the knowledge graph question answering (KGQA) task, where the model converts a question in natural language to a SPARQL query which can be run against the KG to obtain the answer. However, the evaluation of the dataset presented in this paper is focused only on the natural language QA task. The evaluation concerning the KGQA task is left as future work.

The evaluation of our dataset comprises both manual and automated evaluation. We argue that it is essential that a dataset is evaluated manually by native speakers, but having it evaluated automatically can address the scalability issue. We present our human annotator with a sample of 100 entries from six types of questions and have them either approve the question or disapprove it with a pre-determined explanation. As for the automated evaluation, we fine-tune M-BERT with AC-IQuad and other benchmark QA datasets and compare their accuracy. A good dataset will have a high approval rate from human annotators and result in state-of-the-art models to achieve competitive accuracy compared to other datasets.

The paper is organized as follows. Section 2 discusses the related work relevant to our study. Section 4 outlines the four steps to generate the dataset. We then discuss the evaluation method in more detail with Sect. 5. Section 6 covers the evaluation results, and finally, Sect. 7 concludes.

2 Related work

2.1 Natural language question answering datasets

The largest English QA dataset is SQuAD 2.0 (Rajpurkar et al., 2018), comprising 150 thousand QA items. One-third of these items are impossible items where the context does not provide any substring that can be the correct answer. These impossible items are necessary to ensure that models do not overfit simple semantic text patterns but can demonstrate that they understand the text in a deeper meaning and show robustness against distracting sentences (Weissenborn et al., 2017).

There are two Indonesian QA datasets. The most prominent native is TyDI QA (Clark et al., 2020), which covers ten languages, including Indonesian. Based on our count, it has 6 thousand entries for training and 2 thousand entries for development & testing. The dataset is constructed by providing the first group of annotators snippets to a Wikipedia article and coming up with a genuine question unanswered by the article. The second group of annotators searched for a relevant text that responded to the question.

There is also an Indonesian SQuAD dataset (Muis & Purwarianti, 2020) obtained by translating the English SQuAD to Indonesian using Google Translate API. They employed the token alignment algorithm from Carrino et al. (2020) to track which token is translated to which token. This allowed them to work out where is the location of the correct substring in the context text. However, Clark et al. (2020) noted that this is not ideal as the generated text is too “translationese”^{Footnote 1} and not native enough.

2.2 Knowledge graph question answering datasets

The KGQA task refers to answering a given question with an answer obtained from entities in some knowledge graph. To solve this task, one usually has to construct a KG query, e.g., SPARQL, that captures the intention of the question as demonstrated by Fig. 1. The most comprehensive English KGQA dataset is LC-QUAD 2.0 (Dubey et al., 2019). It is notable for the variety of its question styles. Besides the straightforward one, it offered True-False questions, such as “Is Juan José Ibarretxe a chairperson of FC Barcelona?” that requires a SPARQL ASK-query, transitive questions (“The movie Hellboy is produced by which man who directed Shape of Water?”), questions that require access to more than one triple to answer (“Who are the writers of The Second Coming, whose death place is Menton?”). These questions are constructed by manually taking certain entities and relations that have been pre-determined. The construction proceeds through the KG to find as many triples as possible matching the pre-determined entity and relation list. These triples are then converted to text using several templates. These templates are not designed to be grammatically accurate. Merely, they exist to be used to create draft sentence, which then can be rewritten by human annotators. Such a dataset for Indonesian has yet to exist.

2.3 Automated QA dataset construction

To the authors’ best knowledge, there has been no attempt to build a QA dataset automatically in Indonesian. The following papers are efforts in the English language. We consider two approaches for automated question generation: deriving one from a knowledge graph triple or deriving one from a context text.

To the authors’ best knowledge, there has yet to be an attempt to build an QA dataset automatically in Indonesian. The following papers are efforts in the English language. We consider two approaches for automated question generation: from a knowledge graph triple or a context text.

Serban et al. (2016) scraped Yahoo Answer to seek out the typical pattern of questions. They run their program to find a typical n-gram sequence in the corpus. The proper noun that appears in the question will change. Therefore, every similar n-gram sequence is grouped into one template. However, the proper noun is blanked with $. For instance, one template is “Who is the wife of #.” To form a question from the context, a model trained with a dataset from Yahoo Answer accepts a context text and finds which template is the most suitable.

Lewis et al. (2019) leveraged Lample et al. (2018)’s unsupervised translation algorithm. They realized that instead of using the algorithm to translate English to another language, they could translate a context text to a question. The relevant noun is clozed and replaced with the proper NER tag to guide the algorithm in answering the desired question. For instance, for the context “[PLACE] is the capital city of Poland, whose currency is Zloty.”, then the expected question is “What is the capital of Poland?” and not “What is the currency of Poland?” because it is Warsaw that is being clozed.

Heilman and Smith (2010) performed an exhaustive analysis and came up with heuristics to manipulate the grammar of a context sentence into a question sentence. They achieved this by representing the sentence as a constituency tree. Tregex rules are applied to the grammar of the sentence with Tsurgeon.

Like Heilman and Smith, we choose this explicit grammatical manipulation strategy. More precisely, we employ context-free grammar to exploit the grammatical patterns of the Indonesian language. This is necessary as we do not have a large corpus, so having something that can operate with minimal input helps.

2.4 Performance evaluation via proxy models

When one aims for a golden standard dataset, the standard practice is to evaluate the dataset’s fitness to the problem with the help of human annotators. In the case of a QA problem, the dataset may contain examples, each of which is a question-answer pair. To evaluate the dataset, the annotators manually check every single example in the dataset and must ensure that the answer part of the example is indeed an answer to the question part.

However, this technique is tricky to scale due to the time and resources needed. A possible alternative is to automate it with a proxy model. In this case, we first pick a model that is known to perform well on the QA task based on past evaluation of some benchmark datasets. We then evaluate it to our newly created dataset. Suppose the latter evaluation achieves at least a comparable level of performance on the benchmark dataset. In that case, our newly created dataset is at least as good as the benchmark dataset.

Eyal et al. (2019) already explored this possibility with a text summarization problem. They suggested that a good text summarization model should retain all the necessary information. From this observation, a QA model is trained as a proxy model. The proxy model is then quizzed on the summarized text. A high accuracy indicated that the dataset was grammatically sound and did not lose the necessary info; otherwise, the model would be “confused” or not have the necessary substring to answer the question.

3 AC-IQuAD: dataset overview

Here and henceforth, we use the following IRI namespace prefixes:

wd: for http://www.wikidata.org/entity/
wdt: for http://www.wikidata.org/prop/direct/.

Our dataset AC-IQuAD contains two types of question: simple and complex question. A simple question expresses a SPARQL query consisting of one triple pattern, possibly, with an additional triple pattern defining the entity type of the queried subject or object. For example, “Apa diproduseri oleh Guillermo Del Toro?” or “What is produced by Guillermo Del Toro?” is a simple question since it can be expressed by a single triple pattern: ?A wdt:producer wd:Guillermo Del Toro. Here, ?A is a variable as indicated by a question mark prefix. Complex questions correspond to a SPARQL query that consists of two triples plus an optional typing triple of the queried subject or object entity. Generally, we divide complex questions into four types as shown in Table 1.

Table 1 Complex question definition

AC-IQuAD: Automatically Constructed Indonesian Question Answering Dataset by Leveraging Wikidata

Abstract

Similar content being viewed by others

Natural language processing: state of the art, current trends and challenges

A survey on large language model based autonomous agents

A survey on deep learning approaches for text-to-SQL

1 Introduction

2 Related work

2.1 Natural language question answering datasets

2.2 Knowledge graph question answering datasets

2.3 Automated QA dataset construction

2.4 Performance evaluation via proxy models

3 AC-IQuAD: dataset overview

4 Method

4.1 Question construction via grammar

4.1.1 WH-word selection

4.1.2 Candidate question generation

4.2 Discovery of relevant context sentence

4.3 Question verification with proxy model

4.4 Deduplication

4.5 Refinement for the complex question

5 Evaluation methods

5.1 Human evaluation

5.2 Computer evaluation

6 Results

6.1 Human evaluation

6.2 Computer evaluation

6.3 Domain shift hypothesis

7 Conclusion

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation