1 Introduction

Nowadays, a considerable volume of news articles is produced daily by news sites around the world. With the advent of the Internet and the growing use of technology in human life, news content has been digitalized and available online. In addition to new news, news from past decades is also placed on web pages as news archives. Internet users read the news either to stay up to date with world events or to find answers for the questions they have in mind. They deploy the search engines such as Google, Bing, etc., to find the relevant news articles to their queries.

Although the invention of search engines has been a turning point in the history of the Internet and the world of technology, they are not efficient enough to find the exact answers to the users’ questions.

In order to find the exact answer to the users’ question, the search engine receives the queries and then finds the most relevant documents to that queries from a massive amount of documents on the world wide web. The users then review the returned documents themselves and find the appropriate answer to their questions. Since there are many news articles on WWW and the users have various questions with different complexities, using the mentioned process to find the answers is time-consuming and even impossible. An appropriate solution to the mentioned problem is developing Question Answering (QA) systems for the news articles. QA systems are powerful platforms that receive the users’ questions in natural language and automatically find the exact answers among structured databases or a set of natural language documents (Calijorne et al., 2020). For example, a QA system receives the question “who was the General-Secretary of the United Nations in 2021?” from the user and answers this question with “Abdullah Shahid”

The development of QA systems is currently one of the main tasks in computational linguistics and artificial intelligence . QA systems deploy natural language processing (NLP) and information retrieval (IR) tools to find answers to users’ questions. They first extract the meaning of the users’ questions using NLP and then find the answer among a set of relevant web pages using IR methods.

Currently, QA systems have focused on factoid questions. According to the definition given by Jurafsky and Martin (2000), factoid questions are questions that have short, definite, and unique answers. For example, the question “who was the General-Secretary of the United Nations in 2021?” which was mentioned earlier, has a short, clear, and unique answer. Therefore, it is a factoid question. But the question “Who has been the best Secretary-General of the United Nations in the last two decades?” is not factoid because there is no unique and specific answer to this question, and everyone can answer this question according to their personal opinions.

The first QA systems were rule-based (Ishwari et al., 2019). Rule-based QA systems use syntactic rules to find the answer to a question in a paragraph text. These rules are usually have been written manually based on the lexical and syntactic structure of the question and the paragraph text which contains the answer. In order to produce such rules, a deep understanding of the languages is required (Humphrey et al., 2009). With increasing the volume of the documents on the Internet, statistical approaches were considered the dominant approach to the QA problem. Statistical approaches deploy learning approaches and use large amounts of annotated data to train QA systems. Deep learning techniques, which show significant results on many computational linguistics tasks, have also been deployed on QA systems (Huang et al., 2020). For example, Lei et al. (2018) and Xia et al. (2018) use a CNN and an LSTM neural network, respectively, to classify the questions in a QA system. Nishida et al. (2018) and Karpukhin et al. (2020) use deep neural networks to find the relevant documents in a QA system. While deep learning techniques outperform conventional machine learning algorithms, they require large-scale datasets to be trained.

In recent years, many datasets have been created for the QA tasks. The instances in the QA datasets contains “(P, Q, A)” triplets, where Q, A, and P shows the question, the answer to the question, and the paragraph that contains the answer, respectively.

Most of the existing QA datasets and systems are exclusively in the English language. An English QA system is implemented and designed for the English language and cannot be deployed for finding the answers to the questions in the Persian language.

The Persian language, also named Farsi, is in the group of Indo-Iranian languages in the Indo-European language family and is the official language of Iran, Tajikistan, and Afghanistan and has over 110 million speakers worldwide. Persian is the fifth dominant language of the top 10-million websites on the WWW,Footnote 1 and there are many Persian web pages available on the Internet. While there is a significant advance in QA tasks in the English language, the number of researches performed on Persian QA is relatively low. Due to the lack of studies on Persian QA systems and the importance and wild applications of QA systems in the news domain, this research aims to design and implement a QA system for the Persian news articles. To the best of our knowledge, this is the first QA system for the Persian news domain. The results of the present study answer the following research questions:

  1. 1.

    What is the question type distribution over the Persian news domain?

  2. 2.

    What is the complexity of the users’ questions about the news?

  3. 3.

    Do the new technologies such as BERT, ALBERT and ParsBERT work well for finding the answers to the questions about the Persian news?

  4. 4.

    What is the performance of the Persian news QA system for answering each type of question?

In order to address the research questions, we first created FarsNewsQuAD: The Persian QA dataset for the news domain. To the best of our knowledge, this is the first QA dataset for the Persian news domain. Then we implemented FarsNewsQA: a QA system for answering the questions about the Persian news. We trained FarsNewsQA using FarsNewsQuAD. FarsNewsQA offers an F1 score of \(75.61\%\), which is comparable with that of QA systems on the English SQuAD dataset (Rajpurkar et al., 2016) prepared by the Stanford university.

The remainder of this paper is organized as follows. In Sect. 2, we discuss the related work and put this work in the appropriate context. Section 3 presents the process of creating FarsNewsQuAD and developing FarsNewsQA, including news articles collection, participants, question-answer creation, and FarsNewsQA architecture. Section 4 contains the experiments carried out to answer the research questions. This is followed by an in-depth analysis of the results in Sect. 5. Finally, we outline conclusions in Sect. 6.

2 Literature review

Question Answering (QA) is a famous task in the fields of Natural Language Processing (NLP), Information Extraction (IE), and Information Retrieval (IR). The first studies on QA were performed under the field of IR. BaseBall (Green et al., 1961) was the first QA system developed in 1961 to answer a limited number of questions about the American baseball games. It reads the questions from punched cards and finds the answer from stored data about the baseball games. Baseball uses linguistic patterns to comprehend the question’s meaning and find the answer. In 1973, a QA system called LUNAR (Woods, 1973) was designed to help lunar geologists to find the answer to their questions among the information obtained from NASA Apollo moon missions. In 1999, the Text REtrieval Conference (TREC) set a QA track and provided the researchers with a QA dataset containing a collection of news articles and a set of questions and answers on each piece (Voorhees et al., 2000). The TREC QA tracks were continued in the later years and made significant advances in the QA task.

As mentioned earlier, in the TREC QA tasks, the information source for finding the answers to the questions was a collection of unstructured documents extracted from the newspaper articles. With the advent of the World Wide Web and increasing the number of web pages, researchers used the documents on the web as the information source in QA systems. Modern QA systems are built based on the web and try to find the answers to the questions among the online documents (Zhu et al., 2021).

Nowadays, researches on QA has witnessed significant progress, and many QA systems have been developed in recent years. Modern QA systems which work based on deep learning techniques, require a large number of data for training. It is worth noting that most of the developed QA systems and datasets are exclusively in English. The most famous English QA dataset is the Stanford Question Answering Dataset (SQuAD) (Rajpurkar et al., 2016). The SQuAD contains about 100,000 questions on Wikipedia articles. MS Marco (Nguyen et al., 2016) and WikiQA (Yang et al., 2015) are other QA datasets for the English language. They have been collected by sampling the questions that the people have searched in the Bing search engine. MS Marco and WikiQA contain 100,000 and 3,000 samples, respectively. Natural Questions (Kwiatkowski et al., 2019) dataset is built by collecting the questions searched in the Google search engine and consisted of 300,000 examples. NewsQA is a QA dataset in the English news domain and is collected from the CNN news webpage. It contains over 100,000 instances and is used to train QA systems for the English news.

There are several QA systems developed for non-English languages. Since building QA systems require large-scale datasets and such datasets are rarely available for non-English languages, the first step to building a QA system for a non-English language is creating a dataset for that language. Two main approaches can be taken to creating QA datasets for non-English languages: (1) Translating the SQuAD from English to the desired language using machine translation, (2) Building the QA dataset for the target language from scratch, i.e., building a native QA dataset.

Some research translates the SQuAD and uses the translated dataset for building a QA system for a new language. Carrino et al. (2020) translated the SQuAD to Spanish and created a Spanish QA system. Mozannar et al. (2019) deployed machine translation to translate 48,000 instances from SQuAD into Arabic and developed a QA system for the Arabic language. Lee et al. (2018) translated SQuAD into Korean by machine translation and then built a QA system for the Korean language. Croce et al. (2018) developed a QA system for Italian by semi-supervised translation of the SQuAD into Italian.

Some works build native QA datasets and use them to develop QA systems for the target language. Efimov et al. (2020) create a native QA dataset containing 50,000 instances for the Russian language and build a Russian QA system. Shao et al. (2018) make a Chinese QA system using a native Chinese QA dataset with 30,000 instances. Lim et al. (2019) and Keraronb et al. (2020) built QA systems for Korean and French languages, respectively.

There are only a few works targeted at building QA systems for the Persian language. Veisi and Shandi (2020) developed a Persian medical QA system to answer the questions about the disease and drugs. Abadani et al. (2021, 2021b) translated SQuAD into the Persian language and built a QA system for the general Persian domain. Kazemi et al. (2022) created PersianQuAD, a large-scale persian QA dataset created by native annotators containing about 20,000 question, and implemented a QA system to evaluate this dataset. A native answer selection dataste and and a deep-learning based approach for QA in the Persian language is introduced in  Lim et al.. (2019).  Boreshban et al. (2018) created a religious QA dataset and used it to develop a QA system for answering the questions in the Persian religious domain. Etezadi amp Shamsfard (2020) created PeCoQ, a QA dataset for answering complex Persian questions over the FarsBase graph. To the best of our knowledge, there is no QA system developed to answer the questions about the Persian news.

3 Methodology

In this section, we explain the process of creating the FarsNewsQuAD dataset and FarsNewsQA system.

3.1 FarsNewsQuAD

To ensure the quality of FarsNewsQA, we evaluate it using a Persian news QA dataset. Since there is no news QA dataset for the Persian language, we created FarsNewsQuAD: the QA dataset for the Persian news domain. In this section, we explain the creation process of the FarsNewsQuAD.

3.1.1 News articles collection

We need a set of news articles to create FarsNewsQuAD. For this purpose, we randomly selected 500 news articles from Persian news websites such as HamshahriFootnote 2 and Young Journalist Club.Footnote 3 The selected news articles are from different categories such as Politics, Sports, Social, etc. We extracted about 1000 paragraphs from the selected news articles and kept only those with at least 500 characters. Finally, we used the obtained paragraphs in the process of FarsNewsQuAD creation. Figure 1 shows the process of news articles selection.

Fig. 1
figure 1

The process of news articles selection

3.1.2 Participants

We asked 10 participants to make questions about the Persian news articles and create FarsNewsQuAD. The participants were studying Linguistics or Computer Engineering, and all of them were native Persian speakers. First, we explained the dataset collection process to the participants by providing written guidelines and oral explanations. We also gave a set of good and bad questions to the participants. Then we asked each participant to make 20 questions on the news articles. Two experts validated the obtained questions, and if \(90\%\) of the questions made by a participant were correctly based on the guidelines provided, they could participate in the dataset collection process.

3.1.3 Question-answer creation

We used SAJAD to create the FarsNewsQuAD. SAJAD stands for “Samaneh Jam Avari DatasetFootnote 4 (سامانه جمع آوری ديتاست)” and is a platform for creating QA datasets for the Persian language. SAJAD is designed and developed in the BigData Lab at the University of Isfahan. Figure 2 shows a snapshot of the SAJAD platform.

Fig. 2
figure 2

A Snapshot of the SAJAD platform for creating QA dataset

We asked the participants to log in to the SAJAD and create questions on the news paragraphs. When the participants enter the SAJAD page, the system shows a random paragraph from the poll of the extracted news paragraphs. The participants then read the paragraph, write some questions on the paragraph’s text and specify the corresponding answer within the paragraph text. We asked the participants to spend at least one minute creating each question and posing 3-5 questions in each paragraph. The created question along with its related answer and the paragraph’s text will be saved as an instance of the FarsNewsQuAD.

Figure 3 shows the process of Question-Answer collection to make the FarsiNewsQuAD. As Fig. 3 shows, the participant poses a question on the paragraph and types it in the question section. Then they specify the answer to the question by highlighting the answer within the paragraph text. In the case the participant could not pose any questions on a specific paragraph, they can skip that paragraph and pose questions on the next paragraphs. To ensure the quality and correctness of FarsNewsQuAD, we asked two validators to check the correctness of the questions in FarsNewsQuAD. The validators randomly chose a number of questions posed by each annotator and verified the quality of the questions ragarding the following criteria: (1) Whether the question is grammarly correct and fluent in Persian? (2) Whether the answer to the question exists in the paragraph text? (3) Whether the selected answer is the correct answer to the question? The questions that failed to satisfy these criteria were removed from the dataset. We collected about 600 question-answer instances on the Persian news articles and created FarsNewsQuAD.

Fig. 3
figure 3

The process of question-answer collection and making FarsNewsQuAD

3.2 FarsNewsQA

In this section, we describe FarsNewsQA: A QA system developed for the Persian news domain. To better understand how the FarsNewsQA works, we define the QA problem first. A QA system receives a Question Q and a paragraph P as inputs and finds the answer A to the Question Q in paragraph P. The paragraph text should contain the answer to the question, and hence, the answer is a span of the paragraph text. The answer’s span is represented by a start and an end token, representing the start and end tokens of the question’s answer within the paragraph text, respectively. An example of a paragraph, a question, and the corresponding answer with its start and end tokens are shown in Table 1.

Table 1 An example of a paragraph, a question, and the corresponding answer with its start and end tokens

The QA system tries to estimate the start and end tokens of the answer and hence to find the answer. In line with modern QA systems, we use Deep Learning techniques to predict the start and end tokens of the answer and hence, to find the answer to the question. We deploy the BERT language model (Devlin et al., 2019) for developing FarsNewsQA. BERT stands for Bidirectional Encoder Representation from Transformers and is a deep-learning-based language model presented by Google. BERT is pre-trained for 104 languages, including Persian. It performs very well on a wide range of natural language processing tasks such as part of speech tagging, named entity recognition, and question answering.

In order to deploy BERT in FarsNewsQA, we fine-tune BERT on a Persian QA dataset. In this way, BERT learns how to find the start and end tokens of the answer within the paragraph text. Figure 4 shows the architecture of FarsNewsQA. As Fig. 4 shows, FarsNewsQA first tokenizes the paragraph text and the question sentence using the BERT tokenizer.The generated tokens are then passed to the BERT language model. Finally, the BERT language model predicts the start and end tokens of the answer within the the paragraph text.

Fig. 4
figure 4

The architecture of FarsNewsQA

To implement FarsNewsQA, we used Python as the programming language and deployed PyTorch as our deep-learning library. We used Google Colab,Footnote 5 with NVIDIA Tesla p100 GPU and 12G of RAM, to fine-tune and test the FarsiNewsQA. The built-in tokenizer of the BERT is used to tokenize the question, the answer, and the paragraph text. The models were fine-tuned with the learning rate of \(3\times e^{-5}\) and the batch size of 12. We used AdamW optimizer and number of epochs is 2 to fine-tune the models.

As for the Persian QA dataset, we use PersianQuAD. PersianQuAD (Kazemi et al., 2022) is a large-scale native QA dataset for the Persian language and is freely available https://github.com/BigData-IsfahanUni/PersianQuAD.. It contains about 20,000 questions created by native annotators on a set of Wikipedia articles. Each question is about a paragraph, and the answer to the question is a segment of the paragraph text. Since the style of the Wikipedia articles is similar to that of the news articles, we used PersionQuAD as the training set of the FarsNewsQA.

4 Results

We designed and performed a set of experiments to answer the research questions mentioned in Sect. 1. In this section, we explain the experiments, then report and analyze the results.

4.1 Question type distribution over the Persian news domain

We analyze the question types over the FarsNewsQuAD to show which question types have the most and the least occurrence in the Persian news domain. Knowing the question types distribution over the Persian news domain helps us design and implement better QA systems to answer these questions. In line with the other researches on QA systems (Clark et al., 2020), we classified the Persian questions into seven types: What, How, When, Where, Who, Which, Why. Table 2 shows the mapping of the interrogative words in Persian to the defined question types and an example question for each type.

Figure 5 shows the distribution of the questions types over FarsNewsQuAD. As Fig. 5 shows, What and Who questions have the most and Why and Which questions have the least occurrences on the Persian news domain.

Table 2 Mapping the question types to Persian interrogativ words
Fig. 5
figure 5

Question Types Distribution over FarsiNewsQuAD

4.2 Complexity of the users’ questions about the Persian news

We measure the complexity of the users’ questions to have an intuition about the complexity of the QA problem in the Persian news domain. The lexical similarity between the question and the answer sentence is an indicator of the question’s complexity (Rajpurkar et al., 2016). The more difference between the question and the answer sentence, the more difficult it is to find the answer to that question. Table 3 shows an example of two questions and an answer sentence. The answer to both sentences is Barack Obama, and the lexical overlap between the questions and the answer sentence is underlined in the questions. Question 1 has more lexical overlap with the answer sentence than Question 2. Hence, it is easier for a QA system to find the answer to Question 1, than Question 2.

Table 3 An Example of the similarity between the Questions and the Answer Sentence. Question 1 has more lexical similarity to answer sentence than Question 2

We used the Jaccard Coefficient (Jaccard, 1912) to measure the similarity of the answer sentence to the question. Jaccard Coefficient calculates the number of common words between the question and the answer sentence divided by the total number of the words in the question and the answer sentence. It takes values between 0 and 1, where value 0 shows there are no common words between the question and answer sentence, and value 1 shows that the question and the answer sentence are the same. The lower Jaccard Coefficient between the question and the answer sentence, the less similarity between the question and the answer sentence and the more difficult it is for a QA system to find the answer.

For example, Eq.  1 and Eq.  2 show the Jaccard Coefficient for Question 1 and Question 2 in Table 3, respectively. As expected, the Jaccard Coefficient for Question 1 is larger than that of Question 2.

$$\begin{aligned} Jaccard\, (Answer \,sentence,Question 1)=\frac{9}{15+10}=0.36 \end{aligned}$$
(1)
$$\begin{aligned} Jaccard\, (Answer \,sentence,Question 2)=\frac{3}{15+12}=0.11 \end{aligned}$$
(2)

Table 4 reports the lexical similarity between the questions and the answer sentences based on the Jaccard Coefficient on FarsiNewsQuAD. It shows the complexity of the users’ questions about the Persian news. As Table 4 shows, for \(92\%\) of the questions, the lexical similarity between the question and the answer sentence is less than 0.3, in terms of the Jaccard Coefficient. It demonstrates that the users pose complex questions about the Persian news in terms of the Jaccard Coefficient.

Table 4 The lexical similarity between the Questions and the answers in FarsiNewsQuAD in terms of Jaccard Coefficient

4.3 Performance of the BERT, ALBERT, and ParsBERT for the Persian news QA system

As described earlier, we developed FarsNewsQA: a BERT-based QA system for the Persian news. We developed three versions of the FarsNewsQA. In the first version, we used the BERT. In the second and third versions, instead of BERT, we used ALBERT  (Lan et al., 2019) and ParsBERT (Farahani et al., 2021). ALBERT is a light version of BERT and has fewer parameters and more training speed. ALBERT shows better performance than BERT on the English QA task  (Lan et al., 2019). ParsBERT is a version of the BERT that is trained on a massive amount of Persian articles.

We measure the performance of FarsNewsQA versions by evaluating them on the FarsNewsQuAD. The F1 metric is commonly used to measure the performance of QA systems (Rajpurkar et al., 2016). F1 measures the ratio of the common words between the answers found by the QA system and the correct answer. The larger values of F1 show the better performance of the QA system.

The performance of the three versions of FarsNewsQA in terms of F1, with BERT, ParsBERT, and ALBERT is \(74.34\%\), \(75.61\%\), and \(70.93\%\), respectively. The best version of FarsNewsQA, which uses the ParsBERT, achieves an F1 score of \(75.61\%\). This result is comparable with that of the English QA systems on the SQuAD, created by Stanford University, in which the F1 score is \(80.8\%\). Here we observe that new technologies such as ParsBERT work well for QA systems in the Persian news domain.

4.4 Error analysis of the FarsNewsQA on answering each type of questions

Figure 6 shows the F1 of the FarsNewsQA on answering each type of question in the Persian news domain. As Fig. 6 shows, FarsNewsQA delivers its best F1 score on Where questions and achieves an F1 score of \(92.07\%\) on this type of question. Fig. 6 also shows that FarsNewsQA shows its worse performace on What questions and delivers an F1 score of \(67.32\%\) on these questions. This indicate that working on finding the correct answers to the What questions, would be a good step towards improving the overall performance of the system.

Fig. 6
figure 6

The F1 of the FarsNewsQA on answering each type of question

To better understand the performance of FarsNewsQA, Table 5 shows an example of each type of question, along with their correct answer and the answer founded by the FarsNewsQA.

Table 5 An Example of each type of question,.along with their correct answer and the answer founded by the FarsNewsQA

5 Discussion

In this research, we design and implement FarsiNewsQA: a QA system for Persian news articles. In order to train FarsiNewsQA, we build FarsNewsQuAD: the Persian QA dataset for the news domain. We analyze the question type distribution over the Persian news domain. Knowing the distribution of the question types help the researchers to implement suitable QA systems to answer the Persian news questions. We classified the Persian questions into seven types: What, How, When, Where, Who, Which, and Why. The results show that What and Who questions have the most and Why and Which questions have the least occurrences in the Persian news domain. It demonstrates that when the users ask questions about the Persian news, they usually ask for information about things or events, and also about what or which person performs a task, and they rarely ask about the reasons.

In order to have an intuition about the complexity of the QA problem in the Persian news domain, we measure the complexity of the users’ questions. The lexical similarity between the question and the answer sentence indicates the question’s complexity(Rajpurkar et al., 2016).

Hence, we calculate the lexical similarity to estimate the complexity of the questions. For \(92\%\) of the questions, we observed that the lexical similarity between the question and the answer sentence is less than 0.3 in terms of the Jaccard Coefficient. It shows that the users usually ask complex questions about the Persian news, and the researchers should design strong models for finding the answers to these questions.

We compared the performance of the new technologies such as BERT, ALBERT, and ParsBERT in answering the questions about the Persian news. We drive the following observations from the results:

  • The best performance in terms of F1 is obtained using ParsBERT, and BERT and ALBERT are in the next positions.

  • Since ParsBERT is trained on a larger number of Persian articles than BERT, it performs better than BERT on answering the questions on the Persian news domain.

  • ALBERT shows better performance than the BERT for the English QA task. However, this is not the case for the Persian QA task, and BERT delivers better performance than ALBERT.

  • The best QA model is that version of FarsNewsQA, which uses the ParsBERT and delivers an F1 score of \(75.61\%\). This result is comparable with that of the English QA system on the SQuAD (Rajpurkar et al., 2016) and shows that new BERT-based technologies work well for the Persian news QA systems.

We measured the performance of the FarsNewsQA on answering each type of question. The results show that FarsNewsQA delivers its best F1 score on Where questions and achieves an F1 score of \(92.07\%\) on this type of question. This is because the answer to the Where questions is a place, and finding places, such as city or country names, is relatively straightforward for the QA systems.

We also observed that FarsNewsQA shows its worse performance on What questions and delivers an F1 score of \(67.32\%\) on these questions. We hypothesize that this is because What questions ask for information about things or events and includes a wide range of questions, including What color?, What time?, What day?, What job?, etc. Finding the answers to this wide range of questions is not straightforward for the QA systems.

6 Conclusion

In this paper, we present FarsNewsQA, which is a question answering system for Persian news articles. For developing FarsNewsQA, a QA dataset called FarsQuAD was initially created over Persian news articles. To the best of our knowledge, this research is the first attempt to create and develop the QA dataset and QA system for the Persian news domain. After preparing the FarsQuAD, it was analyzed in order to specify the type and complexity of the questions that the users ask about the Persian news. For developing the QA system, three QA models of BERT, ParsBERT, and ALBERT were trained over the FarsQuAD, and their performance was compared in terms of F1-score measure. Again, the efficiency of the FarsNewsQA in the automatic answering of questions was investigated for different question types. The results of dataset analysis show that What and Who questions have the most and Why and Which questions have the least occurrence frequencies in the Persian news domain. It was also revealed that most of the provided questions are complex questions that should also be answered with complex systems. Analyzing the developed QA models showed that the new BERT-based technologies work well for Persian news QA systems. The best version of the FarsNewsQA offers an F1 score of \(75.61\%\), which is comparable with that of QA systems on the English SQuAD dataset made by the Stanford university. Furthermore, it was inferred from the results that FarsNewsQA shows its best and worst performance on Where and What questions, respectively. As for future work, we consider improving the model performance for answering the What questions.