1 Introduction

Conversational systems have experienced a significant increase in popularity in recent years, advancing into various fields like entertainment [1, 2], customer service [3], and even coding [4]. Furthermore, dialogue systems can be used as tools for acquiring knowledge, either in a direct manner by answering questions [5] or indirectly through user interaction such as tutoring [6,7,8] or debating systems [9,10,11]. A domain of growing interest within this field has been argumentative dialogue systems, which are mostly designed to persuade users to adopt a particular point of view [12,13,14]. Although some topics may have a definitive scientific or societal consensus, ethically challenging issues often remain subjective and lack a unanimous position [15, 16].

For these topics, it is important that individuals understand both sides of an argument to develop an informed opinion. Humans tend to believe information that supports their position and beliefs (confirmation bias) [17]. We therefore assume that people being more familiar with one side of a debate will argue against the opposing perspective.

In this regard, our research on an argumentative dialogue system follows a different direction. Rather than solely influencing user’s opinions towards a particular position, we aim to provide a comprehensive perspective on arguments, in domains that are subject to ethical challenges. Thus, our primary goal is to convey a broad spectrum of arguments through natural language discussions and help users develop a broader understanding by providing an opposing view.

As artificial intelligence (AI) is set to play an increasingly important role in our everyday lives [18], we present a German-language chatbot designed to engage in discussions about ethically challenging future applications of AI systems. We have selected three topics that we believe to be of interest to the public debate: autonomous decision-making in medical, legal, and autonomous driving scenarios. Our discussions are based on the hypothesis that a future AI would operate in these contexts with the same skill as an experienced human professional, deliberately omitting the less ethically controversial support roles of AI.

We built three knowledge bases corresponding to these scenarios and conducted an experiment using our German-language chatbot with 178 student participants from diverse disciplines. Participants were asked to state their stance on a given scenario before and after interaction with the bot.

The primary objective of this study was to determine whether a conversational system, equipped with diverse arguments, could widen a user’s perspective. We refer to this potential widening or alteration in viewpoint as the Argumentation Effect. The study centres around three pivotal questions:

First, the chatbot’s ability to provide users with fresh insights: (1) Can our system successfully provide users with new perspectives on the chosen topics? Second, the user’s reception of the bot is critical: (2) Do users accept and value interacting with our system? Finally, at the heart of our research is the transformative potential of the dialogue system: (3) Can our chatbot influence or broaden users’ perspectives on these ethically complex issues?

Our study prioritizes the facilitation of a comprehensive exploration of ethically challenging topics. By evaluating whether participants acquire additional information after engaging with the chatbot, we aim to measure their engagement with the topic. User acceptance of the chatbot is essential; it suggests that the design of the system and the argumentation framework are conducive to a productive exchange of ideas. Although a change in the user’s stance may be an indication of the chatbot’s influence, our goal is not to direct opinions but to confront users with a range of different arguments, thus broadening their perspective. Therefore, whether or not participants change their original stance is secondary to the broader goal of improving their understanding of complex issues. By addressing these nuanced research questions, we seek to explore the role of AI in facilitating more informed discussions without biasing the outcomes.

This paper is organized as follows: Section 2 introduces a background to conversational systems, and Sect. 3 reviews related work on argumentative chatbots. Section 4 discusses the design and implementation of the chatbot and the structure of the knowledge bases. Section 5 describes the experimental setup. Section 6 presents the results of the study, which are discussed from a technical and argumentative point of view in Sect. 7, along with the limitations of the study. Section 8 provides an overview of the paper and an outlook for future work.

2 Background: overview of recent advances in chatbots

In this section, we delve into a comprehensive review of the extant literature pertinent to our area of study, aiming to situate our research within the broader scientific discourse. We first present an overview of recent advances in the chatbot field, followed by a focused exploration of argumentative dialogue systems and research progress in German-language chatbots.

Over the past years, chatbot development has seen various advances. With the application spreading to different domains, recent years have witnessed a surge in innovative methodologies and diverse implementations. In 2022, researchers explored chatbots in areas such as customer service, entertainment, healthcare, and software engineering, briefly summarized in Table 1. An extensive literature review of deep learning-based dialogue systems has been conducted by Ni et al., covering works until mid-2022 [2].

In the following, we highlight the most recent advances over the last 2 years. Hugo et al. employed a Word2Vec model with clustering to establish a customer service chatbot capable of identifying unknown customer intents, achieving an intent classification accuracy of 0.851 [3]. Concurrently, Cai et al. [19] demonstrated chatbots’ potential in entertainment, developing a music recommendation system leveraging DialogFlow, which received a user rating of 4.21 out of 5. Additionally, He et al. [20] leverage a knowledge base combined with a transformer approach to build a context-enriched recommendation system that engages users in a dialogue.

The healthcare sector has also witnessed considerable attention in chatbot research. For instance, Shah et al. [21] devised a chatbot using decision trees and rule-based systems to assist individuals with eating disorders. Concurrently, Rebelo et al. [22] developed a chatbot-based patient education system focusing on radiotherapy, employing IBM Watson as their platform of choice. Further in healthcare, Shan et al. [23] created a multilingual health information dissemination chatbot, also built upon DialogFlow. On a different note, Abdellatif et al. [4] explored chatbots’ potential in software engineering, developing a system for sourcing useful code samples from Stack Overflow. Lastly for 2022, Merkouris et al. [24] tackled gambling addiction using a chatbot to provide help, their approach also being built with IBM Watson.

Furthermore, in 2023 El-Ansari et al. addressed the e-commerce sector by devising a personalized chatbot capable of providing tailored information. Their methodology compared three different advanced language models—BERT, GPT, and ELMO, showcasing the potential of these models in chatbot development [25].

In the same year, Medeiros et al. [26] presented a chatbot-based approach for providing emotional support to stressed individuals, highlighting the chatbot’s therapeutic potential. Finally, Zarouali et al. [27] introduced an application of chatbots to social science research, utilizing them as tools for data collection for social science evaluations.

2.1 German-language chatbots

Research in the domain of chatbots and dialogue systems has primarily focused on English, but there have been notable contributions in German as well. The following works illustrate existing work within the German context but are not an exhaustive survey. Görtz et al. [28] have implemented a dialogue system, aiming at educating users on prostate cancer, and Bielitzke [29] has integrated a chatbot into a university’s campus management system. Demaeght et al. [30] have investigated the impact of chatbots on university-student communication, underscoring the potential for chatbots in academic environments. Another approach in the educational domain is the Gramabot by Kharis et al. to help students learn German grammar via a chatbot [8].

Table 1 Overview of most recent related papers to our approach of the last 2 years

3 Related work

For our research, we use a conversational system to engage users in discussions on ethically challenging topics related to future AI applications. Therefore, we review relevant literature on argumentative dialogue systems and AI ethics education.

3.1 Argumentative dialogue systems

The field of argumentative dialogue systems has seen significant advancements in recent years. A groundbreaking work in this area is the “Project Debater,” a spoken argumentative dialogue agent, designed to compete in a debate against a professional debater. Developed by Slonim et al. [9], this system was a pioneering effort that marked a new era in the application of artificial intelligence in debate.

Within this context, various other important contributions have been made. Rakshit et al. [31] introduced “Debbie the Debate Bot of the future,” which builds on a database of arguments for three topics of discussion, retrieving counter-arguments as needed. Le et al. [11] expanded on argumentative dialogue with both retrieval-based and generative methods using RNNs.

Bistarelli et al. [32] explored argumentative views on given topics through a chatbot employing Google DialogflowFootnote 1 and an argumentation framework. Altay et al. [33] took this further by developing a chatbot aimed at fostering positive attitudes towards GMOs, equipping it with researched counterarguments to popular objections.

Shi et al.’s [34] exploration into persuasive dialogue systems sought to determine the efficacy of various dialogue strategies in convincing individuals to donate to charity, while Trzebiński et al. [35] focused on the influence of a pro-COVID-19 vaccination chatbot on Belgian adults. Brand et al. [14] conducted similar research with a UK audience, aiming to mitigate vaccine hesitancy.

Aicher et al. contribute to the field of argumentative dialogue systems, enriching it with diverse research and innovative methodologies. They introduced an Elaborateness Score, a metric aimed at enhancing the adaptability of dialogue agents to different users’ communication styles, moving towards personalized interactions [36].

In a separate study, Aicher et al. [37] delved into the terrain of users’ beliefs and opinions, characterizing them as self-imposed filter bubbles. Recognizing the challenge these bubbles pose in limiting perspective and understanding, they develop a method to reduce the probability of users getting into such a bubble. Furthermore, a spoken argumentative dialogue system is developed and evaluated via a user study [38].

Significant to our research are the works of Chalaguine et al., who developed methods for crowdsourcing argument graphs and applied them to various persuasive contexts [10, 12, 39]. They furthered this by creating a dialogue system that incorporates opponents’ concerns in arguments for a vegetarian diet [40]. Similar to parts of our work they evaluate via a user study, measuring the participant’s change of stance.

Farag et al.’s [41] research closely aligns with our approach. The study involved experimentation with chatbots that aimed to challenge users’ opinions on divisive topics like Brexit, veganism, and COVID-19 vaccination. The goal was to broaden perspectives instead of persuading users, which is similar to our aim of engaging in a German-language dialogue on AI ethics.

Although argumentative dialogue systems have made significant progress, the connection between AI ethics education and these systems has not been studied much. There is potential for dialogue systems to have a significant impact on informing and furthering users’ ethical understanding of AI if a useful synergy between persuasive discussion and ethical reasoning can be established by researchers.

3.2 Education on AI ethics

In educational research, Shih et al. [42] proposed a situated learning-based instructional design for non-engineering students to understand and apply AI through hands-on exercises, assessing their grasp of AI ethics. Zhang et al. [43] conducted workshops to familiarize middle school students with AI concepts, including ethical considerations.

Skirpan et al. [44] integrated ethical issues into a CS course, while DiPaola et al. [45] employed design activities to engage students with AI ethics. Zhou et al. [46] provided a pedagogical framework for AI education.

Such educational initiatives, though valuable, are resource-intensive, necessitating human oversight and comprehensive programs. In contrast, our conversational system presents a scalable, efficient preliminary step to engage individuals with AI ethics, offering initial insights and stimulating interest in the subject.

Additionally, insufficient attention is given to incorporating interactive conversation systems that accommodate individual learning paths, particularly in the German language, as they can help bridge the divide between ethical theory education and day-to-day ethical decision-making. Our study aims to fill this gap by creating a chatbot with the goal of providing users with information on ethically challenging AI topics and helping them broaden their perspective in argumentative dialogues.

4 Material and methods

Fig. 1
figure 1

Systematic overview of our argumentative dialogue system which comprises a Web UI, a backend server, a text processing unit, and a knowledge base

The current study introduces an intricate Argumentative Dialogue System, developed to facilitate robust discussions on ethical issues surrounding potential future applications of artificial intelligence (AI). Our systemFootnote 2 which we used to conduct the study operates exclusively in German and is schematically represented in Fig. 1. It comprises four main elements: a Web User Interface (Web UI), a Backend Server, a Text Processing Unit, and a Knowledge Base.

Upon entering the Web UI, users initiate their interaction with the system by registering with an arbitrary name and selecting a discussion topic related to AI ethics. The system’s chatbot then takes an opposite stance to that of the user promoting the exploration of different viewpoints. User inputs are processed in the text processing unit, which employs natural language processing techniques to identify arguments and formulate appropriate responses. The Backend Server manages these operations and maintains dialogue coherence, while the Knowledge Base serves as the foundation for topic-related content.

The forthcoming subsections present a comprehensive breakdown of these components, elaborating their respective functionalities within the system. We suggest referring to Fig. 1 throughout, which provides a visual summary of the system’s architecture and flow.

4.1 Knowledge base

The knowledge base serves as the thematic foundation for our argumentative dialogue system, supplying it with a wealth of content tailored towards various ethical issues in potential future applications of artificial intelligence. Specifically, our system engages users in German-language discussions revolving around three predetermined scenarios which we chose on the assumption that they would be of wider interest to the public debate:

  • MedAI (MedicalAI) Permission of a future AI system to decide on diagnoses and therapies on its own.

  • LawAI AI judges in civil law processes.

  • CarAI Cars that drive completely autonomously without human intervention.

The scenario for each discussion assumes that a future AI would operate in this context with the same skill as an experienced human professional. To facilitate structured and coherent discussions, each of these scenarios is defined within a distinct knowledge base which includes for each topic of discussion:

  • Scenario and Question of Discussion A hypothetical future setting, e.g. “Imagine a future where AI systems autonomously make medical diagnoses and therapeutic decisions within medical centres where medical staff performs medical investigations and communicates with the patients.”, with a central question to determine user stance, e.g. “Should AI systems be allowed to make autonomous medical decisions without human oversight, provided they have proven to be at the same level as humans in their field of expertise?”

  • Argument Graph A hierarchically structured representation of pro and con arguments, curated by PhD students specializing in internet and computer science law.

  • FAQs Common queries about the scenario, derived from preliminary studies (c.f Sect. 4.4 “Wizard of Oz”).

Each topics graph follows the definition of Hadoux and Hunter [47]:

Definition 1

An argument graph is a pair \(\mathcal {G} = (\mathcal {A}, \mathcal {R})\) where \(\mathcal {A}\) is a set and \(\mathcal {R}\) is a binary relation over \(\mathcal {A}\) (in symbols \( \mathcal {R} \subseteq \mathcal {A} \times \mathcal {A}\)). Let Nodes(\(\mathcal {G}\)) be the set of nodes in \(\mathcal {G}\) (i.e. Nodes(\(\mathcal {G}\)) = \(\mathcal {A}\)) and let Arcs(\(\mathcal {G}\)) be the set of arcs in \(\mathcal {G}\) (i.e. Args(\(\mathcal {G}\)) = \(\mathcal {R}\)).

So, an argument graph is a directed graph. Each element \( A \in \mathcal {A} \) is called an argument and \( (A_i, A_j ) \in \mathcal {R} \) means that \( A_i \) attacks \( A_j \) (accordingly, \( A_i \) is said to be an attacker of \( A_j \)). So \( A_i \) is a counterargument for \( A_j \) when \( (A_i, A_j ) \in \mathcal {R} \) holds.

If an argument does not attack any argument, we call it main argument, while arguments that attack and therefore refer to other arguments are called counterarguments. For the medAI scenario, a main argument (with a pro stance) is that “medAI is cheaper with higher quality”. In the graph, this can be countered with “But there should also be enough money to provide good medicine”.

The graphs, designed with expert input from our project partners from the Chair for Internet Law, vary in size. The MedAI graph, for instance, includes 25 main arguments and 33 counterarguments, while the LawAI and CarAI graphs contain (22, 45) and (29, 50) arguments and counterarguments, respectively.

Each node in our argument graphs comprises a label, a summary text, a full text, and a rating. An example can be found in Fig. 2. The label acts as an identifier, the summary text encapsulates the argument succinctly, and the full text provides a comprehensive expression of the argument that the bot can utilize during discussions. The rating, on a scale from 0 to 5, influences the proactive selection of arguments by the bot. Complementing these nodes are the Argument Representatives that provide examples of how the corresponding argument can be articulated.

Fig. 2
figure 2

Example of an argument node’s content, translated into English. AR argument representative

Furthermore, the Knowledge Base includes Conversational Links across topics such as phrases of agreement (“I agree with you but have you considered...”) or acknowledgment (“I understand your point, but...”). These links have the intent of improving the conversational receptiveness [48] and coherence of responses generated by the bot (described in Sect. 4.4).

4.2 Backend server

The Backend Server, implemented using the FlaskFootnote 3 framework, forms the nerve centre of our Argumentative Dialogue System. This component manages the communication flow between the Web UI and the Text Processing Unit, ensuring seamless interactions between the user and the chatbot.

Central to the Backend Server are the User Manager and the Dialogue Manager. The User Manager oversees user registration and management, assigning unique identifiers to users when they interact with the system and ensuring that their inputs and corresponding responses are tracked appropriately. In contrast, the Dialogue Manager monitors the dialogue flow, synchronizing user inputs with chatbot responses to maintain a logical and coherent conversation.

To preserve the interaction history, a dialogue storage module is incorporated within the Backend Server. This module logs all user-chatbot interactions, registering every user input and the chatbot’s corresponding response, as well as the bot’s predictions. This feature enables a comprehensive review of the system’s performance and the users’ dialogue behaviour.

4.3 Web user interface

The Web User Interface constitutes the user’s gateway into our Argumentative Dialogue System. Designed with a user-centric approach, the Web UI facilitates smooth navigation and interaction with the chatbot across both mobile and desktop platforms.

The dialogue initiation process starts when users log into the Web UI, thus registering with the system. Users then select a topic of discussion from a dropdown menu. Alongside the topic selection, a brief description of the scenario and the question of discussion are displayed to provide the user with context and guide the ensuing conversation. As mentioned in Sect. 4.1, we also present a set of frequently asked questions (FAQs) gathered from our preliminary pilot studies to ensure clarity and maintain focus on the ethical aspects within the dialogue.

4.3.1 Design of questionnaire

In alignment with our research objectives outlined in Sect. 1, we developed a questionnaire to be administered before and after participants interacted with our chatbot. The overall aim is to evaluate the system’s ability to convey new information, users’ acceptance of the chatbot, and its influence on users’ opinions. Consequently, we structured our questionnaire around three categories that reflect these objectives: Information Gain, User Satisfaction/Acceptance, and Opinion Influence.

The literature contains substantial work on the impact of dialogue systems on user opinion, typically measured by a comparison of pre- and post-dialogue stances [14, 33,34,35, 40, 49, 50]. Although our primary intention is not to change users’ opinions, we recognize the importance of stance change as an indicator of the argumentative impact of the bot. Therefore, we included similar stance assessment questions in our survey.

For user satisfaction and acceptance, methodologies vary considerably between studies. While there are comprehensive instruments such as CUQ [51], BUS-15 [52], UES [53] and other questionnaires [54], their length is incompatible with the brevity of our study—where participants engage in a short dialogue lasting no more than 10 min. To respect the participants’ time, we opted for a short set of questions on user satisfaction and acceptance. We complemented this quantitative assessment with two optional open-ended questions to allow for richer qualitative feedback.

Information gain has not been commonly measured in works dealing with argumentative dialogue systems. We argue that the assessment of users’ pre-dialogue knowledge is crucial, as it may influence the perceived benefit from the interaction. Following Altay et al. [49], we inquired about participants’ prior engagement with the discussed topics and their evaluation of new knowledge acquired through the dialogue.

Table 2 summarizes the survey questions and their associated objectives and response types. The rationale behind each question is directly related to our research objectives—to gain information, to measure user satisfaction and acceptance, and to monitor any changes in opinion.

Table 2 Summary of pre- and post-dialogue survey questions with the associated goals: information gain (IG), user acceptance (UA), and opinion influence (OI)

4.4 Text processing

The text processing unit forms the cognitive core of our argumentative dialogue system, encompassing two major components: the argument recognition unit (ARU) and the semantic response constructor (SRC).

4.4.1 Argument recognition unit

The argument recognition unit (ARU) utilizes a SentenceBERT (SBERT) model [55]. This neural network architecture is designed to predict sentence similarity. It is derived from the pre-trained BERT Model [56] and can be fine-tuned for specific downstream tasks [55]. We use a pre-trained German BERT [57] model from the huggingface library.Footnote 4

For training data collection, we utilized two methods. Initially, as a proof of concept, we employed the Wizard of Oz approach [58, 59], similar to [41], to engage in dialogues on the MedAI scenario with department employees and students from two lectures. The wizard used a web interface and arguments from the Argument Graphs to engage in online discussions with the students. Given the time-intensive nature of this method, we opted for a different strategy for the other two topics: we asked two students to write five alternate phrasings for each argument in the related Argument Graph. We refer to each phrase that can be assigned to a node in the Argument Graph gathered through these methods as Argument Representative.

Our goal is to develop a model that can detect the similarity between two arguments. We selected SentenceBERT due to its effectiveness in textual similarity tasks [60,61,62]. For training, we adopt a strategy similar to Agirre et al. [63], where argumentative sentences are paired and rated on a similarity scale:

4:

Texts from the same node (full text, summary text)

3:

Texts sharing the same label (all Argument Representatives)

2:

Arguments with a shared arc on the graph

1:

Arguments sequentially linked via intermediary nodes.

0:

Unrelated arguments

We construct a balanced dataset with five classes for training the model, addressing the overrepresentation of unrelated argument pairs (score 0) by applying random undersampling [64]. This results in three datasets, comprising 1500 class samples for the MedAI scenario, 900 for the lawAI, and 1000 for the carAI.

Training involves using mean-squared error loss with cosine similarity of the sentence embeddings, an Adam optimizer with a learning rate of 2e−05, and a batch size of 16, following the method in Reimers et al. [55]. We split the dataset into 70% for training, 10% for development, and 20% for testing. Model selection is based on Spearman’s rank correlation and Pearson’s correlation coefficient during a 5-epoch training period.

In our system, user utterances are mapped to argument graph nodes based on the highest cosine similarity score of semantic embeddings. A classification is deemed successful only if the similarity score exceeds 0.8. Scores below this threshold indicate unrecognized arguments.

Given that this is fundamentally a classification problem with 50 to 70 classes. We improve performance by reducing the number of classes via a method we call knowledge-based filtering. We consider dialogue context and assume that:

  • a repetition of arguments in a conversation is unlikely and

  • users are unlikely to present arguments contradicting their stance.

Therefore, our knowledge-based filter excludes any arguments previously stated in the dialogue by either the bot or the user, as well as any opposing arguments to the user’s stance. This significantly reduces the number of potential classification candidates.

4.4.2 Sentence response constructor

Upon argument recognition and label identification, the label and dialogue history are passed to the SRC where a detailed response is formulated: Paraphrasing the user’s argument using the summary text, selecting a counter-argument from the graph, and leveraging conversational links to ensure a smooth transition between the paraphrase and counter-argument.

The interaction between the argument recognition unit and the sentence response constructor represents our system’s strategy for handling argumentative dialogue. Since the chatbot’s discourse policy is part of our experiments, we will discuss it in the following section.

5 Experimental setup

To study the argumentation effect of our system, we designed our experiments to answer the research questions proposed in Sect. 1.

Addressing these questions, we contribute to both the academic understanding and practical application of chatbot technologies. The following subsections detail the chatbot’s interaction strategy with the users, the design of the user study, and how these components work together towards answering our research questions.

5.1 Chatbot interaction design

The chatbot’s design revolves around an interaction strategy and an argument identification method aimed at promoting meaningful dialogues. The purpose of these interactions is to test the bot’s capacity to provide new information, gain users’ acceptance, and broaden their perspective on discussion topics, addressing our research questions.

The chatbot’s interaction strategy, as shown in Algorithm 1, outlines its dialogue exchanges, including identifying arguments, paraphrasing, presenting counter-arguments, and furthering the conversation. Our dialogue strategy employs active listening techniques [65], which, according to recent findings, enhance conversational satisfaction more effectively than mere acknowledgment in responses [66]. We achieve this by paraphrasing the user’s argument. This is important for two reasons: firstly, it allows the user to know that the bot has comprehended their intention, and secondly, it identifies the part of the user’s statement to which the response refers. The bot achieves this by incorporating the summary text of the identified graph node into its response. Additionally, the bot will present a counter-argument or, if the argument graph does not provide one, a new main argument that contradicts the user’s position. In the selection of a new argument, the argument rating is incorporated, with the bot preferring higher-rated arguments. This approach aims to introduce new information to the user while ensuring that the conversation continues smoothly and the key arguments are provided as the dialogue progresses. The dialogue ends when users decide to click the end chat button which takes them to the post-dialogue questionnaire.

To illustrate the response strategy, Fig. 3 shows an excerpt of a dialogue from the study on the topic MedAI. The dialogue begins with the bot greeting the user, repeating the user’s selected stance, and asking for a first argument on the topic (B1). Upon receiving a user message, the system maps the utterance to a node of the argument graph. If the prediction confidence is above the threshold, the consequent response contains the summary text of the graph node (first bold text in B2, B3, B4, B5) and the respective counterargument (second bold text in B2, B3, B4, B5). In case the confidence is below the threshold, the bot asks the user to rephrase the argument (U5, B6). If the rephrased argument still can not be identified, the bot moves on with a new main argument to keep the conversation going (U6, B7).

Algorithm 1
figure e

Chatbot Dialogue Strategy

Fig. 3
figure 3

Translated excerpt from a MedAI user study dialogue. User utterances are labelled with U1 to U5, bot utterances with B1 to B5. Arguments within bot utterances are bold. The first argument in a bot utterance is the summary of the identified argument node. B1 and B4 are correct classifications, B2 and B3 are incorrect

5.2 User study design

The user study was designed to investigate users’ initial and post-interaction stances, their knowledge of the topic, and the dynamics of their interaction with the chatbot. The specific questionnaires used to collect this data are described in detail in Sect. 4.3 and are integral to addressing the research questions outlined in Sect. 1.

Participants for the research were recruited from students at the University of Würzburg. To promote accessibility and encourage participation, the subjects were not supervised during interaction with our system. This created a realistic environment where they could engage with the system at their desired lengths. Clear instructions were provided along with a QR code directing users to the chatbot’s web interface.

To meet diverse interests, the study presented participants with a choice among three distinct discussion topics. Each topic contains a scenario, a central question for discussion, and an accompanying list of frequently asked questions, as detailed in Sect. 4.1.

Recruitment took place in the campus canteen at the University of Würzburg, targeting the student population. Students were invited to participate in a dialogue with the chatbot, which required them to present at least six arguments over an estimated 10 min, in exchange for a beverage.

Demographic assumptions The collection of individual demographic data was not part of the study design. However, given the nature and location of the recruitment process, it is reasonable to infer certain demographic characteristics of the participant pool. The primary demographic group was young adults, with an age range of 18–25 years—the typical age range for undergraduates and possibly some postgraduates. The gender distribution appeared to be roughly balanced between males and females, based on non-systematic observational data. These demographic details are presumptive and have not been empirically verified; therefore, any interpretation of the study results should be approached with an understanding of these underlying assumptions. A more detailed analysis of the implications of these demographic estimates is provided in Sect. 7.

6 Results

The results of our experiment provide insights into the chatbot’s impact on users’ perspectives on ethical AI applications in medical, legal, and autonomous driving scenarios. These results are directly related to our research questions, highlighting the system’s effectiveness in providing novel information, its acceptance by users, and its impact on users’ opinions.

The evaluation is based on pre- and post-dialogue questionnaires that aim to measure user acceptance, the chatbot’s capability to deliver novel information, and changes in users’ opinions.

From 178 participants who interacted with the chatbot, a total of 151 completed the post-dialogue questionnaire. The results for the single-choice questions are shown in Table 3. We categorize opinion changes as follows:

  • Change in opinion strength A user has increased or decreased their rating on the 1 to 5 Likert scale concerning the stance question but has not crossed the midpoint to change the fundamental stance (remaining pro or con) or has moved to a neutral position (3 on the scale).

  • Stance change A user has shifted their stance from con (1 or 2 on the Likert scale) to pro (4 or 5 on the Likert scale), or vice versa, regardless of the intensity of the stance.

We further examine the nature of these changes. For a Change in Opinion Strength, we distinguish between “softening” (moving towards the neutral position for a more moderate stance) and “intensifying” (moving towards the ends of the scale for a more extreme stance). In the case of Stance Change, we differentiate between “intensifying” (shifting from a moderate position on one side to an extreme position on the opposite side), “softening” (shifting from an extreme position on one side to a moderate position on the opposite side), and “equivalent shift” (moving from a moderate to the corresponding moderate on the opposite side, or from one extreme to the corresponding extreme on the opposite side).

6.1 User acceptance and information gain

Table 3 shows the results for pre- and post-dialogue questionnaires, revealing user Acceptance and perceived Information Gain. Prior Knowledge Scores, on a scale from 1 (low) to 5 (high), indicate moderate-to-low familiarity with the topics of discussion. Post-interaction, users reported an average Information Gain Score above 3 (1 low, 5 high) for all topics. The average Acceptance Score (1 low, 5 high) is just above 3, indicating moderate user satisfaction with the chatbot.

Table 3 Summary of single choice question responses in the pre- and post-dialogue questionnaires

6.2 Opinion changes

Analysis of the change in opinions before and after the dialogue with the bot reveals a subtle effect. While users exhibited a change in opinion, with a percentage varying across topics from 40.4 to 50%, the changes did not show a significant trend towards more extreme positions. Instead, the movement was towards a more moderated view (scores of 2, 3, and 4 on the Likert scale), shown by the part of users with a softened change in opinion strength and change of stance in Table 3.

6.3 Qualitative feedback

To evaluate the open-ended questions, we used conventional qualitative content analysis [67] to identify topics in participants’ responses. The occurrences of these topics were then counted across each dataset. We report these findings in Table 4 as relative and absolute frequencies across the corresponding datasets.

Participants’ qualitative feedback points out strengths like persuasive and new arguments, as well as quick responses. However, a common criticism was the chatbot’s occasional failure to comprehend user input, with over 50% of participants across all scenarios noting this issue.

Table 4 Responses to post-dialogue survey for the questions that were to be answered with free text

6.4 Argument recognition performance

The argument classification data, detailed in Table 5, show a disparity in the chatbot’s ability to recognize arguments. Even though for the medAI scenario the part of correctly identified main arguments is substantially larger compared to incorrect predictions, the majority remains unrecognized over all scenarios and argument types.

Table 5 Summary of argument classification

7 Discussion

In this section, we are discussing the results of our study concerning our research question. We start by analysing the bot’s ability to provide information and the user acceptance of the system before evaluating the bot’s influence on user opinions and its performance on argument classification. Finally, we conclude this section by addressing the limitations of this study.

7.1 Information gain

To assess whether our bot can help users broaden their perspective on the topics of discussion, we need to analyse its ability to provide users with new information. As shown in Table 4, users appreciated the persuasive and novel arguments presented by the chatbot. This appreciation was reflected in 25.8%, 34.62%, and 33.33% of the dialogues with MedAI, LawAI, and CarAI, respectively. Furthermore, the overall information gain, rated on a scale from 1 (no information gain) to 5 (significant information gain), was moderately high (Table 3). On the other hand, few users criticized the nature of the bots’s arguments. While users found some arguments persuasive and novel, there seems to be room for improvement in ensuring that the chatbot provides comprehensive information.

Furthermore, the survey results suggest that users’ prior knowledge did not have a clear influence on opinion change. Those with self-rated high knowledge levels did not shift their stances, but the small sample size of such users (5) limits the conclusions that can be drawn. Overall, our findings do not support a strong correlation between the level of prior knowledge and the likelihood of changing one’s opinion after interacting with the chatbot. Further research with a larger and more diverse sample could provide additional insights into this relationship.

7.2 User acceptance

The User Acceptance Scores for the bot across all three discussion scenarios showed moderate-to-high averages, with values of 3.1, 3.3, and 3.1, respectively. These scores indicate a generally positive reception, which aligns with the free-text responses from the post-dialogue questionnaire where users appreciated the persuasive and new arguments presented by the bot. Notably, the percentages of users who liked persuasive arguments were comparable across domains, ranging from 17.74 to 23.33%. The bot’s quick responses were also highlighted as a positive attribute, suggesting that responsiveness contributed to user satisfaction.

However, the bot faced criticism in terms of comprehension, with over half of the users across all domains pointing out issues with the bot’s understanding of user input. This concern was substantially more pronounced than other negative aspects such as the quality of the bot’s arguments or dialogue strategy, suggesting a critical area for improvement.

The difference between the positive mentions for interaction quality and the negative feedback on comprehension challenges suggests that while users found the bot engaging, its ability to process and respond to input effectively is crucial for User Acceptance.

Out of 178 participants, 151 completed the post-dialogue survey. The non-completion rate of about 15.6% needs to be taken into account whilst evaluating overall user acceptance, as it may reflect different levels of user engagement or satisfaction with the bot. Although we lack knowledge regarding the reason for their withdrawal, frustration with the system may have been a contributing factor.

In conclusion, the bot demonstrates a potential for user acceptance with strengths in argument quality and responsiveness, but improvements in understanding user input are imperative. The completion rate of the post-dialogue questionnaire further underscores the need to address user engagement comprehensively.

7.3 Opinion changes

Exploring the influence of our chatbot on user opinions, our study sought to understand whether conversational AI can reshape or expand perspectives on ethically challenging topics.

We observe noteworthy shifts in user opinions towards the chatbot’s stance. On average, there was a change of 0.51 on the Likert scale for MedAI and 0.44 for CarAI, with 53, respectively, 75 dialogues analysed, while a more substantial mean shift of 1.04 was observed for LawAI from just 24 dialogues. These changes indicate the chatbot’s influence on participants’ views, particularly on the LawAI topic.

Our research provides important insights into the understanding of argumentative dialogue systems and their ability to help users learn new perspectives. Similar to our efforts, Farag et al. conducted a study aimed at increasing people’s open-mindedness towards controversial topics such as veganism, Brexit, and COVID-19 vaccination [41]. Their approach was to use a questionnaire designed to measure the broadening of users’ perspectives, with ’open-mindedness questions’ designed to test participants’ acceptance of the validity, morality, and intellectual basis of views contrary to their own. The ultimate objective was to develop a greater receptiveness to these divergent views through interactive dialogue with their systems.

The effectiveness of such engagement was quantified by the observed shifts towards greater openness, with reported changes ranging from 16 to 28% in the direction of increased openness, and 10.7–18% towards decreased openness with their best system being the Wiki-Bot. These shifts towards greater acceptance of opposing views may be analogous to what we have termed as the “softening” of one’s stance, essentially denoting an increased willingness to consider alternative perspectives.

In comparison with the findings of Farag et al., our system showed a slightly lower likelihood for softening opinions, with corresponding percentages of 20.8%, 19.2%, and 13.3%—LawAI, MedAI, and CarAI—for changes in softened stance combined with softened change opinion strength. Notably, in our study the trend towards more extreme positions—or what we call “intensifying” shifts—was far less widespread. Specifically, no such changes were reported for LawAI, only 3.8% for MedAI, and 10.7% for CarAI.

The comparative analysis of these outcomes suggests that while our systems may be somewhat more conservative in facilitating opinion softening relative to the Wiki-Bot, they are markedly more effective at mitigating the intensification of extreme positions. However, it is important to consider the difference in approach to measuring perspective changes between theirs and ours.

In another work, Chalaguine et al. demonstrated the persuasive capabilities of a chatbot by documenting a 46% change in opinion against tuition fees among 50 participants [10]. In a subsequent study, the authors also highlighted the capacity of dialogue systems by documenting a 20% change in attitude towards COVID-19 vaccination [12]. In a wider study, Brand and Stafford recorded a 16% shift in opinion regarding COVID-19 vaccinations among a sample of 571 participants who were initially neutral or against vaccination [14]. While our findings indicate opinion changes ranging between 40 and 50%, our system differs from these studies, adopting a distinct approach to engaging users. Unlike the dialogue systems designed to advocate for a specific stance, our chatbot is intended to counter the user’s initial perspective, being able to argue in either direction. This approach aims to introduce users to alternative viewpoints rather than to persuade them, enriching the discourse and potentially broadening their outlook. Such a strategy is based on the principle demonstrated by Stanley et al. that exposure to opposing arguments can lead to a reduction in negative attributions towards ideological opponents and foster open-mindedness [68]. Therefore, our chatbot is inherently argumentative, but to expand comprehension rather than to advocate a singular perspective. This distinction suggests an alternative measure for assessing shifts in users’ perspectives—one that prioritizes the broadening of viewpoints over movement towards a particular stance.

Our research on argumentative dialogue systems aligns with broader efforts to show how these technologies can shape discourse and perspectives [10, 12, 32,33,34, 40, 41]. However, with this potential comes an inherent responsibility to consider the ethical dimensions of their use.

These systems, including ours, are capable of expressing a vast range of arguments, potentially extending into contentious areas. While our research actively avoids addressing socially unacceptable issues, we recognize that the methods developed could theoretically be used in less ethical ways. This underscores the need for stringent ethical guidelines and proper oversight during the development and implementation of argumentative technologies. It is important to ensure that these technologies are used for educational and informative purposes and not to mislead or polarize.

In the context of our findings, we take care to utilize argumentation as a means to expand users’ perspectives, only providing socially acceptable views. This approach is reflective of our commitment to fostering constructive dialogue and critical thinking.

With these ethical principles in mind, we turn to examine the concrete impact our system has had on users’ opinions. Analysing the initial distribution and subsequent shifts, as illustrated in Table 3, our data suggest that after engaging in dialogue with the chatbot, users generally shifted towards more moderate views. This trend towards moderation is demonstrated by Fig. 4, which details the stance transition flow for each discussion topic from the pre- to the post-dialogue questionnaires. These diagrams reveal a decrease in the extremity of positions, with a reduction in ‘strong con’ or ‘strong pro’ stances, and an evident shift towards the more moderate positions. Notably, the balance between pro and con stances remains relatively constant, suggesting that the chatbot facilitates a balanced reassessment of views rather than a shift towards a particular argument.

Fig. 4
figure 4

Transition of user stances before and after interaction with the bot. On the left, we observe the initial distribution of stances, quantified by the number of users in parentheses. The right side represents the stance distribution post-dialogue. Notably, the flow patterns indicate a shift from extreme positions (strong) towards more moderate ones

Table 6 User survey results classified by groups (unchanged, change in opinion strength, stance change, i.e. from pro to con or vice versa, and no post-survey)

An analysis of the pre-dialogue stances of the participants who did not complete the post-dialogue questionnaire, represented in Fig. 4, does not indicate a clear trend or commonality that might explain their departure. Their varied initial positions suggest that the premature exits were not influenced by a particular stance but may be attributed to other factors.

This movement towards more centrist viewpoints among the remaining participants may indicate that the chatbot is successful in stimulating a reflective process, prompting users to critically and open-mindedly reassess their stances. This is a desirable outcome, reflective of our system’s aim: to broaden understanding and provide a platform for users to engage with the complexity of issues thoughtfully.

Diving into the reasons behind opinion shifts, users recognizing new arguments from the chatbot were more likely to report a change in stance, particularly if their attitude changed (16.13% recognition of new arguments) (see Table 6). Interestingly, although users with changed opinions had slightly lower pre-dialogue knowledge scores, suggesting a possible openness to new information, there was no direct correlation between prior knowledge and opinion change.

The presence of good arguments, as recognized by the users, did not necessarily lead to an opinion change. This suggests that argument quality was not the sole driver of change; rather, the introduction of new perspectives might be more influential. Even though users occasionally felt misunderstood by the bot, a substantial number still shifted their opinions, implying that exposure to novel, well-structured arguments can be a key factor.

7.4 Argument classification

Despite the ARU’s limitations, it was able to achieve acceptable accuracy on recognized user utterances in the MedAI scenario, indicating the potential for improvement and effectiveness with adequate training data and well-constructed argument representations.

However, a consistent issue across all scenarios was the high rate of No Recognition, particularly for main arguments. This means that for these user utterances, the ARU predicted a confidence score below our set threshold (refer to Sect. 4.4). Therefore, the constructed response does not address the user’s argument. This issue was most severe in the CarAI scenario, which demonstrates a clear need for improving the ARU’s performance in terms of confident classifications.

Additionally, the ARU’s performance on counterarguments and FAQs was relatively poor across all scenarios, further emphasizing the need for robust training data and refined classification techniques. As for the Miscellaneous category, which included inputs that did not fit into our defined arguments, all inputs were unrecognized as we did not have an existing algorithm for handling non-argumentative phrases. We intend to address this issue in future work.

While these results are far from ideal, they represent an important step towards refining AI capabilities in understanding and responding to user arguments. Even with a high rate of misclassifications, the chatbot was able to facilitate meaningful dialogues, broaden user perspectives, and gather valuable data for future improvements.

Given the large proportion of unrecognized arguments, particularly for main arguments, we analysed the relation between our research questions and our systems argument classification performance.

7.5 The impact of recognized arguments on user acceptance and opinion change

Our results indicate that the effectiveness of our argument-based chatbot in terms of User Acceptance and change of user opinions is significantly influenced by the ability of the bot’s Argument Recognition Unit (ARU) to identify and correctly recognize arguments.

To illustrate this, we calculated correlation coefficients between the relative number of recognized arguments in a dialogue, and measures of change of stance, information gain, and user acceptance. These correlations are summarized in Table 7.

Table 7 Correlation coefficient between the relative number of recognized arguments in dialogue and change of stance/information gain/user acceptance

One notable observation from the table is that as the minimum number of recognized arguments increases, the correlation with Change of Stance shifts from negative to positive. This trend suggests that the ability of our bot to recognize and respond to a higher number of user arguments increases the likelihood of a change in users’ opinions.

However, the correlation between the number of recognized arguments and Information Gain remains ambiguous, showing no clear trend as the number of recognized arguments increases. This may indicate that the perception of Information Gain is not strongly tied to the number of recognized arguments. Instead, the quality and novelty of the bot arguments, as well as the pre-dialogue knowledge of the users, might play a more crucial role here. Less informed users will learn new perspectives during the dialogue even though they were not perfectly understood by the bot.

The correlation between user acceptance and the number of recognized arguments strengthens as the bot recognizes more arguments. This consistent pattern reaffirms the importance of the ARU’s capability to recognize user arguments for achieving a higher level of User Acceptance.

However, it’s essential to highlight that as the minimum number of recognized arguments increases, the number of dialogues available for analysis decreases. Although correlations based on smaller samples can provide preliminary insights, these should be interpreted with caution.

7.6 Limitations

Our research has yielded insights into the influence of argumentative dialogue systems on user perspectives on ethically complex topics. Nevertheless, it is important to acknowledge limitations that affect the interpretation of our results and suggest directions for future research.

A limitation of this study is the demographic breadth of our participant pool, predominantly comprised of students, which does not reflect the broader population. This limitation introduces potential biases, particularly in terms of technological receptiveness or skepticism towards information relayed by the chatbot. Interesting research efforts would involve incorporating a diverse demographic to ensure the generalizability of our findings. However, recruiting a group of participants truly representative of the entire population is challenging. Moreover, the temporal dimension of the opinion changes documented in this study remains uncertain. We only gave the post-dialogue questionnaire right after the interactions, so it’s unclear if the same changes will persist over time. We didn’t measure the long-term impact, which means that it’s possible that participants could either return to their initial stance or continue to evolve in their viewpoints. Further studies ought to involve subsequent evaluations to determine the prolonged effects of the chatbot’s impact on users’ viewpoints.

Another challenge we faced was the chatbot’s weakness in recognizing user arguments. This limitation could detract from user satisfaction and diminish the quality of the dialogue, thereby affecting their willingness to engage and potentially altering their openness to change opinions, which is also suggested by the results in Table 7.

Moreover, the proportion of participants who did not complete the post-dialogue survey is a limiting factor. While our study did not supervise participants during their interaction with the bot, resulting in a more realistic situation, the early drop-off of 15% could potentially bias our results in favour of those who remained engaged or had a positive experience with the chatbot. Future research should aim to decrease participant withdrawal and investigate the underlying reasons for disengagement.

Lastly, the questionnaires were designed to be succinct to encourage brief interactions. However, this brevity may have inadvertently led to an oversimplification of participant opinions and experiences. Subsequent research iterations might find a more balanced approach, capturing a richer set of data while maintaining user engagement.

In essence, while our study provides a foundational understanding of the effects of an argumentative dialogue system on ethically challenging topics, addressing these limitations is crucial for the evolution of this research field. By examining these limitations, upcoming research could refine these systems to develop more meaningful and engaging dialogues that increase users’ openness to different perspectives.

8 Conclusion and future work

In this paper, we examined the Argumentation Effect of a chatbot that was designed to engage users in ethically charged discussions about the use of autonomous AI in medical, legal, and automotive domains. We conducted an experiment, using a German-language chatbot to demonstrate the impact of conversational AI technology in encouraging informed and meaningful conversations on relevant societal topics.

Our research was driven by three questions, namely (1) whether the chatbot could offer new perspectives on the chosen topics, (2) if users would accept our system, and (3) whether the chatbot was capable of influencing or broadening users’ perspectives on these complex issues.

We found that our system successfully provided new perspectives to a group of 178 student participants, presenting novel insights into the topics discussed. The quality and innovation of the arguments proposed by our chatbot were recognized among users, although the correlation between opinion change and users’ prior knowledge was not evident.

The acceptance of the chatbot by users was promising, with moderate-to-high scores recorded in the post-dialogue questionnaire. Users appreciated the arguments and the overall design of the system. However, challenges were identified in the chatbot’s ability to comprehend users’ arguments, revealing room for improvement in our argument recognition unit.

To measure the effect of our arguments, we evaluated the shifts in participants’ opinions, finding that 40–50% of participants changed their opinions after engaging with the chatbot. This substantial proportion indicates a clear Argumentation Effect revealing our system’s capacity to effectively challenge and expand users’ viewpoints on the topics at hand. The nature of these opinion changes was particularly revealing: not only did users alter their initial positions, but they also tended to adopt more moderate perspectives overall. These results underscore the chatbot’s role in diversifying the discourse, encouraging a more balanced and informed exchange of views on ethically sensitive issues.

The implications of this research are timely, given the rapid advancements in AI and the intensifying public discourse. Our approach offers a valuable resource for individuals seeking to diversify their informational intake and widen their perspectives.

Nonetheless, the conclusions drawn here must be contextualized. The participant pool, comprising university students, does not reflect the broader population, limiting the generalizability of our findings. The absence of long-term effect data also precludes conclusions about the chatbot’s enduring impact on user perspectives.

Future research directions should thus include studies with a more diverse and representative sample of the population to ascertain the reproducibility of our findings across different demographics. The correlation between the change in opinions and the number of recognized arguments points to the necessity for enhanced technical capabilities in argument recognition, an area we aim to advance in subsequent research. Additionally, as the system currently does not identify non-argumentative user input, leveraging recent advancements in large language models could be instrumental in overcoming this limitation. Such improvements will be crucial for evolving our system into a more robust tool for public enlightenment and discussion.

Through continuous technical refinement and broader participant engagement, we aim to realize the full potential of argumentative chatbots as facilitators of balanced and informed dialogues in ethically complex domains.