Introduction

Recent technological advancements in intelligent and natural language-based information systems (IS) transform everyday life, work, and interactions (Brynjolfsson & McAfee, 2014; Davenport & Kirby, 2016; Diederich et al., 2020). As a result of ongoing developments in artificial intelligence (AI) and improvements in the underlying machine learning (ML) and natural language processing (NLP) algorithms, conversational agents (CAs) are becoming increasingly relevant in organizations as essential gateways to digital services and information (Følstad et al., 2021; Gnewuch et al., 2018). In this context, a recent analysis valued the global market for CAs at $3.49 billion in 2021 and expects it to grow to $22.9 billion by 2030, indicating their increasing importance (Research & Markets, 2022). Primarily operating in external or internal organizational environments (Patel et al., 2021), CAs interact with users (e.g., customers and employees) via natural language to provide convenient access to information from multiple connected systems and data sources. Moreover, CAs can perform standardizable processes and (cost-)effectively automate or assist tasks conventionally performed by employees (Meyer von Wolff et al., 2020). In terms of automation, Gartner predicts that CAs will automate one in ten agent interactions by 2026 (Rimol, 2022). Consequently, CAs are expected to deliver significant economic value in existing and future applications, businesses, and digital ecosystems (Seeger et al., 2021; Seiffer et al., 2021).

Due to their potential, an extensive stream of research has focused on these AI-based systems (Cui et al., 2017; Zierau et al., 2020b). Since 2016, known as the “year of the chatbot” (Dale, 2016, p. 811), interdisciplinary research has explored various aspects related to CAs (Diederich et al., 2019a; Janssen et al., 2020), leading to a significant increase in both scientific and practical knowledge (Zierau et al., 2020a). More specifically, previous research has examined technical aspects (e.g., NLP improvements) as well as framework and platform selection (Diederich et al., 2019a, 2019b; Følstad et al., 2021). Scholars have also investigated user attitudes toward CAs, including acceptance, motivation, and behavioral implications, such as user trust (e.g., Brandtzaeg & Følstad, 2017; Go & Sundar, 2019; Seeger et al., 2017). In addition, prior studies have also focused on interaction design (e.g., Bittner et al., 2019; Gnewuch et al., 2018) and user preferences for visual cues and conversational design of CAs (e.g., Feine et al., 2019a; Schuetzler et al., 2021). Furthermore, social, ethical, and privacy challenges associated with CAs’ implementation and use have been explored (e.g., Ischen et al., 2020; Ruane et al., 2019; Wambsganss et al., 2021).

Despite the steep increase in CA research and their vast opportunities, adopting CAs in organizational environments does not always have a positive impact because the technology is still error-prone and fails in interactions (Gnewuch et al., 2017; Riquel et al., 2021). These deficiencies concern CAs of varying maturity levels and regularly result in incorrect responses and conversational breakdowns (Weiler et al., 2022). Attributable to inadequate CAs, employees have developed negative feelings toward CAs and their providers in recent years (Diederich et al., 2020; Feine et al., 2019b; Schuetzler et al., 2021). To date, several potential reasons for the shortcomings and moderate success of CAs have been identified.

First, a primary reason for the limited success of CAs is their premature deployment, often driven by high expectations and management pressure, usually combined with little knowledge of the CA development process in general and of CA quality in particular. This practice often leads to non-use, dissent, or complete failure, as highlighted by Janssen et al. (2021) and Lewandowski et al. (2022b). Unsatisfactory CA design and limited capabilities can result in a frustrating user experience that triggers resistance and a loss of trust in the CA, further hindering its successful adoption in real-world environments as organizations (Weiler et al., 2022). The failure of CAs is frustrating not only for employees, but also for the CA vendor, who has invested significant effort, time, and money in developing the CA (Janssen et al., 2021; van der Goot et al., 2021).

Second, CAs are only marginally or not continuously evaluated to ensure their improvement, successful operation, and overall progress in organizations (Janssen et al., 2021; Meyer von Wolff et al., 2021). Therefore, previous research has proposed continuous evaluation (e.g., via monitoring (Corea et al., 2020) or chatlog data (Kvale et al., 2019)) and operation and improvement processes (Lewandowski et al., 2022b; Meyer von Wolff et al., 2022) to regularly assess their use, quality, and added value (Brandtzaeg & Følstad, 2018; Meyer von Wolff et al., 2022). However, little is known about how to systematically organize this operation and improvement process to improve CAs. Previous studies have focused on single perspectives, such as continuous technical adaptations (e.g., retraining the NLP algorithm (Meyer von Wolff et al., 2022)), adjusting the knowledge base (Janssen et al., 2021; Jonke & Volkwein, 2018), and improving individual CA functionalities and the dialog flow based on previous failures identified by chatlogs (e.g., Kvale et al., 2019). Despite these insights, there is a lack of knowledge on how CAs can be evaluated with criteria to test and improve their quality throughout their lifecycle (Lewandowski et al., 2022b). In addition, the current findings are often relatively fragmented across disciplines and application domains, and they therefore lack a cohesive axis of transferability for sustained practical usage (Elshan et al., 2022a; Følstad et al., 2021; Li & Suh, 2022). In this regard, experts in the field urge for more collaboration and aggregation in interdisciplinary research on CAs, and encourage further research on the topics around measurement, modeling, and evaluation approaches for CAs, as outlined, for example, in the CA research agenda by Følstad et al. (2021). To the best of our knowledge, there is no holistic overview of criteria for researchers and practitioners and approaches to continuously evaluate, improve, and sustain CAs that would facilitate organizations in this problem context. Therefore, this article explores the following research question:

What are relevant criteria for continuously evaluating the quality of CAs, and how can they be applied?

By addressing the research question, this article aims to systematize the continuous evaluation and improvement of CAs to counteract CA failure in organizational environments. To successfully operate a CA, measurements or criteria are needed for orientation to adapt CAs to user needs (Følstad et al., 2021; Meyer von Wolff et al., 2022). Therefore, we pursued a twofold contribution: (1) a set of relevant criteria to evaluate the quality of CAs and (2) a procedure model as part of the instantiation of the quality criteria set in an IT organization, prescribing its application and evaluation activities. The criteria set and procedure model define a cyclical criteria-based evaluation process that can be triggered by different impulses. These results address the identified research gap and present an approach for practice. Specifically, our proposed quality criteria set addresses this lack of knowledge about the successful operation of CAs. In this context, the evaluated set of quality criteria and the procedure model can serve as an initial overview for organizations to structure CA evaluations systematically and discover areas for improvement. Following design science research (DSR) activities mapped to the three-cycle view by Hevner (2007), we approach the derivation of these results with the following structure: First, we present the related CA research and delineate the research gap in more detail. Next, we describe our research approach to developing our artifact. We then present the findings of our work, including an overview of our final quality criteria set. Subsequently, we outline the instantiation of the quality criteria set using a real-life case in an IT organization. Finally, we discuss our findings as well as their implications and conclude with our limitations and potential future research.

Related research

Text-based conversational agents as specific AI-based IS

Research on AI-based IS has attracted substantial attention (Elshan et al., 2022b; Felderer & Ramler, 2021) and transformed from a technical trend to a pervasive phenomenon in our daily lives (Maedche et al., 2019). AI-based systems proliferate in various application domains and contribute to multiple innovations (Wang et al., 2020). One application area that has seen renewed interest and increasingly utilizes AI is communication with computers via natural language, which has been a topic of research and practice for several decades (Gnewuch et al., 2017). Since the 1960s, researchers have worked on text-based and later speech-based CAs to automate procedures and assist users with various tasks (Følstad et al., 2021). An early example is ELIZA, which allows initial natural language-based interactions with a computer (Weizenbaum, 1966). However, technical limitations (e.g., computational power and storage capacity) and overly simplistic capabilities (e.g., non-learning algorithms) restricted early attempts at CAs, as they could not meet the high expectations (Diederich et al., 2019a; Gnewuch et al., 2017). According to Dale (2016) and Klopfenstein et al. (2017), ELIZA and other previously developed CAs used simple rule-based mechanisms to generate responses.

Nevertheless, in recent decades, technological progress has allowed the development of more sophisticated CAs that utilize novel AI, ML, and NLP algorithms and models (Gnewuch et al., 2017). In this context, the CA attempts to understand the user’s intention behind the input prompt to provide an adequate response output. In particular, the techniques of supervised learning, unsupervised learning, and human-in-the-loop (where humans are involved in the training process) lead to increasingly better CAs (Radziwill & Benton, 2017; Wiethof & Bittner, 2021). As a result, they have gained widespread adoption and can now better address the needs of the general public and the mass market (Maedche et al., 2019).

CAs support the ongoing digitalization and automation of organizations by performing various activities, such as filtering information or efficiently assisting employees in their daily tasks (Zierau et al., 2020a). Hence, with their scalability and 24/7 availability (Gnewuch et al., 2017; Xu et al., 2017), CAs can have a transformative impact on business operations by acting as a central service platform and first point of contact for customers, providing a convenient way to handle service requests more individually before human intervention (Zierau et al., 2020a), and reducing information overload for users (Xu et al., 2017). Accordingly, employees can concentrate on more complex, creative, and non-routine tasks.

The widespread use of CAs has generated significant research interest, with a rapidly growing body of contributions. However, CA research has a strong interdisciplinary character and is fragmented into several research streams (Følstad et al., 2021): Multiple perspectives and disciplines, including “informatics, management and marketing, media and communication science, linguistics and philosophy, psychology and sociology, engineering, design, and human–computer interaction” are employed to study CAs (Følstad et al., 2021, p. 2916). This interdisciplinary research has introduced numerous designations, such as chatbots (e.g., Dale, 2016), conversational (user) interfaces (e.g., Herrera et al., 2019), or dialog systems (e.g., McTear, 2021), leading to debates in the literature about their terminology and classifications. In this context, the authors Gnewuch et al. (2017), for example, have divided these AI-based IS into two subclasses: text-based CAs (e.g., chatbots or natural dialog systems) and speech-based CAs (e.g., smart speakers or virtual assistants).

In this article, we use the term “conversational agent” to refer to all AI- and text-based representations, such as chatbots. Although some research indicates that the distinction between text- and speech-based CAs is marginal since speech-based input can be transferred to text-based input and vice versa from a technical viewpoint (Diederich et al., 2019b), research has also revealed that evaluating speech-based CAs requires distinct criteria compared to text-based CAs. For instance, evaluating the quality of a smart speaker involves design elements, such as overall (hardware) appearance (including styling elements and imagery), as discussed in Su and Hsia (2022). In addition, privacy handling is an important issue, for instance, when referring to the proactive (i.e., listening continuously to react) or reactive (i.e., reacting restricted to specific keywords) activation of speech-based CA, as discussed by Burbach et al. (2019). Furthermore, the ability of smart speakers to process audio speech and handle different dialects, tonalities, and noise in different input environments is crucial (e.g., Bisio et al., 2018), as is robust output generation (e.g., text-to-speech generation and perception of understandability and naturalness (Schmitt et al., 2021)). In summary, while text- and speech-based CAs share some commonalities, evaluating their quality requires considering specific criteria unique to each modality.

Continuous evaluation and improvement of conversational agents

While the development of a CA has become much more accessible, the underlying IS is complex by nature (Maroengsit et al., 2019). Besides the described possibilities and applications of CAs, the management, evaluation, and improvement of these AI-based systems pose new challenges for organizations. These activities are essential because disregarding them can result in high failure and discontinuation rates (Diederich et al., 2020; Janssen et al., 2021). Many CAs have failed in real-world environments due to, among other reasons, frustrating user experiences (Følstad et al., 2018a). As a result, multiple organizations have taken their CAs offline since they lack knowledge of how to ensure continuous evaluation and improvement, leading to an uncoordinated and highly exploratory development process (Janssen et al., 2021). As CAs represent a novel form of IS with distinct characteristics that differentiate them from traditional IS and other AI-based systems, they require new approaches for design, evaluation, and improvement.

One unique characteristic of CAs is their sociability. As social IS, they are capable of interacting with users via natural language, representing a new sociotechnical application class (Maedche et al., 2019). These AI-based systems impact traditional service delivery and enable new individualized and convenient sociotechnical interactions (Klaus & Zaichkowsky, 2020), requiring humanlike, user-centered, and socially interactive IS design (Lewandowski et al., 2022a). Contrary to the classification of AI-based CAs as IS from a technological perspective, the existing literature shows that the organizational adoption and practical use of CAs must be viewed in fundamentally different ways (e.g., Corea et al., 2020; Lewandowski et al., 2021). As a result, CA teams are designing chatbots differently from traditional IS and from multiple new perspectives, having equipped them with social features, names, avatars, and communicative behaviors to attract users’ attention and simulate natural conversation (McTear et al., 2016). Nonetheless, enhancing the user experience of CAs remains a crucial challenge owing to the absence of a comprehensive overview to determine whether they are well-designed and useful, and because of the lack of widely applied approaches to evaluate and improve them, as described in the interdisciplinary chatbot agenda by Følstad et al. (2021).

Another unique characteristic of current CAs is their level of intelligence and ability to learn and improve via naturalistic interactions. As such, they can be classified as a form of learning and intelligent IS, depending on the ongoing development and introduction of, so far, unsolved challenges (Lewandowski et al., 2021; Zierau et al., 2020a). CAs often have limited skills initially, and learning progress depends on the application area and the actors’ engagement in training these systems. Accordingly, CAs’ learning progress is highly context-driven and thus dependent on actual application and usage (Clark et al., 2019; Zierau et al., 2020c). The learning nature of CAs indicates the necessity for novel approaches to their evaluation and improvement (Lewandowski et al., 2022b; Meyer von Wolff et al., 2021).

Consequently, the highest effort needs to be invested in operations where CAs require continuous evaluation and later training and improvement in a real-world context. This endeavor is complicated by rapid changes and high dynamics, in which it is generally impossible to predict how users will interact and what information will be retrieved long-term (Janssen et al., 2021). CAs have gained a great deal of research attention, with perspectives ranging from specific conceptual- or usability-related aspects to technical design. However, detailed theoretical and practical knowledge is lacking for the operation in general and the continuous improvement process of CAs in particular (Lewandowski et al., 2022b; Meyer von Wolff et al., 2021). Hence, a comprehensive and systemized criteria-based approach to continuously evaluate CAs’ quality can help to improve and sustain them.

Evaluation criteria for conversational agents

In recent years, the overall user experience and improvement of CAs have been prominent topics in research endeavors. There is a growing body of knowledge on methods and measures to evaluate the overall user experience with CAs, resulting in initial factors contributing to a positive or negative user experience (Følstad et al., 2021; Zarouali et al., 2018). In addition, authors have examined various effects of CAs at the individual level, either on perceived human likeness, trust, perceived social support, enjoyment, affordance theory (Lee & Choi, 2017; Stoeckli et al., 2019; Zierau et al., 2020b), or in the broader context of IS acceptance theories, such as in the “Technology Adoption Model” (e.g., Pillai & Sivathanu, 2020). However, there is little research on concrete quality criteria that can be applied to ensure systematic CA evaluation and improvement. Thereby, scholars call to establish convergence in interdisciplinary CA research in measurements, models, and approaches for evaluating CAs (Følstad et al., 2021).

Contributions referring to the design and evaluation of CAs are beginning to emerge. According to Følstad et al. (2021), there is a rapidly growing body of work on CA interaction design (e.g., Ashktorab et al., 2019), CA personalization (e.g., Laban & Araujo, 2020; Shumanov & Johnson, 2021), use of interaction elements (e.g., Jain et al., 2018), social cues (e.g., Feine et al., 2019a; Seeger et al., 2021), and capability representation. However, current research is often confined to (1) single design issues or the effects of dedicated design elements (e.g., Seeger et al., 2021), (2) technical measurements or technical performance (e.g., Alonso et al., 2009; Goh et al., 2007), (3) other agent classes, such as embodied or speech-based CAs (e.g., Kuligowska, 2015; Meira and Canuto 2015), and (4) individual design aspects (e.g., Seeger et al., 2021), while (5) being segregated through the interdisciplinary CA research landscape. In addition, CA-oriented research has (6) focused on satisfaction issues, such as human behavior or ethical aspects (e.g., Neff & Nagy, 2016; Radziwill & Benton, 2017), affect and emotions, such as mood adjustment, entertainment, and authenticity (e.g., Meira and Canuto 2015; Pauletto et al., 2013; Radziwill & Benton, 2017) and (7) initial classifications and typologies for high-level analysis and guidance on interaction design (Følstad et al., 2018b), which only play an overarching role for development.

Important preliminary work includes ISO 9241-oriented CA evaluation criteria sets, such as those of Radziwill and Benton (2017), Casas et al. (2020), and Johari and Nohuddin (2021), representing first CA quality criteria sets and approaches. However, they tend to focus on improvements at a high meta-level, such as those regarding efficiency (e.g., robustness to manipulation or unexpected input), effectiveness (e.g., if the CA passes the Turing test), impact and accessibility (e.g., meets neurodiverse needs), trustworthiness and transparency (e.g., security and intrusiveness), and humanity and empathy (e.g., the realness of a CA or personalization). While these criteria may provide valuable guidance in the initial evaluation and improvement of CAs by addressing technical concerns, such as increasing the accuracy of NLP components or conducting user surveys to gauge initial perceptions, they have limited utility for CA teams in organizations seeking to ensure the long-term success of CAs within an application environment. This limitation necessitates a comprehensive system-wide perspective, for example, with respect to the overall input processing, the output presentation, representation elements, or the design of the dialog flow. To fill this research gap, it is essential to develop a more detailed, multi-perspective, and comprehensive set of quality criteria for researchers and practitioners that addresses a broader range of requirements for the long-term success of CAs.

Research approach

Our objective is to create a set of quality criteria for CAs as a central artifact that allows organizations to continuously evaluate and improve their CAs. To achieve this objective, we used the DSR paradigm and applied the three-cycle view presented by Hevner (2007). DSR is well-established in IS research and appropriate for our research because we aim to create an artifact that addresses a real-world problem and enables the continuous improvement of CAs to counteract their failure (Gregor & Hevner, 2013). Following the classification of contribution types in the DSR of Gregor and Hevner (2013), this research contributes knowledge at different levels. We contribute to level two by creating an operational artifact in the form of a set of quality criteria, including a procedure model (design knowledge). We also contribute to level one (artifact instantiation) by applying the quality criteria in a real-world context. We aim to derive and generate prescriptive knowledge from the descriptive knowledge extracted and evaluated from the knowledge base (Drechsler & Hevner, 2018). This knowledge will serve as a normative blueprint for practitioners and starting point for further research. To structure our research endeavor according to the established ground rules of DSR, we conducted seven research steps, as illustrated in Fig. 1.

Fig. 1
figure 1

DSR three-cycle view and our research steps based on Hevner (2007)

Step 1 of the DSR approach refers to the identification and formulation of a pervasive real-world problem. The initial situation was investigated through two semi-structured interviews following a prepared interview guide (see Table 1), revealing that the overall quality and usage rate of their used CA (ExpertBot) was insufficient, and at the same time, the IT organization lacked concrete criteria and an improvement process for it. Supplementing these insights, we examined the successful and failed use cases of organizational CAs in the current body of literature to highlight the practical relevance of the problem beyond our specific case. This status quo demonstrates the need for a solution approach that defines the addressed overarching problem class. Therefore, our research is based on the current knowledge gap regarding how a criteria-based endeavor could sustain the operation and continuous improvement of CAs to ensure their long-term success. This knowledge gap was grounded and described in the Introduction and Related research sections. As a result, we adopted a problem-centered perspective at the beginning of our research, based on Peffers et al. (2007).

Table 1 List of interviewees of Step 1 and the first relevance cycle

Based on the formulated problem, in Step 2, we conducted a structured literature review (SLR) to derive the initial criteria for evaluating CA quality. We followed the five-step process of vom Brocke et al. (2009) in the databases of AISeL, ACM DL, IEEE Xplore, EBSCO, and ProQuest ABI/INFORM. We defined the scope of our SLR using the taxonomy proposed by Cooper (1988), as shown in Table 2. We focus on research outcomes, practices, and applications of quality criteria for CAs. Our goal is to address the lack of anchor points that allow continuous evaluation and improvement of CAs by synthesizing the relevant literature. We adopted a neutral perspective by paying attention to different existing (interdisciplinary) criteria sets of CAs and methods of measuring their effectiveness. Our coverage strategy followed a representative nature, focusing specifically on essential and influential literature to answer our research question. A conceptual organization was chosen to cluster the existing research contributions. The results of our literature review are intended for IS researchers and interdisciplinary researchers concerned with CAs. Furthermore, practitioners can apply the derived quality criteria and procedures to improve their CAs.

Table 2 Applied taxonomy of literature reviews by Cooper (1988)

We first identified the central terms in our research question and decomposed them into related concepts to construct a search term (Brink, 2013; Xiao & Watson, 2019). Next, we used the resulting terms to conduct an initial unstructured literature search of the databases: “conversational agent,” “evaluation,” “criteria,” and “qualit*.” We extracted keywords, synonyms, and homonyms from the relevant papers found (Rowley & Slack, 2004; vom Brocke et al., 2009) and used them to form the following search string: (“chatbot” OR “dialogue system” OR “conversational agent” OR “virtual assistant” OR “cognitive assistant”) AND (“qualit*” OR “design” OR “criteria” OR “effectiveness” OR “evaluation” OR “usability”). We applied the search string to the aforementioned databases, resulting in 1895 articles. After screening the titles, abstracts, and keywords of each article, we selected 180 articles for in-depth analysis. To further filter the literature corpus, we established exclusion criteria to ensure that only relevant articles were included in the dataset. Two researchers independently used these criteria to screen the articles and reduce potential selection biases. Subsequently, we removed articles that addressed (1) technical or architectural aspects, (2) physical machines or robotics and their interfaces, or (3) no specific use cases for CAs. We also removed duplicates. During this process, we reduced our literature dataset to 94 articles by examining their research questions and results sections. In a final rigorous full-text analysis, we identified 67 articles as relevant, consisting primarily of journal and conference articles. Figure 2 illustrates the literature review process.

Fig. 2
figure 2

Literature review process according to vom Brocke et al. (2009)

In Step 3, we embarked on the first design cycle to establish a quality criteria set, version 1 (V1). To do so, we followed a multi-step procedure. Initially, we independently extracted appropriate quality criteria by conducting a full-text analysis of the final 67 articles from Step 2. Next, we integrated the extracted criteria into a shared document containing 221 criteria, with brief descriptions and references. We then refined and streamlined the criteria based on three aspects. First, we sorted all criteria by topic and removed nonrelevant criteria for our research scope (Step 2). Therefore, we excluded, for example, non-CA-specific criteria, irrelevant to the evaluation of text-based CAs (e.g., those relevant only to speech-based or embodied assistants). Second, we combined criteria that were indistinguishable and removed redundant criteria. Third, we weighted the criteria based on their frequency in the reviewed literature. Due to the quantity and complexity of the collated quality criteria set, we developed a multi-level model consisting of three levels: meta-criteria, criteria, and sub-criteria. This hierarchical arrangement allows for the holistic or selective application of the quality criteria set, enabling the evaluation of specific (topic-based) areas without needing to use the entire set.

In Step 4, we evaluated the initial literature-based quality criteria set (V1) through semi-structured interviews to expand the set in a second design cycle. We used Venable et al.’s (2016) Framework for Evaluation in Design Science (FEDS) throughout this process to define the overarching evaluation strategy. Our primary goal was to review and improve the quality criteria set developed for evaluating and improving CAs. Therefore, we chose a formative ex-ante approach to evaluate the quality criteria set for this design cycle. To prepare for the interviews, a semi-structured interview guide with questions about all quality criteria was created to ensure a systematic procedure and comparably gathered data. We then conducted and recorded seven interviews with experts from an IT organization with professional experience in CA projects and external researchers, following the methods of Gläser and Laudel (2009) and Meuser and Nagel (2009). We discussed the possible quality criteria of CAs with the interviewees based on their expertise before individually presenting and assessing our identified ones and having them extend our existing quality criteria set and point out missing aspects. Table 3 presents the list of interviewees of this second design cycle.

Table 3 List of interviewees of Step 4 and the second design cycle

Building on these insights gained from the interviews conducted in Step 4, we developed V2 of our quality criteria set in Step 5. During the interviews, we received feedback from experts and gathered valuable input on the initial criteria set (V1). We decided whether a criterion had to be retained, revised, or added to the criteria set. In this context, we considered the experts’ suggestions on the wording and structural arrangement of the criteria for reasons of comprehensibility, leading to design adjustments in V2. Furthermore, the experts’ experience with CAs in real-world contexts led to the identification of additional quality criteria, which were integrated into V2 in a complementary manner where appropriate.

In Step 6, we conducted a summative naturalistic ex-post evaluation of the quality criteria set (V2) by supervising its case-based instantiation in an IT organization using the FEDS (Venable et al., 2016). Our goal was to verify if the criteria set could be used to evaluate CA quality and whether it could help organizations improve their CAs in a structured and normative way by emphasizing its usefulness and relevance. To achieve this, we developed a procedure model for the application and instantiation of the quality criteria set and conducted two interview rounds. The first round included seven experts, three of them from Step 4 and four new participants; in the second round, one of the new participants was not available, so 13 interviews were conducted in total, as shown in Table 4. During the first round, we asked the experts about the current state version of ExpertBot and its problems and potential for improvement before transitioning to the individual criteria from our set to create suitable scenarios. We used scenarios as flexible containers that include a certain number of our quality criteria that match a collective evaluation. In this context, mockups were created with Figma (2022) as prototypes to simulate each scenario with a current state version and a modified version of ExpertBot, incorporating altered criteria aligned with our criteria set (V2). In the second round, we used the created prototypes to simulate each scenario previously defined in A/B tests related to Young (2014). This allowed us to determine which criteria were considered highly influential and most important to the experts. In addition, we paid attention to whether the experts mentioned new criteria in the instantiation that were not yet included in our quality criteria set. The procedure is more detailed in the “Case-based instantiation of the quality criteria set” section below.

Table 4 List of interviewees of Step 6 and the second relevance cycle

Finally, in Step 7, we incorporated the evaluation results from Step 6, the naturalistic case-based instantiation, into the quality criteria set and developed the final version presented and documented in the next section. Following Gregory and Muntermann’s (2014) theorizing framework, we iteratively developed and improved an abstracted artifact version that met the larger problem class derived in Step 1. We communicate the quality criteria set for CAs as a rigorously elaborated prescriptive artifact, providing applicable knowledge that contributes to the knowledge base as a solution design entity for practitioners with an adaptable framework for situational instantiations to improve their CAs by applying the derived quality criteria (Drechsler & Hevner, 2018). In addition, our set provides descriptive knowledge as an observation and classification concept for researchers, with new insights and starting points for further research on evaluating, understanding, and improving CAs for their long-term success.

Quality criteria set for conversational agents

Based on the DSR research activities, we derived the final criteria set for evaluating and improving the quality of CAs. This set incorporates a hierarchical structure consisting of 6 metacriteria, 14 criteria, and 33 sub-criteria, enabling a systematic and rigorous evaluation process. The meta-criteria are the highest level of abstraction, representing the overarching evaluation areas of a CA. The criteria at the second level break them down. These can be used, for example, to create responsibilities in a CA team for (meta-)criteria areas, ensuring that accountability is clearly defined and understood. This structure also supports informed decision-making (e.g., prioritizing specific criteria of the CA). Although (meta-)criteria provide logical and structural clarity and classification, they are not sufficiently granular for evaluation purposes. Therefore, at the third level, sub-criteria have been defined as specific elements that can be evaluated using qualitative or quantitative methods. Overall, this approach allows for a detailed and comprehensive evaluation of CAs. The following section presents the quality criteria along the six meta-criteria and their hierarchical structures depicted in Table 5.

Table 5 Final CA quality criteria set

Input

Input comprises criteria that focus on creating and submitting requests to the CA. In this context, the diverse interaction abilities of CAs can be evaluated (e.g., Kowald & Bruns, 2020). Many CA teams employ existing communication channels (e.g., messenger front ends, such as Microsoft Teams or websites), ensuring that users are comfortable and familiar with their basic functions (Feng & Buxmann, 2020). However, reflecting, exchanging, or expanding channels with progressive development is essential. Moreover, various input control elements can be evaluated and integrated to facilitate dialog flow. For example, it may be helpful to allow users to interact with CA responses via buttons (Kowald & Bruns, 2020). The interviews emphasized the need to continuously evaluate and refine the selection and functionality of control elements (e.g., text, buttons, reactions, and carousel selections). In addition, the context awareness of CAs should be evaluated. The ability to grasp dialog-oriented contexts allows CAs to incorporate previous user utterances to conduct sophisticated conversations with users. These conversations should be evaluated to ensure that users do not have to enter input repetitively (Saenz et al., 2017). Connected to this, resumption and return points in the dialog tree are fundamental aspects of evaluation. A well-structured dialog flow helps users provide the correct input, achieve their goals, and avoid deadlocks (Diederich et al., 2020). Moreover, the technical environment needs to be established to enable unrestricted usage, especially in complex use cases. From the first to the last user touchpoint, background systems should be conveniently accessed (e.g., single sign-on) to address background systems that resolve requests and provide information.

Output

Output refers to criteria related to the CA-generated response provided in return to the user request. Regarding output, the format of the CA responses should be reflected. The responses require an appropriate selection of suitable visual elements in terms of a user- and content-oriented presentation (e.g., with texts, images, and tiles), as well as high readability (Kowald & Bruns, 2020). Especially in the context of CAs, consistency in language and terminology is important for avoiding complexity and confusion for users (Edirisooriya et al., 2019). In terms of content, the CA should transparently disclose its capabilities and limitations to evoke appropriate user expectations that are consistent with the nature of the CA as a learning IS (Diederich et al., 2020). Furthermore, CA answers should be reviewed to evaluate whether users’ (information) needs have been fulfilled. The relevance and meaningfulness of the presented information and the up-to-dateness of the knowledge base for information retrieval should be checked to determine whether background knowledge must be updated or expanded (Diederich et al., 2020). Apart from recognizing the user’s intent and presenting the correct output, Feng and Buxmann (2020) emphasized the evaluation of different representations and levels of detail of the knowledge. Especially for more complex CAs (e.g., those that combine numerous background systems as a central platform), it is challenging to present the often complex solutions in an abstract and convergent way that provides users with appropriate answers to their concerns. The interview experts highlighted that solutions sorted by the relevance and justification of the CAs’ answers could increase user trust in these outputs. For example, a CA could refer to the background system or source to make it transparent from where the knowledge was obtained (e.g., clickable link below the answer). Closely related, the CAs’ calibration of response appropriateness should be evaluated to provide concise and manageable CA answers. In this context, CAs’ response accuracy (e.g., also referred to as response quality (Jiang & Ahuja, 2020)) needs to be evaluated to present knowledge correctly (e.g., length, tonality, fluency) to the target audience. Regarding the timing of responses, technical response time is considered a relevant factor for CAs. For example, Edirisooriya et al. (2019) identified quick responses—within two to five seconds of the user’s request—as essential. However, the criterion balance between proactivity and interruption, which refers to the fact that CAs’ proactive utterances may interrupt users, indicates that this behavior and its effects on users should be evaluated.

Anthropomorphism

Anthropomorphism relates to human characteristics, such as emotions, applied to nonhuman objects (Schuetzler et al., 2021). Anthropomorphism can positively affect the use of CAs and can be divided into three aspects: humanlike identity, verbal cues, and nonverbal cues (Seeger et al., 2021). First, evaluable criteria in the context of humanlike identity represent aspects that strengthen CA identity (e.g., profile pictures or avatars) and other characteristics, such as demographic information, including gender, age, or name (Seeger et al., 2021). In addition, the general visual representation was highlighted during several interviews. A CA team should reflect on how the CA can be easily detected as the first contact point with the user, including, for example, its integration into a website, such as its position, size, responsive (humanlike) appearance, and colors. Furthermore, CAs’ verbal cues should be reviewed. Besides the ability to engage in social dialogs, called “chitchat,” emotional expressions (e.g., apologizing by the CA), verbal style, and self-reference (e.g., the CA referring to itself as “I” or “me”), or context-sensitive responses, tailored personality and lexical alignment (e.g., by the CA adapting its responses to the users’ utterances (Saenz et al., 2017)) can also be used to make CAs seem more humanlike (Schuetzler et al., 2021; Seeger et al., 2021). In particular, chitchat and character definition were emphasized in the interviews, since many users first check the CA for its social capabilities and quickly lose interest if it fails, even at slight initial social interactions. Further possibilities of humanlike design are nonverbal cues, such as emoticons, or artificially induced typing delays and indicators, such as typing dots (Gnewuch et al., 2018). However, researchers have also noted that a humanlike CA can be repellent to users (e.g., Grudin & Jacques, 2019). Seeger et al. (2021) indicated that the different anthropomorphism criteria must be combined and evaluated practically.

Dialog control

For successful dialog control, CAs’ understanding of users’ requests, along with their intentions and goals, should be evaluated (Clark et al., 2019). However, CAs are learning IS and, therefore, initially error-prone. In particular, user input in lengthy and complex sentences poses a challenge for CAs (Michaud, 2018). Thus, proactive dialog handling in regular operations and reactive handling in failure operations should be evaluated to ensure that CAs avoid, reduce, or recover from failures. In regular operations, organizations should continuously reflect on whether the CA proactively avoids error scenarios by, for example, asking the user to reformulate the request (Diederich et al., 2020) or prompting the user for more information (Chaves & Gerosa, 2021). Further, the interviews revealed the expectation that if no appropriate answer was elicited, the CA should proactively refer to misunderstandings or reintroduce his skills. Afterward, the CA could provide alternative responses to keep the conversation alive (Chaves & Gerosa, 2021). Another way is to provide conversational prompts. Through the use of prompts, the CA provides suggestions for prospective requests in addition to their responses (e.g., in the case of a long response time by the user). The aim is to predict the user’s intentions (e.g., by offering suggestions on text buttons) and proactively avoid error cases when processing a user’s text input (Li et al., 2020). In failure operations, it is crucial to define and evaluate (e.g., proactive and resilient) repair strategies to overcome conversational breakdowns, since their existence can result in a negative experience for users and impair future CA success (Benner et al., 2021). In the case of a breakdown, the CA should fail gracefully in order to maintain user trust (Feng & Buxmann, 2020). For instance, the CA can apologize and propose new solutions (Benner et al., 2021). However, if repair attempts fail repeatedly and the CA’s capabilities are exceeded, the CA should encourage fallbacks or a handover to a service representative (Poser et al., 2021, 2022).

Performance

A holistic evaluation of CA performance represents a strong predictor of CA success (Peras, 2018). By combining design- and technically-oriented principles, CAs’ performance relates directly to user satisfaction (Liao et al., 2016). The performance demonstrates the effective and efficient completion of tasks between the user and the CA (Peras, 2018). Regarding CAs’ effectiveness, the task success rate and the task failure rate could be used to collect the number of successful tasks and the number of default fallback intents to trigger appropriate countermeasures (Peras, 2018). In the interviews, the retention and feedback rates were mentioned regarding the recordings of returning users and continuously evaluating users’ average ratings to uncover weaknesses and derive improvement potential. Furthermore, it is necessary to consider CAs’ efficiency because the adequate performance of tasks explicates only a few insights into whether the CA also performs the tasks with a resource-friendly approach. Given this perspective, evaluating the time required to complete a task (task completion time) and the (average) number of rounds of dialog required (average number of turns) is essential to capture efficiency (Holmes et al., 2019; Peras, 2018). In addition, the human handover rate is significant in evaluating at which points the CA cannot complete a task (Wintersberger et al., 2020).

Data privacy

Data privacy includes criteria related to the realization and communication of data protection endeavors. One important aspect is ensuring that conversations with the CA are kept as private and anonymous as possible, particularly when the CA deals with confidential or personal data (Feng & Buxmann, 2020). During the interviews, we received feedback emphasizing the importance of minimizing the storage of conversational data and ensuring that any stored data is anonymized to the greatest extent feasible, especially when such data is necessary to improve a CAs’ performance. The communication of data protection contains the criterion of transparency toward users, meaning the disclosure of which user data is processed. In this context, it is helpful to provide data protection policies (Rajaobelina et al., 2021).

Case-based instantiation of the quality criteria set

After the research activities of the DSR project in Steps 1 to 5, the final quality criteria set was instantiated in Step 6 in an IT organization to investigate, evaluate, and improve the quality of an existing AI- and text-based CA. Due to the organization’s limited in-depth knowledge of a systematic CA evaluation procedure, including methods, a systemized procedure model was initiated and documented. It comprises three main phases and was applied to utilize the final CA quality criteria set throughout each phase (see Fig. 3).

Fig. 3
figure 3

Procedure model for the evaluation and improvement of CAs using our quality criteria set

Case setting for applying the procedure model

The DSR project considered the following case setting to apply the procedure model, evaluate CAs’ quality, and address an existing real-world problem: (1) The procedure model requires a suitable use case to evaluate the applicability and feasibility to indicate a CAs’ quality. To this end, an existing AI- and text-based CA (ExpertBot) was investigated, evaluated, and improved in an IT organization. Based on our interview analysis (as outlined in Step 1 of our DSR project), ExpertBot was deemed to be a suitable case for a root cause analysis, since the overall quality and usage rate were insufficient. The IT organization uses ExpertBot within organizational boundaries to identify, prioritize, and select needed experts. Therefore, ExpertBot participates in chat conversations and accesses various data sources, such as skill databases, document management systems, and internal chat forums, to provide fitting recommendations for experts and their skills. ExpertBot is integrated into an existing text-based communication channel in Microsoft Teams and works intent-based, using Microsoft Language Understanding (LUIS) and Azure Cognitive Services in the background. (2) Furthermore, forming an expert team with varying experience levels and backgrounds regarding CAs and their application field is crucial to provide a multi-perspective view enabling a broad discussion of the quality criteria and shortcomings of CAs. In our case, to implement the procedure model for evaluating the quality of ExpertBot through all phases, the existing CA team has formed an interdisciplinary team of experts from the IT organization (e.g., CA developers, product owners, management assistance responsible for staffing, and employees from other departments, as outlined in Step 6 of our DSR project). (3) Finally, an expert team requires an appropriate data basis to evaluate CAs. For this purpose, prepared data, such as the user retention rate or other criteria, can be applied. In this context, the newly formed expert team evaluated ExpertBot based on our quality criteria set and derived improvement potentials. Overall, this case setting served as the starting point for instantiating the CA quality criteria set through the procedure model, as shown in Fig. 3.

Utilization of the procedure model

Our procedure model is designed with three main phases and several sub-phases to provide a fine-grained approach that fosters comprehensiveness and traceability. The sub-phases enable us to (1) create progressive guidance for each phase of the procedure model, facilitating the evaluation of CAs; (2) ensure that every aspect of the procedure is thoroughly documented, which is crucial for properly evaluating CAs; and (3) create a more detailed and extensive procedure that helps the expert team to ensure a systematic CA evaluation.

Phase 1: General evaluation

In Phase 1 (general evaluation), we performed a quality criteria-based analysis to identify problems with the current CA version. More specifically, in Sub-phase 1.1, the derived meta-criteria were used to provide a starting point for the CA team’s initial evaluation of the ExpertBot and to identify possible problem areas (see Fig. 3). In Sub-phase 1.2, the corresponding criteria of these problem areas served as a more detailed level to narrow the scope of analysis. Thereby, in Sub-phase 1.3, the sub-criteria belonging to the criteria could be used as indicators of potential problems. Based on these phases and the analysis of appropriate data related to the corresponding sub-criteria, specific problem indicators of the ExpertBot were identified in Sub-phase 1.4.

In our illustrated example from our instantiation (see Fig. 3), the general evaluation revealed that the overarching meta-criteria “output” and “performance” of the ExpertBot needed to be improved. Six problem indicators, such as “detail of knowledge,” “solution convergence and justification,” and “task completion time,” were considered throughout the criteria-based analysis to start an in-depth evaluation. As a result, we initiated an improvement project to address the identified indicators.

Phase 2: In-depth evaluation

As part of Phase 2 (in-depth evaluation), we first conducted Sub-phase 2.1. In cooperation with the IT organization, the CAs’ quality was evaluated, and the potentials for improvement were determined based on the identified problem indicators from Sub-phase 1.4. Using appropriate evaluation methods, the consideration of these improvement potentials was found to be beneficial for the expert team (comprising members from the CA team; see “case setting”).

To gain this insight, we conducted seven semi-structured interviews with the expert team members. We presented the live version of the ExpertBot and asked the participants about the general implementation, problems, and relevance for improvement, along with the corresponding criteria from our set. The resulting evaluated improvement potentials of the ExpertBot were then transformed into coherent scenarios in an aggregation process. Thereby, a collective evaluation of multiple quality criteria in each scenario could be conducted. In the single scenario we outlined, as shown in Sub-phase 2.1 of Fig. 3, all six specific problem indicators were identified as improvement potentials during the interviews. Specifically, the scenario was called “manageable length of answers” and included the improvement potentials “visual elements,” “readability and consistency,” “detail of knowledge,” “solution convergence and justification,” “response appropriateness,” and “task completion time.”

In Sub-phase 2.2, we created mockup prototypes for the transformed scenarios to demonstrate, investigate, and evaluate the identified improvement potentials. In this context, the prototypes enabled a well-founded comparison between the current state version of the CA and the proposed modified CA version(s). The expert team provided valuable feedback to verify whether the identified improvement potentials would be beneficial if implemented or needed to be revised or discarded.

For the creation of prototypes, we employed the Figma (2022) design tool in combination with the Microsoft Teams UI Kit (2023) to ensure a familiar and consistent visual representation during the demonstration. Furthermore, the prototypes were designed based on the previously evaluated improvement potentials corresponding to the analyzed ExpertBot. Subsequently, we conducted A/B tests involving six participants by presenting them with two prototypes for each scenario during semi-structured interviews to achieve a data basis for deciding whether to implement the proposed changes. One prototype contained the current CA version, while the other represented the assumed improvements (modified version, as depicted in Fig. 3). For each scenario, questions were asked in three areas during the interviews. First, we asked participants to evaluate which of the two prototypes was more effective at first glance and which aspects were crucial to this impression (e.g., perceptions of the prototype features and differences). Second, the improvement potentials were addressed individually, and the participants were asked to determine which sub-criteria were conceivable for increasing CA quality. Third, we asked which of the addressed sub-criteria was rated the most important in improving CA quality to prioritize the highest-ranked improvement potentials (e.g., number of mentions) in preparation for the last phase.

Phase 3: Implementation

Finally, in Phase 3 (implementation), the improvement potentials, identified in Sub-phase 2.1 and evaluated as beneficial in Sub-phase 2.2 for increasing the quality of the CA, were implemented in a revised CA live version. These improvements were communicated to the users to ensure their visibility in the organization. After Phase 3, the procedure should be repeated to improve the CA on a long-term basis, for instance, if problems are identified based on existing data, or as part of a general cyclical evaluation to examine the quality of the new CA live version as a whole or in defined segments, which, however, was not part of the instantiation.

Discussion

Organizations strive to implement CAs due to their potential to increase business value with their ability to assist or automate processes, tasks, and activities (Lewandowski et al., 2021). However, despite their strengths in improving organizational efficiency (Zierau et al., 2020c), many CAs across industries are still error-prone and fail during interactions (Gnewuch et al., 2017), leading to a high discontinuation rate (Janssen et al., 2021). To strengthen the management of CAs in an organizational context and improve their success, we have developed a quality criteria set and procedure model for conducting holistic evaluations and improvements of CAs. In a multi-step DSR project, criteria were identified, aggregated, and evaluated ex-ante for applicability and operationality in real-world environments. In addition, a procedure model for the application of the quality criteria set was determined as part of a naturalistic ex-post evaluation. The conducted evaluation activities demonstrate that the incorporated criteria provide an integrated view of a CA evaluation. Regarding the procedure model, the results indicate that a systematic analysis of the problems, requirements, and status quo of a specific CA is supported to identify and improve its most relevant aspects. In combination, these findings have implications for research and practice.

Theoretical implications

First, our quality criteria set and procedure model contribute to CA research by providing a synthesized and systematized approach to improving the success of contemporary CAs. To achieve this, we contribute the quality criteria set derived from strongly dispersed CA research streams (Følstad et al., 2021). A large share of this research has focused on specific design and technical issues to elevate the user experience (e.g., Seeger et al., 2021), as these issues were considered the main challenges in the implementation of CAs (Følstad et al., 2018a; Janssen et al., 2021; van der Goot et al., 2021). However, CAs are inherently complex IS (Maroengsit et al., 2019) with distinctive characteristics that require a comprehensive view and analysis, as failures can arise from multiple (interrelated) factors (Janssen et al., 2021; Meyer von Wolff et al., 2021). Therefore, we extend the focus of current CA research to a consolidated set of essential quality criteria that should be considered to support the prevention of CA failure. We also provide a starting point for a more structured CA evaluation with our procedure model, as recommended by Følstad et al. (2021). The quality criteria set addresses the type of AI and text-based CAs in general domains, as classified by Gnewuch et al. (2017). Nevertheless, the results may also apply to other types of CAs. Additionally, our work complements other preliminary efforts, such as the evaluation criteria sets of Radziwill and Benton (2017) and Casas et al. (2020), to provide a better understanding of CAs in the improvement process with a system-wide view.

Recent technical advancements in the field of NLP and ML applications should also be highlighted, especially the emergence of large language models (LLMs). These models are pre-trained on billions of text samples from specific data sources on the Internet and can generate diverse types of content (Brown et al., 2020; Jiang et al., 2022). In particular, these models become widely available to (non-technical) users via the release of intuitive and conversational interfaces, such as OpenAI’s ChatGPT or Google’s Bard (Jiang et al., 2022; Teubner et al., 2023). These releases implicate a remarkable movement in CA research exploring new application scenarios and their potential, which is also referred to as a new “AI wave” by Schöbel et al. (2023). Consequently, the question arises of which quality criteria are affected in this new wave and what requirements result for CAs and their (further) development. We, therefore, expect a paradigm shift in the perception and utilization of the different criteria from our set in regard to the novel LLM applications, which could lead to high dynamics and flexibility in their adaptation and use.

Second, this article contributes to management research on CAs, which encompasses various aspects of the CA lifecycle, such as critical phases, factors, or tasks within CA development (Lewandowski et al., 2022b; Meyer von Wolff et al., 2022). However, these initial studies neither provide deeper insights into CAs’ evaluation and improvement nor explain how a cyclical evaluation process can be executed. In this regard, our research provides an approach for evaluating and improving CAs, which can serve as a meta-model for other researchers using different qualitative and quantitative methods within the lifecycle of a CA. While researchers focus on the design of CAs by targeting specific aspects, such as increasing user trust or anthropomorphism (e.g., Seeger et al., 2017), they often disregard the importance of evaluating CAs on an ongoing basis, as elaborated in the “Related research” section. Thereby, we provide knowledge regarding a structured and continuous CA evaluation to ensure the improvement of CAs during their operation in organizations (Janssen et al., 2021; Meyer von Wolff et al., 2021). In addition, the quality criteria set and procedure can assist in other lifecycle phases, for instance, by providing an overview of initial design issues in the initiation phase. Furthermore, the quality criteria set can support a multi-perspective and comprehensive development process and the detection of problems before going live in the integration phase to avoid direct failure. Our article aggregates design knowledge, supplemented by practical insights, and introduces a structured approach that provides initial insights into activities, people, and data, which can foster operations and enhance the performance of CAs.

Third, from the DSR lens, we contribute prescriptive design knowledge with the quality criteria set and procedure model for their application. Both form our developed artifact. This artifact provides a foundation that can be applied in the identified higher-level problem class in other solution spaces (Hevner et al., 2004). The application of the artifact can be utilized to address and explore the problem class in more depth and further improve the ability to apply it in a generalized manner or to design more sophisticated artifacts as tools for similar problems.

Practical implications

In addition to its theoretical contributions, our article has practical implications for organizations. By providing a systematic approach to evaluation and improvement, our artifact can guide CA teams in various ways to support the successful development, operation, and evolution of CAs. First, the combination of the quality criteria set and procedure model allows practitioners to obtain a comprehensive overview of relevant criteria and to narrow down the evaluation of their CA to identify specific problems and improve the overall quality of CAs. Second, the procedure model can serve as a blueprint for CA teams to systematize the evaluation process. The delineation of content and the sequence of relevant steps provides a feasible approach for practitioners to structure their evaluation and improvement activities of existing CAs. In addition, the criteria set can serve as a basis for CA teams to decide whether a CA project should be established and whether requirements are present (e.g., prepared data, an interdisciplinary team) to enable a comprehensive and multi-perspective evaluation of the quality of CAs. Thereby, the execution of evaluation and improvement tasks could be accelerated. Apart from the description of relevant criteria and the evaluation steps, the artifact’s application may positively affect organizations. For instance, following the systematized procedure to improve CAs, the perceived user satisfaction could increase, thus resulting in an improved acceptance and usage rate and consequently counteracting the discontinuation rate of CAs. In addition, the evolution of CAs and related positive effects could not be limited to the CA domain, as their success could foster the overall AI transformation of an organization, so that the increased quality and use of CAs can influence other learning and AI-based IS.

Limitations and future research

Our research is not without limitations that have implications for further research. The developed artifact comprises a comprehensive set of quality criteria. However, its application does not inevitably guarantee success in the deployment and continuous improvement of CAs. To achieve this broad goal, additional aspects, such as technical requirements (e.g., AI, ML, and NLP algorithms and tools), a fit of the technology to the use case (e.g., using a CA for complex tasks or in emotionally-sensitive environments), design (e.g., human–computer interface), and organizational communication to users (e.g., tutorials, highlighting benefits and restrictions) have to be considered. All factors in interaction lay the foundation for a successful CA operation. In pursuit of this goal, the quality criteria set and procedure model can be considered one piece of the greater puzzle.

The instantiation of the quality criteria set revealed several challenges and aspects that need further research. First, in Phase 1 of our instantiation, we determined the need for a quality criteria-based, in-depth evaluation of CAs’ output and performance. These overarching meta-criteria proved to be valid starting points for exploring the improvement potentials of the ExpertBot. Nevertheless, further investigation is required to identify additional triggers that warrant in-depth evaluation. Broadening the perspective, triggers from outside the organizational boundaries, such as feedback from customers, are possible. However, we did not identify any of these triggers in our project due to the inward-facing use case of the ExpertBot.

Second, further research on how organizations can generate a CA evaluation strategy, including aspects such as evaluation intervals, criteria selection, and suitable evaluation methods, is needed. In our real-world instantiation, we have resorted to semi-structured interviews and A/B testing as qualitative evaluation methods, which are not necessarily suitable for all criteria. Overall, a general framework could assist organizations and researchers in the selection of suitable evaluation and data analysis methods for relevant areas (such as our meta-criteria). In particular, longitudinal studies that explore application of the quality criteria set and procedure model in real-world environments can provide deeper insights into evaluation strategies and the impact of their use.

Third, we observed that different quality criteria of our set have varying levels of impact on CAs’ quality. The criteria exhibit an indeterminate degree of interdependence as they influence each other. In addition, we found that the skills of the participants (e.g., CA team) can influence this factor. We derive three directions for further research: (1) conception, design, and evaluation of a modular, context-adaptive procedure model that can be tailored to arbitrary CA application environments and their essential quality criteria, including the individual conditions; (2) investigations to determine the needs of AI and data literacy experts for designing, continuously evaluating, and improving CAs, as well as the needs of (non-expert) users for utilizing and validating the information output to counteract their failure; (3) identification of criteria in our set that may need to be evaluated more or less frequently. A combined classification or ranking of the influence and importance of the criteria (e.g., by an empirical research approach) offers additional potential.

Moreover, we expect further technical progress and research in the context of customizable CAs, as also described in a study by Schöbel et al. (2023). Aspects such as social presence and anthropomorphism, as well as personalization and empathy of human-AI interactions, are to be considered. Especially the new wave of AI technologies and LLM could lead to improvements in terms of better customization and contextualization. By exploring these areas, further research could counteract the skepticism of users toward conventional CAs, perceiving them as unnatural, impersonal, or deceptive (Schöbel et al., 2023), and reduce the overall failure of CAs (Gnewuch et al., 2017). In this regard, our set of criteria contains anthropomorphism criteria, such as identity, visual representation, and tailored personalization. However, these criteria need to be further explored in the wake of the recent customization and contextualization capabilities of LLMs that could make CAs more adaptive to users’ emotional states, for example, by tailoring responses to individual needs and preferences, fostering a wider acceptance in the future.

In addition to information retrieval scenarios, CAs augmented with LLM capabilities could act in a broader spectrum of possible use cases. In conjunction with our demonstrated procedure model, new approaches for evaluating and improving CAs extended by LLMs may become mandatory. For example, generative activities (e.g., content created based on statistical methods and available data) should be handled differently from information retrieval activities (e.g., content extracted unchanged from a connected data source). The question whether the generated content corresponds to the truth or contains false and misleading facts arises. The range of application scenarios in practice and the exploration of these technologies are in their infancy and will be an engaging field of research (Schöbel et al., 2023).

While our article provides insights and an approach to the evaluation and improvement of CAs, it has methodological limitations. Although our artifact proved to be applicable by instantiating it in an IT organization, its transferability to other application environments with CAs of different use cases, other CA teams, and conditions remain to be proven to further address the overarching problem class. In this vein, our set of quality criteria could be a building block for adaptation. Overall, several foundations are laid for research on the design, validation, and adaptation of the quality criteria set and procedure model.

Conclusion

CAs have become increasingly relevant in facilitating convenient access to information and services, representing essential gateways for organizations to interact with customers or employees (Følstad et al., 2021). However, due to their frequent premature deployment and varying maturity levels, CAs can be error-prone and fail to meet the requirements of their intended use cases, ultimately leading to their abandonment. To address this challenge, we conducted a DSR project that demonstrates how organizations can leverage a systematized procedure model based on criteria-based analysis to foster continuous evaluation and improvement of CAs. Our article provides guidance for organizations to better understand and evaluate the quality of their CAs, thereby laying the foundation for their long-term success. As a result, this article expands the knowledge base on CAs and emphasizes that evaluating and improving them is an ongoing challenge due to their complex and unique nature. Additional research is needed to further explore how organizations can conduct criteria-based evaluations of their CAs and develop effective evaluation strategies.