Introduction

Colonoscopy is regarded as a tier-1 screening modality, which has significantly contributed to the reduction in the incidence of colorectal cancer (CRC) and its associated mortality [1]. Adenoma detection rate (ADR), defined as the percentage of screening colonoscopies that reveal at least one adenoma, is a quality marker for endoscopists. It is worth noting that every 1% improvement in ADR corresponds to a 3% reduction in CRC cases [2]. To further emphasize the importance of adenoma detection, the US Multi-Society Task Force (MSTF) recommends striving for a minimum ADR of 30% for male and 20% for female patients [3].

In order to improve this important quality marker, numerous solutions have been proposed including the utilization of Artificial Intelligence (AI), employing second observers, employing distal attachment devices, among other strategies [4, 5].

The idea of a second observer, whether in the form of a trainee, nurse, or technician is a time-proven method that has consistently shown effectiveness in improving both ADR and Polyp Detection Rate (PDR) during colonoscopies [6]. However, as we enter the era of AI, there is growing interest in the potential of this technology to improve ADR independently, eliminating the need for a second observer. In a comprehensive network meta-analysis of 94 Randomized Controlled Trials (RCTs), our study has previously demonstrated the superiority of AI in improving both ADR and PDR when compared to various modalities, including different methods of virtual chromoendoscopy and add-on devices [5]. Notably, our study excluded the second observer from its analysis in the recently published findings [5].

Due to the similarity in the operational mechanics of AI and a second observer, there has been a recent interest in comparing these two modalities [7]. Consequently, we decided to perform a network meta-analysis aimed at evaluating the effectiveness of AI, second observer, and a single observer in improving ADR.

Methods and Materials

Search Strategy

A comprehensive search of the following databases was conducted from inception until April 24, 2023, using multiple databases including MEDLINE (PubMed platform, NCBI), Embase (Embase.com, Elsevier), Web of Science Core Collection, Korean Citation Index, and SciELO (Clarivate), Global Index Medicus (World Health Organization), and Cochrane Central Register of Controlled Trials (Cochrane Library, Wiley). Screening of key reference / bibliographic lists for more studies was additionally performed. The keywords/ subject terms used to search were ‘colonoscopy,’ ‘adenoma,’ ‘artificial intelligence,’ ‘adenoma detection rate,’ ‘single blind’ ‘double blind,’ and ‘dual observer’ along with their corresponding medical subject heading terms, in various combinations. There were no language restrictions or filters applied. The search strategy was created by an experienced librarian (WLS) and reviewed by another investigator (MKG). The detailed search strategy can be reviewed in Supplementary Table 1.

Selection and Data Collection Process

Upon completion of the search process, all results were exported to EndNote 20 citation management software (Clarivate, Philadelphia, Penn, USA) where duplicates were removed via EndNote’s duplicate detection algorithms. Subsequently, manual screening was undertaken to identify and eliminate any duplicates. The title and abstract screening were conducted independently by two investigators (J.D and D.S.D), which was resolved by a third senior author (M. A) in cases of disputes. The full-text screening was conducted in the same manner by the same investigators. In cases where full-text articles were not readily available or further data were required, the corresponding authors of the included studies were contacted and missing or additional data were requested. Throughout the process, strict adherence to the Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) guidelines was practiced [8].

Inclusion and Exclusion Criteria

We included all studies that involved adult patients aged 18 and above undergoing colonoscopy with a single observer, double observer, or AI. Our inclusion criteria comprised relevant randomized studies and prospective and retrospective studies. Conversely, we excluded case reports, case series with fewer than 10 patients, editorials, guidelines, and review articles.

Data Extraction and Study Outcomes

The results were tabulated using Microsoft Excel (Microsoft, Redmond, Wash, USA). The extracted data items included author names, date of publication, type of study design, age, sex, the total number of patients and adenoma detection rate in single , double (operator plus observer), and AI operators.

Data Synthesis and Statistical Analysis

Categorical data were summarized as counts and percentages. We conducted a direct head-to-head comparator analysis and network meta-analysis of all available groups. The direct meta-analysis was performed using the DerSimonian-Laird method and random-effects model on Open Meta Analyst (CEBM, University of Oxford, Oxford, United Kingdom) [9].

Comparison of interventions and visual displaying of the findings was conducted using network meta-analysis using the random-effects model on the ‘R’ package ‘Netmeta’ (Bell Labs, Murray Hill, USA) [10]. The odds ratio (OR) for each outcome was calculated with a 95% confidence interval (CI). Microsoft Excel (Microsoft, Redmond, Wash, USA) was used to tabulate and create tables for this study. A p-value < 0.05 was considered statistically significant. The “frequentist method” was used to rank the intervention and a P-score was generated. Study heterogeneity was assessed using the I2 statistic defined by the Cochrane Handbook for systematic reviews and a value > 50% was considered as substantial heterogeneity [11]. Disagreement between direct and indirect evidence was assessed using the node-splitting technique [10].

Bias Assessment

The risk of bias assessment for the included studies was conducted using the Newcastle–Ottawa Scale (NOS) for observational studies [12], grading of recommendations assessment, development, and evaluation (GRADE) for RCTs [13]. GRADE approach to evaluate the strength of evidence for results from the NMA. Five domains that affect the level of confidence in the NMA results are considered: (i) risk of bias, (ii) inconsistency (iii) indirectness, (iv) imprecision, and (v) heterogeneity. Due to the nature of the trials evaluating the effects of a second observer and AI, participants and personnel could not be blinded. Therefore, this domain was not used to calculate the overall risk of bias for included studies. GRADEPro GDT was utilized to construct the summary of findings (SoF) table illustrated in Supplementary Table 5. Publication bias was assessed visually using a funnel plot and Egger’s regression analysis. A p-value < 0.05 was considered statistically significant publication bias.

Results

The search strategy identified 2952 articles, of which 260 unique studies were subjected to full-text review. After applying strict inclusion and exclusion criteria (Supplementary Fig. 1), a final selection of 26 studies was made. Our study included 20 randomized control trials and 3 retrospective and 3 prospective studies involving 22,560 subjects [7, 14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38]. These studies encompassed the years 2008 to 2023 and comprised of 12,596 ‘single’ observer, 4187 ‘double’ observer and 5777 ‘AI’ observer cases. The mean age across the studies ranged from 46.0 to 66.4 years (Table 1). Table 2 provides details on adenoma detection rate categorized by single, double, and AI operators.

Table 1 Demographic characteristics of included studies
Table 2 Intervention outcomes

Direct Meta-Analysis

In the direct comparative analysis, single-observer ADR was compared with AI ADR. The results indicated a statistical difference in adenoma detection rate between single operator and AI, favoring AI (OR: 0.668, 95% CI 0.595–0.749, p < 0.001), as depicted in Fig. 1. Similarly, a significant difference was observed in ADR between single observer vs second observer, with second observer demonstrating a higher ADR. (OR: 0.771, 95% CI 0.688–0.865, p < 0.001), as depicted in Fig. 2.

Fig. 1
figure 1

Forest plot showing direct comparison between single operator vs. AI

Fig. 2
figure 2

Forest plot showing direct comparison between single operator vs. double operator

Network Meta-Analysis

Adenoma detection rate in all modalities is summarized in Table 3. AI (RR: 1.26, 95% CI 1.17–1.35) and second observer (RR: 1.19, 95% CI 1.09–1.30) exhibited statistically significant higher rates of adenoma detection when compared to single operator. Network forest plot is demonstrated in Fig. 3A. (net forest plot). No statistical difference was noted when comparing AI to second observer. (RR 1.1 (0.9–1.2, p = 0.3).

Table 3 Network meta-analysis outcomes evaluating adenoma detection rate in cohort studies
Fig. 3
figure 3

A Network Forest plot of adenoma detection rate in cohort studies. B Network Forest plot of adenoma detection rate in RCTs

Random-effect models of second observer to single observer had low heterogeneity at 18% exhibiting high confidence in the level of evidence and low risk of bias, while single operator versus AI had high heterogeneity at 63% suggesting high variability among included studies. Net split models were also consistent in direct and indirect evidence when evaluating adenoma detection rates. (Supplementary Table 2) This consistency reinforces the confidence in the overall findings of the network meta-analysis and strengthens the conclusions drawn regarding the comparative effectiveness of the interventions. Corresponding net split plots (Supplementary Fig. 2A) and net graphs (Supplementary Fig. 2B) are provided in supplementary materials.

Randomized Controlled Trials

In RCTs, adenoma detection rate in all modalities is summarized in Table 4. Similarly, AI (RR: 1.27, 95% CI 1.18–1.37) and second observer (RR: 1.21, 95% CI 1.07–1.37) exhibited statistically significant higher rates of adenoma detection when compared to single operator. No statistical difference was noted when comparing AI to second observer. (RR: 1.0 (0.9–1.2), p = 0.49). Network forest plot is demonstrated in Fig. 3B. (net forest plot).

Table 4 Network meta-analysis outcomes evaluating adenoma detection rate in RCTs

Random-effect models of second observer versus single observer had low heterogeneity at 0% exhibiting high confidence in the level of evidence and low risk of bias, while AI versus single operator had high heterogeneity at 65% suggesting high variability among included studies. Net split models were also consistent in direct and indirect evidence when evaluating adenoma detection rates. (Supplementary Table 3). Corresponding net split plots (Supplementary Fig. 3A) and net graphs (Supplementary Fig. 3B) are provided in supplementary materials.

Ranking of Interventions

P-scores were used to assess the effectiveness of AI, second and single observers in detecting adenomas. AI achieved the highest score of 0.92, followed by second observer with a score of 0.58. Single operator scored 0.1. These findings were similar to that of RCTs. A higher P-score indicates a higher probability of detecting adenomas. Net ranking in Supplementary Fig. 5 illustrates these findings.

Publication Bias

The Newcastle–Ottawa scale was utilized to rank cohort studies, with scores ranging from 5 to 6, indicating moderate quality as shown in Supplementary Table 4. GRADE assessment was used for RCTs, which showed high to low quality of evidence, as shown in Supplementary Table 5. The results of the Egger’s publication score (p = 0.0404) and Thompson-Sharp (p = 0.0475) revealed significant evidence of publication bias, as shown in Supplementary Fig. 4A. Similarly, for RCTs, Egger’s publication score (p = 0.0035) and Thompson-Sharp (p = 0.0046) showed evidence of publication bias (Supplementary Fig. 4B).

Discussion

This systematic review and network analysis have demonstrated that both AI and second observer are superior to the single observer in improving ADR. However, when comparing AI to a second observer, we did not find any statistical difference; however, meta-ranking suggested potential preference toward AI although statistical significance could not be achieved.

Our meta-analysis confirmed the commonsense expression that “two pairs of eyes are better than one,” whether those pairs belong to humans or machines. As Wallace aptly suggested, three mechanisms contribute to missing an adenoma: it may not be within the visual field, it might go unrecognized, or it may be unrecognizable [39]. The second observer aids in recognizing “not recognized” polyps and potentially even those in the first scenario if they encourage better technique. In contrast, AI has the potential of assisting in identifying “not recognizable” polyps, in addition to the aforementioned scenarios. Hence, the effectiveness of a second human observer depends on their experience and training. Prior studies have demonstrated that when a trainee serves as the second observer, ADR increases with each year of the trainee’s fellowship training [29]. Furthermore, studies have shown that experienced nurses, when acting as second observers, increase ADR compared to their less experienced counterparts [40]. The studies included in our meta-analysis involved different settings and observers with varying levels of experience. Therefore, standardizing training for second observers, reducing heterogeneity, and evaluating these effects become imperative.

As for AI, with the ever-increasing quality of endoscopic imaging, visual field, and the computational power of processors, AI should theoretically achieve superior potential over the human-eye in diagnosing adenomas. AI can provide real-time pixel-level analysis of every frame, overcoming the human-eye’s limitations, such as its propensity to miss briefly visible or partially blocked adenomas. In addition, the human-eye is susceptible to inherent defects such as “inattentional blindness” when distractions lead to missed polyps and “change blindness” when alterations are missed during eye movement [41,42,43]. These intrinsic limitations cannot be fully addressed by another human observer. In spite of these limitations, the presence of another human observer may have unaccounted benefits as Rex et al. showed an increase in withdrawal time when the endoscopist was being observed [44]. This may possibly be due to competition and increased attentiveness, leading to improved inspection techniques, such as inspecting behind folds and spending more time in withdrawal. Meanwhile, endoscopists may become over reliant on AI, inadvertently reducing the quality of inspection, highlighting the need for a second observer.

AI offers solutions to mitigate reduced exam quality due to endoscopists’ overconfidence in it. By reminding endoscopists to inspect behind folds and improve withdrawal time, Computer-Aided Quality (CAQ) AI has shown efficacy in improving exam quality [45]. While our studies primarily included the Computer-Aided polyp Detection (CADe) AI, combinations of CADe and CAQ AI have shown superior results compared to CADe [45]. Future research is required to study these effects, especially in head-to-head comparisons with second observers.

A notable issue with AI in diagnostics is its high rate of false positives, mistakenly identifying benign cases as adenomas, which are disproven upon further examination [46]. Nevertheless, evidence suggests that AI outperforms the human counterpart in this aspect; Wang et al.’s recent study found that second observers were significantly more prone to false alarms than AI [7]. We speculate that with advancements in polyp characterization (CADx) AI, the rate of false positives in the AI group will further decrease. This may potentially make AI more efficient than second observers- an area requiring further investigation.

Another aspect to explore is how these modalities benefit endoscopists based on their expertise. Some studies suggest that second observers exclusively benefit inexperienced endoscopists and offer no additional advantage to expert endoscopists [30, 31]. This may be due to the marginal benefit of second observers for experts. Another potential cause might be the reluctance of trainees or nurses to point out missed lesions to expert endoscopists. This is important as the included studies are from different countries with cultural heterogeneity. Another potential source of heterogeneity is the level of experience of the second observer; some studies included trainees, while others included nurses. Despite this variation, all studies showed effectiveness regardless of the type of second observer. With regard to AI, while some studies show greater benefits for inexperienced endoscopists, others do not find significant differences in AI’s benefit between high and low detectors [45, 47]. Although subgroup analysis was not possible due to insufficient data, future studies may help identify the subgroups that benefit most from these modalities.

Our study had some limitations. First, the study combines both RCTs and observational studies, with the latter carrying inherent bias. However, most of the data were derived from RCTs that demonstrate lower relative bias rates. Second, the non-blinded methodology in the studies could affect endoscopist behavior due to the presence of another observer or overconfidence in the assisting modality [48]. The involvement of multiple endoscopists should help reduce this bias. Third, heterogeneity exists in terms of patient populations and timing. In our systematic review, the second-observer studies were published predominantly prior to 2018, while all AI studies emerged after 2019; however, all colonoscopies utilized HD endoscopes. Fourth, due to the limited follow-up duration, long-term outcome data on interval CRC and mortality are scarce and could not be analyzed. Lastly, we were unable to conduct subgroup analysis based on polyp morphology and location, which is essential given that serrated lesions in the right colon are an important cause of interval CRCs.

In conclusion, both AI and second observer led to improvement in ADR compared to single-observer colonoscopy. More standardized RCTs are required to compare AI with second observers, as current data suggest AI’s superiority, even though statistical significance was not achieved. As the technology evolves, we recommend utility of AI, if feasible, to improve ADR.