1 Introduction

Well-designed randomised experiments such as randomised controlled trials (RCTs) are considered the “gold standard” for evaluating and comparing interventions at the end of the study (Akobeng 2005; Rosenberger et al. 2019; Kim et al. 2021). The standard protocol for conducting an RCT is based on following a predefined design that does not allow modifications of the trial without approved amendments (Pallmann et al. 2018). For example, interventions are assigned with prespecified (typically equal) randomisation probabilities that cannot be changed during the course of the study to direct participants to the most promising options. This approach ensures strong statistical guarantees in terms of bias and type-I error control, among other properties (see e.g., Villar et al. 2015; Williams et al. 2021; Robertson et al. 2023), but allows no flexibility for data-driven online alterations. Adaptively-randomised experiments have the potential to utilise accumulating information within the trial to modify its course (randomisation probabilities, sample size, etc.) in accordance with predetermined rules (Pallmann et al. 2018). To illustrate, response-adaptive randomisation schemes may use participants’ responses to skew the allocation toward more efficient or informative interventions with the aim of assigning superior ones to as many participants as possible (Robertson et al. 2023). While collecting high-quality data, these types of experiments can result in a more flexible, efficient, and ethical alternative compared to traditional randomised studies (Bothwell et al. 2018; Pallmann et al. 2018).

As a concrete example, consider a platform such as Netflix or Spotify: instead of just showing users their own TV show/music preferences (or display random content), the system’s goal is to continuously interact with users to learn and offer them special recommendations that may enhance their overall experience and engagement with the platform. Clearly, to be able to efficiently provide the most ideal content, the outcomes of interest (e.g., some measure of appreciation or engagement) must be promptly observed so as to adjust to users’ preferences on a continuous basis. It is thus becoming increasingly common to incorporate preference information that may provide insights on a longer-term outcome of interest such as engagement. Netflix has actually discovered significant business value in collecting users’ feedback to personalise their movie experience (Amatriain and Basilico 2015). Similarly, university instructors saw great value in using student ratings to give them better or preferred explanations of a concept to enhance their understanding (Williams et al. 2018). In the healthcare setting, relying on feedback collected through mobile-health technologies is increasingly practised for healthcare delivery (Figueroa et al. 2022; Liu et al. 2023) and to increase medication adherence (Gandapur et al. 2016).

Starting from the pioneer work of Thompson (1933), multi-armed bandit (MAB) algorithms (Lattimore and Szepesvári 2020) have been argued for decades as useful tools to adaptively randomise experiments. This framework provides a succinct abstraction of the trade-off between exploration (learning enough information about the different options or arms) and exploitation (selecting the most promising arm(s) so far), inherent in many online sequential decision-making problems with incomplete knowledge. Specifically, in MAB problems, a learner or decision-maker must repeatedly select an arm from a given set of alternatives, say \(A_t \in {\mathscr {A}}\), in an online manner for each round \(t=1,\dots ,T\), with T finite or infinite. After selecting an arm \(A_t\), a numerical reward \(Y_t(A_t) \in {\mathbb {R}}\) associated with that arm is observed, ideally before the next round. A typical goal in MAB problems is to learn how to efficiently use observations from previous rounds to improve decision making so as to maximise–under uncertainty on the best arm–the expected total reward \({\mathbb {E}}\left[ \sum _{t=1}^T Y_t(A_t)\right]\).

Several algorithms have been proposed for the MAB problem; for a survey, we refer to Lattimore and Szepesvári (2020). In this work, we focus on a highly interpretable, computationally efficient, and asymptotically optimal (Agrawal and Goyal 2017) strategy originally introduced in the clinical trial arena: Thompson sampling (TS; Thompson 1933; Russo et al. 2018). Due to its competitive empirical and theoretical properties (see e.g., Chapelle and Li 2011; Agrawal and Goyal 2017, for details), TS is nowadays receiving renewed attention in several domains, including online recommendation (Chapelle and Li 2011), economics/finance (Charpentier et al. 2023), education (Williams et al. 2018), and healthcare (Deliu et al. 2023; Figueroa et al. 2021).

There exists a wide array of TS variants (see e.g., Russo 2016; Kasy and Sautmann 2021; Li et al. 2022); however, with a few exceptions, most of them are based on binary or continuous outcomes, with a persistent shortage of appropriate solutions for rating scale data. In such cases, typical practices consist in either dichotomising the ordinal reward variable according to a (often arbitrary) cutoff and then using a binary model (Williams et al. 2018), or using a Normal model directly on the rating outcome (Deliu et al. 2021; Parapar and Radlinski 2021). Such practices have long been recognised as suboptimal in terms of both reward efficiency (Williamson and Villar 2020) and statistical inference (Altman and Royston 2006).

Motivating example This work is directly motivated by a series of exploratory studies aimed at evaluating the feasibility of MAB algorithms to develop intelligent adaptive systems that continuously improve user experiences through their ratings. Without loss of generality and only for illustrative purposes, here we report on a two-armed behavioural experiment (MTurk I) we conducted on Amazon Mechanical Turk, an online platform widely used by academics as a quick and inexpensive means of collecting experimental data (Mason and Suri 2012). Participants were recruited online, upon invitation to complete the survey with a compensation of USD $10 per user. Among other personal information (such as age and gender), participants were asked to provide a rating defined on a 7-point Likert scale for two types of messages related to mood and mental health. In the MTurk I experiment, a traditional (balanced) randomised design was implemented with the aim of learning about arm effectiveness. Overall, \(T=110\) users have participated in the study, with \(T_1 = 58\) and \(T_2 = 52\) users receiving arm 1 (“Today is a new day to start fresh”) and arm 2 (“Let the past make you better, not bitter”), respectively. A summary of the ratings provided to each arm is reported in Fig. 1, which shows a small superiority of arm 1 (sample mean of \({\hat{\mu}}_1 = 5.81\) vs \({\hat{\mu }}_2 = 5.08\)). Subsequent experiments have been conducted adaptively using TS, but, despite the rating outcome, they were all based on a conventional Normal model, therefore motivating this work.

Fig. 1
figure 1

Arm ratings distribution in the two-armed MTurk I experiment, defined on a 7-point Likert scale. A small superiority of arm 1 vs arm 2 is shown: sample means \({\hat{\mu }}_1 = 5.81\) vs \({\hat{\mu }}_2 = 5.08\), sample variances \({\hat{\sigma }}_1 = 1.04\) vs \({\hat{\sigma }}_2 = 1.72\), for an overall sample size \(N=110\)

Our contribution and related work The contribution of this paper is threefold. First, we introduce Multinomial Thompson sampling (Multinomial-TS), a TS version specifically designed for rating scale data. The existence of a conjugate prior–namely the Dirichlet distribution (Kotz et al. 2000)–for this exponential family, allows an easy implementation of the proposed algorithm through its closed-form posteriors. Second, guided by the empirical behaviour of Multinomial-TS in particular skewed distributions, we also introduce Multinomial-TS with augmented support, a variation of Multinomial-TS that, based on a simple trick on priors’ support, improves the ability of the algorithm to balance arms allocation when an optimal arm does not exist. Further investigations on how to better calibrate uncertainty in such a scenario are conducted by studying the role of the prior distribution. As we discuss in Sects. 3 and 4, these considerations offer a potential solution for mitigating the so-called incomplete learning phenomenon (see e.g., Kalvit and Zeevi 2021), which refers to the inefficiency that many MAB algorithms, including TS, may have when exploring the arms. Finally, we evaluate the possibility of redesigning the motivating MTurk I experiment using the different Multinomial-TS variants and discuss their benefits and drawbacks. This paper is an extended version of the short paper presented at the 51st Scientific Meeting of the Italian Statistical Society on June, 2022 (Deliu 2022).

It should be noted that a few other works have studied TS under a multinomial model (Agrawal et al. 2022; Riou and Honda 2020; Zhang et al. 2021); however, they all differ from our work in several aspects, including the problem of interest. Specifically, Zhang et al. (2021) and Agrawal et al. (2022) both study a selection problem, where a decision system is faced with one choice among a set of K arms (online products and radio codebooks, respectively). Although these K arms may be considered analogous to the number of points of a Likert scale, we emphasise that in our (rating scale) problem each arm is itself defined on a scale, and the scale has an ordered nature. Riou and Honda (2020) cover a more general framework that can also suit our problem, but their work assumes bounded rewards in [0, 1]. Furthermore, it is worth noting that all of these works are primarily interested in regret analysis, performed under the assumption that there exists a unique optimal arm. No evaluations or considerations are made in scenarios with equal optimal arms, which represents a common scenario in many realistic applications, from image classification to personalised medicine (see e.g., Berry et al. 1997, for examples).

Structure of the work The paper is organised as follows: in Sect. 2, we introduce the general adaptive K-armed experimental setup along with the TS algorithm. The proposed Multinomial-TS strategy is detailed in Sect. 2.1 and its empirical performance is evaluated in Sect. 3. Section 4 is devoted to examining two potential strategies for calibrating the incomplete learning phenomenon and mitigating uncertainty within Multinomial-TS: one based on considerations about the skewness of the distribution (Sect. 4.1) and another one based on the information carried out by the prior distribution (Sect. 4.2). The potential of applying the proposed Multinomial-TS variants for the MTurk I experiment is discussed in Sect. 5. Finally, Sect. 6 presents concluding remarks and some research directions for future work.

2 Problem setting and methods

Experimental setup Consider a general K-armed experiment defined over a finite horizon T, in which N participants are accrued in a fully-sequential way, with an experimental size \(N = T\). At accrual, each participant \(t=1,\dots ,T\) is assigned to one of the K available arms \(A_{t} \in \{1,\dots , K\}\) and subsequently an outcome \(Y_{t}(A_{t})\) associated with the assigned arm \(A_{t}\) is observed before the next round \(t+1\). We assume that \(Y_{t}(A_{t}), t=1,\dots ,T\), does not depend on individual characteristics but only on arms, although our results may be generalised to a more general contextual MAB setting (Lattimore and Szepesvári 2020). The arms are drawn according to a policy \(\varvec{\rho _t} \doteq \{\rho _{t,k}, k=1,\dots ,K\}\), where \(\rho _{t,k}\) is the allocation probability of arm k in the round t. Given the history of selected arms and associated rewards, say \({\mathscr {H}}_{t} \doteq \{A_{\tau }, Y_{\tau }(A_{\tau }), \tau = 1,\dots ,t \}\), the goal is to find an (optimal) allocation policy so as to maximise the expected cumulative reward over the horizon T. Resembling this experimental setup, the MAB paradigm is framed on the efficient use of observations from previous rounds for estimating arm reward distributions and choosing which arm to select in the future. The most common measure of performance in MABs is the ability to maximise, under uncertainty on the best arm, the expected total reward \({\mathbb {E}}\left[ \sum _{t=1}^TY_t(A_t)\right]\), e.g., cumulative ratings provided by users over time. Equivalently, the goal is to minimise the so called expected total regret (more simply, total regret), i.e., how much we expect to regret in not knowing/selecting the optimal arm, when one exists. Formally, considering a stationary setting in which the mean reward associated with each arm does not change over different rounds, i.e., \({\mathbb {E}}[Y_t(A_t)] = {\mathbb {E}}[Y(A_t)] = \mu _{A_t}\), and the optimal arm \(A_t^* = A^* \doteq {{{\,\mathrm{arg\,max}\,}}}_{k \in {\mathscr {A}}}{\mathbb {E}}(Y (A_t = k)) = {{{\,\mathrm{arg\,max}\,}}}_{k \in {\mathscr {A}}}\mu _k, \forall t\), we have

$$\begin{aligned} \text {Total Regret} = {\mathbb {E}}\left( \sum _{t=1}^T Y_{t} (A^*)\right) - {\mathbb {E}}\left( \sum _{t=1}^T Y_{t} ( A_t)\right) = T\mu ^* - \sum _{t=1}^T\mu _{A_t}, \end{aligned}$$
(1)

with \(\mu ^* = \mu _{A^*} = \max _{k \in {\mathscr {A}}} \mu _k\). Basically, regret is defined as the difference between the maximum possible reward attainable and the reward resulting from the arms selected over the horizon T.

In such a setting, outcomes can be considered independent draws from a fixed but unknown distribution with the following conditional mean:

$$\begin{aligned} {\mathbb {E}}\left( Y_{t}(A_{t}\right) \mid {\mathscr {H}}_{t-1}) = {\mathbb {E}}(Y \mid A_{t}) = \sum _{k=1}^K\mu _{k}{\mathbb {I}}(A_{t}=k), \end{aligned}$$

where \(\{\mu _k, k=1,\dots ,K\}\) are the unknown arm parameters.

The focus of this work is on rating scale outcomes. Without loss of generality, we consider a J-point Likert scale with \(Y_{t} \in {\mathscr {Y}}\doteq \{1,2,\dots ,J\}\) for \(t=1,\dots ,T\), where higher values translate into better outcomes. Given the specific rating scale setting, we consider a data generation process that occurs according to a multinomial distribution such that \(\mu _k = \sum _{j=1}^Jjp_{j,k}\) for \(k=1,\dots ,K\), where \(p_{j,k}={\mathbb {P}}(Y(A_t=k)=j), j=1,\dots ,J, k=1,\dots ,K\) are the unknown parameters defining the multinomial family. This will be discussed in more detail in Sect. 2.1.

Thompson sampling Rooted in a Bayesian framework, TS defines arm allocations in terms of their posterior probability of being associated with the maximum expected reward at each round \(t=1,\dots ,T\). In a K-armed setting, denoted by \(\rho _{t,k}^{\text {TS}}\) the TS probability of allocating arm k at round t, this is given by:

$$\begin{aligned} \rho _{t,k}^{\text {TS}}&= {\mathbb {P}}\Big ({\mathbb {E}}\big (Y_{t}(A_{t} = k)\big ) \ge {\mathbb {E}}\big (Y_{t}(A_{t} = k')\big ),\forall k' \mid {\mathscr {H}}_{t-1} \Big ) \nonumber \\ {}&= {\mathbb {P}}\Big (\mu _{k} \ge \mu _{k'},\forall k' \mid {\mathscr {H}}_{t-1} \Big ) \nonumber \\&= \int _{\Omega _1\times \cdots \times \Omega _K} {\mathbb {I}}[\mu _{k} \ge \mu _{k'}, \forall k'] \prod _{k=1}^K\pi _t(\mu _{k} \mid {\mathscr {H}}_{t-1}) d\mu _1 \times \dots \times d\mu _K, \end{aligned}$$
(2)

with \(\pi _t\) the posterior distribution of arm means, and \(\Omega _k\) the parameter space of \(\mu _k, \forall k\).

With a few exceptions (e.g., the Normal model), the exact computation of the quantity in Eq. (2) is not feasible. Thus, the typical way to implement TS (Russo et al. 2018) involves drawing at each round t a sample from the posterior distribution of the parameters of each of the unknown arms, and then selecting the arm associated with the highest posterior estimated mean reward, say \({\tilde{\mu }}_{tk} = {\mathbb {E}}({\tilde{Y}}_{t} \mid A_{t} = k)\), that is,

$$\begin{aligned} {\tilde{a}}_{t} \doteq \text {argmax}_{k=1,\dots ,K}{\mathbb {E}}({\tilde{Y}}_{t} \mid A_{t} = k) = \text {argmax}_{k = 1,\dots ,K}{\tilde{\mu }}_{tk}. \end{aligned}$$
(3)

To guarantee computationally efficient posterior sampling, given the repeated implementation of the strategy for each round t, a conjugate family is generally considered, with the most common distributional assumptions for the reward variable being the Bernoulli and the Normal model (Russo et al. 2018).

The TS algorithm is recognised as an asymptotically optimal strategy (Agrawal and Goyal 2017), meaning that it matches the asymptotic lower bound of the regret metric in Eq. (1) for adaptive allocation schemes. We refer to Lai and Robbins (1985) for the expression of the lower bound, which is beyond the scope of this paper.

2.1 Proposal: multinomial Thompson sampling

The multinomial distribution is the extension of the binomial distribution for categorical outcomes with more than two response categories (Agresti 2019). Consider a fixed number of J mutually exclusive categories of an outcome \(Y_t, t=1,\dots ,T\), among which, one and only one category is observed at each round t, i.e., \(\sum _{j=1}^J{\mathbb {I}}(Y_t=j) = 1\), with \(\sum _{t=1}^T\sum _{j=1}^J{\mathbb {I}}(Y_t=j) = T\). Similarly to the binomial, the multinomial distribution models the probability of counts \(X_{j} = \sum _{t=1}^T{\mathbb {I}}(Y_t=j), \forall j \in \{1,\dots ,J\}\), over T trials. Denoted by \(\text {Multinom}(T;{\varvec{p}})\), with \(T>0\) the number of trials and \({\varvec{p}} = (p_1,\dots ,p_J) = \Big ({\mathbb {P}}(Y = 1),\dots ,{\mathbb {P}}(Y = J)\Big )\) the unknown model parameters belonging to the standard simplex \({\mathscr {P}}^{J} = \{ p \in [0,1]^J: \sum _{i=1}^{J}p_{j}=1 \}\), its probability mass function is given by

$$\begin{aligned} f(x_{1},\dots ,x_{J}; T; p_{1},\dots ,p_{J}) = \left( \frac{T!}{\prod _{j=1}^Jx_{j}!}\right) \prod _{j=1}^Jp_{j}^{x_{j}}, \end{aligned}$$
(4)

where \(x_{j}=\{0,\dots ,T\}\), for all \(j=1,\dots ,J\), and \(\sum _{j=1}^J x_{j}=T\). The expected number of times category j is observed over T trials is \({\mathbb {E}}(X_{j}) = T p_j\), with variance \({\mathbb {V}}(X_{j}) = T p_j (1-p_j)\) and covariance \({\mathbb {C}}ov(X_{j}, X_i) = -T p_j p_j\), for \(i \ne j\). Theorem 1 bridges these properties of the multinomial family to the distribution of interest.

Theorem 1

Assuming a fixed number J of mutually exclusive ordinal and real valued categories, and denoting with \(Y_t, t=1,\dots , T\), the category variable, this can be expressed as

$$\begin{aligned} Y_t = \sum _{j=1}^J j X_{j}, \quad {\textbf{X}}\sim \text {Multinom}(1;{\varvec{p}}),\quad t=1,\dots , T, \end{aligned}$$

with

$$\begin{aligned} \mu _t&= {\mathbb {E}}[Y_t] = \sum _{j=1}^J j p_j, \end{aligned}$$
(5)
$$\begin{aligned} {\mathbb {V}}(Y_{t})&= \sum _{j=1}^J j^2 p_j(1-p_j) - \sum _{i\ne j} ij p_ip_j. \end{aligned}$$
(6)

The proof of Theorem 1 is based on noticing that in each single trial (consider, for simplicity, \(T=1\)), \(X_{j}=\{0,1\}\), for all \(j=1,\dots ,J\), with \(\sum _{j=1}^J X_{j} = 1\). The expectation in Eq. (5) as well as the variance in Eq. (6) follow straightforwardly from operator properties and known results on the multinomial distribution (Agresti 2019).

We emphasise the key role of this result in light of the algorithm under study. Multinomial-TS is indeed implemented by using the result in Eq. (5) to directly compute the posterior mean outcome, as required by the definition in Eq. (2). Posterior updates are based on the Dirichlet connection to multinomial-distribution counts. In fact, Eq. (4) can also be expressed using the gamma function \(\Gamma\), directly showing its similarity to the Dirichlet distribution, its conjugate prior. Given a parameter vector \(\varvec{\alpha } = (\alpha _1,\dots ,\alpha _J)\), with \(\alpha _j > 0, \forall j=1,\dots ,J\), the Dirichlet distribution, denoted by \(\text {Dir}(\varvec{\alpha })\), models our knowledge on the unknown parameters \((p_1,\dots ,p_J)\) of the multinomial as:

$$\begin{aligned} f(p_1,\dots ,p_J; \alpha _1,\dots ,\alpha _J) = \left( \frac{\Gamma (\sum _{j=1}^J\alpha _j)}{\prod _{j=1}^J\Gamma (\alpha _j)}\right) \prod _{j=1}^Jp_{j}^{\alpha _{j}-1}, \end{aligned}$$
(7)

where again \({\varvec{p}}\) belongs to the standard \(J-1\) simplex \({\mathscr {P}}^{J} = \{ p \in [0,1]^J: \sum _{i=1}^{J}p_{j}=1 \}\).

We then iteratively update our beliefs about unknown parameters \(\varvec{\alpha }\) based on the observed outcome \(Y_t\) at each round t. Specifically, for \(t=1,\dots ,T\), the posterior distribution of arms is defined by the vector \(\varvec{\alpha }\) with elements \(\alpha _{j} \leftarrow \alpha _j + c_{j}\), \(\forall j=1,\dots ,J\), where \(c_{j} = {\mathbb {I}}(y_t=j)\) indicates whether the category j was observed at time t. The update is made independently for each arm and changes occur only for the selected one. The pseudocode of Multinomial-TS is given in Algorithm 1.

Algorithm 1
figure a

Multinomial-TS

3 Simulation studies

For our empirical evaluation, we start with a set of simulation studies based on the setup introduced in Sect. 2, with \(T=1000\), \(J=7\), and \(K=2\). We consider four scenarios defined by the following data-generation processes.

  • Scenario 1: Existence of a unique optimal arm (\(H_1\!: \mu _1 > \mu _2\)). In line with the MTurk I example in Fig. 1 (\(\mu _1 = 5.8\); \(\mu _2=5.08\)), we generate \(Y_t, t=1,\dots ,T\) as

    $$\begin{aligned} Y_t \mid (A_t=k)&= [1,\dots ,7]\times {\textbf{X}}_k,\quad {\textbf{X}}_k \sim \text {Multinom}(1;{\textbf{p}}_k), \\ {\textbf{p}}_k&= {\left\{ \begin{array}{ll} (0.00, 0.02, 0.02, 0.05, 0.21, 0.45, 0.24)\quad k=1,\\ (0.08, 0.06, 0.02, 0.06, 0.33, 0.27, 0.19)\quad k=2. \end{array}\right. } \end{aligned}$$
  • Scenario 2: Identical arms (\(H_0\!: {\textbf{p}}_1 = {\textbf{p}}_2\))—symmetric distribution. Following the normal approximation of the binomial distribution, we define \({\textbf{p}}_k\) so as to resemble a (symmetric) Normal distribution with parameters \(\mu _k = 4\) (to get symmetry over the 7 categories) and \(\sigma ^2_k = 1.7\) (aligned with the MTurk I study), for \(k=1,2\). Specifically, we generate \(Y_t, t=1,\dots ,T\) as

    $$\begin{aligned} Y_t \mid (A_t=k)&= [1,\dots ,7]\times {\textbf{X}}_k,\quad {\textbf{X}}_k \sim \text {Multinom}(1;{\textbf{p}}_k), \\ {\textbf{p}}_k&= (0.02, 0.09, 0.23, 0.31, 0.23, 0.09, 0.02)\quad k=1,2. \end{aligned}$$
  • Scenario 3: Identical arms (\(H_0\!: {\textbf{p}}_1 = {\textbf{p}}_2\))—right-skewed distribution. To obtain right-skewed data we consider \(\mu _1 = \mu _2 = 3\) and generate \(Y_t, t=1,\dots ,T\) as

    $$\begin{aligned} Y_t \mid (A_t=k)&= [1,\dots ,7]\times {\textbf{X}}_k,\quad {\textbf{X}}_k \sim \text {Multinom}(1;{\textbf{p}}_k), \\ {\textbf{p}}_k&= (0.2, 0.3, 0.15, 0.15, 0.1, 0.05, 0.05)\quad k=1,2. \end{aligned}$$
  • Scenario 4: Identical arms (\(H_0\!: {\textbf{p}}_1 = {\textbf{p}}_2\))—left-skewed distribution. To obtain left-skewed data we consider \(\mu _1 = \mu _2 = 5\) and generate \(Y_t, t=1,\dots ,T\) as

    $$\begin{aligned} Y_t \mid (A_t=k)&= [1,\dots ,7]\times {\textbf{X}}_k,\quad {\textbf{X}}_k \sim \text {Multinom}(1;{\textbf{p}}_k), \\ {\textbf{p}}_k&= (0.05, 0.05, 0.1, 0.15, 0.15, 0.3, 0.2)\quad k=1,2. \end{aligned}$$

To disentangle the individual role of mean, variance, and skewness, we relax the null cases of identical arms \(H_0\!: {\textbf{p}}_1 = {\textbf{p}}_2\). Specifically, an extended number of scenarios with different probability mass functions \({\textbf{p}}_1 \ne {\textbf{p}}_2\), but identical means \(\mu _1 = \mu _2\) are sampled from the polytope of discrete distributions with mean \(m \in \{1,2,\dots ,7\}\). These results are reported in Appendix B.

We evaluate the proposed Multinomial-TS using a uniform prior over simplexes for each of the two arms, i.e., \(Dir( \varvec{\alpha }_k = {\textbf{1}})\) for \(k=1,2\). Its performance is then compared with alternative modelling options that are part of the current common practice (see Sect. 1). More specifically, we consider:

  • Normal-TS, assuming weakly-informative and identical priors \(N(\mu _k=4, \sigma _k^2=100),\ k=1,2\), in all scenarios. Posterior arm means are updated following the conjugacy of the Normal reward model, with unknown means and known variances. The variances are set according to the formulation reported in Eq. (6) for each scenario.

  • Binary-TS with a cutoff \(y_c=6\), meaning that a success occurs when \(Y_t \ge 6\). Uniform and identical \(\text {Beta}(\alpha _k=1, \beta _k=1),\ k=1,2\), priors are assumed in all scenarios. Posterior arm means are updated following the conjugacy of the Beta-Bernoulli model, with the outcome variable following a Bernoulli distribution.

  • Binary-TS with a cutoff \(y_c=7\), such that only the highest category \(j = 7\) is considered a success. Uniform and identical \(\text {Beta}(\alpha _k=1, \beta _k=1),\ k=1,2\), priors are assumed in all scenarios. Posterior arm means are updated following the conjugacy of the Beta-Bernoulli model, with the outcome variable following a Bernoulli distribution.

  • An Oracle that is assumed to know the underlying truth, which always assigns the arm with the highest mean reward (Besbes et al. 2014) when this exists. Under identical arm scenarios, the Oracle allocates arms with equal probability.

Regret and Optimal Arm Allocation—Scenario 1 (\(H_1: \mu _1 > \mu _2\)) We evaluate standard bandit performances under a setting in which an optimal arm, defined as the one yielding the maximum outcome, exists and it is unique. Note that in a setting with equal arm distributions, thus equal arm means, regret becomes meaningless by definition as \(T\mu ^* - \sum _{t=1}^T\mu _{A_t} = 0\) when \(\mu _k = \mu ^*\), for all k. Empirical results for this setting are shown in Fig. 2, highlighting the increased performance of the proposed Multinomial-TS over both Normal-TS and Binary-TS.

Fig. 2
figure 2

Regret and proportion of optimal arm allocation in the proposed Multinomial-TS vs Normal-TS and Binary-TS. Values are obtained by averaging across \(10^4\) independent TS trajectories

Notably, Binary-TS results to be highly sensitive to changes in the cutoff value \(y_c\), with greater values remarkably impacting regret and best-arm allocation. As a result, when the extreme category \(j=7\) is chosen as the cutoff value, Binary-TS focusses on discriminating between upper extreme values vs all other values (that is, \(Y_t = 7\) vs \(Y_t < 7\)). The outcomes \(Y_t < 7\) are treated equally, although they contain important information on arms distributions.

Arms Allocation—Scenario 2 (\(H_0: {\textbf{p}}_1 = {\textbf{p}}_2\); symmetric distribution) While in an identical arm setting with \({\textbf{p}}_1 = {\textbf{p}}_2\) regret is not of interest, it is instead helpful to understand how an algorithm balances the allocation of one arm over the other one when no one should be exclusively preferred. Results in Fig. 3 show that, differently from the Oracle that in such a case would assign arms with equal probability (by definition), TS would continuously search for an optimal arm, with the aim, and often the result, of assigning it more often only by chance. This intrinsic characteristic is reflected in all modelling strategies and is exacerbated in the Normal case (see Fig. 3; top-right plot).

Fig. 3
figure 3

Empirical allocation (\(T_1/T\), with \(T_1\) being the number of times arm 1 is allocated) under the identical arm case (Scenario 2). Values are obtained by averaging across \(10^4\) independent TS trajectories

In fact, in the Normal case, the empirical allocation of arms does not only not concentrate (does not converge to a unique value), but its distribution closely resembles a uniform distribution in the probability interval [0, 1]. This behaviour results in heavily unbalanced allocations in favour of one of the two arms–eventually fixing on selecting only a single arm–even when no underlying differences between arms exist. We emphasise that this is a well-studied phenomenon in MAB problems, referred to as incomplete learning (see e.g., Keskin and Zeevi 2018) and occurring when parameter estimates fail to converge to the true value. The main reason is the insufficient exploration of the arms, although recent work has pointed to some consequences of the sequential nature of data collection (see e.g., Villar et al. 2015; Shin et al. 2019; Deshpande et al. 2018). As a result, when using standard statistical estimators, such as the sample mean, in adaptively collected data, unbiasness, consistency, and asymptotic normality are no longer guaranteed (Hadad et al. 2021), with negative impacts on hypothesis testing, that is, inflated type-I error and low power (Deliu et al. 2021).

Arms Allocation—Scenario 3 and Scenario 4 (\(H_0: {\textbf{p}}_1 = {\textbf{p}}_2\); skewed distribution) The skewness of the distribution and its direction can play a relevant role in the algorithm’s behaviour and its resulting performances. This is particularly true for Multinomial-TS, which shows a significant sensitivity to the shape of the distribution in terms of its ability to balance arms allocation under a null (\(H_0\)) scenario. Specifically, it can be observed that, in the case of positive skewness (Scenario 3; Fig. 4), the arm allocation seems to concentrate, although with high variability, around the ideal \(T_1 / T = 1/2\) value, with \(T_1\) being the number of times the arm 1 is allocated. We recall that this is the result that we would observe under the Oracle in a two-arm setting. However, for negatively skewed distributions (Scenario 4; Fig. 4), an opposite trend occurs, highlighting the incomplete learning phenomenon more intensely.

Fig. 4
figure 4

Empirical allocation (\(T_1/T\), with \(T_1\) being the number of times arm 1 is allocated) under the identical arm case but with a non-symmetric distribution (Scenario 3: right-skewed distribution; Scenario 4: left-skewed distribution). Values are obtained by averaging across \(10^4\) independent TS trajectories

In fact, when looking at the empirical probabilities of extreme arm allocations in Scenario 4, these have values \({\mathbb {P}}\left( \frac{T_1}{T}>0.95\right) = 8.7\%\) and \({\mathbb {P}}\left( \frac{T_1}{T}<0.05\right) = 8.9\%\), which are substantial compared to the values \(<0.5\%\) characterising Scenario 3 (see Table 1).

Additional results reported in Appendix B support these findings and allow for a better understanding of the individual role of the standard deviation vs the skewness of the distribution. Specifically, the higher the variability of one arm compared to the other, the lower its empirical probability of being assigned an extreme number of times (this can be inferred from Table 2). Analogously, the higher the left skewness, the higher the chances of observing a large amount of extreme allocations.

Normal-TS and Binary-TS show similar patterns in these scenarios; however, the overall leaning is detrimental rather than ameliorable, particularly for Binary-TS (we refer to Figs. 16 and 17 in Appendix C).

Table 1 Empirical probabilities (in percentage) of observing an arm allocation within a given range. Values are referred to Multinomial-TS and are obtained by averaging across \(10^4\) independent TS trials

Motivated by this distinctive nature of Multinomial-TS, in Sect. 4.1 we discuss a simple trick that may be useful in enhancing the safety of online-data collection when using this algorithm.

4 Mitigating the incomplete learning problem via prior considerations

As discussed in Sect. 3, the incomplete learning phenomenon characterising TS reveals that, in a context with unknown parameters, full knowledge of the parameter values may be precluded in the long run, with only one parameter (i.e., the one of the “optimal” arm) being consistently estimated. Although this behaviour may be considered acceptable when one optimal arm truly exists, it is detrimental and misleading in opposite cases, that is, when all arms are identical. Therefore, a good strategy should consider certain adjustments so that some active experimentation (Antos et al. 2008) is used to generate additional information about the undersampled parameters. We now explore two directions to address this challenge and better calibrate the uncertainty with the proposed Multinomial-TS.

4.1 Multinomial-TS with augmented prior support

As illustrated in Fig. 4 and Fig. 3, the more the reward distribution is positively skewed, the more Multinomial-TS is able to balance arm allocation under the null. This is depicted in: i) an allocation distribution with a peak at \(1/K = 1/2\) and; ii) a very low probability of allocating arms with a proportion \(T_1/T\) equal to or close to the extremes 0 or 1. This is shown both in Fig. 4 (left plot) and Table 1, where \({\mathbb {P}}(T_1/T < 0.05) = 0.02\) (in Scenario 3).

On the other side, when the reward distribution is negatively skewed (Scenario 4), arms allocation follow a U-shaped distribution (see Fig. 4-right plot), with peaks on the extremes, highlighting a relevant incomplete learning problem. In fact, in Scenario 4, we have that \({\mathbb {P}}(T_1/T < 0.05) = 0.08\), which is greater that the 0.05 probability occurring for a uniformly-distributed case. Such behaviour suggests a high instability and sensitivity of the algorithm to relatively high values of the outcome variable with respect to its support.

Motivated by this particular behaviour of Multinomial-TS, we introduce a simple algorithm modification that alters the skewness of the prior and, in turn, the posterior distribution. Note that within TS, the underlying skewness of the reward distribution is reflected in the arm posterior distributions. For example, a \(Dir(\varvec{\alpha })\) with a parameter vector with identical elements (e.g., \(\varvec{\alpha } = {\varvec{1}}\)) indicates uniformity, thus symmetry, over the support. We recall that the support of a Dirichlet \(\{p_1,\dots ,p_J\}\), with \(p_j \in [0,1]\) for each j and \(\sum _{j=1}^Jp_j = 1\), is the standard \((J-1)\)-simplex specifying both the number of categories J and the set of their probability distributions. To induce some skewness in the distribution, it therefore suffices to increase the number of categories J, say from J to \(J+s\), with s a nonnegative integer and \(\alpha _i \ne \alpha _j\), for \(i=J+1,\dots , J+s\), \(j=1,\dots , J\). We term s the support-augmentation hyperparameter.

In the particular case of TS, where posteriors are iteratively updated as \(\alpha _{j} \leftarrow \alpha _j + c_{j}\), with \(c_{j} = {\mathbb {I}}(Y_t=j)\), the idea is to specify an augmented support of order \((J-1)+s\) for the \(Dir(\varvec{\alpha })\) prior. Given that \(Y_t \in \{1,\dots ,J\}\) for each \(t=1,\dots ,T\), it is guaranteed by its possible realisations that the Dirichlet posteriors will be iteratively skewed to the right. In fact, ignoring for now the arm index, we have that:

$$\begin{aligned} \alpha _{j} + c_{j}&\ge \alpha _{j},\quad j=1,\dots ,J,\\ \alpha _{j} + c_{j}&= \alpha _{j},\quad j=J+1,\dots ,J+s, \end{aligned}$$

as \({\mathbb {P}}(Y_t>J) = 0, \forall t\), thus \(c_{j} = 0, \forall j \in [J+1,J+s]\).

We emphasise that, although such augmentation trick requires a change in the support of the prior distribution, involving thus a change in theoretical support of the reward model, the latter remains practically unaltered by virtue of \({\mathbb {P}}(Y_t>J) = 0, \forall t\). Also note that Multinomial-TS is a special case of this modified version-which we name Multinomial-TS with augmented support-when \(s=0\). The pseudocode of the latter is provided in Algorithm 2.

The plausibility of our solution is supported by an existing theoretical result (see Theorem 2) for Binary-TS, which nonetheless represents a particular case of Multinomial-TS, when \(J=2\) and \(Y_t \in \{0,1\}\), for all t.

Theorem 2

(Kalvit and Zeevi (2021)) In a two-armed model where both arms yield rewards distributed as Bernoulli(p), the following holds under TS as \(n \rightarrow \infty\):

  1. (I)

    If \(p=0\), then \(\frac{T_1}{T} \rightarrow \frac{1}{2}\);

  2. (II)

    If \(p=1\), then \(\frac{T_1}{T} \rightarrow Unif[0,1]\).

We conjecture that a similar result to Theorem 2–Case (I) applies to Multinomial-TS when the parameters are in the form \(\mathbf {p_k} = (1,0\dots ,0), \forall k\). Notice that since our aim is to balance allocation under the null scenarios, the asymptotic limit in Case (I) represents our main interest. Specifically, in a K-armed setting with identical arms yielding rewards that follow a \(Multinom({\textbf{p}})\) distribution, with \({\textbf{p}}=(1,0\dots ,0)\), we expect that, under Multinomial-TS, the allocation proportions will be such that \(T_k/T \rightarrow 1/K\) as \(n \rightarrow \infty\), for \(k = 1,\dots , K\). We support our conjecture with empirical evaluations for \(K=2, 3, 4\), where \(T_k / T\) is expected to converge to 1/2, 1/3 and 1/4, respectively. We refer to Fig. 5 and point to additional simulation studies explored in Appendix B (in particular, Fig. 15 and Table 7).

Fig. 5
figure 5

Empirical allocation (\(T_1/T\), with \(T_1\) being the number of times arm 1 is allocated) under \(H_0: \mathbf {p_1} = \dots = \mathbf {p_K}\), with \(\mathbf {p_k}=(1,0\dots ,0)\) and \(k=2, 3, 4\). Values refer to Multinomial-TS and are obtained by averaging across \(10^4\) independent TS trajectories

In light of this result, the augmentation trick detailed in this section is explored in Fig. 6, where we show the distribution of arm allocations of the modified Multinomial-TS for different values of the augmentation size s, including \(s=0\). Empirical results are based on Scenario 4, representing the worst-case scenario in terms of arm allocation under the null. As expected, the higher the hyperparameter s, the higher the overall balance in the allocation of the two arms. For \(s=20\), we already achieve more than satisfactory results, with \({\mathbb {P}}(T_1/T \in [0.45, 0.55]) = 78\%\) and \({\mathbb {P}}(T_1/T < 0.1) = {\mathbb {P}}(T_1/T > 0.9) = 0\%\), highlighting the value of the augmented-support solution to the incomplete learning phenomenon.

Fig. 6
figure 6

Empirical allocation (\(T_1/T\), with \(T_1\) being the number of times arm 1 is allocated) under Scenario 4: identical arm means with a left-skewed distribution. Values refer to Multinomial-TS with augmented support (see Algorithm 2) and are obtained by averaging across \(10^4\) independent TS trajectories

The computation cost in terms of running time of both Multinomial-TS and Multinomial-TS with augmented support is detailed in Appendix D. As depicted in Fig. 19 and Table 8, when compared to a standard Multinomial-TS (\(s=0\)), the increasing (median) time complexity of Multinomial-TS with augmented support is negligible up to \(s=5\) (25.04 vs 26.79 milliseconds) and it is more than doubled for \(s=100\) (25.04 vs 69.85 milliseconds). A similar trend is shown when J increases; notice that this is expected as J and s play the same role: they define the number of scale categories. Finally, the time complexity also increases with the number of arms K: in line with existing literature on TS (see e.g., Min et al. 2019), the run time is linear of order \({\mathscr {O}}(K)\); see also Fig. 20 in Appendix D.

4.2 Multinomial-TS with more informative priors

Useful insights into the general sensitivity of the algorithm to the choice of priors has been evaluated in previous literature (see e.g., Liu and Li 2016; Russo et al. 2018). Russo et al. (2018), for example, points out that ignoring any useful knowledge from past experience may increase the time it takes for TS to identify the most effective arms. Therefore, a careful choice of the prior can improve learning performance under \(H_1\), or arm allocations under \(H_0\). While the effects of the prior distribution should wash out as \(T \rightarrow \infty\), they may be decisive at the beginning of the experiment, where TS reacts very sensitively to small variations in arm outcomes, even if these are simply dictated by noise. Thus, selecting a prior that still maintains a certain uniformity on the plausible response categories but is less variable may be helpful for the problem at hand. This is well illustrated in Fig. 7, where an increase in the value \(\alpha _j\), for \(j=1,\dots ,J\), leads to a decreased variability in the empirical allocation of the arms, with values concentrated mainly around \(1/K=1/2\). Note that, under uncertainty on the best arm, priors are chosen to be identical for both arms.

Fig. 7
figure 7

Empirical allocation (\(T_1/T\), with \(T_1\) being the number of times arm 1 is allocated) under Scenario 2: identical arm means with symmetric distribution. Values refer to Multinomial-TS (see Algorithm 1) with different prior choices, and are obtained by averaging across \(10^4\) independent TS trajectories

Clearly, the higher the value of the hyperparameters \(\alpha _j\), the greater the extent of prior information carried over by the algorithm and the slower its ability to adapt to each new observation. Therefore, while beneficial in null scenarios, a natural consequence is a potential reduction in the overall reward efficiency under an alternative scenario, due to the higher weight assigned to the prior. To understand the overall benefit of using a more informative prior, we thus quantify the trade-off between optimal arm allocation under \(H_1\) and arms balancing under \(H_0\). Fig. 8 compares the empirical probability of observing an allocation proportion close to 1/2 when the arms are identical (\(H_0\); Scenario 2) and of the optimal arm in a setting where an optimal arm exists (\(H_1\); Scenario 1). Note the analogy between the curve in Fig. 8 and a ROC curve: here, the x- and y-axes may be interpreted in a similar fashion to sensitivity (true negative rate) and specificity (true positive rate), respectively. The ideal results are those lying in the right corner: both sensitivity and specificity close to 1.

Fig. 8
figure 8

Trade-off between between optimal arm allocation under \(H_1\) (Scenario 1; y-axis) and arms balancing under \(H_0\) (Scenario 2; x-axis). Comparisons are quantified with the empirical probability of observing an empirical allocation in a given range. Values refer to Multinomial-TS (see Algorithm 1) with different prior choices, and are obtained by averaging across \(10^4\) independent TS trajectories

As we can notice, the effect of the prior values is substantial under \(H_0\), where the increasing values of \(\alpha _j\) are increasingly translated into a higher probability of having a more balanced allocation. However, under \(H_1\) the effect is less relevant: up to a value of \(\alpha = 250\), the algorithm is still efficiently (with probability higher than 0.9) allocating the optimal arm in more than \(90\%\) of the times. Such a different behaviour of TS between \(H_0\) and \(H_1\) allows us to determine an optimal value of \(\alpha\) that would achieve desirable guarantees in both scenarios. For example, with \(\alpha = 250\), we achieve good regret performances, while also ensuring \(80\%\) confidence that a sufficiently balanced allocation in [0.4, 0.6] will be achieved under \(H_0\).

5 Redesigning the MTurk I experiment

In this section, we discuss the potential of redesigning the motivating MTurk I experiment detailed in Sect. 1 (see Fig. 1) with an adaptive design guided by the proposed Multinomial-TS versions. We use a nonparametric simulation-based approach following a resampling strategy (bootstrapping; Efron and Tibshirani 1993) from the real data collected in the MTurk I experiment. More specifically, consider the set of \(T = 110\) participants data split into two urns: Urn 1, with the \(T_1 = 58\) data points associated with arm 1 and Urn 2, with the \(T_2 = 52\) data points associated with arm 2. Anytime arm 1 is selected by Multinomial-TS, a data point (rating) from Urn 1 is sampled with replacement and vice versa. The resampling approach is replicated a number of 10, 000 independent times, each based on a horizon of \(T=110\) as in the original MTurk I experiment.

A comparison between the actual arm allocations observed in the MTurk I experiment and the expected allocation in a redesigned experiment with Multinomial-TS and Multinomial-TS with augmented support is presented in Fig. 9. We focus on the proportion of allocation of the optimal arm (arm 1: “Today is a new day to start fresh”) over the horizon T. Compared to the balanced allocation of the original design, both Multinomial-TS variants show an increased allocation of arm 1, with \(86\%\) and \(83\%\) participants assigned to it at the end of the experiment. However, if we evaluate the uncertainty of the two Multinomial-TS variants (assessed in terms of the [0.1, 0.9] percentile confidence intervals), we can notice a higher uncertainty in the standard version compared to the augmented-support one. This is particularly true at the beginning of the experiment (for \(T < 60\)), where the possibility of allocating the inferior arm 1 can be substantial. For example, at \(T=30\), in 10% of the cases, the optimal arm 1 is allocated only 30% and 45% of the times with Multinomial-TS and Multinomial-TS with augmented support, respectively. For \(T > 60\), the benefits of the proposed strategies, particularly Multinomial-TS with augmented support, which shows reduced uncertainty, are relevant compared to the original study. Notice that, in light of the computational complexity assessments made in Appendix D, Multinomial-TS with augmented support comes at a negligible running time cost: for an horizon \(T=110\), we observe a median time (in milliseconds) of 9.91 (\(s=3\)) vs 9.70 (\(s=0\)) for a single run (see Table 8), translating into a median time (in minutes) of 1.65 (\(s=3\)) vs 1.62 (\(s=0\)) for 10, 000 runs.

Fig. 9
figure 9

Comparison between the actual allocation of optimal arm 1 (\(T_1/T\)) observed within the MTurk I experiment and the expected allocation, with their uncertainty ([0.1, 0.9] percentile confidence intervals; dotted lines), attainable with Multinomial-TS and Multinomial-TS with augmented support (\(s=3\)). Results are obtained by averaging across \(10^4\) TS trajectories

For reproducibility of the results, the R codes and data are made available at the following repository: https://github.com/nina-DL/MultinomialTS.

6 Conclusion and future work

In this work, motivated by the MTurk I field experiment, we extended the applicability of TS to rating scale data, introducing Multinomial-TS. We demonstrated that, in scenarios with a unique optimal arm, it can outperform the widely used TS variants with a Normal or a Bernoulli model, which results in being highly sensitive to the dichotomisation threshold. In scenarios with identical arm means, Multinomial-TS can offer a more balanced solution in terms of arm allocation, but its performance depends on the shape of the underlying reward distribution, more specifically on its skewness. Motivated by this, we introduced an alternative variant on Multinomial-TS, called Multinomial-TS with augmented support, which artificially injects skewness into the prior distribution, and thus the posterior, through a support extension. Additionally, the role of an increased informative content provided by the Dirichlet prior to calibrating the arms uncertainty is discussed. Both approaches demonstrated substantial benefits in an identical arm means scenario, highlighting a potential solution for the incomplete learning problem in this setting. The impact of the prior informative content is further evaluated to quantify the trade-offs under an optimal arm scenario, demonstrating satisfactory results. By illustrating the sensitivity of the algorithm to different prior choices, this work also fills an important knowledge gap, often neglected within the bandit literature.

Further work is required to understand how the proposed versions would behave in a contextual MAB setting, especially when different user characteristics are associated with different preferences. To this end, a future line of research may integrate tools from the literature on regression for categorical and ordinal data (Agresti 2019; Hedeker 2008; Tutz 2011) into MABs and adaptive experiments. Among the existing frameworks, it is worth mentioning the vast literature on mixture models that account for the uncertainty of the respondents. These include the CUB (combination of uniform and shifted binomial) class (see Piccolo and Simone 2019, and the related discussion and rejoinder, for a state-of-the-art survey), and other mixture model configurations that also factor in different response attitudes (see e.g., Colombi et al. 2019). Questions about whether the assumption of stationarity is plausible in these contexts and to what extent it influences the decision-making within the experiment represent an additional challenge. Along this line, a flexible approach for incorporating time dependencies is given by hidden Markov models, recently explored in longitudinal rating data with dynamic response styles (Colombi et al. 2023).