1 Introduction

In several contexts, decision-making relies heavily (or exclusively) on expert ratings, especially in situations where a direct quantification of quality of an object or a subject is either impossible or unavailable. Examples include applicant selection procedures, grading of student assignments in education, or risk evaluation in emergencies, all of which rely on observational ratings made by experts. For ease of exposition, throughout this paper we will refer to evaluation of students’ work in an educational context as the primary example. To ensure consistency across different teachers, harmonization of marking criteria is often used to improve inter-rater agreement and homogeneity (Gisev et al. 2013; Gwet 2008); however, discrepancies between grades assigned by different teachers may still persist (Bygren 2020; Barneron et al. 2019; Makransky et al. 2019; Zupanc and Štrumbelj 2018), reflecting each teacher’s approach to evaluation. Therefore, statistical models that can capture inter-rater agreement (or disagreement) can shed light on heterogeneity between teachers and aid the mark moderation process (Bygren 2020; Barneron et al. 2019; Crimmins et al. 2016).

The specific context that we are considering in this work is the observational setting where a set of raters are evaluating different sets of itemsFootnote 1 out of a total population of items; these sets may be completely disjoint (i.e., each item is evaluated by exactly one rater). Each item is represented by a set of covariates, assumed to follow some distribution. Within a hierarchical statistical model, a common assumption is that raters (who may or may not include covariates) may each be characterized through a latent variable capturing e.g. whether an evaluator is generous or how they assess different aspects of the work. In the simplest setting, in an evaluation context where there is no space for subjectivity, these latent variables will be identical for all raters, in the sense that their view of the item is identical and as a result their evaluation style is assumed to be the same. However, it is well-known in many scientific fields, e.g., cognitive neuroscience (Barneron et al. 2019; Makransky et al. 2019; Briesch et al. 2014), statistics (Agresti 2015; Gelman et al. 2013) and psychometrics (Bartoš et al. 2020; Nelson and Edwards 2015; Hsiao et al. 2011), that individual variability in rating tasks (Wirtz 2020) needs to be accounted for when aggregating or interpreting individual raters’ recommendations.

Existing works account for heterogeneity between raters through a latent variable within a mixed-effects model (Martinková et al. 2023; Bartoš et al. 2020; Nelson and Edwards 2015, 2008). In other words, a regression model is used where the rating is modelled conditionally on covariates with a random effect that varies across raters. However, the distribution of the latent variable is typically assumed to be unimodal, and cannot capture eg. polarisation or clustering of rater types. The present work aims to extend these models to account for clustered variability between raters. Through a Bayesian approach, a Dirichlet process mixture prior is placed over the hierarchical rater effects in a linear model. This flexible prior naturally accommodates different clusters among raters (i.e., different distributions for the rater effects). A multiple-level model is specified in which observations (i.e., ratings) are nested within raters, and in turn these are nested within clusters. These clusters reflect distinct groups of raters in terms of their decision-making, and can be used to characterise the level of (dis)agreement. The level of multimodality (i.e., how separated the latent group densities are) quantifies the polarization of the latent groups. For instance, a large variance between teacher scores might be due to both the presence of two main divergent latent trends among them or to a high level of noise in their assessing (Koudenburg and Kashima 2022). It is important to differentiate the two cases and quantify the group polarization both for theoretical and practical purposes. Differentiate systematic differences of opinion against high level of noise might be needed (Koudenburg et al. 2021). They are two very different cases and much attention must be paid in distinguishing one another. The former is a case of high group polarization (Esteban and Ray 1994): two different teachers clusters emerge with a small within-cluster variance and a large variance between different clusters. In the second case only one cluster emerge with a large variance. It might be argue that in the first case, even that the overall agreement might be quite low since there are two main different trends among raters, there might be a high agreement within the same trend (Tang et al. 2022). Assuming the latent agreement among raters as the degree of latent similarity in rating, an index regarding the polarization of the different possible groups of raters might be informative (Koudenburg and Kashima 2022; Tang et al. 2022; Koudenburg et al. 2021). In this work we introduce a novel index to quantify the latent polarization among raters through the posterior distribution of the hierarchical effects (DiMaggio et al. 1996). It naturally derives from the nonparametric model and overcomes some strong assumptions (e.g., the number of latent groups, the ratings distribution) of the previous indices (Koudenburg et al. 2021; Esteban and Ray 1994). This nonparametric index, referred to as \(\lambda\) index, is based on the shape of the posterior distribution of the hierarchical effects. It connects two different research lines: it relates the works on distribution polarization of opinions (Koudenburg and Kashima 2022; Tang et al. 2022; Koudenburg et al. 2021) with those about the inter-rater agreement analysis (Martinková et al. 2023; Bartoš et al. 2020; Nelson and Edwards 2015; Gisev et al. 2013; Gwet 2008; Nelson and Edwards 2008).

The paper proceeds as follows: Sect. 2 is devoted to the general psychometric framework, the key concepts of inter-rater agreement, inter-rater reliability are introduced; the statistical model is specified in Sect. 3 and the adopted Gibbs sampler in Sect. 4; the novel rater similarity index is described in Sect. 5; simulation studies are reported in Sect. 6, as an illustrative example, a real data analysis is described in Sect. 8; it is followed by conclusion and future directions in Sect. 1.

2 Existing work in inter-rater agreement and hierarchical effects models

Several methods and statistical models that aim to account for inter-rater variability have appeared in the literature (Nelson and Edwards 2015; Gwet 2008; Cicchetti 1976). Models such as the Cultural Consensus Theory (Oravecz et al. 2014), which explores individuals’ shared cultural knowledge, have been proposed to capture unobserved agreement and similar trends in groups of raters (Dressler et al. 2015). Two related but different concepts have been introduced: inter-rater agreement and inter-rater reliability. The former refers to the extent to which different raters’ evaluations are concordant (i.e, they assign the same value to the same item), whereas the latter refers to the extent to which their evaluations consistently distinguish different items (Gisev et al. 2013). In other words, while the inter-rater agreement indices quantify the observed concordance, the inter-rater reliability indices aim to quantify the consistency of their evaluations (e.g., despite assigning different values, the distinction among the items is the same). The present work focuses on latent agreement intended as homogeneity in the evaluators’ point of view (Tang et al. 2022; Esteban and Ray 1994).

A number of methods are available to quantify both inter-rater agreement and inter-rater reliability. Indices for pairs (Nelson and Edwards 2008; McHugh 2012) or multiple raters (Jang et al. 2018), for binary (Gwet 2008), polytomous (Nelson and Edwards 2015) or continuous (Liljequist et al. 2019) ratings are commonly used in different contexts. Recent developments using the framework of Hierarchical Linear Models (i.e., HLMs) provide a more accurate estimation of inter-rater reliability accounting for different sources of variability (Martinková et al. 2023).

Despite the popularity of work on this issue, less attention has been paid to possible latent similarities of the raters (Wirtz 2020). From a psychometric point of view, it can be appealing to assess the extent to which different raters might be heterogeneous in their ratings (Martinková et al. 2023; Bartoš et al. 2020; Koudenburg et al. 2021; Casabianca et al. 2015; Nelson and Edwards 2015; Gisev et al. 2013; DeCarlo 2008; Gwet 2008; Nelson and Edwards 2008).

There are certain situations in which the subjective opinion of the raters is very informative; as a simple example, the type of teachers’ training or experience can be thought of as latent states which affect a range of evaluations differently (Childs and Wooten 2023; Barneron et al. 2019; Bonefeld and Dickhäuser 2018; Dee 2005). Sometimes the major interest is not on the mere consistency between raters, but on their actual evaluation. For instance, in a selection process the actual students’ scores are very relevant for their admission (Zupanc and Štrumbelj 2018). Even if a strict standardization of teachers evaluation is not feasible, some statistical methods can tackle these issues. In all these contexts the assessment of uniformity among raters could be useful and would provide further information about the rating process.

To this aim, existing work, e.g. Martinková et al. (2023); Nelson and Edwards (2015); Casabianca et al. (2015); Hsiao et al. (2011); Cao et al. (2010); DeCarlo (2008), adopts an hierarchical approach where correlations between ratings are naturally captured through an hierarchical Bayesian model. Each rater \(i = 1,\ldots ,I\) is assumed to be rating a different set of items \({\mathscr {J}}_i \in {\mathscr {J}}\), \({\mathscr {J}}_i \cap {\mathscr {J}}_{i+1} = \emptyset\).Footnote 2 The rating \(y_{ij}\) of the item \(j \in {\mathscr {J}}_i\) carried out by rater \(i = 1,\ldots ,I\), is modelled as follows:

$$\begin{aligned} y_{ij}= & {} {\textbf{x}}_{ij}'{\beta }+{\textbf{z}}_{ij}'{\textbf{u}}_i+\varepsilon _{ij},\;\; i = 1,\ldots ,I, \; j \in {\mathscr {J}}_i. \end{aligned}$$
(1)

Here \({\textbf{x}}_{ij}\) and \({\textbf{z}}_{ij}\) are, respectively, \(1 \times p\) and \(1 \times q\) vectors of distinct explanatory variables of rating \(y_{ij}\); \({\beta }\) is a \(p \times 1\) vector of non varying effects and \({\textbf{u}}_i\) is a \(q \times 1\) vector the hierarchical effects of rater i.

In the standard HLM formulation, the following distribution is specified for the rater effects:

$$\begin{aligned} {\textbf{u}}_i \sim N_q({\textbf{0}}, \pmb {\varSigma }), \quad i = 1,\ldots ,I. \end{aligned}$$

Where \(N_q(\cdot )\) stands for a q-variate normal distribution; Here \({\textbf{0}}\) is a \(q \times 1\) zero vector and \(\pmb {\varSigma }\) is a \(q \times q\) positive semi-definite covariance matrix. For the hierarchical normal linear model \(\varepsilon _{ij} \sim N(0,\sigma _\varepsilon^2 )\), with \({\textbf{u}}_i\) and \(\varepsilon _{ij}\) typically assumed independent. The distribution of each vector-valued hierarchical effects \({\textbf{u}}_i\) is then assumed to follow some distribution and captures variability across different raters.

In the above mentioned example, \(y_{ij}\) is the score given to student \(j \in {\mathscr {J}}_i\)’s essay by teacher i. Since an observational approach is adopted (i.e., each raters rates a different set of items), the effect of the student is not identifiable (each student is rated only by one rater). Assuming that students effects are i.i.d., their variance is added to that of the residuals. In the univariate case (i.e., when \(z_i=1\), varying intercept model) the relevance of the raters effect \(u_i \sim N(0,\sigma ^2_u)\), where \(\sigma ^2_u>0\) is the variance, might be quantified through the intraclass correlation coefficient (i.e., ICC):

$$\begin{aligned} ICC= \frac{\sigma _u^2}{\sigma _u^2+\sigma _\varepsilon ^2}. \end{aligned}$$

It is the ratio between the variance of the raters effect and the total variability of the model, i.e., the proportion of variance of the score due to the teacher, which reflects the correlation of two ratings given by the same rater. Smaller values of ICC indicate a small effect of the rater on the student’s score.

3 Dirichlet process mixture and hierarchical effects

The HLM assumption regarding the distribution of the hierarchical effects is crucial in characterising different possible clusters or latent patterns of heterogeneity among raters (Dorazio 2009). The common Gaussian assumption for the distribution of the these effects may obscure skewness and multimodality present in the data. A more flexible specification of the hierarchical effects distribution can help capture more complex patterns of variability. Models that account for skew-normal (Lin and Lee 2008), skew-normal-cauchy (Kahrari et al. 2019), multivariate t (Wang and Lin 2014), extreme values (McCulloch and Neuhaus 2021) effects distributions have been proposed (Schielzeth et al. 2020). Nevertheless, they poorly account for the possible presence of multimodality in those distributions. In this regard, a mixture distribution has been proposed as a potential solution (Heinzl and Tutz 2013; Kyung et al. 2011; Kim et al. 2006). Each mode can then correspond to a cluster with a similar pattern (e.g., the same deviation from the population mean). Several works have explored this issue in the past two decades (Villarroel et al. 2009; Tutz and Oelker 2017). For instance, Verbeke and Lesaffre (Verbeke and Lesaffre 1996) proposed a standard normal mixture distributions for the hierarchical effects. James and Sugar (James and Sugar 2003) explored this approach in the context of functional data. De la Cruz-Mesía (De la Cruz-Mesia and Marshall 2006) proposed a mixture distribution for non-linear hierarchical effects in modelling continuous time autoregressive errors. A heteroscedastic normal mixture model in the hierarchical effects distribution was considered in linear (Komárek et al. 2010) and generalized hierarchical linear (Komárek and Komárková 2013) models. Despite the breadth of specifications for the mixture model, in all the aforementioned models, the number of mixture components needs to be specified. Although this may not be a critical assumption in certain contexts, it may be questionable or detrimental in settings with a lack of a priori information on the level of multimodality, especially in cases where the characterisation of the multimodality is of direct interest.

When the number of components of the mixture is unknown, a Dirichlet Process Mixture (hereafter DPM) for the hierarchical effects is a natural extension (Gill and Casella 2009; Navarro et al. 2006; Verbeke and Lesaffre 1996). This nonparametric extension allows the model to capture an unknown marginal distribution of the hierarchical effects through the Dirichlet Process (Antoniak 1974; Ferguson 1973). Modeling the hierarchical effect \({\textbf{u}}_i\) as an infinite mixture of some distribution family (e.g., Normal) enables the model to account for possible multimodality without specifying the number of mixture components. Some existing works adopted this nonparametric approach and pose a DPM prior over the hierarchical effects (e.g., Heinzl and Tutz (2013); Heinzl et al. (2012); Kyung et al. (2011)).

The HLM of Eq. (1) is then specified in the same way as before through:

$$\begin{aligned} y_{ij}= & {} {\textbf{x}}_{ij}'{\beta }+{\textbf{z}}_{i}'{\textbf{u}}_i+\varepsilon _{ij}, \quad \quad i = 1,\ldots ,I, \quad j \in {\mathscr {J}}_i. \;\; \end{aligned}$$

The following hierarchical prior distribution is placed over the raters effects:

$$\begin{aligned}&{\textbf{u}}_i | \mathbf {\mu }_{i}, {\textbf{Q}}_{i} \sim N_q(\mathbf {\mu }_{i}, {\textbf{Q}}_{i}) \\&\mathbf {\mu }_i,{\textbf{Q}}_i | G \,\,{\mathop {\sim }\limits ^{iid}}\, \, G \\&G \sim DP(\alpha , G_0) \end{aligned}$$

where \(\mathbf {\mu }_i\) and \({\textbf{Q}}_i\) are, respectively, the \(q \times 1\) a location parameter vector and the \(q \times q\) positive semi-definite covariance matrix for the hierarchical effects \({\textbf{u}}_i\) of rater \(i=1,\ldots ,I\). Here \(\varepsilon _{ij} \sim N(0,\sigma _\varepsilon^2 )\), \(i = 1,\ldots ,I\), \(j \in {\mathscr {J}}_i\); \({\textbf{u}}_i\) and \(\varepsilon _{ij}\) are assumed independent as before.

3.1 DPM as a generative process for the hierarchical effects

Here, \(DP(\alpha , G_0)\) is a DPM with \(\alpha >0\) precision parameter and base measure \(G_0\). These specify the mixing distribution G (Heinzl and Tutz 2013), so that each realization of G is almost surely a discrete probability measure on the space \((\Omega , {\mathscr {F}})\) (Blackwell 1973). Thus, since the DPM is a discrete generative process with non-zero probability of ties, some of the realizations might be identical to each other with probability determined by the precision parameter \(\alpha\). Therefore, specifying this hierarchical model on the components location parameters \(\mathbf {\mu }\) induces a clustering in the hierarchical effects (i.e., the raters) (Kyung et al. 2011); hierarchical effects belonging to the same c-th cluster with location parameter \(\mathbf {\mu }_c\) are then independent and identically distributed. In other words, in the context of the HLM, the DPM specifies the component-specific location parameter \(\mathbf {\mu }_c\), so that each rater has each has their own unique hierarchical effects value \({\textbf{u}}_i\) (Heinzl and Tutz 2013).

The DPM is a generative process commonly used in conjunction with a parametric family of distributions (e.g., Normal, Poisson), and the base measure parameter \(G_0\) denotes this specified distribution. Thus, for any element \(A_n\), \(n=1,\ldots ,N\), of \({\mathscr {A}}\), a finite measurable partition of \(\Omega\),

$$\begin{aligned} (G(A_1), G(A_2),\ldots ,G(A_N) ) \sim Dir(\alpha G_0(A_1),\alpha G_0(A_2),\ldots ,\alpha G_0(A_N)) \end{aligned}$$

where \(Dir(\cdot )\) stands for the Dirichlet distribution, and \(G_0\) defines the expectation of GFootnote 3, therefore they have the same support. The parameter \(\alpha\), a multiplicative constant of the vector-valued Dirichlet parameter, determines the probability of a new realization of the process to be different of the previous ones (Blackwell and MacQueen 1973). In other words, it governs the probability that the DPM generates a new cluster. Formally, the generative property of the DPM is that, for \(i=1,\ldots ,I\), with I being for instance the total number of raters:

$$\begin{aligned} G \sim DP(\alpha , G_0), \qquad \mathbf {\mu }_i|G \sim G \end{aligned}$$

the probability that the new I-th realization \(\mu _I\) of G assumes a different values than the previous ones is described by the well known Pólya Urn Model:

$$\begin{aligned} \mathbf {\mu }_I|\mathbf {\mu }_1,\mathbf {\mu }_2,\ldots ,\mathbf {\mu }_{I-1},\alpha \sim \frac{\alpha }{\alpha +I-1}G_0+\frac{1}{\alpha +I-1} \displaystyle \sum _{c=1}^{C} r_c \end{aligned}$$

with \(C \in {\mathbb {N}}\) being the number of already observed distinct clusters among the realizations of G (i.e., the number of the different values of \(\mathbf {\mu }\) already observed, in other words the number of clusters) and \(r_c\) counts the elements in the c-th cluster. Basically, since G is a discrete probability measure, the C clusters represent different point masses (or different sets of point masses in the multivariate case) and \(r_c\) is the frequency of each of them. Considering the conditional distribution of \(\mu _I\) as a mixture distribution, the probability that \(\mathbf {\mu }_I\) is a new point mass sampled from \(G_0\) is proportional to \(\alpha\), the probability that it is equal to the already observed c-th point mass is proportional to \(r_c\). In this notation, the role of \(\alpha\) in sampling a new (not already observed) value of \(\mathbf {\mu }_I\) (i.e., a new point mass, a new cluster) is interpretable.

To this regard, Sethuraman (1994) described a stick-breaking construction of the DP.Footnote 4 In this formulation G is equivalent to:

$$\begin{aligned} G=\displaystyle \sum _{c=1}^{\infty } \pi _c \delta _{\mathbf {\mu }_c} \end{aligned}$$

where \(\delta\) is the Dirac measure on \(\mathbf {\mu }_c\) and \(\mathbf {\mu }_c \,{\mathop {\sim }\limits ^{iid}}\, G_0\) is assumed. The weights \(\{\pi _c\}_{c=1}^{\infty }\) of the infinite mixture result from the stick-breaking procedure as follows:

$$\begin{aligned}&\pi _c= \nu _c \prod _{l<c}(1-v_l)\\&v_c \,{\mathop {\sim }\limits ^{iid}}\, Be(1,\alpha ) \end{aligned}$$

with \(Be(\cdot )\) indicating the Beta distribution and \(\{\nu _c\}_{c=1}^{\infty }\) being reparameterized weights. It is even more explicit in this construction that the random measure G is a mixture of point masses. The distribution of the random weights \(\mathbf {\pi }\) (i.e., the probability of different allocation to the clusters) is governed through the stick-breaking process by the precision parameter \(\alpha\). Further details are given in the Appendix.

In practice, one of the established approximations to the stick-breaking process is to truncate the infinite number of components to a large, finite value:

$$\begin{aligned} G=\displaystyle \sum _{c=1}^{R} \pi _c \delta _{\mathbf {\mu }_c} \end{aligned}$$

for large enough value of R (Tutz and Oelker 2017; Gelman et al. 2013).

In summary, the hierarchical effects distribution considering a stick breaking construction of the DPM might be then specified as follow:

$$\begin{aligned}&{\textbf{u}}_i | \mathbf {\mu }, {\textbf{Q}}, \,{\mathop {\sim }\limits ^{iid}}\,\, \displaystyle \sum _{c=1}^{R} \pi _c N_q(\mathbf {\mu }_c, {\textbf{Q}}_c), \;\;\;i = 1,\ldots , I \\&\mathbf {\mu }_c, {\textbf{Q}}_c \,{\mathop {\sim }\limits ^{iid}}\, G_0 \\&\pi _c= \nu _c \prod _{l<c}(1-v_l),\;\;\text {where} \\&v_c \,{\mathop {\sim }\limits ^{iid}}\, Be(1,\alpha ), \;\;c=1\ldots ,R \end{aligned}$$

With this nonparametric model specification, latent common tendencies among raters might emerge through the components of the model (Heinzl and Tutz 2013; Heinzl et al. 2012; Kyung et al. 2011). The Bayesian approach allows us to characterize the shape of the distribution of the rater effects, as well as explore the effect of uncertainty on these (Gelman et al. 2013). For example, in an applied context, strict vs. accommodating are very common latent states that drive students’ essays grading process (Zupanc and Štrumbelj 2018; Briesch et al. 2014; Dee 2005).

4 Prior distributions and estimation procedure

The DPM mixture model has been well studied in the literature in a variety of different settings, especially within Bayesian inference (Canale and Prünster 2017; Müller et al. 2015). Several sampling schemes have been proposed both in the Bayesian context (e.g., Canale and Dunson (2011); Dahlin et al. (2016); Kyung et al. (2011)) and in the frequentist one (e.g., Tutz and Oelker (2017)). Within the Bayesian framework, Gibbs sampling (Dahlin et al. 2016), slice sampler (Kyung et al. 2011; Walker 2007), Sequential Monte Carlo algorithms (Ulker et al. 2010), split-merge algorithms (Bouchard-Côté et al. 2017), have been proposed among others.

In this work, the model specification permits the use of conjugate priors, so that a blocked Gibbs sampling can be used (Heinzl and Tutz 2013; Heinzl et al. 2012; Kyung et al. 2011)., with details shown below.

4.1 Prior specification

Several of the parameters in the model have conjugate prior distributions which allow easier computation.

  • For the effects \({\beta }\) the following hierarchical prior is assigned:

    $$\begin{aligned}&{\beta }|{\textbf{b}}_{\beta }, {\textbf{B}}_{\beta } \sim N_p({\textbf{b}}_{\beta }, {\textbf{B}}_{\beta })\\&{\textbf{b}}_{\beta } \sim N_p({\textbf{b}}_0, {\textbf{S}}_0)\\&{\textbf{B}}_{\beta }=diag(\sigma _{\beta _1}^2,\ldots ,\sigma _{\beta _p}^2)\\&\sigma _{\beta _m}^2 \,{\mathop {\sim }\limits ^{iid}}\, IG(a_{\beta _0}, b_{\beta _0} ) \end{aligned}$$

    for \(m =1,\ldots ,p\), where p is the number of covariates associated to the effects \({\beta }\). Here, \(IG(\cdot )\) stands for inverse-gamma with shape parameters \(a_{\beta _0}>0\) and rate parameters \(b_{\beta _0}>0\). Where \({\textbf{b}}_0\) and \({\textbf{S}}_0\) are, respectively, the \(p \times 1\) vector of location parameters and the \(p \times p\) positive semi-definite covariance matrix of \({\textbf{b}}_\beta\) (i.e, the location parameter vector of the non varying effect \({\beta }\)); \({\textbf{b}}_\beta\) and \({\textbf{S}}_\beta\) are, respectively, the \(p \times 1\) location parameter and the \(p \times p\) positive semi-definite covariance matrix for \({\beta }\) (i.e., the non varying effect). The set \(\{{\textbf{b}}_0, {\textbf{S}}_0,a_{\beta _0}, b_{\beta _0}\}\) of the hyperparameters are specified by the user. A diagonal matrix is suggested for \({\textbf{S}}_0\) as showed by Heinzl et al. (2012).

  • A diagonal structure for the \(q \times q\) prior covariance matrix \({\textbf{Q}}_r\) for the hierarchical effects is specified as follows for each mixture component \(r = 1,\ldots ,R\) and each each related covariate \(d =1,\ldots , q\):

    $$\begin{aligned} {\textbf{Q}}_r= & {} diag(\sigma _{Q_{1r}}^2,\ldots ,\sigma _{Q_{qr}}^2) \end{aligned}$$

    For the base measure \(G_0\)Footnote 5 and the precision parameter \(\alpha\) of the DP mixture model the following priors are specified:

    $$\begin{aligned} G_0= & {} N_q(\mathbf {\mu }_0,{\textbf{D}}_0) \times IG(a_{Q_0},b_{Q_0})^q \\ \mathbf {\mu }_0\sim & {} N_q({\textbf{m}}_0, {\textbf{W}}_0)\\ {\textbf{D}}_0= & {} diag(\sigma _{{D_0}_1}^2,\ldots ,\sigma _{{D_0}_q}^2)\\ \sigma _{{D_0}_d}^2\sim & {} IG(a_{D_0},b_{D_0})\\ \alpha\sim & {} Ga(c_0^{\alpha },C_0^{\alpha }) \end{aligned}$$

    for \(d=1,\ldots ,q\), where q is the number of covariates associated to the hierarchical effects \({\textbf{u}}_i\), and for \(a_{D_0},a_{Q_0}>0\) and \(b_{Q_0}, b_{D_0}>0\). Here \(Ga(\cdot )\) stands for Gamma distribution with \(c_0^{\alpha }>0\) and \(C_0^{\alpha }>0\) respectively the shape and the rate parameters. Where \({\textbf{m}}_0\) and \({\textbf{W}}_0\) are, respectively, the \(q \times 1\) location parameter vector and the \(q \times q\) positive semi-definite covariance matrix of \(\mathbf {\mu }_0\) (i.e. the location parameter of the base measure \(G_0\)); \(\mathbf {\mu }_0\) and \({\textbf{D}}_0\) are, respectively, the \(q \times 1\) location parameter vector and the \(q \times q\) positive semi-definite covariance matrix of the base measure \(G_0\). The set \(\{a_{Q_0},b_{Q_0}, {\textbf{m}}_0, {\textbf{W}}_0,a_{D_0}, b_{D_0}, c_0^{\alpha }, C_0^{\alpha }\}\) of the hyperparameters need to be fixed. A diagonal structure is suggested for \({\textbf{W}}_0\) as above.

  • The following prior is assigned to the noise variance:

    $$\begin{aligned} \sigma _\varepsilon\sim & {} Ga(a_\varepsilon ,b_\varepsilon ) \end{aligned}$$

    with \(a_\varepsilon >0\) and \(b_\varepsilon >0\) hyperparameters fixed by the user as well.

4.2 Posterior sampling

Since most of the parameters in the model have conjugate prior distributions, a blocked Gibbs sampling algorithm was used for the posterior sampling (Ishwaran and James 2001).

The parameter vector for the model is \(\theta =\{ {\beta }, {\textbf{b}}_\beta ,{\textbf{B}}_\beta , \mathbf {\mu }_0,{\textbf{D}}_0, {\textbf{Q}}, \sigma _{\varepsilon }, \alpha , \mathbf {\pi }, {\textbf{c}} \}\) which is updated at each state of the Markov chain of the Gibbs sampling. Here \({\textbf{c}}= (c_1,\ldots ,c_I)\) is the allocation parameter of the raters to the clusters and. Further details on the following sampling are given in the Appendix. The closed-form posteriors are as follows.

  1. 1.

    Update parameters referring to effects \({\beta }\):

    $$\begin{aligned} {\beta }| {\textbf{b}}_\beta , {\textbf{B}}_\beta , {\textbf{u}}, \sigma _\varepsilon , {\textbf{y}}\sim & {} N_p({\textbf{b}}_\beta ^{*},{\textbf{B}}_\beta ^{*}) \end{aligned}$$

    For each covariate \(m=1,\ldots ,p\) associated with a non varying effect \(\beta _m\),

    $$\begin{aligned} b_{\beta _m}|\sigma _{\beta _m}^2, \beta _m\sim & {} N \left( \left( \frac{1}{\sigma _{\beta _m}^2} + \frac{1}{s_{0_m}^2}\right) ^{-1} \left( \frac{\beta _m}{\sigma _{\beta _m}^2} + \frac{m_{\beta _m}}{s_{0_m}^2} \right) , \left( \frac{1}{\sigma _{\beta _m}^2} + \frac{1}{s_{0_m}^2}\right) ^{-1}\right) \\ \sigma _{\beta _m}^2| b_{\beta _m}, \beta _m\sim & {} IG\left( a_{b_0} + \frac{1}{2}, \frac{1}{2} (\beta _m-b_{\beta _m})\right) \end{aligned}$$
  2. 2.

    Update parameters referring to hierarchical effects:

    • For each rater \(i = 1,\ldots ,I\):

      $$\begin{aligned} {\textbf{u}}_i|\mathbf {\mu }_{c_i}, \mathbf {\mu }_0, {\textbf{Q}}_0, {\beta }, \sigma _\varepsilon , {\textbf{y}}_i\sim & {} N_q(\mathbf {\mu }_{c_i}^*,{\textbf{Q}}_{c_i}^*), \end{aligned}$$

      where \(\mathbf {\mu }_{c_i}\) is the location parameter vector of the cluster where the i-th rater is allocated.

    • For each component \(r=1,\ldots ,R\) of the truncated mixture: - If \(\not \exists i: c_i = r\) (if no rater are currently allocated into cluster r), for each covariate \(d =1,\ldots ,q\) associated to an hierarchical effect (independently):

      $$\begin{aligned} \mathbf {\mu }_r | \mathbf {\mu }_0, {\textbf{D}}_0\sim & {} N_q(\mathbf {\mu }_0, {\textbf{D}}_0) \\ \sigma _{Q_{dr}}^2\sim & {} IG(a_{Q_0},b_{Q_0}) \end{aligned}$$

      - If \(\exists i: c_i = r\) (if at least one rater assigned to component r), for each covariate \(d =1,\ldots ,q\) associated to an hierarchical effect (independently):

      $$\begin{aligned} \mu _{r_d} | \sigma _{Q_m}^2, \mu _{0_r}, \sigma _{D_{0_m}}^2, {\textbf{u}}, {\textbf{c}}\sim & {} N(\mu _{0_r}^*, \sigma _{D_{0_m}}^{2*}) \\ \sigma _{Q_{dr}}^2|\mathbf {\mu }, {\textbf{u}}\sim & {} IG \left( a_{Q_0}^*, b_{Q_0}^* \right) \end{aligned}$$

      Essentially, at each iteration t, if the r-th cluster is empty the component location parameters \(\mu _r\) are sampled from the prior as suggested by (Gelman et al. 2013), otherwise they are drawn from the above mentioned closed-form posterior.

    • Each rater \(i =1,\ldots ,I\) is re-allocated into a cluster:

      $$\begin{aligned} c_i| \mathbf {\pi }, \mathbf {\mu }, {\textbf{Q}},{\textbf{u}}_i\sim & {} Cat(\mathbf {\omega }_i^*) \end{aligned}$$

      where \(Cat(\cdot )\) stands for Categorical distribution, and \(\mathbf {\omega }_i^*\) is reported in the Appendix. A truncated approximation for the DPM mixture model was used (Gelman et al. 2013; Heinzl et al. 2012) for a large value of R. The stick-breaking construction was used to generate the mixture weights \(\pi _{1:R}\).

    • For each component \(r = 1,\ldots ,R-1\):

      $$\begin{aligned} v_r|\mathbf {\phi }, \alpha\sim & {} Be \left( 1+c_r,\alpha +\displaystyle \sum _{l=c+1}^{R} r_l \right) \end{aligned}$$

      and \(v_R=1\) for the last cluster. Here \(c_r\) is the number of raters assigned to the cluster r, and \(r_l\) is the number of raters assigned to the cluster l.

    • The precision parameter is updated as follows:

      $$\begin{aligned} \alpha |v_1,\ldots ,v_{R-1}\sim & {} Ga\left( R-1+a_\alpha ,b_\alpha -\displaystyle \sum _{c=1}^{R-1} ln(1-v_r) \right) \end{aligned}$$
    • For each covariate \(d =1,\ldots ,q\) associated with an hierarchical effect the base measure parameters are updated:

      $$\begin{aligned} \mu _{0_d}|\sigma _{D_{0_d}}^2, \mathbf {\mu }\sim & {} N\left( \left( \frac{I}{\sigma _{D_{0_d}}^2} + \frac{1}{\sigma _{W_{0_d}}^2} \right) ^{-1} \left( \frac{I}{\sigma _{D_{0_d}}^2} {\overline{\mu }}_d + \frac{m_{0_d}}{\sigma _{W_{0_d}}^2} \right) , \left( \frac{I}{\sigma _{D_{0_d}}^2} + \frac{1}{\sigma _{W_{0_d}}^2} \right) ^{-1} \right) \end{aligned}$$

      where \({\overline{\mu }}_d\) is the mean of the location parameters related to the d-th covariate over all the clusters.

      $$\begin{aligned} \sigma _{D_{0_m}}^2|\mu _{0_m},\mathbf {\mu }\sim & {} IG \left( a_{D_0} + \frac{I}{2}, b_{D_0}+\frac{1}{2}\displaystyle \sum _{i=1}^{I} (\mu _{i_m}-\mu _{0_m})^2 \right) \end{aligned}$$
  3. 3.

    Update the error variance:

    $$\begin{aligned} \sigma _\varepsilon ^2|{\beta }, {\textbf{u}}, {\textbf{y}}\sim & {} IG \left( a_\varepsilon + \frac{1}{2}I |{\mathscr {J}}|, b_\varepsilon +\frac{1}{2}\displaystyle \sum _{i=1}^{I} \displaystyle \sum _{j \in {\mathscr {J}}_i}\left( y_{ij}-{\textbf{x}}_{ij}' \beta -{\textbf{z}}_{ij}' {\textbf{u}}_i \right) ^2 \right) . \end{aligned}$$

    Here \(|{\mathscr {J}}|\) is the cardinality of the set of all the rated items \({\mathscr {J}}\), it equals the number of observations.

5 The nonparametric \(\lambda\) index

The marginal posterior distribution of the hierarchical effects in the model outlined above captures information about the polarization or disagreement among raters (on the assumption that the model captures the data adequately). The ICC (i.e., intraclass correlation coefficient, (Martinková et al. 2023; Bartoš et al. 2020; Agresti 2015; Gelman et al. 2013)) might adequately quantify inter-rater variability if the normal distributional assumption of the rater hierarchical effect holds. Two assumptions are made computing the standard ICC considering a normal distributed hierarchical effect. Firstly, that the raters are sampled from the same population. Secondly, that possible different latent trends among raters are not interesting or eventually regarded as disagreement ratings. This might be a good first approximation of the rating process. Nevertheless, when more detailed considerations are needed, or subtle heterogeneity among raters is expected, the standard ICC might be less informative and inaccurate. Besides the latter issue, further information about the shape of the posterior might be quantified. For instance, in presence of a bimodal hierarchical effects distribution with two very distant modes (for example, when opinions are polarised), considering the posterior distribution of \(\sigma _u\) as an index of variability among raters might be misleading.

Several indexes have been proposed to quantify group opinion polarization (e.g., (Tang et al. 2022; Koudenburg and Kashima 2022; Koudenburg et al. 2021; Esteban and Ray 1994)) and to measure distribution bimodality (e.g., the Ashman’s D (Forchheimer et al. 2015) or the bimodal separation index (Zhang et al. 2003)). The strong assumptions behind their use limit them to be valid options only in the parametric context or when the number of clusters is known. A model based nonparametric index is here proposed to overcome these limitations.

To this end the full estimated distribution of \({\textbf{u}}\) resulting from the model might be useful. At each iteration t, the density of \({\textbf{u}}\) is given by the corresponding mixture model given the parameters at iteration t. Following the formulation of (Gelman et al. 2013), the set of modes and antimodes (i.e., the lowest frequent value between two modes) is identified. When the distribution of \({\textbf{u}}\) is multimodal, the latent polarization (disagreement) \(\lambda\) is then defined as the log ratio between the mean density of the modes and the that of the anti-modes, it is zero when it is unimodal:

$$\begin{aligned} \lambda = {\left\{ \begin{array}{ll} \log \left( \frac{\frac{1}{M}\sum _{m=1}^{M} f_{u}(\gamma _m)}{\frac{1}{M-1}\sum _{m=1}^{M-1} f_{u}(\zeta _m)}\right) , &{} \quad \text {if} \quad M>1 \\ 0, &{} \quad \text {otherwise.} \\ \end{array}\right. } \end{aligned}$$

Where M is the number of modes \(\gamma _m\), \(m=1,\ldots ,M\) and the number of antimodes \(\zeta _m\), \(m=1,\ldots ,M-1\) of the density of \({\textbf{u}}\); \(f_{u}(\cdot )\) denotes the density at a specific point. Larger values of \(\lambda\) indicate strongly multimodal distribution of the hierarchical effects, whereas smaller values are evidence of weak multimodality, thus the estimated hierarchical effects are less concentrated.

As it is shown in Fig. 1 larger values of \(\lambda\) indicate distribution polarization, whereas smaller values indicate a less concentrated and more spread density distribution. The \(\lambda\) index is strongly affected by both location and scale parameters of the mixture components. For this reason it might be very informative in presence of multimodal distributions. Assuming such a raters’ group polarization as a result of low latent agreement among raters, the \(\lambda\) index might be a useful diagnostic tool.

Fig. 1
figure 1

Different values of \(\lambda\) indicate different polarization levels. Three different values of \(\lambda\) were computed for three different mixture distributions, respectively. The realizations of these distribution are here referred to as u. Black dotted lines indicate the mean mode density, red dotted lines indicate the mean antimode density. a High polarization: the mixture components are highly and clearly separate, the mean density values of the modes is far larger then the mean density value of the antimodes; the log-density ratio between these two quantities is \(\lambda =2.55\). b Medium polarization: the mixture components are clearly separated, but the mean density values of the modes is closer to the mean density value of the antimodes; the log-density ratio between these two quantities is \(\lambda =0.81\). c Low polarization: the mixture components are not clearly separated, the mean density values of the modes is very close to the mean density value of the antimodes; the log-density ratio between these two quantities is \(\lambda =0.19\). d No polarization: the mixture distribution has only one mode (i.e., \(\gamma _1\)) and \(\lambda =0\) since the number of mode is not greater then one

6 Simulation studies

The following simulations aim to evidence how the values of \(\lambda\) varying across different polarization settings. The first simulation investigates the role of the precision parameter \(\alpha\) and the variance of the mixture components in determining the values of \(\lambda\). The second one shows the complementary role of \(\lambda\) in the inter-rater agreement analysis and how this index varies across different settings.

6.1 Simulation 1: DPM and \(\lambda\)

6.1.1 Simulation setting

The first simulation study explores the role that the precision parameter of the Dirichlet Process and the variance of the components have in determining the values of the log-density index \(\lambda\). For simplicity purpose the mixture components are assumed to have the same variance Q in this simulation, so the component subscription will be omitted. The objective is to study the effect of \(\alpha\) and Q, on \(\lambda\) conditional on all the other variables. Since the former has a crucial role in the determination on the point masses of G, and thus the concentration of its realizations, an inverse relation between \(\alpha\) and \(\lambda\) is expected if Q is fixed. Likewise, an inverse relation between Q and \(\lambda\) is expected if \(\alpha\) is fixed. It is interpretable as an index of the sharpness of the modes. For this reason both the precision parameter of the DPM and the variance of its components are expected to have an effect on \(\lambda\). Controlling for Q (i.e., keeping it fixed), the expected relation is: the smaller \(\alpha\), i.e. the precision of the DPM mixture, the larger \(\lambda\), i.e. the relative density around the modes; controlling for \(\alpha\) (i.e., keeping it fixed), the expected relation is: the smaller Q, i.e. the variance of the components of the DPM, the larger \(\lambda\). The parameters of the base measures \(G_0\) have a non-negligible role in determining \(\lambda\), so in this section focus is devoted to the relation between the precision parameter \(\alpha\), the mixture components variance Q and the index \(\lambda\). Indeed, in all the study simulations the values of the other parameters involved in the DPM have been kept fixed across the scenarios.

Data generating process

The experimental design is as follows. For 4 different values of \(\alpha = (0.1, 1, 5, 20)\) and 2 different values of \(Q = (0.1, 1.5)\) a set of independent observations \(u=1,\ldots ,n\) are drawn from the following DPM:

$$\begin{aligned}&u_i |\mu _c, Q, \,{\mathop {\sim }\limits ^{iid}}\, \displaystyle \sum _{c=1}^{R} \pi _c N(\mu _c, Q), \;\;\;i = 1,\ldots , n \\&\mu _c \,{\mathop {\sim }\limits ^{iid}}\, G_0 \\&\pi _c= \nu _c \prod _{l<c}(1-v_l),\;\;\text {where} \\&v_c \,{\mathop {\sim }\limits ^{iid}}\, Be(1,\alpha ), \;\;c=1\ldots ,R. \end{aligned}$$

Where \(\mu _c\) and \(\pi _c\) are the location parameter and the mixing proportion of the component c, respectively; \(G_0\) is the base measure; and \(\nu _c\) is the parameter of the stick-breaking. Following the above mentioned truncated stick-breaking construction, here R is the maximum number of observable cluster. Across the eighth scenarios the following quantities are assigned: the number of observations \(n=500\), the maximum number of clusters \(R=50\), the base measure \(G_0:\) \(U(-6,6)\). Here, \(U(\cdot )\) stands for uniform distribution. The use of these distributions in the present experimental context aims to highlight the effect of different values of \(\alpha\) and Q on \(\lambda\) in a more evident and interpretable manner.

6.1.2 Results

As shown in Tables 1 and 2 as \(\alpha\) increases, and so the number of point masses of G increases as well, \(\lambda\) decreases. The density of the observations \(u=1,\ldots ,n\) is concentrated around few point masses (few modes) for lower value of \(\alpha\) and is spread out the larger. Note also the change of density of the antimodes. As expected, it is proportional to the precision parameter in a positive fashion. As the observations are more spread as \(\alpha\) increases, there are fewer intervals in the support with relative small density: \(\lambda\) index decreases at larger values of \(\alpha\) (column-wise Tables 1 and 2). A similar proportional relation is observed between the variance of the mixture components Q and \(\lambda\) when \(\alpha\) is kept fixed (row-wise Tables 1 and 2). Smaller values of both \(\alpha\) and Q result in a high polarized distribution of \(u=1,\ldots ,n\) and correspond to larger values of \(\lambda\). Whereas larger values of both \(\alpha\) and Q result in a low polarized distribution and correspond to smaller values of \(\lambda\). It is is an index of how spread the density is over the support of the hierarchical effects.

From an interpretative point of view, \(\lambda\) indicates the degree of overlap between the infinitely many clusters. It might be informative of the separation between them. Since this quantification is based on a non-parametric density, \(\lambda\) is not directly related to the number of the group, or to the cluster location. It indicates the degree to which the independent observations drawn from a DPM overlap; the variance of the cluster Q also plays a crucial role. The index thus quantifies the combined effect of the parameters to assess the extent to which possible different opinions (i.e., the modes) might be strongly shared among the raters (i.e., the modes are sharp pick of density). To this regard, \(\lambda\) is a polarization index in presence of heterogeneity. The higher the polarization levels, the larger the values of the index. The practical interpretation and the operational decisions must be guided by the field of application.

Table 1 The eight scenarios correspond to a DPM with different values of the precision parameter \(\alpha\) and the components variance Q. Each scenario correspond to a specific combination of these two parameters. All the other quantities are fixed across the scenarios. The realizations of the DPM are here indicated as \(u= 1,\ldots ,n\). Different combinations of \(\alpha\) and Q result in different values of \(\lambda\). For fixed values of Q (column-wise), a proportional relation is shown between \(\alpha\) and \(\lambda\): when the first increases, the second decreased. Similarly, for fixed values of \(\alpha\) (row-wise), a proportional relation is shown between Q and \(\lambda\): when the first increases, the second decreased. Smaller values of both \(\alpha\) and Q result in a high polarized distribution of \(u=1,\ldots ,n\) and correspond to larger values of \(\lambda\). Whereas larger values of both \(\alpha\) and Q result in a low polarized distribution and correspond to smaller values of \(\lambda\)
Table 2 Parameters values at each scenarios. Each of them correspond to a DPM with different values of the precision parameter \(\alpha\) and the components variance Q. Both \(\alpha\) and Q have an effect on the distribution polarization of the realizations of the DPM. As a result different values of \(\lambda\) are observed

6.2 Simulation 2: inter-rater agreement and \(\lambda\)

6.2.1 Simulation setting

The following simulation study aims to highlight the complementary role of \(\lambda\) as an additional summary metric in inter-rater agreement analysis. The varying intercept parametrization is hereafter adopted as univariate case for the raters effects. To this aim the standard modelling approach (i.e., the normal distributed varying intercept and the resulting ICC) is compared with the nonparametric proposed above (i.e., the DPM prior over the varying intercept and \(\lambda\)). The experiment evaluates the performance of both standard ICC and \(\lambda\) in the presence of heterogeneity between raters’ evaluations due to a multimodal distribution of the hierarchical effects.

Data generating process

Three experimental scenarios were planned, in which a different clustering on the raters’ intercept parameter was specified in the generative model. In each scenario the rater’s intercept \(u_i\) was generated from a bimodal Gaussian mixture. The location parameters of the mixture components were fixed across the scenarios, \(\mu _1=-3\) and \(\mu _2=3\); whereas decreasing values (1, 0.5, 0.1) were assigned to the components scale parameters \(Q_1\) and \(Q_2\) (see Table 3). This resulted in different polarization scenarios. The mixture components were kept equiprobable (\(\pi _c=0.5\)), \(c \in \{1,2\}\) throughout. The number of raters \(I=100\) and the number of items \(J=250\) were fixed across the scenarios. One continuous covariate \(x_{ij}\) with an effect \(\beta =2\) was used and it was the same across the scenarios.

Table 3 True raters’ hierarchical effects distribution across different scenarios. A Gaussian mixture is specified as distribution of the hierarchical effects \(u_i=1,\ldots ,I\). The location parameters of two components of the mixture are kept fixed across the scenarios and decreasing values were assigned to the respective scale parameters

Standard model approach

The following priors were specified for the standard hierarchical effect model (i.e., the varying intercepts are assumed to be i.i.d. normal distributed):

$$\begin{aligned}&\beta \sim N(0, 5), \\&\sigma _\varepsilon \sim Exp(0.2), \\&\sigma _u \sim Exp(0.2), \\&u_i \,{\mathop {\sim }\limits ^{iid}}\, N(0,\sigma _u), \end{aligned}$$

for \(i =1,\ldots ,I\); \(Exp(\cdot )\) stands fro the exponential distribution and \(\beta\) is the non-hierarchical effect, \(\sigma _u\) and \(\sigma _\varepsilon\) are the hierarchical effect and the noise variances parameters, respectively. A logic of complexity penalization was used in the choice of the above mentioned priors distributions (Simpson et al. 2017). The posterior of each standard hierarchical effect model were sampled using NUTS-Hamiltonian MCMC in Stan language (Stan Development Team 2022).

Nonparametric model approach

The set of priors introduce in Sect. 4 were elicited for the DPM models with the following hyperparameters as suggested by (Heinzl et al. 2012): \({\textbf{b}}_0= {\textbf{0}}, {\textbf{S}}_0= 1000{\textbf{I}}_p\), \(a_{\beta _0}=0.005, b_{\beta _0}=0.005, {\textbf{m}}_0= {\textbf{0}}\), \({\textbf{W}}_0=100{\textbf{I}}_q, a_{D_0}=0.5, b_{D_0}=0.5\), \(a_{Q_0}=0.001, b_{Q_0}=0.001, a_\alpha =2, b_\alpha =2,a_\varepsilon =0.005, b_\varepsilon =0.005\). As result of some preliminary analysis, a dense grid of 481 equally-spaced values from − 12 to 12 (i.e., with a fixed interval of 0.05) was used to monitoring the mixture density of the nonparametric varying intercept \(u_i\) at each iteration. The posterior distribution of the nonparametric hierarchical effect is obtained as the set of the mean density of each point of the grid over the iterations (Gelman et al. 201320132013).

In all the computations for both the models 55,000 iteration with 5000 burn-in were used, the Markov chains were thinned the by a factor of 50, resulting in samples of size 1000 (Heinzl et al. 2012).

6.2.2 Results

As shown in Table 4 the standard model (i.e., that in which hierarchical raters intercepts are assumed to be i.i.d. normally distributed) due to the rigid distributional assumption of the hierarchical parameters is not able to capture the possible multimodal distribution and it resulted in a large value of the hierarchical effect variance \(\sigma _u\) (see Table 4 and Fig. 2). As a result, the ICC didn’t capture almost any difference among the three different scenarios (see Table 4 and Fig. 3). On the contrary, the DPM model, due to the flexible nonparametric specification of the intercepts prior, showed a good performance. As evident from Fig. 4 and Table 4 the DPM model was far more able to reproduce the data generating process. The different mixture used to generate the data emerged clearly from the posterior of the grid adopted to monitoring \({\textbf{u}}\) (Fig. 4). Since the DPM model properly learn the multi-modalities of the raters intercepts density, the index \(\lambda\), being based on the ratio between the mean density of the modes and that of the antimodes present in the grid at each iteration, showed to be able to differentiate the three different polarization scenarios. It gives some interesting information regarding the shape of the non-parametric mixture distribution. The 95% credible interval of \(\lambda\) (see Table 5 and Fig. 5) as estimated in the three different scenarios highlighted different degrees of amplitude and separation (i.e., different degrees of polarization) along them. The index \(\lambda\) is computed as a logarithm of the ratio of the average mode density against the density of the antimodes at each iteration t of the posterior sampler. So, in the first scenario, the values of the 95% credible interval are smaller, indicating that in most of the iterations the difference between the mean density at the modes and that of the antimodes was very small. In terms of the third scenario, \(\lambda\) assumed rather larger values along the iterations as evidence that the mode density is far larger than that of the antimodes. In other words, the rater clusters were separated and clearly distinct. The parameters of the DPM are the most influential with regard to \(\lambda\). Specifically, the location parameters \(\mu _c\), \(c=1,\ldots ,R\), and the scale parameters \(Q_c\), \(c=1,\ldots ,R\), showed to have a combined effect of the proposed index. The 95% HDP intervals of the parameters of both the DPM prior model and that with normal distributional assumption are reported in Tables 6 and 7, respectively.

Table 4 95% HPD intervals of the hierarchical effects variance \(\sigma _u\) from the standard models (i.i.d. normal distributed varying intercepts) and of the Grid density of the hierarchical effect in DPM models
Table 5 95% HPD intervals of the ICC from the standard models (i.i.d. normal distributed varying intercepts) and of \(\lambda\) from the DPM models
Table 6 95% crebible intervals of \(\beta\), DPM and residuals related parameters. Here \(\beta\) is the non varying effect, \(b_\beta\) and \(\sigma _\beta\) are, respectively, the related location and scale hyperparameters; \(\mu _0\) and \(\sigma _{D_0}\) are the location and scale parameters of the base measure \(G_0\), respectively. The precision parameter \(\alpha\) and the residuals standard deviation \(\sigma _\varepsilon\) are also reported
Fig. 2
figure 2

95% HPD intervals of \(\sigma _u\) from the standard models (i.i.d. normal distributed varying intercepts): a Scenario 1: lower polarization. b Scenario 2: medium polarization. c Scenario 3: higher polarization. Scenario 3: higher polarization. These models, due to their rigid distributional assumption, poorly differentiate the three different polarization scenarios

Fig. 3
figure 3

95% HPD intervals of ICC from the the standard models (i.i.d. normal distributed varying intercepts): a Scenario 1: lower polarization. b Scenario 2: medium polarization. c Scenario 3: higher polarization

Fig. 4
figure 4

95% HPD intervals of the hierarchical effect from the DPM. The different mixture used to generate the data emerged clearly from the posterior of the grid adopted to monitoring \({\textbf{u}}\). a Scenario 1. b Scenario 2. c Scenario 3

Fig. 5
figure 5

Posterior distribution of \(\lambda\): a Scenario 1: lower polarization. b Scenario 2: medium polarization. c Scenario 3: higher polarization. The black dotted lines stands for 95% credible intervals

Table 7 95% HPD intervals of the other parameters of the standard models (i.i.d. normal distributed varying intercepts)

7 Large scale performance assessment

The evaluation heterogeneity of teachers is a long-standing issue in psychometrics (Uto 2022; Shirazi 2019; Bonefeld and Dickhäuser 2018; Casabianca et al. 2015; DeCarlo 2008). Highly biased scored might have a detrimental effect on students proficiency and education (Chin et al. 2020; Paredes 2014; Cooper 2003). The proposed nonparametric model and the index \(\lambda\) might be valuable tools to address this issue. They might help to shed light on very biased assessment contexts and to provide fairer scores. The estimated hierarchical effect of each teacher (which may be interpreted as the teacher’s bias) might be used to adjust the observed score. The index \(\lambda\) might quantify teachers polarization in their grading.

The Matura data set

As an illustrative real data application, a large scale performance assessment data set was analysed (Zupanc and Štrumbelj 2018). The DPM-model was applied to a large-scale essay assessment data obtained during the nation-wide external examination conducted by the National Examination Centre in upper secondary schools in Slovenia also known as Matura and analyzed in Zupanc and Štrumbelj (2018). These data were related to the spring term argumentative essays for years between 2010 and 2014. Particular attention is devoted to the distinction between two main aspects of essay writing: the language correctness (i.e., the presence of grammatical or syntactic errors) and the the good argumentation of the content (i.e., a good and clear presentation of all the arguments). Regarding the data structure, students are nested within the teachers. So that each student’s essay is evaluated by one trained teacher, who is asked to grade it concerning two different rubrics. An essay can receive a score between 0 and 20 for the language-related rubric and between 0 and 30 for the content-related one. Prior analysis of these data (Zupanc and Štrumbelj 2018) revealed that heterogeneity among teachers was broadly down to two types: strict and lenient. The two different trends might be captured by the model and their polarization quantified by the \(\lambda\) index.

For this reason N=2616 students’ essays, each scored by one of I=18 different teachers, were considered for the analysis.Footnote 6 The objective of this application is to analyze teachers’ individual differences in scoring the essay content, controlling for its language correctness. How lenient or strict they are in scoring the quality of an essay content, without the effect of the language correctness. The content score is commonly ways more susceptible to idiosyncrasies or biases of the teacher than the language-related score, which is generally more objective (Childs and Wooten 2023; Zhu et al. 2021; Shirazi 2019). Accordingly, the content-related score was specified as outcome variable and the language-related score as covariate with a non varying effect. A DPM hierarchical prior was specified over the teachers’ intercepts. All the scores were re-scaled for this analysis to get a easier parameters value interpretationFootnote 7 (Gelman et al. 2013).

7.1 Results

The language-related score showed a posterior mean effect of 0.27 on the content-related score, with a (0.16, 0.38) 95% credible interval. The language correctness of the essay writing had moderate role in predicting the evaluation of the its content. As shown by Fig. 6a the DPM-model learned the presence of two main trends from the data. The bimodal non-parametric distribution over the grid suggested that the teachers were rather heterogeneous in the essay scoring process. More precisely, they seemed to be slightly polarized around two main tendencies. Some teacher showed a slightly more lenient or stricter than the others (i.e., who had a larger or smaller hierarchical effect posterior mean, respectively), see Fig. 6. The \(\lambda\) index showed a posterior mean of 1.87 which suggested a low polarization. The \(\lambda\) 95% HPD interval was (0.0, 6.83) which indicated a non negligible occurrence of quite high values of \(\lambda\). All the other parameters credible intervals are reported in Table 8

Assuming this latent group polarization as a low latent agreement among raters, the \(\lambda\) index might be used in a diagnostic manner. Considering the present application, some solutions might be suggested for a fairer assessment process. Firstly, assuming a very negligible noise term, the teacher’s estimated bias might be removed from the actual score. Another practical solution might be the implementation ad hoc training aimed to a much more shared point of view in essay scoring.

Fig. 6
figure 6

a 95% HPD of the monitoring grid for the teachers’ hierarchical effects \(u=1,\ldots ,I\). b Posterior mean of the hierarchical effect of each teacher. Two different clusters emerged from the analysis, as expected: the more lenient (to the right-hand side) and the stricter (to the left-hand side). c \(\lambda\)’ 95% credible intervals. It indicate a moderate polarized posterior distribution of the posterior hierarchical effects

Table 8 95% credible intervals of \(\beta\), DPM and residuals related parameters. Here \(\beta\) is the non varying effect, \(b_\beta\) and \(\sigma _\beta\) are, respectively, the related location and scale hyperparameters; \(\mu _0\) and \(\sigma _{D_0}\) are the location and scale parameters of the base measure \(G_0\), respectively. The precision parameter \(\alpha\) and the residuals standard deviation \(\sigma _\varepsilon\) are also reported

8 Conclusions

Most of the statistical models commonly used to analyze data from such observational contexts haven’t shown to be very flexible to certain types of heterogeneity among raters. The common HLMs with a normal (or unimodal) distributional assumption for the hierarchical effects cannot capture any possible latent clusters, i.e. any multimodality. Indeed, the residual covariance modelled through the hierarchical effects might be informative about different latent similarities among raters. In this regard, incorporating a DPM in the prior of the hierarchical effects distribution is a flexible choice to address this issue.

Consequently, the estimation of the agreement among the raters should take into account the possible multimodal distribution of the hierarchical effects. Interest might not be exclusively on the proportion of variance attributable to the hierarchical effects over the total variance (i.e., the main interpretation of the ICC); instead, it might be more appealing to explore the entire multimodal density. Since the DPM naturally accommodates clusters among hierarchical effects (i.e., among raters), it is natural to consider the extent to which the mixture components are separated. Since \(\lambda\) is based on the density approximated through the grid approach \(f({\textbf{u}})\), it reflects both the clustering induced by the Dirichlet process and the variance of the mixture components. Due to the particular information carried by \(\lambda\) it might be more informative about the latent agreement among raters than the solely ICC. The latter is very useful when the normal distributional assumption of the hierarchical effects holds. However, in the presence of multimodality the estimate of the variance of the hierarchical effect \(\sigma _u\) is not accurate (it might be over-estimated) and the related ICC might be non-informative.

In contexts in which strong beliefs about the exact number of cluster are present or it is supported by some sort of evidence, an hierarchical model with a prior finite mixture distribution over the hierarchical effects is expected to have comparably good performance as well. The parametric variance of a mixture might be take into account in the ICC formula in these cases. For the above mentioned reasons, added flexibility and the shrinkage property the DPM was here preferred.

Many other studies are needed to fully understand the performance of \(\lambda\) across different combinations of the Dirichlet process parameters. Future works might highlight the role of \(\lambda\) when the rating is either expressed on a dichotomous or on a polytomous scale. Further studies might highlight the computation of \(\lambda\) when multivariate hierarchical effects are specified. Comparisons between this index and the others widely used in these cases (Tang et al. 2022; Forchheimer et al. 2015; Zhang et al. 2003) might be a focus of future studies. Further application of \(\lambda\) in a non-parametric context might be studied (Canale and Prünster 2017).