1 Introduction

Preference-approvals are preference structures for declaring opinions over a set of alternatives. They combine decision makers’ preference orderings and classify the alternatives as either acceptable or unacceptable (see Brams 2008, Chapter 3, Brams and Sanver 2009 and Sanver 2010). Thus, in preference-approval structures, voters should declare which alternatives are acceptable and rank-order them. Additionally, voters may either rank-order unacceptable alternatives or avoid displaying their preferences about them, as in fallback voting (Brams and Sanver 2009), by showing indifference between these alternatives.

Preference-approval structures have been studied from various perspectives, with a significant focus on exploring their basic properties (Dong et al. 2021) and achieving consensus in group decision making (GDM) (Erdamar et al. 2014; Liang et al. 2018; Barokas and Sprumont 2022). In particular, Barokas (2022a) introduced a social choice rule known as majority approval and compared it to other social choice rules. Additionally, Barokas (2022b) developed an axiomatic approach to allocation rules that are mathematically equivalent to preference-approvals but different from the voting rules.

A study conducted by Kruger and Sanver (2021) investigated preference-approvals and identified that there could be issues with reconciling ranking information and approval information in the method. In fact, they demonstrated that aggregating preference-approvals by decomposing the rankings and approvals could be dictatorial, indicating that the preferences of a single individual or a small group of individuals may excessively influence the resulting decision. Furthermore, Liu et al. (2023) proposed a model for Multi-Criteria Group Decision Making problems using the Preference Approval Structure approach, considering the Partial Information of Linguistic Terms to increase consistency between the preference-approvals and multi-criteria assessments.

Nevertheless, little effort has been devoted to developing clustering algorithms that deal with preference-approvals. The clustering task deals with classifying objects in homogeneous clusters, such that objects in a cluster have a higher degree of similarity than they do with items from other clusters. (see Jain et al. 1999 and Everitt et al. 2011). To the best of our knowledge, the only proposal applying clustering algorithms to preference-approval structures is found in Albano et al. (2023). They introduced a family of distances between preference-approvals and used a simple hierarchical clustering algorithm to find homogeneous groups of individuals. However, the possibility of clustering alternatives in preference-approvals has not yet been addressed. The goal of this paper is to fill this gap by demonstrating that identifying homogeneous groups of alternatives can be beneficial in reducing the complexity of the preference-approval space and making the data easier to interpret. Indeed, developing a method for clustering alternatives based on preference-approvals has several practical implications that can benefit decision-makers in a variety of settings by enabling them to identify potential trade-offs and conflicts between different policy options. For instance, clustering alternatives could be helpful for identifying groups of politicians that are most similar in terms of voters’ preferences when evaluating candidates in an election. This data can help political campaigns determine which groups of voters to target with their message. Furthermore, clustering alternatives can be used to identify outliers, which are politicians who are not similar to others, and their presence can be interpreted as an indication of heterogeneous opinions among voters.

Another potential application of clustering algorithms for preference-approvals is in the context of online product recommendations. Online retailers often use recommendation algorithms to suggest products to their customers based on their previous purchases and browsing history. However, the complexity of the preference-approval space could be a limit for these algorithms since it can make it difficult to identify meaningful patterns and make accurate recommendations. Therefore, by applying clustering algorithms to the preference-approval data, retailers can more effectively group products based on their similarity, leading to more accurate and relevant customer recommendations.

Although the literature on clustering algorithms applied to preference orderings is rich, it is not straightforward to transfer it directly to the preference-approval framework because preference-approvals are more complex structures. Clustering approaches for preference rankings can be applied to both individuals and alternatives. Most commonly used methods for clustering individuals involve an algorithmic model, such as hierarchical clustering, or an approach that aims to optimize a badness-of-fit function, such as K-means, PCA, MDS, or fuzzy clustering. Further details on these methods can be found in Heiser and D’Ambrosio (2013, pp. 19–31).

Despite being less studied, the task of clustering alternatives rather than individuals in preference rankings is undoubtedly relevant. Marden (1996) defined a distance between two alternatives as the squared Euclidean distance of the ranks assigned to them. Thus, objects will be close if the voters give them similar ranks. Finally, they applied a simple hierarchical clustering to find meaningful groups. Sciandra et al. (2020) proposed a projection pursuit-based clustering method to simultaneously identify clusters of both individuals and alternatives in preference rankings.

Similarly to the task of clustering alternatives in preference rankings, González del Pozo et al. (2017) focused on clustering alternatives in ordered qualitative scales. They designed an agglomerative hierarchical clustering algorithm, relying on the concept of ordinal proximity measure, to cluster nine US presidential candidates. The degree of consensus is measured by the proximity of all pairs of individual appraisals over the evaluated alternatives.

In this work, we introduce a new family of pseudometrics on the set of alternatives taking into account voters’ opinions on these alternatives through preference-approvals. To obtain clusters, we apply an order-invariant partitioning algorithm, known as Ranked k-medoids (RKM), see Zadegan et al. (2013), taking the similarities as input among pairs of alternatives based on the proposed pseudometrics. Finally, clusters are represented in 2-dimensional space using non-metric multidimensional scaling. This paper is an extended version of the paper presented at the 51st Scientific Meeting of the Italian Statistical Society in June, 2022 (Albano et al. 2022).

The paper is organized as follows. Section 2 is devoted to introducing basic notation and concepts we use throughout the article. Section 3 contains our proposal for clustering alternatives. Section 4 includes some case studies in order to emphasize the advantage of reducing the complexity of the preference-approval space. Finally, Sect. 5 concludes the paper with some remarks.

2 Preliminaries

Let \(\,X=\{x_1,\dots ,x_n\}\,\) represent a finite set of alternatives, with \(\,n\ge 2\). A full and transitive binary relation on X is a weak order (or complete preorder). While, a linear order on X is an antisymmetric weak order on X.

The set of weak and linear orders on X is denoted by \(\,W(X)\,\)and \(\,L(X)\,\), respectively. Given \(\,R\in W(X)\), we represent the asymmetric and symmetric components of R with \(\,\succ \,\) and \(\,\sim \,\), respectively: \(\,x_i \succ x_j\,\) if not \(\,x_j\,R\,x_i\), and \(\,x_i \sim x_j\,\) if \(\,x_i\,R\,x_j\,\) and \(\,x_j\,R\,x_i\).

Given a set Y, with \(\,{\mathcal {P}}(Y)\,\) we denote its power set, i.e., \(\,I\in {\mathcal {P}}(Y) \,\Leftrightarrow \, I\subseteq Y\). In turn, with \(\,\# Y\,\) we denote the cardinality of Y.

2.1 Preference-approvals

Consider a scenario in which a group of voters \(\,V=\{v_1,\dots ,v_m\}\), with \(\,m\ge 2\) have to declare their preferences on a set of alternatives \(\,X=\{x_1,\dots ,x_n\}\), with \(\,n\ge 2\).

By splitting X into A, the set of acceptable alternatives, and \(\,U=X\setminus A\), the set of unacceptable alternatives, where A and U can both be the empty set, we assume that each voter uses a weak order to rank the options in X and additionally determines whether each option is acceptable or unacceptable.

We also make the following consistency assumption: given two alternatives \(x_i\) and \(x_j\), if \(x_j\) is acceptable and \(x_i\) is ranked above \(x_j\), then \(x_i\) should be acceptable as well.

Definition 1

A preference-approval on X is a pair \(\,(R,A)\in W(X) \times {\mathcal {P}}(X)\,\) satisfying the following condition:

$$\begin{aligned} \forall x_i,x_j\in X\; \big [(x_i\,R\,x_j \text{ and } x_j\in A) \;\Rightarrow \; x_i\in A \big ]. \end{aligned}$$

With \(\,{\mathcal {R}}(X)\,\) we denote the set of preference-approvals on X.

A profile is a vector of preference-approvals \(\,\big [(R_1,A_1),\dots , (R_m,A_m)\big ] \in {\mathcal {R}}(X)^m\), where \(\,(R_k,A_k)\,\) is the preference-approval of the voter \(\,v_k\in V\).

Remark 1

If \(\,(R,A)\in {\mathcal {R}}(X)\), then the following conditions are satisfied:

  1. 1.

    \(\forall x_i,x_j\in X\; \big [(x_i\in A \text{ and } x_j\in U) \;\Rightarrow \; x_i \,\succ \,x_j\big ]\).

  2. 2.

    \(\forall x_i,x_j\in X\; \big [(x_i\,R\,x_j \text{ and } x_i\in U) \;\Rightarrow \; x_j\in U\big ]\).

Example 1

Consider the preference-approval \(\,(R,A)\in {\mathcal {R}}(\{x_1,x_2,x_3,x_4\})\,\) represented by

figure a

The alternatives above the line are acceptable, i.e., \(\,A=\{x_1\}\), and those below the line are unacceptable, i.e., \(\,U=\{x_2,x_3,x_4\}\). This means that alternatives in the upper rows are preferred to those in the lower rows, and alternatives in the same row are indifferent.

The number of approvals, linear orders, weak orders, and preference-approvals when the number of alternatives is \(\,n=2,3,\dots ,10\,\) are listed in Table 1. The total number of approvals (subsets of X) and linear orders is widely known to be \(\,2n!\) and \(\,n!\), respectively. While, according to Good (1975) and Bailey (1998), there are \(\,n!(\log _2\,e)^{n+1}/2\) weak orders. Finally, the last column of Table 1 shows the exact number of preference-approvals (these data come from Albano et al. 2023).

Table 1 Number of approvals, linear orders, weak orders and preference-approvals

Table 1 provides a comprehensive overview of the number of possible preferences and rankings that can be generated for a given number of alternatives. It allows to gain a better understanding of the combinatorial explosion that occurs as the number of alternatives increases. Indeed, the complexity of the preference-approval space poses a significant challenge in developing algorithms or models for preference aggregation and prediction.

2.2 A pseudometric on preferences

Positions are easily assigned to alternatives in linear orders: given \(\,R\in L(X)\), the position of each alternative \(\,x_i\in X\,\) in \(R\,\) is defined through the mapping \(\,P_R:X\longrightarrow \{1,\dots ,n\}\,\) that gives the first choice a score of 1, the second alternative a score of 2, and so on.

In weak orders, the positions of the alternatives can be assigned in a variety of ways. One of them, employed by García-Lapresta and Pérez-Román (2011) is based on Smith (1973), Black (1976), and Cook and Seiford (1982). Given \(\,R\in W(X)\), the position of \(\,x_i\in X\,\) in \(R\,\) is determined by the mapping \(\,P_R:X\longrightarrow [1,n]\,\) defined as

$$\begin{aligned} P_R(x_i)=n - \#\left\{ x_k\in X \mid x_i \succ x_k\right\} - \frac{1}{2} \cdot \#\left\{ x_k\in X \setminus \{x_i\} \mid x_i \sim x_k\right\} , \end{aligned}$$
(1)

that is, the position of \(x_i\) in R is determined by subtracting the number of alternatives to which \(x_i\) is strictly preferred (i.e., they appear after \(x_i\) in R) from n, the total number of alternatives. This value is then adjusted by subtracting half the number of alternatives that are tied with \(x_i\) (i.e., are indifferent to \(x_i\)). The resulting positions can be used to compare the rankings of alternatives across different weak orders. From Eq. (1), we introduce a pseudometric on the set of alternatives that measures the difference between the positions of two alternatives in a weak order.

Proposition 1

Given \(\,R\in W(X)\), the mapping \(\,d_P:X \times X \longrightarrow \mathbb {R}\,\) defined as

$$\begin{aligned} d_P(x_i,x_j)= \vert P_R(x_i) - P_R(x_j) \vert \end{aligned}$$
(2)

is a pseudometric on X, i.e., it satisfies the following conditions for all \(\,x_i,x_j,x_k \in X\):

  1. 1.

    \(d_P (x_i,x_j) \ge 0\).

  2. 2.

    \(d_P (x_i,x_i) = 0\).

  3. 3.

    \(d_P (x_i,x_j) = d_P (x_j,x_i)\).

  4. 4.

    \(d_P (x_i,x_j)\le d_P (x_i,x_k) + d_P (x_k,x_j)\).

Additionally, it is satisfied \(\,d_P (x_i,x_j) = 0 \;\Leftrightarrow \; x_i \,\sim \,x_j\), for all \(\,x_i,x_j\in X\).

Obviously, if \(\,R\in L(X)\), then \(d_P\) is a metric, i.e., \(\,d_P (x_i,x_j) =0\;\Leftrightarrow \; x_i=x_j\), for all \(\,x_i,x_j\in X\). Note that \(\,d_P (x_i,x_j) \in \{0,1,\dots , n-1\}\,\) for all \(\,x_i,x_j\in X\).

2.3 A pseudometric on approvals

Given \(\,A \subseteq X\), the indicator function (or characteristic function) of A, \(\,I_A:X\longrightarrow \{0,1\}\), is defined as

$$\begin{aligned} I_A(x_i)=\left\{ \begin{array}{ll} 1, \text { if }x_i \in A,\\ 0, \text { if }x_i \in X\setminus A. \end{array} \right. \end{aligned}$$
(3)

From Eq. (3), we now introduce a pseudometric on the set of alternatives that measures the difference between the membership of two alternatives in a set.

Proposition 2

Given \(\,A \subseteq X\), the mapping \(\,d_A:X \times X \longrightarrow \mathbb {R}\,\) defined as

$$\begin{aligned} d_A(x_i,x_j)= \vert I_A(x_i) - I_A(x_j) \vert \end{aligned}$$
(4)

is a pseudometric on X, i.e., it satisfies the following conditions for all \(\,x_i,x_j,x_k \in X\):

  1. 1.

    \(d_A (x_i,x_j) \ge 0\).

  2. 2.

    \(d_A (x_i,x_i) = 0\).

  3. 3.

    \(d_A (x_i,x_j) = d_A (x_j,x_i)\).

  4. 4.

    \(d_A (x_i,x_j)\le d_A (x_i,x_k) + d_A (x_k,x_j)\).

Additionally, it is satisfied \(\,d_A (x_i,x_j) = 0 \;\Leftrightarrow \; \big [x_i,x_j\in A \; \text{ or } \; x_i,x_j\notin A\big ]\), for all \(\,x_i,x_j\in X\).

Note that \(\,d_A (x_i,x_j) \in \{0,1\}\,\) for all \(\,x_i,x_j\in X\).

Remark 2

Every preference-approval \(\,(R,A)\in {\mathcal {R}}(X)\,\) can be codified in terms of \(\,P_R(x_i)\,\) (Eq. 1) and \(\,I_A(x_i)\,\) (Eq. 3) as follows:

$$ \begin{aligned} \big [P_R(x_1),P_R(x_2),\dots ,P_R(x_n)\big ]\, \& \,\big [I_A(x_1),I_A(x_2),\dots ,I_A(x_n)\big ]. \end{aligned}$$
(5)

For instance, in Example 1, \(\,(R,A)\,\) is codified as \( \,(1,\,2.5,\,2.5,\,4)\, \& \,(1,0,0,0)\).

3 The proposal

Given a profile \(\,\big [(R_1,A_1),\dots , (R_m,A_m)\big ] \in {\mathcal {R}}(X)^m\,\) and two alternatives \(\,x_i,x_j\in X\), we now present two indices that quantify the distance between these items in terms of preference and approvals, respectively, for each voter \(\,v_k \in V\). They are based on the pseudometrics introduced in Eqs. (2) and (4).

3.1 Preference discordances

The preference-discordance between \(x_i\) and \(x_j\) for the voter \(\,v_k \in V\,\) is defined as

$$\begin{aligned} p_{ij}^{k} = \frac{1}{n-1} \cdot \vert P_{R_k}(x_i)-P_{R_k}(x_j) \vert . \end{aligned}$$
(6)

Note that \(\,p_{ij}^{k} \in [0,1]\).

Remark 3

Note that if a voter expresses a linear order \(R \in L(X)\), then: (i) there will not be any pair of different alternatives whose preference-discordance is 0 and (ii) there will be only one pair of alternatives whose preference-discordance is maximum, equal to 1:

$$\begin{aligned} R \in L(X) \;\Rightarrow \; {\left\{ \begin{array}{ll} p_{ij}^k\ne 0 \; \text{ for } \text{ all } \; x_i,x_j \in X,\; x_i \ne x_j,\\ \exists !\; x_i,x_j \in X \quad p_{ij}^k=1. \end{array}\right. } \end{aligned}$$

On the contrary, if a voter expresses a weak order that is not a linear order \(\,R' \in \big [ W(X)\setminus L(X)\big ]\), and indifference between different alternatives happens, then: i) there will exist at least a pair of different alternatives whose preference-discordance is 0; ii) no pair of alternatives produces a preference-discordance equal to 1:

$$\begin{aligned} R' \in \big [ W(X)\setminus L(X)\big ] \;\Rightarrow \; {\left\{ \begin{array}{ll} \exists \;x_i,x_j\in X,\; x_i \ne x_j, \quad p_{ij}^k=0,\\ p_{ij}^k\ne 1 \; \text{ for } \text{ all } \; x_i,x_j \in X. \end{array}\right. } \end{aligned}$$

Remark 4

Note that, \(p^k_{ij}\,\) is decreasing as the total number of alternatives, n, increases. Figure 1 plots the preference-discordance \(\,p_{ij}^k(x_i,x_k)\,\) as a function of n, where \(x_i,x_j\) are two adjacent alternatives for the k-th voter.

Fig. 1
figure 1

Preference-discordance of two adjacent alternatives by n

The total number of different alternatives, n, determines the expressivity of voters. Two alternatives \(\,x_i, x_j\in X\,\) that are adjacent are considered more similar in a large order than in a small one. For example, consider the universe of weak orders for \(\,n=4\): \(\,x_i\,\) and \(\,x_j\,\) are adjacent in 40 out of the 75 possible scenarios, about \(53\%\). On the contrary, when the number of alternatives doubles, \(n=8\), the number of weak orders in which \(\,x_i\,\) and \(\,x_j\,\) are adjacent drops to 170,440 out of 545,835, approximately \(31\%\). As n increases, the percentage of scenarios in which \(\,x_i\,\) and \(\,x_j\,\) are adjacent decreases, and so does the average distance between them.

Finally, the average preference-discordance, \(\,\bar{p}_{ij}\), summarizes the average dissimilarity between two alternatives according to the whole set of voters:

$$\begin{aligned} \bar{p}_{ij}=\frac{1}{m}\sum _{k=1}^m p_{ij}^k. \end{aligned}$$
(7)

3.2 Approval discordances

The approval-discordance between \(x_i\) and \(x_j\) for the voter \(\,v_k\in V\,\) is defined as

$$\begin{aligned} a_{ij}^{k} = \vert I_{A_k}(x_i) - I_{A_k}(x_j) \vert , \end{aligned}$$
(8)

where \(\,a_{ij}^{k} \in \{0,1\}\).

Unlike \(\,p_{ij}^k\), the approval-discordance is not influenced by the number of alternatives whose acceptability is established. Considering all possible approvals of n alternatives, the percentage of approval vectors in which \(\,x_i\,\) and \(\,x_j\,\) receive the same rating remains constant as n varies.

Finally, the average approval-discordance, \(\,\bar{a}_{ij}\), summarizes the average dissimilarity between two alternatives according to the whole set of approvals:

$$\begin{aligned} \bar{a}_{ij}=\frac{1}{m}\sum _{k=1}^m a_{ij}^k. \end{aligned}$$
(9)

3.3 Global discordances

In order to define an overall measure of discordance between each pair of alternatives, we consider the family of weighted means, \(\,h:[0,1] \times [0,1] \longrightarrow [0,1]\), defined as

$$\begin{aligned} h(x,y) = \lambda \cdot x + (1-\lambda )\cdot y, \end{aligned}$$
(10)

where \(\,\lambda \in [0,1]\,\).

Taking into account the preference and approval discordances introduced in Eqs. (6), (7), (8) and (9), respectively, and the family of weighted means defined in Eq. (10), we now introduce a global measure of discordance between pairs of alternatives.

Definition 2

Given a profile \(\, \big [(R_1,A_1),\dots , (R_m,A_m)\big ] \in {\mathcal {R}}(X)^m\,\) and \(\,\lambda \in [0,1]\), the mapping \(\,\delta _{\lambda }:X \times X \longrightarrow [0,1]\,\) is defined as

$$\begin{aligned} \delta _{\lambda }(x_i,x_j) = \frac{1}{m} \cdot \sum _{k=1}^m \left( \lambda \cdot p_{ij}^k + (1-\lambda )\cdot a_{ij}^k\right) = \lambda \cdot \bar{p}_{ij} + (1-\lambda ) \cdot \bar{a}_{ij}. \end{aligned}$$
(11)

Proposition 3

Given a profile \(\,\big [(R_1,A_1),\dots , (R_m,A_m)\big ] \in {\mathcal {R}}(X)^m\), the mapping \(\,\delta _{\lambda }\,\) is a pseudometric on X for every \(\,\lambda \in [0,1]\). We say that \(\,\delta _{\lambda }\,\) is the pseudometric associated with \(\,\lambda \).

Proof

Taking into account Propositions 1 and 2, it is obvious that \(\,\delta _{\lambda }\,\) satisfies the following conditions for all \(\,x_i,x_j \in X\): \(\,\delta _{\lambda }(x_i,x_j) \ge 0\), \(\,\delta _{\lambda } (x_i,x_i) = 0\,\) and \(\,\delta _{\lambda } (x_i,x_j) = \delta _{\lambda } (x_j,x_i)\). Finally, \(\,\delta _{\lambda }\,\) satisfies the triangle inequality being a convex combination of pseudometrics. \(\square \)

Figure 2 shows how \(\,\delta _{\lambda }\,\) varies as a function of \(\,\bar{p}_{ij}\,\) and \(\,\bar{a}_{ij}\,\) for \(\,\lambda =0.1,\, 0.5,\, 0.9\).

Fig. 2
figure 2

Heatmaps \(\delta _{\lambda }\)

In Fig. 2a, \(\,\lambda \,\) is set to 0.5. Thus, \(\,\bar{p}_{ij}\,\) and \(\,\bar{a}_{ij}\,\) have the same weight in determining the final distance \(\,\delta _{\lambda }(x_i,x_j)\). As a result, the corresponding heatmap is symmetrical with respect to the secondary diagonal, and \(\delta _{\lambda }\) increases diagonally from bottom to top and from left to right.

On the contrary, \(\,\lambda =0.1\,\) Fig. 2a and \(\,\lambda =0.9\,\) (Fig. 2)c correspond to two unbalanced settings. Giving much more importance to approvals, \(\,\lambda =0.1\,\) Fig. 2a, causes the bottom area of the graph to contain lower distances, \(\,\delta _{\lambda }\,\) grows much more noticeably vertically rather than horizontally. Finally, in Fig. 2c, \(\,\delta _{\lambda }\,\) is dominated by the preference-discordance. The lesser distances are found on the left side of the graph, and \(\,\delta _{\lambda }\,\) expands horizontally significantly more than vertically.

The choice of \(\lambda \) as a weighting parameter in metrics for preference-approvals has been the subject of debate in other scientific articles (Erdamar et al. 2014; Dong et al. 2021; Albano et al. 2023). There is not a \(\lambda \) value that is always the best choice for preference-approvals problems. Generally, as the relative importance of the two types of information is unknown, a recommended value of \(\lambda \) is 0.5, which assigns equal importance to both the ranking and approval components.

3.4 Clustering procedure and visualization

In this paper, we use the algorithm Ranked k-medoids (RKM) (see Zadegan et al. 2013) to find clusters but we highlight that our pseudometrics can be used jointly with any distance-based clustering algorithm.

The RKM technique employs a function that assigns a rank to alternatives based on how similar they are to each other, with the more similar alternatives receiving a lower rank. In other words, \(\,{{\,\textrm{rank}\,}}(x_i, x_j) = l\,\) shows that \(x_j\) is the l-th similar alternative to \(x_i\) among n alternatives in the dataset. Sorting the similarity values between \(x_i\) and other items in the dataset allows one to determine the ranks of the remaining objects in relation to an item like \(x_i\). A rank matrix is also expressed by the rank function \(\,K=[k_{ij}]\), where \(\,{{\,\textrm{rank}\,}}(x_i,x_j)=k_{ij}\,\) for all \(\,x_i,x_j\in X\).

Note that, due to the fact that two items are rarely at the same rank as one another, K is not always symmetric. Thus, K is an \(n \times n\) matrix that shows the hostility relationship among alternatives in the dataset.

The hostility value (hv) of a particular object, \(x_i\), within a collection of alternatives, G, is introduced in order to identify the medoids. The hostility value, \({{\,\textrm{hv}\,}}_i\), of \(x_i\) within the set G is defined as:

$$\begin{aligned} {{\,\textrm{hv}\,}}_i=\sum _{x_i \in G}k_{ij}. \end{aligned}$$
(12)

Starting from the similarities among pairs of objects based on \(\,\delta _{\lambda }(x_i,x_j)\), the RKM algorithm firstly calculates K matrix and selects the medoids randomly. Then, for each medoid, select the group of the most similar objects to each medoid, using the sorted index matrix, and calculate the hostility values of every object in those groups using Eq. (12). Afterwards, select the object with the highest hostility value as the new medoid and move one of the medoids placed in the same group. Finally, iterate the process and assign each object to the most similar medoid.

This algorithm requires the number of clusters to be specified before. However, some methods, such as the Silhouette Coefficient, can be used to estimate the optimal number of clusters in our data.

The RKM method is particularly suitable in our case since it analyzes a ranking of dissimilarities, which makes the results order-invariant, meaning that data transformations that preserve the original order of the data have no impact on clusters.

In order to represent the resulting clusters in a 2-dimensional space, the multidimensional scaling (MDS) is employed. This class of methods attempts to express an observable proximity or distance matrix by a simple geometrical model or map so that the greater the perceived distance between two alternatives, the more apart the points representing them in the final geometrical model are.

Such models estimate q-dimensional coordinate values to represent n alternatives of a distance matrix. They optimize a chosen goodness of fit index, how closely the predicted distances approximate the observed ones. A number of optimization strategies, when combined with a variety of goodness of fit indices, result in a variety of MDS algorithms (Hothorn and Everitt 2006).

In this paper, given the nature of the objects, the Non-metric Multidimensional Scaling is employed. This method constructs fitted distances in the same rank order as the original distance, thus preserving the rank order of the proximities. Algorithms for accomplishing this are described in Kruskal (1964). The required coordinates for a given set of disparities are found by minimizing a function called Stress based on the squared differences between the observed proximities and the derived disparities. The process iterates until a suitably chosen convergence condition is satisfied.

4 Case studies

This section shows how the proposed metric can be used to perform cluster analysis on real data.

4.1 Eurobarometer dataset

Eurobarometer is a collection of cross-country public opinion surveys conducted on the authority of the European Commission and other European Union (EU) institutions since 1973. These surveys address a variety of topics pertaining to the EU and its member countries. Specifically, the data used in this paper come from survey question QA7 named “Public opinion in the European Union”.Footnote 1 Voters, divided up by country, were asked to indicate which of the values listed in Table 2 the EU meant the most to them.

Table 2 Values in the EU

As a result, data are stored in Table 11, which has 15 columns (each indicating an object of \(\,X=\{x_1, \dots , x_{15}\}\)) and 27 rows (one row for each EU member country). The table’s generic cell \(\,ij\) displays the total number of votes that the i-th country gives in favor of the j-th alternative.

The original table is transformed into a set of preference-approvals to perform the analysis. Following Albano et al. (2023), the alternatives are ranked in order of popularity, from the most to the least voted, and approvals are derived by approving the alternatives that obtained a higher number of votes than the national average. For instance, Table 3 displays the votes cast in Italy (Table 11 contains the votes cast in all countries).

Table 3 Votes in Italy

Following Eq. (5) and considering that the average vote in Italy is 18.53, the votes in Italy are converted into a preference-approval as:

$$ \begin{aligned} (4,\, 9.5,\, 6,\, 13,\, 1,\, 8,\, 3,\, 2,\, 9.5,\, 11.5,\, 11.5,\, 14,\, 15,\, 7,\, 5) \, \& \, (1,0, 1, 0, 1, 0, 1, 1, 0, 0, 0, 0, 0, 0, 1 ) \end{aligned}$$

that can be visualized as follows

figure b

In Fig. 3, the 15 alternatives are arranged on the preference-approval plane. The location of each alternative in this 2-dimensional space is identified by its ExpectedRank (i.e., the average rank over the whole set of voters) and by its RelativeApproval, i.e., the relative frequency of voters who considered it acceptable.

Fig. 3
figure 3

Preference-approval plane, Eurobarometer

The preference-approval plane provides a summary of the evaluations of voters on average. In particular, it reveals that all voters consider “Freedom of movement” the best alternative: it is unanimously approved and always placed first in the preference-approvals; its RelativeApproval and its ExpectedRank are equal to 1. The other alternatives tend to lie in a straight line with a negative angular coefficient. The further we move away from the point (1, 1), the worse the corresponding alternatives obtained average ratings.

Note that the preference-approval plane aids the interpretation of clusters once they have been estimated. However, it should not be considered a tool to identify clusters since the distance between points in the preference-approval plane does not necessarily reflect the pseudometric in Eq. (11). Alternatives having similar average ranking positions and approvals may show discordance among the voters.

Example 2

To further clarify this concept, let us consider \(\,(R_1,A_1), \, (R_2,A_2) \in {\mathcal {R}}(\{x_1,x_2,x_3,x_4\})\,\) the following preference-approvals:

figure c

For each alternative \(\,x_i \in X\), the ExpectedRank, expected approval and \(\,\delta _{0.5}\,\) distance matrix are reported in Tables 4 and 5.

Table 4 ExpectedRank and RelativeApproval
Table 5 Distances \(\,\delta _{0.5}\)

Note that \(x_1\) and \(x_4\) have the same RelativeApproval and ExpectedRank, thus identical coordinates in the preference-approval plane, but show maximum discordance over the voters, i.e., \(\,\delta _{0.5}(x_1,x_4)=1\). In fact, they are placed at the opposite extremes in both preference-approvals. Therefore, the preference-approval plane is intended to be an interpretative tool to visualize average judgments and interpret clusters once they have been estimated. At the same time, it is not appropriate to identify clusters since it does not reflect similarities among elements.

Figure 4 shows the clusters estimated by the RKM algorithm, where the central medoid for each cluster is highlighted through the dimension of the point. We investigate the effect of the \(\lambda \) parameter on the output, by setting \(\,\lambda =0.1, \,0.5,\, 0.9\). In this way, we are able to study three scenarios: \(\lambda =0.5\), which corresponds to giving the same importance to approvals and preferences, and \(\,\lambda =0.1,\, 0.9\), which corresponds to the opposite unbalanced situations.

We also show the Stress values in each scenario to assess the goodness of the graphical representation obtained with the MDS. Note that the position of the points in the new space found by MDS depends on the value of \(\lambda \). If the parameter, \(\lambda \) varies, the graphical representation does as well.

Fig. 4
figure 4

Graphical representation of RKM clusters

In general, the Stress coefficient varies between 6.88 and 6.43, showing a good adaptation that tends to improve slightly as \(\lambda \) increases. The optimal number of clusters, chosen through the Silhouette criterion, is two independently from the \(\lambda \) value.

When \(\,\lambda =0.1,\, 0.5\), the clusters found are the same, but the degree of separation between them clearly changes. In fact, the two clusters exhibit a higher separation indexFootnote 2 under \(\,\lambda =0.1\), i.e., assigning much more weight to the approvals than under \(\,\lambda =0.5\). Thus, a clear division is obtained between frequently accepted and not accepted alternatives. The two clusters become closer as \(\lambda \) reaches 0.5. In this example, there are clearly two different types of alternatives: those referring to negative aspects (“Bureaucracy”, “Unemployment”, “Money waste”, etc.) and those referring to positive aspects (“Freedom”,“Democracy”, etc.). For this reason, a voter with a bad opinion about the EU will prefer the former and vice versa. Indeed, the two clusters are robust and remain unchanged for small and moderate values of \(\lambda \). In this sense, clusters also provide a measure of the consistency of voters’ judgment when alternatives can be divided into natural groups. In this instance, when \(\,\lambda =0.1,\, 0.5\), the identified clusters split options related to negative attributes from those related to positive qualities. If the clusters were a mixture of good and bad options, it would imply low consistency among the judges.

Note that the proximity between points in the two-dimensional space discovered by the MDS (Fig. 4a–c) reflects the similarities based on \(\delta _{\lambda }\), between the alternatives over the voters. Thus, the position of the elements in this new space addresses the cluster interpretation.

Indeed, although in the preference-approval plane (see Fig. 3),“Money waste” is closer to the alternatives belonging to Cluster 1, its position in the MDS space reveals that actually, it is part of Cluster 2.

Figure 4c displays clusters under \(\,\lambda =0.9\), i.e., unbalanced towards preferences. In this case, Cluster 1 isolates the three alternatives frequently placed in the first positions (see the preference-approval plane Fig. 3), namely: “Freedom”,“Peace” and “Euro”.

To better understand how these can be used, consider a policymaker seeking to design a campaign to improve the EU’s public image. By analyzing the clusters and the responses within each cluster, the policymaker can tailor the campaign message to better resonate with the target audience. For instance, the campaign could emphasize the positive aspects of the EU, such as “Freedom of movement”, “Peace” and “Democracy”, to appeal to those with a positive view of the EU. Conversely, the campaign could address negative aspects, such as “Bureaucracy”, “Money waste” and “Loss of our cultural identity” to appeal to those with a negative view of the EU. As a matter of fact, the clusters can inform policymakers and political parties about the values and concerns that are most important to voters in different countries. This allows a geographical type analysis to describe EU preferences.

Table 6 displays the average ranking positions and the average approval ratings given by each country to the alternatives in the two clusters (identified using \(\lambda =0.5\)). As an example, let’s consider Belgium’s rankings and approval ratings for Cluster 1 and Cluster 2. In Cluster 1, the alternatives \(\{ x_1,x_2,x_3,x_5,x_6,x_7,x_8,x_{15}\}\) received the following ranking positions (3, 8, 7, 1, 6, 4, 2, 9). The approval ratings for these alternatives are: (1, 0, 0, 1, 1, 1, 1, 0). Therefore, the average ranking for Cluster 1 alternatives is 5 and the average approval rating is 0.62. On the other hand, for Cluster 2, the alternatives \(\{x_{4},x_{9}, x_{10},x_{11}, x_{12},x_{13}, x_{14} \}\) received the following ranking positions (12, 15, 11, 5, 13, 14, 10) . The approval ratings for these alternatives are (0, 0, 0, 1, 0, 0, 0). The average ranking for Cluster 2 is 11.43, while the average approval rating is only 0.14. This indicates that Belgium prefers the alternatives in Cluster 1.

Table 6 Average ranking position and approval rating for items in Cluster 1 and Cluster 2 for each country

Table 6 shows that Cluster 1 exhibited overall better ranking positions and higher approval ratings compared to Cluster 2, indicating more positive evaluations of its alternatives. However, it should be noted that some countries, such as Austria and Slovakia, assign better ranking positions and higher approval than the other countries to Cluster 2 items, which could indicate that negative aspects of the EU, such as “No border control” and “Money waste”, may be of particular concern to people in these countries. Thus, they may require more targeted and nuanced messaging that addresses specific concerns or criticisms that they have about the EU. Understanding these specific concerns can help policymakers craft messages that resonate with these groups and ultimately improve their overall perception of the EU. In this sense, these two countries can be considered outliers compared to the rest of the EU countries.

On the other hand, countries such as Croatia, Ireland, Portugal, and Slovenia assign particularly high approval ratings and good ranking positions to alternatives in Cluster 1. This could indicate that positive aspects of the EU, such as “Freedom of movement” and “Peace”, may resonate exceptionally well with the people of these countries.

Furthermore, the clusters can identify potential areas of disagreement or conflict among EU member countries. Policymakers can address these differences and find common ground by understanding the values and concerns that are most important to voters in different countries. For example, countries with significantly different ratings between clusters, such as Austria, which gave the highest rating in Cluster 2, and Slovakia, which gave the highest rating in Cluster 1, may have opposing ideologies and concerns that could lead to conflicts. In contrast, countries with comparable ratings across clusters, like Finland and Sweden, might have more common values and issues, making cooperation easier.

Overall, the clusters identified through the Eurobarometer dataset have practical applications for policymakers, political parties, and anyone interested in understanding public opinion within the EU.

4.2 Pew Research Center dataset

The Pew Research Center is a research institute that specializes in data-driven social science research, including public opinion surveys, demographic studies, content analysis, and more.

In this analysis, the survey “American Trends Panel Wave 33”Footnote 3 is considered. Data in this report is drawn from the panel wave conducted from March 27 to April 9, 2018, to collect the opinions of United States citizens regarding the space agency NASA.

In this analysis, we focus specifically on a query in which a total of \(2\,541\) respondents were asked to assess how much priority NASA should give to a list of nine lines of action, listed in Table 7. Individuals employed the linguistic terms from the qualitative scale in Table 8 to accomplish this.

Table 7 Lines of action
Table 8 Linguistic terms

In order to remove neutral answers, the respondents giving at least a “No answer” responses were excluded, i.e., about \(3\%\) of the total sample size. Furthermore, for each respondent, alternatives were arranged into a preference-approval. The two linguistic terms \(l_1\) and \(l_2\) were used to indicate an acceptable alternative. An example is provided in Table 9.

Table 9 Pew Research Center example

The respondent \(\,v_{10}\,\) preference-approval (see Eq. 5) is

$$ \begin{aligned} (7.5,\, 3.5,\, 1.5,\, 1.5,\, 7.5,\, 5,\, 7.5,\, 7.5,\, 3.5) \, \& \, ( 0, 1, 1, 1, 0, 0, 0, 0, 1) \end{aligned}$$

that can be visualized as follows

figure d

Here, approvals are generated directly by the voters through linguistic terms so that each voter can define all alternatives as acceptable or vice versa.

Figure 5 shows the nine alternatives on the preference-approval plane.

Fig. 5
figure 5

Preference-approval plane, Pew Research Center

In this example, the RelativeApproval of each item ranges between 55 and 95%, meaning that each alternative has been considered as acceptable by more than half of the individuals. Therefore, although the alternatives may be regarded as acceptable by voters on average, less urgent alternatives, such as exploration of other planets and satellites (Moon and Mars) and more urgent alternatives, such as earth monitoring (Climate and Asteroids), can be identified.

The clusters estimated by the RKM algorithm are shown in Fig. 6. As in the previous example, three different values of \(\,\lambda =0.1,\,0.5,\,0.9\,\) are used. For each scenario, the clusters, medoids and Stress values reached by the MDS are illustrated for graphical representation.

Fig. 6
figure 6

Graphical representation of RKM clusters

In general, the stress coefficient varies between 2.43 and 1.8, showing an excellent adaptation that tends to improve as \(\lambda \) increases. The optimal number of clusters, chosen through the Silhouette criterion, turns out to be two independently from the \(\lambda \) value.

The effect of the parameter \(\lambda \) on cluster building is visible. Setting \(\,\lambda =0.1\,\) results in Cluster 1 including only the two alternatives related to space exploration (Moon and Mars), which have the lowest RelativeApproval (see Fig. 5). Voters strongly tend to attribute the same approvals to these two alternatives.

Increasing the value of \(\,\lambda =0.5\,\) causes Cluster 1 to enlarge by including the alternative “Searching life”. Finally, giving much more weight to the rankings, i.e. \(\,\lambda =0.9\), results in Cluster 1 also including the alternative Searching natural resources. In this way, Cluster 1 contains the four alternatives that are most frequently placed in the last positions in voters’ preference-approvals.

To use the clusters obtained in the Pew Research Center dataset in practice, one could identify which lines of action have similar approval profiles and use this information to inform policymakers. For instance, in this case study, we can state that the fact that the three alternatives \(x_1\) (Search for life and planets that could support life) \(x_2\) (Explore the Moon) and \(x_3\) (Explore Mars) belong to the same cluster will suggest that policies focused on one of these areas will also receive the consent of voters who have a preference for the alternatives linked to it. Therefore, solutions shared by the three lines could significantly reduce the economic resources otherwise necessary for interventions on the individual dimensions. Moreover, by clustering the lines of action based on preference-approval profiles, policymakers can gain insights into the relationships between different policy options and the values and priorities of the electorate. These insights can inform the development of policies that align more closely with public sentiment and are, therefore, more likely to be successful. Table 10 shows the average ExpectedRank, RelativeApproval and within distance of the altenatives in the two clusters, identified using \(\lambda =0.5\).

Table 10 Comparison of two Clusters Based on ExpectedRank, RelativeApproval, and average distance

The table shows that Cluster 1 alternatives have a higher average RelativeApproval (0.86) and a better average ExpectedRank (4.46) than Cluster 2. This means that, on average, individuals were more supportive of Cluster 1’s than Cluster 2’s activities. Furthermore, the average within-cluster distance is the same in both groups even though only three action lines are in the second cluster \(\{x_1, x_7, x_8\}\) and six action lines \(\{x_2, x_3, x_4, x_5, x_6, x_9\}\) belong to the first cluster. From a practical point of view, this means that investments in terms of time, money, and human resources aimed at one or more lines of action in Cluster 1 are likely succeed in attracting the support of all those who have expressed a preference for one or more of these actions, appealing a wider range of citizens as a result. Therefore, the action lines in Cluster 1 shall have a higher priority than those in Cluster 2.

It is worth noting that, in contrast to the first case study, the preference-approvals in this study were obtained using a different method. Instead of ranking the alternatives in order of popularity and approving those that received a higher number of votes than the national average, individuals employed a qualitative scale consisting of five linguistic terms to indicate their level of approval for each alternative. Consequently, due to the qualitative nature of this scale, there were significantly more approved alternatives in this study than in the first case study. As a result, the ranking of alternatives played a more critical role in evaluating the distance between alternatives, and the impact of \(\lambda \) on the clustering process is more pronounced.

5 Concluding remarks

Preference-approvals structures are gaining increasing attention in social choice as they allow decision-makers to describe their preferences using more flexible and intuitive ordinal information. In this paper, we propose a new method for clustering alternatives in preference-approvals. First, we introduce a family of pseudometrics, \(\delta _{\lambda }\), able to quantify the distance between alternatives based on two main components: the preference-discordance \(p_{ij}\) and the approval-discordance \(a_{ij}\), and on the \(\lambda \) parameter, which regulates the weight to give to each component.

To obtain clusters, we apply the Ranked k-medoids partitioning algorithm, taking as input the similarities among pairs of alternatives based on the proposed pseudometrics. Finally, clusters are represented in 2-dimensional space using Non-Metric Multidimensional Scaling.

Through two applications to real data, we demonstrate how our algorithm allows dividing a heterogeneous population of alternatives into homogeneous groups, reducing the complexity of the preference-approval space and providing a more accessible interpretation of data. We also show the effect of the \(\lambda \) parameter on cluster identification and visualization.

Future research should consider using the proposed clustering method to collapse categories in the context of multiple-choice models. Moreover, it will be important that future research investigate a method to identify simultaneous clusters of both individuals and alternatives in the preference-approval framework, extracting helpful information in a low-dimensional subspace. In the future, we will certainly consider any relevant alternatives that may arise in the context of preference-approval clustering and include a comparison with them.