Skip to content
BY 4.0 license Open Access Published by De Gruyter December 12, 2023

When is the allele-sharing dissimilarity between two populations exceeded by the allele-sharing dissimilarity of a population with itself?

  • Xiran Liu EMAIL logo , Zarif Ahsan , Tarun K. Martheswaran and Noah A. Rosenberg

Abstract

Allele-sharing statistics for a genetic locus measure the dissimilarity between two populations as a mean of the dissimilarity between random pairs of individuals, one from each population. Owing to within-population variation in genotype, allele-sharing dissimilarities can have the property that they have a nonzero value when computed between a population and itself. We consider the mathematical properties of allele-sharing dissimilarities in a pair of populations, treating the allele frequencies in the two populations parametrically. Examining two formulations of allele-sharing dissimilarity, we obtain the distributions of within-population and between-population dissimilarities for pairs of individuals. We then mathematically explore the scenarios in which, for certain allele-frequency distributions, the within-population dissimilarity – the mean dissimilarity between randomly chosen members of a population – can exceed the dissimilarity between two populations. Such scenarios assist in explaining observations in population-genetic data that members of a population can be empirically more genetically dissimilar from each other on average than they are from members of another population. For a population pair, however, the mathematical analysis finds that at least one of the two populations always possesses smaller within-population dissimilarity than the value of the between-population dissimilarity. We illustrate the mathematical results with an application to human population-genetic data.

1 Introduction

Statistics that measure the genetic dissimilarity between pairs of populations are widely used for interpreting population-genetic data (Bowcock et al. 1994; Chakraborty and Jin 1993; Gao and Martin 2009; Mountain and Cavalli-Sforza 1997; Mountain and Ramakrishnan 2005; Rosenberg 2011; Tal 2013; Witherspoon et al. 2007). Patterns in numerical values of the statistics appear in calculations of the relative similarity and dissimilarity of different human groups (Mountain and Ramakrishnan 2005; Rosenberg 2011; Witherspoon et al. 2007). Further, genetic dissimilarity statistics, often termed “genetic distances,” underlie frequently applied tools for data analysis and visualization, including methods such as evolutionary tree construction (Bowcock et al. 1994) and multidimensional scaling (Gao and Martin 2009).

Population-level genetic dissimilarity statistics computed at a single genetic locus often proceed by considering pairs of vectors, p and q, representing the allele frequencies of two populations. Each vector consists of nonnegative entries that sum to 1. Hence, for a locus with I distinct alleles, such a genetic dissimilarity statistic has domain Δ I−1 × Δ I−1, where Δ I−1 is the simplex p 1 , p 2 , , p I : i = 1 I p i = 1 and  p i 0 for all  i .

Among the many genetic dissimilarity statistics that are available (Jorde 1985; Nei 1987), those known as allele-sharing dissimilarities form a distinctive subset. Such statistics view a dissimilarity between two populations as the mean of a dissimilarity between pairs of individuals, one from one population and one from the other. With this perspective, they have a simple interpretation as a population-level generalization of an individual-level statistic. They also have a natural connection to a fundamental computation in human population genetics – the apportionment of genetic diversity among different levels of genetic structure (Edge et al. 2022; Lewontin 1972) – which can be viewed in terms of various mean pairwise dissimilarities across certain subsets of individuals (Rosenberg 2011).

Unlike most dissimilarity statistics – such as those based on such principles as the Euclidean distance between functions of allele frequency vectors (Cavalli-Sforza and Edwards 1967) or the dot product of these vectors (Nei 1972) – because they emerge from inter-individual computations among non-identical individuals, allele-sharing dissimilarities can produce nonzero values for the dissimilarity between a polymorphic population and itself. This feature assists in understanding a property of genetic variation in structured populations: the extent to which genetic dissimilarity of individuals from the same population ever exceeds genetic dissimilarity of individuals from different populations, if at all.

Because individuals in a population generally possess a larger number of recent shared ancestors than individuals from different populations, a perspective focused on population-genetic descent predicts that individuals from the same population will be genetically more similar than individuals from different populations. Indeed, in human population genetics, studies of allele-sharing dissimilarity find that the mean dissimilarity across pairs of individuals from different populations does exceed the mean dissimilarity for pairs from the same populations (Mountain and Ramakrishnan 2005; Rosenberg 2011; Tal 2013; Witherspoon et al. 2007). However, such studies also find a perhaps unexpected result that the allele-sharing dissimilarity for some pairs of individuals from the same population can exceed the dissimilarity for some pairs from different populations.

Here, we seek to explain the properties of allele-sharing dissimilarities within and between populations. We study mathematical properties of population-level allele-sharing dissimilarities under the assumption that individuals in a population represent random draws from the vector of allele frequencies in the population. We consider mean allele-sharing dissimilarities for pairs of individuals from the same population and for pairs of individuals from different populations, evaluating the conditions on allele-frequency vectors under which the allele-sharing dissimilarity for a population to itself can exceed the allele-sharing dissimilarity between two populations. We interpret the results in relation to ongoing efforts to understand human genetic similarity and difference.

2 Methods

2.1 Allele-sharing dissimilarities

An allele-sharing dissimilarity (ASD) is a type of dissimilarity that is based on counting the number of alleles shared at a locus between two diploid individuals. We consider two different versions of the ASD concept.

In one ASD variant, which we denote by D 1 , “allele-sharing” for two diploid individuals is interpreted as the number of shared elements in their multisets of alleles. Consider a locus with four distinct alleles, the minimum number required so that all possible cases exist. Call these alleles A, B, C, and D. For D 1 , two individuals both with genotype AB have 2 alleles shared, as the sets {A, B} and {A, B} have 2 identical elements. An individual with genotype AB and an individual with genotype AC have 1 allele shared, as the sets {A, B} and {A, C} have 1 element shared between them, namely A. Two individuals with genotype AA have 2 alleles shared, as multisets {A, A} and {A, A} have 2 shared elements, A and A. The dissimilarity D 1 then uses 1 minus half the number of the shared alleles as the dissimilarity; the normalization ensures that D 1 lies in [0,1] (Gao and Martin 2009; Mountain and Cavalli-Sforza 1997). With 0, 1, and 2 shared alleles, the dissimilarity equals 1, 1 2 , and 0, respectively.

Another variant of ASD, which we denote by D 2 , instead considers alleles individually, evaluating the fraction of pairs of alleles, one from the first individual and one from the second, that are distinct (Mountain and Ramakrishnan 2005). For two individuals with genotype AB, D 2 is equal to 1 2 , because among the four possible pairs of alleles – (A, A), (A, B), (B, A), and (B, B), where the first entry in the pair represents an allele from the first individual and the second entry is an allele from the second individual – two of four contain distinct alleles.

Table 1 shows all seven possible pairs of unordered diploid genotypes for two individuals and their corresponding dissimilarities measured by D 1 and D 2 . In only two of seven cases do the two dissimilarities differ.

Table 1:

Two variants of allele-sharing dissimilarity. All possible pairs of unordered genotypes are shown, along with their values of D 1 and D 2 .

Case Genotypes D 1 D 2
1 AA, AA 0 0
2 AA, AB 1 2 1 2
3 AA, BB 1 1
4 AA, BC 1 1
5 AB, AB 0 1 2
6 AB, AC 1 2 3 4
7 AB, CD 1 1

2.2 Notation

Consider a locus with I distinct alleles. We consider allele-frequency vectors in each of two populations. In Population 1, the allele frequencies are p = (p 1, p 2, …, p I ), where p i represents the frequency of allele i. In Population 2, they are q = (q 1, q 2, …, q I ). The frequencies satisfy 0 ≤ p i , q i ≤ 1 for all i, and i = 1 I p i = i = 1 I q i = 1 .

We are interested in mathematical properties of the distribution of ASD measure D , for pairs of populations – possibly the same population – where D can refer to D 1 or D 2 . We denote the dissimilarity D between two randomly chosen individuals within the same population with allele-frequency vector p by D w ( p ) , and the corresponding dissimilarity between two randomly chosen individuals from different populations with allele-frequency vectors p and q by D b ( p , q ) . We often drop the arguments for convenience.

We will have occasion to use various symmetric sums involving allele frequencies. For t = 1, 2, 3, 4, for expressions in the separate populations, we use the notation

(1) σ t = i = 1 I p i t , τ t = i = 1 I q i t ,

where σ 1 = τ 1 = 1.

For expressions involving both populations, we use

(2) ρ t u = i = 1 I p i t q i u ,

where (t, u) is equal to (1,1), (1,2), (2,1), or (2,2). Note that each of these sums can be viewed as an inner product.

2.3 Assumptions

We seek to perform ASD computations under the assumption that individuals are sampled at random from allele-frequency distributions. With this perspective, for a random pair of individuals, an ASD measure is a random variable that depends on the allele-frequency vectors of two populations of interest, treated as parameters.

At a given locus, we assume that the two alleles of an individual are sampled independently, so that diploid genotypes in a population are assumed to follow Hardy–Weinberg proportions. In other words, the probabilities of diploid genotypes in a population with allele-frequency vector p equal p i 2 for homozygous genotypes and 2p i p j for heterozygous unordered genotypes, with ij.

3 Distribution of D w

We first compute allele-sharing dissimilarities between random pairs of individuals sampled from the same population, evaluating the properties of random variables D 1 w and D 2 w .

3.1 Distribution of D 1 w

D 1 w is a random variable that takes on values 0, 1 2 , and 1. We compute its probability distribution, and we then evaluate its mean and variance.

P D 1 w = d . We obtain the probability for each possible genotype combination in Table 1. These probabilities appear in Table 2, both as sums and as simplified polynomials.

Table 2:

Probabilities of genotype combinations for pairs of individuals sampled from the same population. For each case, the probability is written as a sum, which is then simplified using Eq. (1).

Case Genotypes Probability Simplified probability
1 AA, AA i = 1 I p i 4 σ 4
2 AA, AB 4 i = 1 I p i 3 j = 1 j i I p j 4σ 3 − 4σ 4
3 AA, BB i = 1 I p i 2 j = 1 j i I p j 2 σ 2 2 σ 4
4 AA, BC 2 i = 1 I p i 2 j = 1 j i I p j k = 1 k i , j I p k 2 σ 2 4 σ 3 2 σ 2 2 + 4 σ 4
5 AB, AB 2 i = 1 I p i 2 j = 1 j i I p j 2 2 σ 2 2 2 σ 4
6 AB, AC 4 i = 1 I p i 2 j = 1 j i I p j k = 1 k i , j I p k 4 σ 2 8 σ 3 4 σ 2 2 + 8 σ 4
7 AB, CD i = 1 I p i j = 1 j i I p j k = 1 k i , j I p k = 1 i , j , k I p 1 6 σ 2 + 8 σ 3 + 3 σ 2 2 6 σ 4

With the probabilities of all genotype combinations obtained, we can sum across genotype combinations to compute probabilities for D 1 w ( p ) to equal 0, 1 2 , and 1. The resulting probabilities appear in Table 3.

Table 3:

Probability distribution of D 1 w ( p ) , the allele-sharing dissimilarity D 1 w for a pair of individuals sampled at random from a population with allele-frequency vector p. The table is obtained by summing entries in Table 2.

Value of the dissimilarity (d) P D 1 w ( p ) = d
0 2 σ 2 2 σ 4
1 2 4 σ 2 4 σ 3 4 σ 2 2 + 4 σ 4
1 1 4 σ 2 + 4 σ 3 + 2 σ 2 2 3 σ 4

E [ D 1 w ] . The expected value of D 1 w ( p ) can be computed from the full probability distribution, via

E [ D 1 w ( p ) ] = d 0 , 1 2 , 1 d P D 1 w ( p ) = d .

Using the probabilities in Table 3, the result is

(3) E [ D 1 w ( p ) ] = 1 2 σ 2 + 2 σ 3 σ 4 .

In the I = 2 case, using p 2 = 1 − p 1 so that σ t = p 1 t + ( 1 p 1 ) t , Eq. (3) becomes:

(4) E [ D 1 w ( p ) ] = 2 p 1 4 p 1 2 + 4 p 1 3 2 p 1 4 .

Figure 1A plots Eq. (4) as a function of p 1. In the figure, we can observe that the mean value of the dissimilarity increases from a value of 0 at p 1 = 0, when the population is monomorphic, to a peak of 3 8 at p 1 = 1 2 . It then decreases symmetrically to 0 at p 1 = 1.

Figure 1: 
Mean and variance of the within-population dissimilarities 




D


1


w




${\mathcal{D}}_{1}^{w}$



 and 




D


2


w




${\mathcal{D}}_{2}^{w}$



 for I = 2 alleles as functions of the frequency p
1 of one of the alleles. (A) Mean, Eqs. (4) and (10). (B) Variance, Eqs. (8) and (14).
Figure 1:

Mean and variance of the within-population dissimilarities D 1 w and D 2 w for I = 2 alleles as functions of the frequency p 1 of one of the alleles. (A) Mean, Eqs. (4) and (10). (B) Variance, Eqs. (8) and (14).

V a r [ D 1 w ] . To obtain the variance of the distribution of D 1 w ( p ) , we first calculate

(5) E [ D 1 w ( p ) 2 ] = d 0 , 1 2 , 1 d 2 P D 1 w ( p ) = d = 1 3 σ 2 + 3 σ 3 + σ 2 2 2 σ 4 .

The variance can then be obtained from Eqs. (3) and (5) by V a r [ D 1 w ( p ) ] = E [ D 1 w ( p ) 2 ] E [ D 1 w ( p ) ] 2 :

(6) V a r [ D 1 w ( p ) ] = σ 2 σ 3 3 σ 2 2 + 8 σ 2 σ 3 4 σ 2 σ 4 4 σ 3 2 + 4 σ 3 σ 4 σ 4 2 .

For the I = 2 case, we once again use that p 2 = 1 − p 1:

(7) E [ D 1 w ( p ) 2 ] = p 1 p 1 2

(8) V a r [ D 1 w ( p ) ] = p 1 5 p 1 2 + 16 p 1 3 32 p 1 4 + 40 p 1 5 32 p 1 6 + 16 p 1 7 4 p 1 8 .

Figure 1B plots Eq. (8) as a function of p 1. Like the mean, the variance of the dissimilarity increases from 0 at p 1 = 0 to a peak at p 1 = 1 2 , decreasing symmetrically to 0 at p 1 = 1. The maximal variance is 7 64 .

3.2 Distribution of D 2 w

We compute the distribution of random variable D 2 w . This computation uses the same probabilities for genotype pairs as those used for D 1 w in Table 2.

P D 2 w = d . We compute the probability for each of the possible values of D 2 w by summing probabilities in Table 2. The resulting probabilities appear in Table 4.

Table 4:

Probability distribution of D 2 w ( p ) , the allele-sharing dissimilarity D 2 w for a pair of individuals sampled at random from a population with allele-frequency vector p. The table is obtained by summing entries in Table 2.

Value of the dissimilarity (d) P D 2 w ( p ) = d
0 σ 4
1 2 4 σ 3 + 2 σ 2 2 6 σ 4
3 4 4 σ 2 8 σ 3 4 σ 2 2 + 8 σ 4
1 1 4 σ 2 + 4 σ 3 + 2 σ 2 2 3 σ 4

E [ D 2 w ] . Summing across the possible values for the dissimilarity,

E [ D 2 w ( p ) ] = d 0 , 1 2 , 3 4 , 1 d P D 2 w ( p ) = d ,

yielding the result

(9) E [ D 2 w ( p ) ] = 1 σ 2 .

Note that Eq. (9) gives the “expected heterozygosity,” the probability that two draws from the allele-frequency distribution produce distinct alleles.

For the I = 2 case, we have σ 2 = p 1 2 + ( 1 p 1 ) 2 = 1 2 p 1 + 2 p 1 2 , so Eq. (9) simplifies to

(10) E [ D 2 w ( p ) ] = 2 p 1 2 p 1 2 = 2 p 1 ( 1 p 1 ) .

Figure 1A plots Eq. (10) as a function of p 1. The mean value of the dissimilarity is symmetric around a peak at ( 1 2 , 1 2 ) , equaling 0 at p 1 = 0 and p 1 = 1.

V a r [ D 2 w ] . The variance of the distribution of D 2 w is obtained using V a r [ D 2 w ] = E [ D 2 w ( p ) 2 ] E [ D 2 w ( p ) ] 2 . We first find

(11) E [ D 2 w ( p ) 2 ] = d 0 , 1 2 , 3 4 , 1 d 2 P D 2 w ( p ) = d = 1 7 4 σ 2 + 1 2 σ 3 + 1 4 σ 2 2 .

Therefore,

(12) V a r [ D 2 w ( p ) ] = 1 4 σ 2 + 1 2 σ 3 3 4 σ 2 2 .

For the I = 2 case, we use p 2 = 1 − p 1 to obtain

(13) E [ D 2 w ( p ) 2 ] = p 1 2 p 1 3 + p 1 4

(14) V a r [ D 2 w ( p ) ] = p 1 4 p 1 2 + 6 p 1 3 3 p 1 4 .

Figure 1B plots Eq. (14). The variance has peaks at ( 3 3 6 , 1 12 ) and ( 3 + 3 6 , 1 12 ) , between which it has a local minimum at ( 1 2 , 1 16 ) . It equals 0 at p 1 = 0 and p 1 = 1.

3.3 Comparison of D 1 w and D 2 w

Comparing E [ D 1 w ] (Eq. (3)) and E [ D 2 w ] (Eq. (9)), we quickly observe that if p i ≠ 1 for all i, then

(15) E [ D 1 w ] < E [ D 2 w ] .

The result follows by noting ( 1 p i ) 2 > 0 for all i, so that i = 1 I p i 2 ( 2 p i ) < i = 1 I p i 2 1 + p i 2 and 2σ 3 < σ 2 + σ 4, from which we obtain Eq. (15). In fact, Eq. (15) follows from Table 1: for all possible genotype combinations, D 1 w D 2 w , and the inequality is strict in two of seven cases, at least one of which must have nonzero probability if p i ≠ 1 for all i.

For I = 2, Eq. (15) can be observed in Figure 1A, as it can be seen that the curve for E [ D 2 w ] exceeds that for E [ D 1 w ] . The largest excess occurs at p 1 = p 2 = 1 2 . Figure 2C plots the difference E [ D 2 w ] E [ D 1 w ] for the case of I = 3, and the maximal difference in the figure also occurs when alleles have the same frequency, ( p 1 , p 2 , p 3 ) = ( 1 3 , 1 3 , 1 3 ) .

Figure 2: 
Mean and variance of the within-population dissimilarities 




D


1


w




${\mathcal{D}}_{1}^{w}$



 and 




D


2


w




${\mathcal{D}}_{2}^{w}$



 for I = 3 alleles as functions of the frequencies p
1 and p
2 of two of the alleles. (A) Mean of 




D


1


w




${\mathcal{D}}_{1}^{w}$



, Eq. (3). (B) Mean of 




D


2


w




${\mathcal{D}}_{2}^{w}$



, Eq. (9). (C) 


E

[



D


2


w



]

−
E

[



D


1


w



]



$\mathbb{E}\left[{\mathcal{D}}_{2}^{w}\right]-\mathbb{E}\left[{\mathcal{D}}_{1}^{w}\right]$



. (D) Variance of 




D


1


w




${\mathcal{D}}_{1}^{w}$



, Eq. (6). (E) Variance of 




D


2


w




${\mathcal{D}}_{2}^{w}$



, Eq. (12). (F) 


V
a
r

[



D


2


w



]

−
V
a
r

[



D


1


w



]



$\mathrm{V}\mathrm{a}\mathrm{r}\left[{\mathcal{D}}_{2}^{w}\right]-\mathrm{V}\mathrm{a}\mathrm{r}\left[{\mathcal{D}}_{1}^{w}\right]$



.
Figure 2:

Mean and variance of the within-population dissimilarities D 1 w and D 2 w for I = 3 alleles as functions of the frequencies p 1 and p 2 of two of the alleles. (A) Mean of D 1 w , Eq. (3). (B) Mean of D 2 w , Eq. (9). (C) E [ D 2 w ] E [ D 1 w ] . (D) Variance of D 1 w , Eq. (6). (E) Variance of D 2 w , Eq. (12). (F) V a r [ D 2 w ] V a r [ D 1 w ] .

For the variances, Figure 1B finds that for I = 2, V a r [ D 1 w ] > V a r [ D 2 w ] for intermediate p 1, and that the two variances are comparable for p 1 near 0 or 1, with some p 1 values producing V a r [ D 1 w ] < V a r [ D 2 w ] . Figure 2F illustrates a similar result for I = 3. For both I = 2 and I = 3, at intermediate allele frequencies, V a r [ D 1 w ] > V a r [ D 2 w ] ; at extreme allele frequencies, the two variances are comparable, sometimes with V a r [ D 1 w ] < V a r [ D 2 w ] .

4 Distribution of D b

We now examine allele-sharing dissimilarities between pairs of individuals from different populations. Let p be the allele frequency vector for the population from which the first individual is sampled, and let q be the corresponding vector for the population of the second individual; the special case of q = p follows Section 3. We evaluate the properties of the random variables D 1 b and D 2 b .

4.1 Distribution of D 1 b

P D 1 b = d . We obtain the probability for each possible genotype combination for a pair of individuals from different populations. For this computation, we use the polynomials in Eqs. (1) and (2). The resulting probabilities appear in Table 5.

Table 5:

Probability of genotype combinations for pairs of individuals sampled from two populations. For each case, the probability is written as a sum, which is then simplified using Eqs. (1) and (2).

Case Genotypes Probability Simplified probability
1 AA, AA i = 1 I p i 2 q i 2 ρ 22
2 AA, AB 2 i = 1 I p i 2 q i j = 1 j i I q j + 2 i = 1 I p i q i 2 j = 1 j i I p j 2ρ 21 + 2ρ 12 − 4ρ 22
3 AA, BB i = 1 I p i 2 j = 1 j i I q j 2 σ 2 τ 2ρ 22
4 AA, BC i = 1 I p i 2 j = 1 j i I q j k = 1 k i , j I q k + i = 1 I q i 2 j = 1 j i I p j k = 1 k i , j I p k σ 2 + τ 2 − 2σ 2 τ 2 − 2ρ 21 − 2ρ 12 + 4ρ 22
5 AB, AB 2 i = 1 I p i q i j = 1 j i I p j q j 2 ρ 11 2 2 ρ 22
6 AB, AC 4 i = 1 I p i q i j = 1 j i I p j k = 1 k i , j I q k 4 ρ 11 4 ρ 21 4 ρ 12 4 ρ 11 2 + 8 ρ 22
7 AB, CD i = 1 I p i j = 1 j i I p j k = 1 k i , j I q k = 1 i , j , k I q 1 σ 2 τ 2 + σ 2 τ 2 4 ρ 11 + 4 ρ 21 + 4 ρ 12 6 ρ 22 + 2 ρ 11 2

We sum across genotype combinations to obtain probabilities for D 1 b to equal particular values. Table 6 provides these probabilities.

Table 6:

Probability distribution of D 1 b ( p , q ) , the allele-sharing dissimilarity D 1 b for a pair of individuals sampled at random from two populations with allele-frequency vectors p and q. The table is obtained by summing entries in Table 5.

Value of the dissimilarity (d) P D 1 b ( p , q ) = d
0 2 ρ 11 2 ρ 22
1 2 4 ρ 11 2 ρ 21 2 ρ 12 4 ρ 11 2 + 4 ρ 22
1 1 4 ρ 11 + 2 ρ 21 + 2 ρ 12 + 2 ρ 11 2 3 ρ 22

E [ D 1 b ] . As we did for the within-population dissimilarity D 1 w ( p ) , we compute the expected value of the distribution of the between-population dissimilarity D 1 b ( p , q ) as

E [ D 1 b ( p , q ) ] = d 0 , 1 2 , 1 d P D 1 b ( p , q ) = d .

Using the values in Table 6, we obtain

(16) E [ D 1 b ( p , q ) ] = 1 2 ρ 11 + ρ 21 + ρ 12 ρ 22 .

For the I = 2 case, with p 2 = 1 − p 1 and q 2 = 1 − q 1, Eq. (16) simplifies to

(17) E [ D 1 b ( p , q ) ] = p 1 + q 1 4 p 1 q 1 + 2 p 1 2 q 1 + 2 p 1 q 1 2 2 p 1 2 q 1 2 .

Figure 3A plots Eq. (17). The figure has maxima of 1 at (p 1, q 1) = (1, 0) and (0,1), when the two populations have the greatest difference in allele frequency, and equals 0 at (0,0) and (1,1). It has a saddle surface with a value of 3 8 at saddle point ( p 1 , q 1 ) = ( 1 2 , 1 2 ) .

Figure 3: 
Mean and variance of the between-population dissimilarities 




D


1


b




${\mathcal{D}}_{1}^{b}$



 and 




D


2


b




${\mathcal{D}}_{2}^{b}$



 for I = 2 alleles as functions of the frequencies (p
1, q
1) of one of the alleles. (A) Mean of 




D


1


b




${\mathcal{D}}_{1}^{b}$



, Eq. (17). (B) Mean of 




D


2


b




${\mathcal{D}}_{2}^{b}$



, Eq. (23). (C) 


E

[



D


2


b



]

−
E

[



D


1


b



]



$\mathbb{E}\left[{\mathcal{D}}_{2}^{b}\right]-\mathbb{E}\left[{\mathcal{D}}_{1}^{b}\right]$



. (D) Variance of 




D


1


b




${\mathcal{D}}_{1}^{b}$



, Eq. (21). (E) Variance of 




D


2


b




${\mathcal{D}}_{2}^{b}$



, Eq. (27). (F) 


V
a
r

[



D


2


b



]

−
V
a
r

[



D


1


b



]



$\mathrm{V}\mathrm{a}\mathrm{r}\left[{\mathcal{D}}_{2}^{b}\right]-\mathrm{V}\mathrm{a}\mathrm{r}\left[{\mathcal{D}}_{1}^{b}\right]$



.
Figure 3:

Mean and variance of the between-population dissimilarities D 1 b and D 2 b for I = 2 alleles as functions of the frequencies (p 1, q 1) of one of the alleles. (A) Mean of D 1 b , Eq. (17). (B) Mean of D 2 b , Eq. (23). (C) E [ D 2 b ] E [ D 1 b ] . (D) Variance of D 1 b , Eq. (21). (E) Variance of D 2 b , Eq. (27). (F) V a r [ D 2 b ] V a r [ D 1 b ] .

V a r [ D 1 b ] . We first compute

(18) E [ D 1 b ( p , q ) 2 ] = d 0 , 1 2 , 1 d 2 P D 1 b ( p , q ) = d = 1 3 ρ 11 + 3 2 ρ 21 + 3 2 ρ 12 2 ρ 22 + ρ 11 2 .

Using V a r [ D 1 b ( p , q ) ] = E [ D 1 b ( p , q ) 2 ] E [ D 1 b ( p , q ) ] 2 , the variance is thus

(19) V a r [ D 1 b ( p , q ) ] = ρ 11 1 2 ρ 21 1 2 ρ 12 3 ρ 11 2 + 4 ρ 11 ρ 21 + 4 ρ 11 ρ 12 4 ρ 11 ρ 22 ρ 21 2 ρ 12 2 2 ρ 12 ρ 21 + 2 ρ 12 ρ 22 + 2 ρ 21 ρ 22 ρ 22 2 .

For the I = 2 case, we have p 1 = 1 − p 2 and q 1 = 1 − q 2. Equations (18) and (19) simplify to

(20) E [ D 1 b ( p , q ) 2 ] = 1 2 p 1 + 1 2 q 1 2 p 1 q 1 + 1 2 p 1 2 + 1 2 q 1 2

(21) V a r [ D 1 b ( p , q ) ] = 1 2 p 1 + 1 2 q 1 4 p 1 q 1 1 2 p 1 2 1 2 q 1 2 + 8 p 1 2 q 1 + 8 p 1 q 1 2 4 p 1 3 q 1 24 p 1 2 q 1 2 4 p 1 q 1 3 + 20 p 1 3 q 1 2 + 20 p 1 2 q 1 3 4 p 1 4 q 1 2 24 p 1 3 q 1 3 4 p 1 2 q 1 4 + 8 p 1 4 q 1 3 + 8 p 1 3 q 1 4 4 p 1 4 q 1 4 .

Figure 3D shows that the variance has higher values away from the four corners (0,0), (1,0), (0,1), and (1,1) for (p 1, q 1), equaling 0 in each of these corners.

4.2 Distribution of D 2 b

P D 2 b = d . We use Table 5 to obtain the probabilities of particular values of D 2 b . The resulting probabilities appear in Table 7.

Table 7:

Probability distribution of D 2 b ( p , q ) , the allele-sharing dissimilarity D 2 b for a pair of individuals sampled at random from two populations with allele-frequency vectors p and q. The table is obtained by summing entries in Table 5.

Value of the dissimilarity (d) P D 2 b ( p , q ) = d
0 ρ 22
1 2 2 ρ 21 + 2 ρ 12 + 2 ρ 11 2 6 ρ 22
3 4 4 ρ 11 4 ρ 21 4 ρ 12 4 ρ 11 2 + 8 ρ 22
1 1 4 ρ 11 + 2 ρ 21 + 2 ρ 12 + 2 ρ 11 2 3 ρ 22

E [ D 2 b ] . For D 2 b , we substitute the values from Table 7 into

E [ D 2 b ( p , q ) ] = d 0 , 1 2 , 3 4 , 1 d P D 2 b ( p , q ) = d .

We obtain

(22) E [ D 2 b ( p , q ) ] = 1 ρ 11 .

This quantity is the between-population analogue of expected heterozygosity, the probability that two random draws, one from the allele-frequency distribution of a locus in one population and one from the corresponding distribution in a second population, represent the same allele.

For the I = 2 case, Eq. (22) simplifies to

(23) E [ D 2 b ( p , q ) ] = p 1 + q 1 2 p 1 q 1 .

Figure 3B plots Eq. (23). The figure has maxima of 1 at (p 1, q 1) = (1, 0) and (0,1) and equals 0 at (0,0) and (1,1). It has a saddle surface with a value of 1 2 at saddle point ( p 1 , q 1 ) = ( 1 2 , 1 2 ) .

V a r [ D 2 b ] . We find that

(24) E [ D 2 b ( p , q ) 2 ] = d 0 , 1 2 , 1 d 2 P D 2 b ( p , q ) = d = 1 7 4 ρ 11 + 1 4 ρ 21 + 1 4 ρ 12 + 1 4 ρ 11 2 .

Therefore, by V a r [ D 2 b ( p , q ) ] = E [ D 2 b ( p , q ) 2 ] E [ D 2 b ( p , q ) ] 2 ,

(25) V a r [ D 2 b ( p , q ) ] = 1 4 ρ 11 + 1 4 ρ 21 + 1 4 ρ 12 3 4 ρ 11 2 .

For the I = 2 case, Eqs. (24) and (25) simplify to

(26) E [ D 2 b ( p , q ) 2 ] = 1 2 p 1 + 1 2 q 1 + 1 2 p 1 2 + 1 2 q 1 2 p 1 q 1 p 1 2 q 1 p 1 q 1 2 + p 1 2 q 1 2

(27) V a r [ D 2 b ( p , q ) ] = 1 2 p 1 + 1 2 q 1 3 p 1 q 1 1 2 p 1 2 1 2 q 1 2 + 3 p 1 2 q 1 + 3 p 1 q 1 2 3 p 1 2 q 1 2 .

Figure 3E plots Eq. (27). The variance is greatest at ( p 1 , q 1 ) = ( 1 2 , 0 ) , ( 1 2 , 1 ) , ( 0 , 1 2 ) , and ( 1 , 1 2 ) and equals 0 at (0,0), (1,0), (0,1), and (1,1). It has a local minimum at ( p 1 , q 1 ) = ( 1 2 , 1 2 ) .

4.3 Comparison of D 1 b and D 2 b

The two measures for the between-population dissimilarity have the same expected value, E [ D 1 b ] = E [ D 2 b ] , if for all i, at least one of p i , 1 − p i , q i , and 1 − q i is zero. The condition for equality can be seen from E [ D 2 b ] E [ D 1 b ] = ρ 11 ρ 21 ρ 12 + ρ 22 = i = 1 I p i ( 1 p i ) q i ( 1 q i ) . Excluding these equality cases, we have

(28) E [ D 1 b ] < E [ D 2 b ] .

Note that D 1 b D 2 b for all possible genotype combinations in Table 1.

The inequality in Eq. (28) can be observed for the I = 2 case in Figure 3C, where the surface plot of E [ D 2 b ] E [ D 1 b ] remains greater than or equal to 0, with equality only on the boundary. The largest difference occurs at p 1 = q 1 = 1 2 .

Figure 3F compares the variances of D 1 b and D 2 b for the case of I = 2. Across most of the parameter space, V a r [ D 1 b ] > V a r [ D 2 b ] . The excess is greatest at points ( p 1 , q 1 ) = ( 1 3 , 2 3 ) and ( 2 3 , 1 3 ) .

5 The relative magnitudes of E [ D w ] and E [ D b ]

We now examine the relative magnitudes of the expectations E [ D w ] and E [ D b ] . We determine the conditions under which the expectation of a within-population dissimilarity exceeds that of a between-population dissimilarity.

5.1 Inequality relationship between E [ D 1 w ( p ) ] and E [ D 1 b ( p , q ) ]

For arbitrary I, using Eqs. (3) and (16), the expression E [ D 1 w ( p ) ] > E [ D 1 b ( p , q ) ] is equivalent to

(29) 2 σ 2 + 2 σ 3 σ 4 + 2 ρ 11 ρ 21 ρ 12 + ρ 22 > 0 .

This condition can be written with vector notation. Let p ̃ = p 1 2 , p 2 2 , , p I 2 and q ̃ = q 1 2 , q 2 2 , , q I 2 , treating p, q, p ̃ , and q ̃ as row vectors. We have the identities σ 2 = pp T , σ 3 = p p ̃ T = p ̃ p T , σ 4 = p ̃ p ̃ T , ρ 11 = pq T , ρ 12 = p q ̃ T , ρ 21 = p ̃ q T , and ρ 22 = p ̃ q ̃ T .

Equation (29) thus becomes

(30) 2 p p T + 2 p p ̃ T p ̃ p ̃ T + 2 p q T p ̃ q T p q ̃ T + p ̃ q ̃ T > 0 ,

which simplifies to

(31) p p p ̃ ( p q ) T [ ( p p ̃ ) ( q q ̃ ) ] T < 0 .

For I = 2, we can further simplify this condition on p 1 and q 1, noting p 2 = 1 − p 1 and q 2 = 1 − q 1.

Theorem 1

Consider a locus with I = 2 distinct alleles. For individuals sampled from two populations with allele frequency vectors p = (p 1, 1 − p 1) and q = (q 1, 1 − q 1), E [ D 1 w ( p ) ] > E [ D 1 b ( p , q ) ] holds if and only if

(32) 0 < q 1 < p 1  if  0 < p 1 a , g ( p 1 ) < q 1 < p 1  if  a p 1 < 1 2 , p 1 < q 1 < g ( p 1 )  if  1 2 < p 1 1 a , p 1 < q 1 < 1  if  1 a p 1 < 1 ,

where

g ( x ) = 2 x 3 4 x 2 + 4 x 1 2 x ( 1 x ) ,

and

a = 1 3 3 33 13 3 2 2 / 3 2 5 / 3 3 33 13 3 + 2 0.3522

is the unique real root of 2x 3 − 4x 2 + 4x − 1.

Proof

We simplify Eq. (29) noting p 2 = 1 − p 1 and q 2 = 1 − q 1. To find the region where E [ D 1 w ( p ) ] > E [ D 1 b ( p , q ) ] , we solve the polynomial inequality

(33) p 1 q 1 4 p 1 2 + 4 p 1 q 1 + 4 p 1 3 2 p 1 2 q 1 2 p 1 q 1 2 2 p 1 4 + 2 p 1 2 q 1 2 > 0 ,

with 0 ≤ p 1 ≤ 1 and 0 ≤ q 1 ≤ 1. Solving for q 1 in terms of p 1, we find that the expression in Eq. (33) is 0 at q 1 = p 1 and at q 1 = g(p 1), and for fixed p, it is positive when q lies between the two roots. The unique real root for g(x) = x is at x = 1 2 , so that g(p 1) < p 1 for p 1 < 1 2 and g(p 1) > p 1 for p 1 > 1 2 .

For 0 p 1 < 1 2 , g(p 1) < 0 for p 1 < a, so that for 0 ≤ p 1a, the region where the expression in Eq. (33) is positive includes the full interval (0, p 1) for q 1. For a p 1 1 2 , it is positive only in interval (g(p 1), p 1) for q 1.

For 1 2 < p 1 < 1 , g(p 1) = 1 for p 1 = 1 − a, with g(p 1) < 1 for p 1 in 1 2 , 1 a and g(p 1) > 1 for p 1 in 1 a , 1 . Hence, for p 1 in [ 1 2 , 1 a ] , the expression in Eq. (33) is positive for q 1 in (p 1, g(p 1)), and for p 1 in [1 − a, 1], it is positive for q 1 in (p 1, 1). □

Figure 4A plots the region identified in Theorem 1. That a nonempty region exists indicates that sometimes, allele frequencies for a biallelic locus produce a within-population dissimilarity that exceeds the between-population dissimilarity. Note that because the choice of which allele is labeled 1 and which is labeled 2 is arbitrary, (p 1, q 1) is included in the region if and only if (1 − p 1, 1 − q 1) is also included.

Figure 4: 
Values of (p
1, q
1) for which 


E

[



D


w



]

>
E

[



D


b



]



$\mathbb{E}\left[{\mathcal{D}}^{w}\right]{ >}\mathbb{E}\left[{\mathcal{D}}^{b}\right]$



 in the case of I = 2 alleles, shaded in color. (A) 




D


1




${\mathcal{D}}_{1}$



, Theorem 1. (B) 




D


2




${\mathcal{D}}_{2}$



, Theorem 2.
Figure 4:

Values of (p 1, q 1) for which E [ D w ] > E [ D b ] in the case of I = 2 alleles, shaded in color. (A) D 1 , Theorem 1. (B) D 2 , Theorem 2.

We can calculate the area of the region in the unit square representing the probability P E [ D 1 w ] > E [ D 1 b ] under the assumption that p 1 and q 1 are independently and identically distributed with uniform-[0,1] distribution:

(34) P E [ D 1 w ] > E [ D 1 b ] = p 1 = 0 a q 1 = 0 p 1 1 d q 1 d p 1 + p 1 = a 1 2 q 1 = g ( p 1 ) p 1 1 d q 1 d p 1 + p 1 = 1 2 1 a q 1 = p 1 g ( p 1 ) 1 d q 1 d p 1 + p 1 = 1 a 1 q 1 = p 1 1 1 d q 1 d p 1 = 2 p 1 = 0 a p 1 d p 1 + p 1 = a 1 2 4 p 1 3 + 6 p 1 2 4 p 1 + 1 2 p 1 ( 1 p 1 ) d p 1 = a 2 + 2 a 1 2 2 log 2 log a log ( 1 a ) 0.17179 .

To evaluate P E [ D 1 w ] > E [ D 1 b ] more generally, for each I from 2 to 20, we perform a simulation. In particular, for each I, we consider independently and identically distributed vectors p and q from the uniform distribution over the simplex Δ I−1 (the Dirichlet-(1, 1, …, 1) distribution, where the vector of 1’s has length I). We sample 100,000 replicate pairs (p, q), and for each pair we evaluate if E [ D 1 w ] > E [ D 1 b ] .

Figure 5A plots the resulting probability. We can observe that for I = 2, the simulated P E [ D 1 w ] > E [ D 1 b ] accords with the analytical value in Eq. (34). The probability then decreases with increasing I.

Figure 5: 
The probability 


P


E

[



D


w



]

>
E

[



D


b



]





$\mathbb{P}\left(\mathbb{E}\left[{\mathcal{D}}^{w}\right]{ >}\mathbb{E}\left[{\mathcal{D}}^{b}\right]\right)$



 for simulated pairs of allele frequency vectors (p, q) with I distinct alleles. (A) 




D


1




${\mathcal{D}}_{1}$



. (B) 




D


2




${\mathcal{D}}_{2}$



. Independent and identical uniform distributions are simulated for each I, 2 ≤ I ≤ 20, by drawing uniformly from the simplex Δ
I−1 (100,000 replicates).
Figure 5:

The probability P E [ D w ] > E [ D b ] for simulated pairs of allele frequency vectors (p, q) with I distinct alleles. (A) D 1 . (B) D 2 . Independent and identical uniform distributions are simulated for each I, 2 ≤ I ≤ 20, by drawing uniformly from the simplex Δ I−1 (100,000 replicates).

5.2 Inequality relationship between E [ D 2 w ( p ) ] and E [ D 2 b ( p , q ) ]

For arbitrary I, via Eqs. (9) and (22), the expression E [ D 2 w ( p ) ] > E [ D 2 b ( p , q ) ] is equivalent to

(35) ρ 11 σ 2 > 0 .

With σ 2 = pp T and ρ 11 = pq T , Eq. (35) thus becomes

(36) p ( p q ) T < 0 .

For I = 2, Eq. (36) can be simplified to a condition on p 1 and q 1, again noting p 2 = 1 − p 1 and q 2 = 1 − q 1.

Theorem 2

Consider a locus with I = 2 distinct alleles. For individuals sampled from two populations with allele frequency vectors p = (p 1, 1 − p 1) and q = (q 1, 1 − q 1), E [ D 2 w ( p ) ] > E [ D 2 b ( p , q ) ] holds if and only if

(37) 0 < q 1 < p 1  if  0 < p 1 < 1 2 , p 1 < q 1 < 1  if  1 2 < p 1 < 1 .

Proof

With p 2 = 1 − p 1 and q 2 = 1 − q 1, Eq. (35) simplifies to

p 1 q 1 2 p 1 2 + 2 p 1 q 1 > 0 .

Solving this inequality, we arrive at the result. □

Figure 4B plots the region identified in Theorem 2. This region describes the locations in which allele frequencies for a biallelic locus produce a within-population dissimilarity that exceeds the between-population dissimilarity. As is true for D 1 , (p 1, q 1) is included in the region if and only if (1 − p 1, 1 − q 1) is also included.

The area of the region in the unit square, representing P E [ D 2 w ] > E [ D 2 b ] under the assumption that p 1 and q 1 are independently and identically distributed with uniform-[0,1] distribution, is straightforward:

(38) P E [ D 2 w ] > E [ D 2 b ] = p 1 = 0 1 2 q 1 = 0 p 1 1 d q 1 d p 1 + p 1 = 1 2 1 q 1 = p 1 1 1 d q 1 d p 1 = 1 4 .

We evaluate P E [ D 2 w ] > E [ D 2 b ] for each I from 2 to 20 by simulation. For each I, we consider independently and identically distributed vectors p and q from the uniform distribution over the simplex Δ I−1 (the Dirichlet-(1, 1, …, 1) distribution), sampling 100,000 replicate pairs (p, q), and evaluating the fraction of pairs for which E [ D 2 w ] > E [ D 2 b ] .

Figure 5B plots the resulting probability, illustrating the agreement between the simulated P E [ D 2 w ] > E [ D 2 b ] and the analytical value in Eq. (38) for I = 2. The probability then decreases as I increases.

5.3 Comparison of the E [ D w ] E [ D b ] inequalities for D 1 and D 2

The inequality E [ D w ] > E [ D b ] , where the mean dissimilarity between individuals from the same population exceeds that between individuals from different populations, holds under different scenarios for D 1 and D 2 . Comparing Eqs. (34) and (38), we see that for the case of I = 2, E [ D 1 w ] > E [ D 1 b ] holds over a smaller fraction of the parameter space than the corresponding inequality E [ D 2 w ] > E [ D 2 b ] (Figure 4). Further, if the former inequality holds, then the latter always holds as well.

In Figure 5, we also observe that the probabilities P E [ D w ] > E [ D b ] are higher for D 2 than for D 1 in simulations with different numbers of alleles. Hence, use of D 2 rather than D 1 produces a greater probability that the within-population genetic dissimilarity exceeds the between-population dissimilarity.

6 The relative magnitudes of E [ D w ] ̄ and E [ D b ]

We have seen that both for D 1 and for D 2 , it is possible for the expected dissimilarity E [ D w ] of random pairs of individuals within a population to exceed the expected dissimilarity E [ D b ] of random pairs between that population and a second population. However, we will see that for a pair of populations, the mean of their two within-population dissimilarities never exceeds their between-population dissimilarity.

For a pair of populations with allele frequency vectors p and q, let E [ D 1 w ( p , q ) ] ̄ = 1 2 ( E [ D 1 w ( p ) ] + E [ D 1 w ( q ) ] ) , and let E [ D 2 w ( p , q ) ] ̄ = 1 2 ( E [ D 2 w ( p ) ] + E [ D 2 w ( q ) ] ) .

6.1 Inequality relationship between E [ D 1 w ] ( p , q ) ̄ and E [ D 1 b ( p , q ) ]

Theorem 3

E [ D 1 w ( p , q ) ] ̄ E [ D 1 b ( p , q ) ] , with equality if and only if p = q.

Proof

We use Eqs. (3) and (16) to rewrite E [ D 1 w ( p , q ) ] ̄ E [ D 1 b ( p , q ) ] , obtaining

E [ D 1 w ( p ) ] + E [ D 1 w ( q ) ] 2 E [ D 1 b ( p , q ) ] = σ 2 + σ 3 1 2 σ 4 τ 2 + τ 3 1 2 τ 4 ρ 21 ρ 12 + ρ 22 + 2 ρ 11 .

Rewriting in terms of the vectors p, q, p ̃ , and q ̃ , we have

E [ D 1 w ( p ) ] + E [ D 1 w ( q ) ] 2 E [ D 1 b ( p , q ) ] = ( p q ) ( p q ) T + ( p q ) ( p ̃ q ̃ ) T 1 2 ( p ̃ q ̃ ) ( p ̃ q ̃ ) T = 1 2 p q 2 1 2 ( p q ) ( p ̃ q ̃ ) 2 0 .

Equality is reached in the last step if and only if p = q. □

6.2 Inequality relationship between E [ D 2 w ( p , q ) ] ̄ and E [ D 2 b ( p , q ) ]

Theorem 4

E [ D 2 w ( p , q ) ] ̄ E [ D 2 b ( p , q ) ] , with equality if and only if p = q.

Proof

We rewrite E [ D 2 w ( p , q ) ] ̄ E [ D 2 b ( p , q ) ] using Eqs. (9) and (22):

E [ D 2 w ( p ) ] + E [ D 2 w ( q ) ] 2 E [ D 2 b ( p , q ) ] = σ 2 2 τ 2 2 + ρ 11 .

In terms of the vectors p and q, we have

E [ D 2 w ( p ) ] + E [ D 2 w ( q ) ] 2 E [ D 2 b ( p , q ) ] = 1 2 p p T 1 2 q q T + p q T = 1 2 p q 2 0 ,

with equality if and only if p = q. □

6.3 Comparison of the E [ D w ] ̄ E [ D b ] inequalities for D 1 and D 2

The inequality E [ D w ( p , q ) ] ̄ E [ D b ( p , q ) ] , with equality if and only if p = q, holds for both D 1 and D 2 . Comparing the proofs of Theorems 3 and 4, we see that

(39) E [ D 1 w ( p , q ) ] ̄ E [ D 1 b ( p , q ) ] = E [ D 2 w ( p , q ) ] ̄ E [ D 2 b ( p , q ) ] 1 2 ( p q ) ( p ̃ q ̃ ) 2 .

The extent to which E [ D 1 w ( p , q ) ] ̄ < E [ D 1 b ( p , q ) ] for pq, or E [ D 1 w ( p , q ) ] ̄ E [ D 1 b ( p , q ) ] , has a greater absolute value than the corresponding extent to which E [ D 2 w ( p , q ) ] ̄ < E [ D 2 b ( p , q ) ] for pq, or E [ D 2 w ( p , q ) ] ̄ E [ D 2 b ( p , q ) ] .

7 Data analysis

7.1 Data

Our theoretical analysis predicts features of dissimilarities D 1 and D 2 in within-population and between-population computations. To compare to empirical observations, we examine multiallelic microsatellite data from the Human Genome Diversity Project (HGDP-CEPH panel). We consider the 1048 individuals and 783 microsatellite loci from Rosenberg et al. (2005), employing the H1048 subset of the HGDP-CEPH panel (Rosenberg 2006). We follow previous uses of the HGDP-CEPH panel in considering 53 populations and 7 geographic regions. We focus on 30 populations for which the number of sampled individuals is greater than 15. Across these 30 populations, the total number of individuals considered is 813.

7.2 Theoretical computations

For our theoretical calculations, given a population in the data set and a locus, we compute allele frequencies. We then apply our theoretical formulas to the allele frequency vectors. Note that if a locus is missing genotypes in an individual, then we omit that individual from the calculation of population allele frequencies at the locus, so that we maintain the property that allele frequencies at a locus in a population sum to 1.

7.3 Empirical computations

For empirical calculations, we consider the actual diploid individuals in the HGDP-CEPH data, for within-population computations comparing all pairs of individuals within a population. For between-population computations, we compare all pairs of individuals, one each from two populations. Pairwise dissimilarities between diploid genotypes are obtained according to Table 1. We compute within-population and between-population dissimilarities as the means across relevant pairs, and we compute variances of dissimilarity distributions across pairs of individuals.

For this analysis, we omit individuals with missing data prior to computation of empirical ASD values. In between-population comparisons, all allelic types present in one but not the other population are assigned a frequency of 0 in the population in which they are absent.

We perform the theoretical and empirical calculations for all 783 loci.

7.4 Results of data analysis

Figure 6 compares empirical and theoretical means and variances of within-population dissimilarities across pairs of individuals, considering 100 randomly sampled loci in 30 populations. Figure 6A compares the empirical value of E [ D 1 w ] computed by averaging D 1 w values for all pairs of sampled individuals with the theoretical value predicted from the allele frequencies and Eq. (3). The theoretical calculation generally predicts the empirical dissimilarity, with most points clustering along the diagonal (r = 0.962). In Figure 6B, a similar plot for E [ D 2 w ] using Eq. (9) for the theoretical computation produces closer agreement between the empirical and theoretical values (r = 0.999).

Figure 6: 
Empirical and theoretical mean and variance of within-population allele-sharing dissimilarities. Each panel considers 100 randomly sampled loci (among 783) in 30 populations with sample size greater than 15 (100 × 30 = 3000 data points in each panel). (A) 


E

[



D


1


w



]



$\mathbb{E}\left[{\mathcal{D}}_{1}^{w}\right]$



. (B) 


E

[



D


2


w



]



$\mathbb{E}\left[{\mathcal{D}}_{2}^{w}\right]$



. (C) 


V
a
r

[



D


1


w



]



$\mathrm{V}\mathrm{a}\mathrm{r}\left[{\mathcal{D}}_{1}^{w}\right]$



. (D) 


V
a
r

[



D


2


w



]



$\mathrm{V}\mathrm{a}\mathrm{r}\left[{\mathcal{D}}_{2}^{w}\right]$



. Empirical values rely on dissimilarity calculations according to Table 1 from pairs of diploid individuals, and theoretical values are calculated from allele frequencies according to Eqs. (3), (6), (9) and (12).
Figure 6:

Empirical and theoretical mean and variance of within-population allele-sharing dissimilarities. Each panel considers 100 randomly sampled loci (among 783) in 30 populations with sample size greater than 15 (100 × 30 = 3000 data points in each panel). (A) E [ D 1 w ] . (B) E [ D 2 w ] . (C) V a r [ D 1 w ] . (D) V a r [ D 2 w ] . Empirical values rely on dissimilarity calculations according to Table 1 from pairs of diploid individuals, and theoretical values are calculated from allele frequencies according to Eqs. (3), (6), (9) and (12).

Figure 6C and D compare empirical and theoretical variances across pairs of individuals for within-population dissimilarities, using Eqs. (6) and (12) for the theoretical computation. The theoretical variance predicts the empirical variance, but the agreement is not as close as for the mean (r = 0.676 for V a r [ D 1 w ] , r = 0.732 for V a r [ D 2 w ] ).

Figure 7 plots analogous comparisons for between-population dissimilarities, considering a subset of loci from Figure 6. In Figure 7A, we see a close relationship between empirical E [ D 1 b ] and theoretical E [ D 1 b ] similar to the relationship observed in Figure 6A (r = 0.943). As was seen in Figure 6B, in Figure 7B, we see a stronger relationship between the empirical value of E [ D 2 b ] and the theoretical value (r = 1.000).

Figure 7: 
Empirical and theoretical mean and variance of between-population allele-sharing dissimilarities. Each panel considers 10 randomly sampled loci in pairs among the 30 populations with sample size greater than 15 (


10
×



30


2



=
4350


$10{\times}\left(\genfrac{}{}{0pt}{}{30}{2}\right)=4350$



 data points in each panel). The 10 loci are taken from among those used in Figure 6. (A) 


E

[



D


1


b



]



$\mathbb{E}\left[{\mathcal{D}}_{1}^{b}\right]$



. (B) 


E

[



D


2


b



]



$\mathbb{E}\left[{\mathcal{D}}_{2}^{b}\right]$



. (C) 


V
a
r

[



D


1


b



]



$\mathrm{V}\mathrm{a}\mathrm{r}\left[{\mathcal{D}}_{1}^{b}\right]$



. (D) 


V
a
r

[



D


2


b



]



$\mathrm{V}\mathrm{a}\mathrm{r}\left[{\mathcal{D}}_{2}^{b}\right]$



. Empirical values rely on dissimilarity calculations according to Table 1 from pairs of diploid individuals, and theoretical values are calculated from allele frequencies according to Eqs. (16), (19), (22) and (25).
Figure 7:

Empirical and theoretical mean and variance of between-population allele-sharing dissimilarities. Each panel considers 10 randomly sampled loci in pairs among the 30 populations with sample size greater than 15 ( 10 × 30 2 = 4350 data points in each panel). The 10 loci are taken from among those used in Figure 6. (A) E [ D 1 b ] . (B) E [ D 2 b ] . (C) V a r [ D 1 b ] . (D) V a r [ D 2 b ] . Empirical values rely on dissimilarity calculations according to Table 1 from pairs of diploid individuals, and theoretical values are calculated from allele frequencies according to Eqs. (16), (19), (22) and (25).

Figure 7C and D consider relationships between empirical and theoretical between-population variances for D 1 and D 2 . As was observed in Figure 6C and D, empirical and theoretical variance are correlated (r = 0.676 for V a r [ D 1 b ] , r = 0.731 for V a r [ D 2 b ] ), but the agreement for variances is not as close as for the mean.

Figure 8 empirically examines the inequalities in Theorems 3 and 4 stating that when computed from allele frequencies, the mean of the within-population dissimilarities for two populations is always less than the dissimilarity between them. It shows all population pairs from Figures 6 and 7 with a single random locus.

Figure 8: 
Empirical and theoretical 


E

[



D


b



]



$\mathbb{E}\left[{\mathcal{D}}^{b}\right]$



 and 





E

[



D


w



]


̄




$\bar{\mathbb{E}\left[{\mathcal{D}}^{w}\right]}$



. Each panel considers a random locus, D1S1677, in 435 pairs of populations with sample size greater than 15. The locus is among those used in Figures 6 and 7. The upper left triangle is the region in which the between-population dissimilarity of two populations exceeds the mean of the within-population dissimilarities of the two populations, 


E

[



D


b



]

>



E

[



D


w



]


̄




$\mathbb{E}\left[{\mathcal{D}}^{b}\right]{ >}\bar{\mathbb{E}\left[{\mathcal{D}}^{w}\right]}$



, as proven for theoretical disimilarities (Theorems 3 and 4). The two ends of a horizontal gray line indicate the 


E

[



D


w



]



$\mathbb{E}\left[{\mathcal{D}}^{w}\right]$



 values for two populations whose mean within-population dissimilarity is plotted at the midpoint of the line. (A) Theoretical values of 




D


1




${\mathcal{D}}_{1}$



. (B) Theoretical values of 




D


2




${\mathcal{D}}_{2}$



. (C) Empirical values of 




D


1




${\mathcal{D}}_{1}$



. (D) Empirical values of 




D


2




${\mathcal{D}}_{2}$



.
Figure 8:

Empirical and theoretical E [ D b ] and E [ D w ] ̄ . Each panel considers a random locus, D1S1677, in 435 pairs of populations with sample size greater than 15. The locus is among those used in Figures 6 and 7. The upper left triangle is the region in which the between-population dissimilarity of two populations exceeds the mean of the within-population dissimilarities of the two populations, E [ D b ] > E [ D w ] ̄ , as proven for theoretical disimilarities (Theorems 3 and 4). The two ends of a horizontal gray line indicate the E [ D w ] values for two populations whose mean within-population dissimilarity is plotted at the midpoint of the line. (A) Theoretical values of D 1 . (B) Theoretical values of D 2 . (C) Empirical values of D 1 . (D) Empirical values of D 2 .

In Figure 8A, we find that the theoretical values of E [ D 1 b ] and E [ D 1 w ] ̄ , computed from allele frequencies alone, follow the predicted inequality, with E [ D 1 b ] > E [ D 1 w ] ̄ . However, the theorem does not necessarily apply to dissimilarities computed from actual diploid individuals, and indeed, some exceptions are observed in which the empirical E [ D 1 b ] is smaller than E [ D 1 w ] ̄ (Figure 8C). Similar results hold for E [ D 2 b ] and E [ D 2 w ] ̄ inFigure 8B and D.

Figure 9 tabulates the fraction of loci for which the empirical within-population dissimilarity of a population (denoted Population 1) exceeds the population’s empirical between-population dissimilarity with a second population (Population 2), or E [ D w ] > E [ D b ] . The populations are arranged geographically, following a general decrease in within-population genetic diversity with migration distance from Africa, as measured by expected heterozygosity 1 − σ 2 (Prugnolle et al. 2005; Ramachandran et al. 2005). In Figure 9A, for D 1 , if Population 1 is a population with relatively low within-population heterozygosity, such as a Native American population, then its within-population dissimilarity rarely exceeds its between-population dissimilarity with a second population (rightmost columns). The fraction of loci for which E [ D w ] > E [ D b ] is greatest for intermediate-heterozygosity South Asian populations (central columns). If Population 2 is a high-heterozygosity African population, then for all non-African choices of Population 1, the within-population dissimilarity of Population 1 rarely exceeds the between-population dissimilarity with an African Population 2 (bottom rows). Similar patterns are seen in Figure 9B for D 2 , with the additional observation that the within-population dissimilarity of Population 1 often exceeds the between-population dissimilarity when low-heterozygosity Native American populations are placed in the role of Population 2 (top rows).

Figure 9: 
Fraction of loci for which 


E

[



D


b



]

<
E

[



D


w



]



$\mathbb{E}\left[{\mathcal{D}}^{b}\right]{< }\mathbb{E}\left[{\mathcal{D}}^{w}\right]$



. Each panel considers all 783 loci in pairs among the 30 populations with sample size greater than 15. Each cell denotes a pair of populations, with Population 1 considered for the within-population dissimilarity. Geographical regions are separated by bold black lines. (A) 




D


1




${\mathcal{D}}_{1}$



. (B) 




D


2




${\mathcal{D}}_{2}$



.
Figure 9:

Fraction of loci for which E [ D b ] < E [ D w ] . Each panel considers all 783 loci in pairs among the 30 populations with sample size greater than 15. Each cell denotes a pair of populations, with Population 1 considered for the within-population dissimilarity. Geographical regions are separated by bold black lines. (A) D 1 . (B) D 2 .

8 Discussion

Allele-sharing statistics are often used to quantify genetic dissimilarity within and between populations. Because they typically share a larger number of recent ancestors, individuals from the same population might be predicted to possess a lower genetic dissimilarity than those from different populations. We have mathematically explored the circumstances under which this prediction fails, when the genetic dissimilarity within a population exceeds the genetic dissimilarity between two populations. The analysis characterizes the properties of allele frequency vectors that give rise to this counterintuitive scenario, illustrating its occurrence in human population-genetic data.

When does within-population dissimilarity for a population exceed between-population dissimilarity with a second population? The conditions that permit this inequality in the case of I = 2 alleles are instructive (Theorems 1 and 2 and Figure 4). In this case, two populations have unbalanced allele frequencies, with Population 2 more unbalanced than Population 1, but the two populations are similar in their frequencies. In Population 1, dissimilarity is generated from comparisons of homozygotes for one allele and homozygotes for the other allele. However, because Population 2 has allele frequencies that are more unbalanced than those of Population 1, fewer comparisons of distinct homozygotes occur in the between-population comparison. This phenomenon results in a within-population dissimilarity in Population 1 that exceeds the between-population dissimilarity. Beyond I = 2, such an excess is observed in empirical calculations with I ≥ 2 alleles (Figure 9), as well as in simulations, though with decreasing probability as I increases (Figure 5).

Although a population can possess greater within-population dissimilarity than its between-population dissimilarity to a second population, we find that for arbitrary numbers of alleles I, it is not possible for both populations in a pair to possess greater within-population dissimilarity than the between-population dissimilarity (Theorems 3 and 4). In data, “theoretical” dissimilarities obtained by treating allele frequencies in the data as parametric frequencies of two populations follow this inequality strictly, with greater between-population dissimilarity than at least one of the two within-population dissimilarities (Figure 8A and B). Similarly, the mean of the two within-population dissimilarities is strictly less than the between-population dissimilarity in theoretical calculations (Figure 8A and B); while “empirical” dissimilarities calculated from individual genotypes can violate the inequality, we find that these violations are generally mild (Figure 8C and D).

The results can contribute to understanding unexpected phenomena involving allele-sharing dissimilarities in human populations. We have seen that within-population dissimilarities in Population 1 sometimes exceed between-population dissimilarities, often in comparisons that involve a lower-diversity Population 2 and a higher-diversity Population 1 (Figure 9); in essence, a high-diversity population can possess enough variation that its inter-individual dissimilarity can exceed the dissimilarity between populations. Our theoretical calculations provide a basis for this scenario, and in fact, we saw for I = 2 that it is not unlikely in certain parts of the allele frequency space (Figure 4).

Our theoretical analysis deepens a line of inquiry on mathematical effects on allele-sharing. For each of two dissimilarity functions, we have obtained probability distributions of within- and between-population allele-sharing dissimilarities across pairs of individuals as functions of allele frequencies (Tables 3, 4, 6, 7), focusing on the mean and variance of the dissimilarity statistics (Eqs. (3), (6), (9), (12), (16), (19), (22) and (25)). The expressions for these quantities, and inequalities concerning their relationships (Theorems 1–4), augment previous efforts on the mathematics of allele-sharing dissimilarities in terms of allele frequencies (Chakraborty and Jin 1993; Tal 2013).

The two variants of allele-sharing dissimilarity that we studied, D 1 and D 2 , share many features. For I = 2 and I = 3 alleles, the expected values of D 1 w and D 2 w are maximal when all alleles have the same frequency (Figures 1A and 2A, B). Trends in expectations of D 1 b and D 2 b at I = 2 are also similar (Figure 3A and B), as are the regions in which E [ D w ] > E [ D b ] for I = 2 (Figure 4), and the simulated probabilities P E [ D w ] > E [ D b ] for I ≥ 2 (Figure 5).

However, some consistent differences between the two dissimilarities are also observed. D 2 D 1 for all genotypes (Table 1), and hence, E [ D 2 w ] E [ D 1 w ] (Figures 1 and 2C and Eq. (15)) and E [ D 2 b ] E [ D 1 b ] (Figure 3C and Eq. (28)). Although both dissimilarities have E [ D w ] ̄ E [ D b ] (Theorems 3 and 4), E [ D 1 w ] ̄ E [ D 1 b ] E [ D 2 w ] ̄ E [ D 2 b ] (Eq. (39)), so that the extent to which E [ D w ] ̄ lies below E [ D b ] has greater magnitude for D 1 .

The within-population variance across pairs of individuals is not uniformly higher for either dissimilarity (Figures 1B and 2F); at I = 2, it has different shapes, as V a r [ D 2 w ] has two maxima, whereas V a r [ D 1 w ] has only one (Figure 1B). D 2 has larger regions in which E [ D w ] > E [ D b ] for I = 2 (Figure 4) and for I ≥ 2 (Figure 5). In the empirical analysis, D 2 has a closer match between empirical and theoretical mean values of the dissimilarity (Figures 6B and 7B). Its patterns in the fraction of loci for which E [ D w ] > E [ D b ] align more closely with the heterozygosity values of the populations, with the probability of E [ D w ] > E [ D b ] larger when Population 1 is a higher-diversity population and Population 2 is a lower-diversity population (Figure 9B). Notably, expressions for E [ D 2 ] are closely tied to heterozygosity (Eq. (9)) and its between-population analogue (Eq. (22)), potentially explaining the tighter connection of heterozygosity to its associated observations. Thus, the lesser-used D 2 – which, unlike D 1 , allows the dissimilarity of an individual and itself to be nonzero (Table 1) – does possess a more easily interpreted pattern in the probability that E [ D w ] > E [ D b ] .

Does our analysis suggest a preference for D 1 over D 2 , or vice versa? To summarize, D 1 has been used more frequently than D 2 , and it also has the property that the dissimilarity of an individual and itself is zero. The less frequently used D 2 does not have this property, but it produces simpler expressions for its within-population and between-population expectations, with more natural interpretations of those expectations and their consequences. We conclude that although D 1 has a number of desirable properties, D 2 does as well, and it perhaps merits attention commensurate with that given to D 1 .

This work has several possible extensions. We have focused on the first and second moments of allele-sharing dissimilarities across pairs of individuals; the full distributions (Tables 3, 4, 6, 7) could also be further investigated. We examined I = 2 in the greatest detail, but special cases that fix a maximal value of I could also be considered. We chose the two most frequently used ASD variants, D 1 and D 2 , but a variant designed for genotypes obtained by observation of band patterns (Chakraborty and Jin 1993) could also be studied.

We have only considered allele-sharing dissimilarity between population pairs at a single locus, and it will be of interest to investigate dissimilarities that average across many loci. Our theoretical calculations focus on dissimilarities between two random individuals chosen from specified allele-frequency distributions at a locus. Although such distributions have nonzero probability only on the discrete values { 0 , 1 2 , 1 } for D 1 and { 0 , 1 2 , 3 4 , 1 } for D 2 , when an allele-sharing dissimilarity is calculated as an average across L loci, the 2L + 1 values { 0 , 1 2 L , 1 L , 3 2 L , , L 1 L , 2 L 1 2 L , 1 } become possible values for D 1 (all multiples of 1 2 L in [0,1]), and the 4L values { 0 , 1 2 L , 3 4 L , , 4 L 3 4 L , 2 L 1 2 L , 4 L 1 4 L , 1 } for D 2 (all multiples of 1 4 L in [0,1], other than 1 4 L itself). Thus, the mean allele-sharing dissimilarity of a random pair of individuals across many loci – computed either theoretically or empirically – has many possible numerical values, potentially giving rise to continuous approximations for associated probability distributions.

We note significant caveats in interpreting our empirical analysis in relation to our theoretical computations. The empirical computations make use of all pairs of individuals drawn from specified samples; each sampled individual appears in many pairs, so that the empirical analysis does not follow the assumption of the theoretical analysis that pairs represent independent draws from allele frequency distributions. A second difference of the empirical and theoretical analyses is that the theoretical analysis assumes that pairs of alleles within an individual are independent draws from the allele-frequency distribution, whereas inbreeding can induce dependence of these alleles empirically. Such deviations from the assumptions of the theoretical analysis in conducting the empirical analysis could be explored in simulations that do and do not permit inbreeding and reuse of pairs of individuals and in empirical samples large enough to avoid such reuses.

Allele-sharing dissimilarities have long been used in population genetics. The mathematical relationships we have obtained assist both in predicting their properties in relation to allele frequencies and in understanding empirical aspects of their values. When counterintuitive phenomena are obtained with such dissimilarities – such as a greater within-population dissimilarity than the between-population dissimilarity – the mathematical results can potentially provide insight into the unexpected observations.


Corresponding author: Xiran Liu, Institute for Computational and Mathematical Engineering, Stanford University, Stanford, CA 94305, USA, E-mail:

Award Identifier / Grant number: R01 HG005855

Award Identifier / Grant number: BCS-2116322

  1. Research ethics: Not applicable.

  2. Author contributions: Study design: XL, NAR; Mathematical analysis: XL, ZA, TKM, NAR; Data analysis: XL, ZA; Manuscript preparation: XL, ZA, NAR. The authors have accepted responsibility for the entire content of this manuscript and approved its submission.

  3. Competing interests: The authors state no conflict of interest.

  4. Research funding: We acknowledge NIH grant R01 HG005855 and NSF grant BCS-2116322 for support.

  5. Data availability: Not applicable.

References

Bowcock, A.M., Ruiz-Linares, A., Tomfohrde, J., Minch, E., Kidd, J.R., and Cavalli-Sforza, L.L. (1994). High resolution of human evolutionary trees with polymorphic microsatellites. Nature 368: 455–457. https://doi.org/10.1038/368455a0.Search in Google Scholar PubMed

Cavalli-Sforza, L.L. and Edwards, A.W.F. (1967). Phylogenetic analysis: models and estimation procedures. Am. J. Hum. Genet. 19: 233–257.Search in Google Scholar

Chakraborty, R. and Jin, L. (1993). A unified approach to study hypervariable polymorphisms: statistical considerations of determining relatedness and population distances. In: Pena, S.D.J., Chakraborty, R., Epplen, J.T., and Jeffreys, A.J. (Eds.), DNA fingerprinting: state of the science. Birkhäuser Verlag, Basel, pp. 153–175.10.1007/978-3-0348-8583-6_14Search in Google Scholar PubMed

Edge, M.D., Ramachandran, S., and Rosenberg, N.A. (2022). Celebrating 50 years since Lewontin’s apportionment of human diversity. Phil. Trans. Roy. Soc. Lond. B Biol. Sci. 377: 20200405. https://doi.org/10.1098/rstb.2020.0405.Search in Google Scholar PubMed PubMed Central

Gao, X. and Martin, E.R. (2009). Using allele sharing distance for detecting human population stratification. Hum. Hered. 68: 182–191. https://doi.org/10.1159/000224638.Search in Google Scholar PubMed PubMed Central

Jorde, L.B. (1985). Human genetic distance studies: present status and future prospects. Annu. Rev. Anthropol. 14: 343–373. https://doi.org/10.1146/annurev.an.14.100185.002015.Search in Google Scholar

Lewontin, R.C. (1972). The apportionment of human diversity. Evol. Biol. 6: 381–398. https://doi.org/10.1007/978-1-4684-9063-3_14.Search in Google Scholar

Mountain, J.L. and Cavalli-Sforza, L.L. (1997). Multilocus genotypes, a tree of individuals, and human evolutionary history. Am. J. Hum. Genet. 61: 705–718. https://doi.org/10.1086/515510.Search in Google Scholar PubMed PubMed Central

Mountain, J.L. and Ramakrishnan, U. (2005). Impact of human population history on distributions of individual-level genetic distance. Hum. Genom. 2: 4–19. https://doi.org/10.1186/1479-7364-2-1-4.Search in Google Scholar PubMed PubMed Central

Nei, M. (1972). Genetic distance between populations. Am. Nat. 106: 283–292. https://doi.org/10.1086/282771.Search in Google Scholar

Nei, M. (1987). Molecular evolutionary genetics. Columbia University Press, New York.10.7312/nei-92038Search in Google Scholar

Prugnolle, F., Manica, A., and Balloux, F. (2005). Geography predicts neutral genetic diversity of human populations. Curr. Biol. 15: R159–R160. https://doi.org/10.1016/j.cub.2005.02.038.Search in Google Scholar PubMed PubMed Central

Ramachandran, S., Deshpande, O., Roseman, C.C., Rosenberg, N.A., Feldman, M.W., and Cavalli-Sforza, L.L. (2005). Support from the relationship of genetic and geographic distance in human populations for a serial founder effect originating in Africa. Proc. Natl. Acad. Sci. USA 102: 15942–15947. https://doi.org/10.1073/pnas.0507611102.Search in Google Scholar PubMed PubMed Central

Rosenberg, N.A. (2006). Standardized subsets of the HGDP-CEPH human genome diversity cell line panel, accounting for atypical and duplicated samples and pairs of close relatives. Ann. Hum. Genet. 70: 841–847. https://doi.org/10.1111/j.1469-1809.2006.00285.x.Search in Google Scholar PubMed

Rosenberg, N.A. (2011). A population-genetic perspective on the similarities and differences among worldwide human populations. Hum. Biol. 83: 659–684. https://doi.org/10.1353/hub.2011.a465110.Search in Google Scholar

Rosenberg, N.A., Mahajan, S., Ramachandran, S., Zhao, C., Pritchard, J.K., and Feldman, M.W. (2005). Clines, clusters, and the effect of study design on the inference of human population structure. PLoS Genet. 1: e70. https://doi.org/10.1371/journal.pgen.0010070.Search in Google Scholar PubMed PubMed Central

Tal, O. (2013). Two complementary perspectives on inter-individual genetic distance. Biosystems 111: 18–36. https://doi.org/10.1016/j.biosystems.2012.07.005.Search in Google Scholar PubMed

Witherspoon, D.J., Wooding, S., Rogers, A.R., Marchani, E.E., Watkins, W.S., Batzer, M.A., and Jorde, L.B. (2007). Genetic similarities within and between human populations. Genetics 176: 351–359. https://doi.org/10.1534/genetics.106.067355.Search in Google Scholar PubMed PubMed Central

Received: 2023-01-05
Accepted: 2023-11-10
Published Online: 2023-12-12

© 2023 the author(s), published by De Gruyter, Berlin/Boston

This work is licensed under the Creative Commons Attribution 4.0 International License.

Downloaded on 7.5.2024 from https://www.degruyter.com/document/doi/10.1515/sagmb-2023-0004/html
Scroll to top button