Skip to content
BY 4.0 license Open Access Published by De Gruyter February 19, 2024

A global test of hybrid ancestry from genome-scale data

  • Md Rejuan Haque EMAIL logo and Laura Kubatko

Abstract

Methods based on the multi-species coalescent have been widely used in phylogenetic tree estimation using genome-scale DNA sequence data to understand the underlying evolutionary relationship between the sampled species. Evolutionary processes such as hybridization, which creates new species through interbreeding between two different species, necessitate inferring a species network instead of a species tree. A species tree is strictly bifurcating and thus fails to incorporate hybridization events which require an internal node of degree three. Hence, it is crucial to decide whether a tree or network analysis should be performed given a DNA sequence data set, a decision that is based on the presence of hybrid species in the sampled species. Although many methods have been proposed for hybridization detection, it is rare to find a technique that does so globally while considering a data generation mechanism that allows both hybridization and incomplete lineage sorting. In this paper, we consider hybridization and coalescence in a unified framework and propose a new test that can detect whether there are any hybrid species in a set of species of arbitrary size. Based on this global test of hybridization, one can decide whether a tree or network analysis is appropriate for a given data set.

1 Introduction

Advances in modern sequencing technologies have led to the availability of a substantial amount of DNA sequence data from hundreds, if not thousands, of species. These data are often used to understand the evolutionary relationships between species as well as the process of evolution more generally. One way of representing the evolutionary relationships between a collection of species is with a species-level phylogeny, also known as a phylogenetic tree. A species-level phylogeny is an acyclic graph in which all edges are considered to be directed from the root node to the leaves. An internal node at which a lineage splits into two lineages that each subsequently evolve independently represents a speciation event. In a phylogeny, all internal nodes except the root node are connected to three edges, making the degree of all internal nodes three. The root node, which represents the common ancestor of all the species in the data, is connected with two edges leading the root node to have degree two, while the leaves (also known as taxa) have degree one and represent the present-day populations from which the DNA sequence data were collected (see right panel of Figure 1).

Figure 1: 
Model for the species-level relationships among four taxa under the coalescent model with hybridization. Here taxon H is a hybrid of taxa P
1 and P
2, τ
1, τ
2, and τ
3 are the speciation times for the internal nodes from the present. The circles on the leftmost figure represent internal nodes. The red circle (n
2) represents the hybrid node as there are three descendent from that node. The network on the left can be decomposed into two species-level phylogenies; one with H and P
2 as sister taxa with probability γ, and one with H and P
1 as sister taxa with probability 1 − γ.
Figure 1:

Model for the species-level relationships among four taxa under the coalescent model with hybridization. Here taxon H is a hybrid of taxa P 1 and P 2, τ 1, τ 2, and τ 3 are the speciation times for the internal nodes from the present. The circles on the leftmost figure represent internal nodes. The red circle (n 2) represents the hybrid node as there are three descendent from that node. The network on the left can be decomposed into two species-level phylogenies; one with H and P 2 as sister taxa with probability γ, and one with H and P 1 as sister taxa with probability 1 − γ.

One major drawback of using a phylogenetic tree to represent evolutionary history is that it assumes speciation occurs through bifurcation, meaning that an ancestor can have precisely two descendants. However, this might not always be the case since there are other evolutionary mechanisms that generate novel species. One such mechanism is hybridization, which occurs when two distinct populations interbreed and produce a new species that shares genetic material with both parents. However, hybridization does not always result in creation of a new species, in which case the mechanism is known as introgression (Dowling and DeMarais 1993; Good et al. 2003; Grant et al. 2004; Mallet 2005, 2007; Roques et al. 2001; Salzburger et al. 2002; Thórsson et al. 2001; Weigel et al. 2002). Recent studies have shown numerous instances of hybridization in plants and animals, despite the belief that hybridization is rare (Mallet 2007; Mavárez et al. 2006; Rieseberg 1997). If a collection of species is known to include species that have arisen via hybridization, one should infer a species network rather than a phylogenetic tree. Since the internal nodes in a network are allowed to have degree more than three, networks can capture evolutionary processes such as hybridization (see left panel of Figure 1). However, inferring a network from data can be computationally intensive, especially when the number of taxa and/or the length of the DNA sequence alignment are large (Rabier et al. 2021). In addition, network analysis in the absence of hybridization can lead to incorrect conclusions concerning the ancestry of the taxa identified to be of hybrid origin. Thus, an important first step before deciding whether to perform a network or a tree analysis is to gather concrete evidence about the possibility of hybridization among the species on which inference is to be made.

In the literature, methods for detecting hybridization are ubiquitous. However, several of these methods ignore a crucial evolutionary process called incomplete lineage sorting (ILS). In practice, the DNA sequence data come from genes, and each of these genes has its own evolutionary history, which may or may not match the evolutionary history of the species. When two gene copies from the current generation fail to coalesce in the immediately previous generation, incomplete lineage sorting (ILS) results, leading to gene-tree species-tree discordance (Maddison 1997). Coalescent theory links the gene trees to a species tree by providing the probability distribution of all possible rooted gene trees arising within a rooted species tree and thus can be used to model ILS (Pamilo and Nei 1988; Tajima 1983; Takahata and Nei 1985). To distinguish between hybridization and incomplete lineage sorting, Joly et al. (2009) proposed a statistic based on the observation that the genetic distance between two DNA sequences should be smaller in the presence of hybridization than the distance that would be observed if only ILS is acting on the genes. One can simulate this distance under the null hypothesis that ILS alone explains the discordance using coalescent theory. Once the distribution of the distance under the null hypothesis is estimated, one can compare the observed distance to the null distribution to assess the evidence for hybridization (Joly et al. 2009).

Tests based on the specific site pattern frequencies (a particular position at an alignment is called a site, and the combination of possible nucleotides at a particular site for the sampled individuals is called a site pattern) have also been developed to test ancient admixture utilizing Patterson’s D-Statistic (Durand et al. 2011; Green et al. 2010; Patterson et al. 2012). In addition, several likelihood-based methods have been proposed, including Meng and Kubatko (2009), Kubatko (2009), and Yu et al. (2014). In 2019, Kubatko and Chifman proposed a method to detect hybridization based on “phylogenetic invariants” (polynomials in the site pattern probabilities that evaluate to zero on any probability distribution that is consistent with the tree topology and associated model). Their method considers models that can generate data that allows the possibility of hybridization as well as incomplete lineage sorting. Based on a comparison of popular hybridization detection methods, Kong and Kubatko (2021) concluded that Kubatko and Chifman’s (2019) invariant-based method detects hybrid species quickly and accurately, generally outperforming the other methods tested in terms of power.

One common characteristic of several of these available methods is that they test whether a suspected species is a hybrid given knowledge of its two parents. However, to decide whether or not a network analysis is necessary, we need to perform a global test to assess evidence for the presence of one or more hybrid taxa in the entire data set. Kubatko and Chifman (2019) extended their method to deal with phylogenies containing more than four taxa by testing every species as a hybrid and applying a Bonferroni correction to adjust for multiple testing. However, the number of tests one needs to consider can be massive, especially when the number of taxa is large, making the Bonferroni correction less powerful in detecting hybridization events. For instance, from a data set containing m species, one can pick three species and test three hypotheses considering one of the three species as a hybrid and the other two as parents. Hence, one can test a total of 3 × m 3 hypothesis using a data set containing m species. If, for instance, a data set contains eight species, we need to perform 168 tests, and the number of required tests grows significantly as the number of taxa increases. In addition, the individual tests are correlated because the test statistic for each individual test is calculated using sequences from closely related species, making the Bonferroni method conservative. Thus, to perform a global test for detecting hybridization, we need a method that can handle arbitrary correlation structures. One possibility is to use a combination test.

A combination test combines the individual p-values to aggregate multiple small effects rather than adjusting individual p-values to maintain the false rejection rate. Fisher’s combination test is the oldest method of combining individual p-values to test an overall null hypothesis (Fisher 1932). Consider a scenario where we carry out l individual tests. Fisher’s combination test statistic to perform this test is as follows:

(1) χ 2 l 2 = 2 i = 1 l log ( p i ) ,

where p i , i = 1, 2, …, l, is the p-value from the ith individual test. The test statistic above converges to a Chi-squared distribution with 2l degrees of freedom under the assumption that all individual nulls are true and the p i are independent. When the individual p-values are small, the value of the test statistic in Equation (1) tends to be larger, leading to evidence suggesting that not all of the null hypotheses are true. The test has substantial power to reject the global null hypothesis when a large fraction of the individual p-values are small. This will not be the case, however, when only a small fraction of the individual tests are expected to be significant (Arias-Castro et al. 2011; Koziol and Perlman 1978), a situation that we might often encounter in many applications of bioinformatics. For instance, while assessing genome-scale data to detect hybridization events for 20 taxa, one needs to perform a total of 2280 tests, only a few of which are expected to be significant. The application of Fisher’s combination test, in this case, may lead to failure to detect any hybrid species.

Multiple attempts have been made to improve the power of the test in situations like this. The most popular combination tests that attempt to improve power are Tippett’s minimum p-value (Tippett et al. 1931) and higher criticism tests (Donoho and Jin 2004). In Tippett’s minimum p-value approach, the test statistic is determined by the minimum of the individual p-values. That is, for the l individual p-values (p 1, p 2, …, p l ), the test statistic is defined as: S T = min(p 1, p 2, …, p l ). Assuming all the individual p-values are independent and all the individual nulls are true, Tippett’s minimum p-value statistic (S T ) converges to a Beta(1, l) distribution. For l individual tests with α = 0.05, the higher criticism test statistic proposed by Tukey is

H C 0.05 , l = l [ ( Fraction significant at 0.05 ) 0.05 ] / 0.05 × 0.95 .

In Tukey’s interpretation, the null hypothesis should be rejected when HC is greater than 2, assuming that the individual p-values are independent. When the assumption of independence of individual p-values holds, Tippett’s minimum p-value and the higher criticism tests both have high power. However, both these tests, as well as Fisher’s combination test, fail to perform well when there is dependence among the individual tests.

In reality, it is often the case that the individual tests have a dependence structure and the p-values are correlated. For example, in a phylogenetic tree, species are related by their common ancestor. There are methods available based on permutation or numerical simulation to incorporate the correlation structure (Liu and Xie 2019). However, these approaches are computationally involved and time-consuming when performing a large number of individual tests involving massive data sets. Recently, an analytical approximation method has been proposed using the higher criticism test to incorporate the dependence structure (Barnett et al. 2017). However, this approximation is computationally involved and does not perform well when the p-values are extremely small.

To incorporate the arbitrary dependence structure of individual tests and efficiently handle a large number of tests, Liu and Xie (2020) proposed a combination test called the Cauchy combination test (CCT). Following Fisher’s combination test, the test statistic of the Cauchy combination test is calculated using the weighted sum of transformed p-values. Let p i be the individual p-value, for i = 1, 2, …, l. Then, the Cauchy combination test statistic is defined as

(2) T = i = 1 l ω i tan { ( 0.5 p i ) π } ,

where the weights ω i ’s are nonnegative and i = 1 l ω i = 1 . The authors showed that the statistic in Equation (2) converges to a standard Cauchy distribution under the global null hypothesis even when there is an arbitrary correlation structure between the individual p-values. The only assumption made by the authors is that any two of the individual test statistics should have a bivariate normal distribution. That is, considering X m to be the mth test statistic, m = 1, 2, …, l, for any 1 i < j l , ( X i , X j ) T follows a bivariate normal distribution. However, Liu and Xie (2020) argued that the bivariate normality assumption is mild, and the test is robust even when the assumption is not satisfied.

In 2022, Zhongxue Chen showed situations where the Cauchy combination test failed to reject the global null hypothesis even if some of the individual tests were rejected correctly. The author proposed a robust test combining the MinP (Tippett et al. 1931) and the Cauchy combination (Liu and Xie 2020) test that performs well under a wide range of situations. Consider combining l individual tests with p-values p 1, p 2, …, p l . Let the p-value for the global null hypothesis using the Cauchy combination test be p c , and the p-value using the MinP test be p m with p m = l × min(p 1, p 2, …, p l ). Using the resulting p-values from the Cauchy combination and MinP test, Chen (2022) proposed a test statistic called MinP-CCT-MinP (MCM) with which the p-value for the global test is calculated as follows:

(3) p MCM = 2 × min { p c , p m , 0.5 } .

Chen (2022) showed that the proposed test is robust and performs well in situations where the Cauchy combination test might fail while maintaining the correct type I error rate. In addition, the author proved that, unlike the Cauchy combination test, the MCM test need not assume that the joint distribution of pairs of individual tests is bivariate normal.

Here, we consider the test statistic proposed by Kubatko and Chifman (2019) to test whether a particular species is a hybrid of two specified parents using both the Cauchy combination and MCM test frameworks proposed by Liu and Xie (2020) and Chen (2022), respectively, to suggest a global test statistic for detecting the presence of hybridization within a set of species. We perform an extensive simulation study to investigate the performance of the proposed test. We also apply our method to empirical data sets and compare results with existing studies of those data sets. We focus our work on the test of Kubatko and Chifman (2019), rather than on other available methods e.g., Patterson’s D-Statistics (Durand et al. 2011; Green et al. 2010; Patterson et al. 2012), because previous work (Kong and Kubatko 2021) has shown that the method of Kubatko and Chifman (2019) is more powerful than methods based on D-Statistics. Recalling Equations (2) and (3), it is clear that if we compare the power of a global test using CCT or MCM for two different individual tests, the individual test that is more powerful will always result in a more powerful global test, and it is thus unnecessary to further consider tests based on D-statistics here. Nonetheless, we note that the method could equally well be applied to the case where D-Statistics are used for the individual tests. Finally, we note that a key feature of our approach is that dependence in the individual tests is allowed; i.e., there is no assumption that the p-values from the individual tests are uniformly distributed when the null hypotheses are true. Indeed, an empirical data set for which no hybrids are believed to exist displays a non-uniform distribution of p-values.

2 Methods

We begin this section by providing the mathematical details of both the individual test of hybridization on quartets of species and the global test of hybridization for the entire set of species. Readers who are most interested in the empirical aspects of the proposed method of hybrid detection can skip the Methods and move straight to the Results.

As discussed in the previous section, ILS is a crucial evolutionary mechanism that can lead to gene-tree species-tree discordance and thus should be incorporated into phylogenetic applications. Hence, we will consider a data generation mechanism that allows both hybridization and ILS, following the model proposed by Meng and Kubatko (2009). As already discussed, a bifurcating species-level phylogeny is inadequate to represent a hybridization event. Instead, a species network that connects two edges of a phylogenetic tree using a horizontal line is commonly used to depict a hybrid node. Consider the phylogenetic network in the leftmost panel of Figure 1. Here H is a hybrid species that inherits a portion (γ) of its genetic material from P 2 and the rest (1 − γ) from parent P 1; τ i is the speciation time for node i measured from the present.

We assume that the aligned DNA sequences arise along the network. Let Y L be the observed nucleotide for a specific site in the DNA sequence for taxon L. Then a site pattern arising from the species network in Figure 1 can be expressed as Y O Y P 1 Y H Y P 2 . For a four-taxon network, the number of possible site patterns is 44 = 256 since Y L ∈ {A, G, C, T}. Let p ijkl = Pr ( Y O = i , Y P 1 = j , Y H = k , Y P 2 = l | ( S γ , τ ) ) be the probability of site pattern ijkl with i, j, k, l ∈ {A, C, G, T}, and let P = (P AAAA |(S γ , τ), …, P TTTT |(S γ , τ)) be the set of all possible site pattern probabilities on network (S γ , τ). If we let N ijkl be the observed number of sites with site pattern ijkl and N be the total number of sites in the aligned DNA sequence data, one can estimate p ijkl as p ̂ ijkl = N ijkl N , and the set of observed site pattern probabilities can be written as P ̂ = ( P ̂ AAAA | ( S γ , τ ) , , P ̂ TTTT | ( S γ , τ ) ) .

As the vector N P ̂ represents the observed count of each of the 256 possible site patterns, it is easy to see that

N p ̂ Multinomial ( N ; P ) .

With ij ∈ {A, G, C, T} consider the following linear relationships:

f 1 = p iijj | ( S γ , τ ) p ijij | ( S γ , τ ) , f 2 = p ijji | ( S γ , τ ) p ijij | ( S γ , τ ) , f 3 = p ijii | ( S γ , τ ) p iiji | ( S γ , τ ) , f 4 = p iiij | ( S γ , τ ) p iiji | ( S γ , τ ) ,

Kubatko and Chifman (2019) showed that f 2 and f 4 evaluate to zero assuming the site patterns arise from tree S 1 (middle panel of Figure 1), while f 1 and f 3 are non-zero, and the opposite is true considering f 1 and f 3 with species tree S 2 (rightmost panel of Figure 1). The authors also showed that f 1 f 2 = γ 1 γ , and using this relationship, they defined a test statistic called the Hils statistic to test H 0: γ = 0:

(4) H = f ̂ 2 f ̂ 1 f ̂ 2 μ f 1 μ f 2 σ ̂ f 2 2 f ̂ 1 f ̂ 2 2 2 σ ̂ f 1 f 2 f ̂ 1 f ̂ 2 + σ ̂ f 1 2

Here, μ f 1 and μ f 2 are the mean of f 1 and f 2, respectively, σ f 1 and σ f 2 are the standard deviation of f 1 and f 2, respectively, and σ f 1 f 2 is the covariance between f 1 and f 2. It has been shown that under H 0, the statistic H follows a standard normal distribution and the null is rejected for large values of the observed statistic (Kubatko and Chifman 2019).

Now, consider testing the following hypothesis given aligned DNA sequence data for m + 1 taxa (m ingroup species and one outgroup):

  • H 0: None of the species are hybrid versus H A : At least one of the species is hybrid

Consider choosing three ingroup taxa and the outgroup making a four-taxon subset from the set of taxa. Using the selected subset, we can consider a quartet network assuming that one of three ingroup taxa is a hybrid and the other two are the parental species. Note that we can build three such networks from one four-taxon subset by designating each ingroup taxon as the hypothesized hybrid species. For instance, if the selected four-taxon subset contains outgroup O and ingroup taxa P 1, P 2, and H, we can build three networks assuming H as a hybrid of parents P 1 and P 2, P 1 as a hybrid of parents H and P 2, and P 2 as a hybrid of parents H and P 1. Using the whole collection of species, one can build 3 × m 3 quartet networks with the outgroup and three taxa (triplet) selected from the m ingroup taxa. For instance, one can build a total of 3 × 5 3 = 30 quartet networks from a collection of species containing one outgroup and five taxa.

Note that for a network with m taxa, 4 m site patterns are possible. To get the site patterns for a quartet network built by selecting a four-taxon subset from the collection of taxa, one can marginalize over the missing species. For instance, consider creating quartet networks from the species network (S γ , τ) in Figure 2a containing six species. Let the kth quartet network ( S γ k ) contains species O as outgroup, h as hybrid, and p 1, p 2 as the parental species. Then the site pattern probability P AAGT S γ k | ( S γ , τ ) can be found by:

P AAGT S γ k | ( S γ , τ ) = i j P AAGTij S γ | ( S γ , τ )

here, i, j ∈ {A, G, C, T} represents the possible site patterns of species a and b from network (S γ , τ) in Figure 2a that were not selected in the kth quartet network ( S γ k ) , respectively. It is clear that after marginalization, the possible number of site patterns for a quartet network is 44 = 256.

Figure 2: 
Five hybridization scenarios were considered in this study: (a) one recent hybridization event within six species, (b) one ancient hybridization event within six species, (c) one recent and one ancient hybridization event within seven species, (d) one recent hybridization event within twenty species, and (e) one ancient hybridization event within twenty species. The outgroup species is denoted by o. For panels a, b, and d, the hybrid species is denoted by h, and parents of the hybrid species are denoted by p
1 and p
2. For panel c, the two hybrid species are denoted by h
1 and h
2; the parents of h
2 are p
1 and p
2, and the parents of h
1 are the ancestor of a and b, and the ancestor of p
1, h
2, and p
2. For panel e, the hybrid species is denoted by h, and its parents are the ancestor of a, b, …, l, and the ancestor of m, n, …, u. γ is the proportion of genetic material shared by the leftmost parent to the hybrid. The time between speciation event i and i + 1 is denoted by τ

i
. Note that only the first sub-figure (panel a) is fully labeled.
Figure 2:

Five hybridization scenarios were considered in this study: (a) one recent hybridization event within six species, (b) one ancient hybridization event within six species, (c) one recent and one ancient hybridization event within seven species, (d) one recent hybridization event within twenty species, and (e) one ancient hybridization event within twenty species. The outgroup species is denoted by o. For panels a, b, and d, the hybrid species is denoted by h, and parents of the hybrid species are denoted by p 1 and p 2. For panel c, the two hybrid species are denoted by h 1 and h 2; the parents of h 2 are p 1 and p 2, and the parents of h 1 are the ancestor of a and b, and the ancestor of p 1, h 2, and p 2. For panel e, the hybrid species is denoted by h, and its parents are the ancestor of a, b, …, l, and the ancestor of m, n, …, u. γ is the proportion of genetic material shared by the leftmost parent to the hybrid. The time between speciation event i and i + 1 is denoted by τ i . Note that only the first sub-figure (panel a) is fully labeled.

Once the set of four taxa for the kth (k = 1, 2, …, l = 3 × m 3 ) quartet network is selected, one can decompose the network ( S γ k ) into two species trees (S k1 and S k2) following Figure 1. Also, following the idea of Kubatko and Chifman (2019), one can define f 1 S γ k , f 2 S γ k , f 3 S γ k , and f 4 S γ k using the quartet network S γ k and show that f 1 S γ k and f 3 S γ k are zero when evaluated on the site pattern probabilities arising from one of the species trees (e.g., S k1) and not zero when evaluated on the site pattern probabilities arising from the other species tree (e.g., S k2). On the other hand, for f 2 S γ k and f 4 S γ k , the opposite is true. However, none of the linear relationships are zero when evaluated on the site pattern probabilities from the hybrid species network S γ k with γ k ∈ (0, 1). One can obtain a function of γ k using these linear relationships as follows:

f 1 S γ k f 2 S γ k = γ k 1 γ k .

Now, one can easily estimate these functions by substituting the observed site pattern probabilities and then use the properties of the multinomial distribution and easily estimate the means μ f 1 S k γ , μ f 2 S k γ and variances σ f 1 S k γ 2 , σ f 2 S k γ 2 of these estimated functions. It is also possible to estimate the covariance σ f 1 S k γ , f 2 S k γ between these functions. Once the means, variances, and covariances are computed, one can use the H statistic in Equation (4) to test H 0:γ k = 0 for the kth quartet network ( S γ k ) :

(5) H k = f ̂ 2 S γ k f ̂ 1 S γ k f ̂ 2 S γ k μ f 1 S γ k μ f 2 S γ k σ ̂ f 2 S γ k 2 f ̂ 1 S γ k f ̂ 2 S γ k 2 2 σ ̂ f 1 S γ k f 2 S γ k f ̂ 1 S γ k f ̂ 2 S γ k + σ ̂ f 1 S γ k 2

Under the null hypothesis that γ k = 0, the term μ f 1 S γ k μ f 2 S γ k is 0. Also, note that under the null hypothesis, H k follows the standard normal distribution. Thus, the p-value of this test can be computed easily as follows:

(6) p k = 1 Φ ( h k ) ,

where h k is the observed value of the test statistic and Φ(h k ) = Pr(Z < h k ) with Z ∼ normal(0, 1). Note that one can repeat this process and compute the p-values for all possible tests. For instance, for a species network with five ingroup and one outgroup species, one can compute a total of l = 3 × 5 3 = 30 p-values. These tests will not be independent as species in a tree are related through their common ancestors. Let p 1, p 2, …, p l be the corresponding p-values of the l tests. To test the global null hypothesis: H 0: None of the species are hybrid, we propose the following statistic:

(7) G = k = 1 l ω i tan { ( 0.5 p k ) π } = k = 1 l ω i tan { ( Φ ( h k ) 0.5 ) π } ;

here, the ω i ’s are non-negative weights such that k = 1 n ω i = 1 . Under the global null hypothesis H 0, the test statistic G follows a standard Cauchy distribution (Liu and Xie 2020). Note that a large value of the test statistic G provides evidence against the global null hypothesis. Thus the c.d.f. of the standard Cauchy distribution can be used to compute the p-value for the global null as follows:

(8) p c = 0.5 ( arctan g ) / π .

Here, g is the realized value of G. The global null hypothesis is rejected if the p-value is smaller than α given that the chosen significance level is α. Note that in applying the results of the Cauchy combination test (Liu and Xie 2020), we have assumed that for any 1 ≤ i < jl, ( H i , H j ) T follows a bivariate normal distribution. The bivariate normality assumption is reasonable as the site pattern probabilities are well approximated by a normal distribution when the number of sites is large. Also, the summation of any two test statistics (e.g., H 1 + H 3) is just the sum of two approximate normal random variables, and thus the sum is also normal. Hence, it is safe to assume bivariate normality here.

Finally, to apply the MCM proposed by Chen (2022), we do the following:

  1. Step 1: Using the individual p-values (p 1, p 2, …, p l ), calculate the p-value (p m ) for the global test using the MinP test with p m = l × min{p 1, p 2, …, p l }.

  2. Step 2: Calculate the p-value (p mcm ) for the global test of no hybridization event using the MCM test with p mcm = 2 × min{p c , p m , 0.5}, where p c is the p-value for the global test from the Cauchy combination test.

2.1 Simulation study

To investigate the performance of the proposed global test for detecting the presence of hybridization within a collection of species, we designed an extensive simulation study with five different species networks covering several evolutionary scenarios (Figure 2). We varied the number of taxa, the number of hybridization events, the time of hybridization, the branch lengths (τ i ), the hybridization parameter (γ), the length of the simulated sequences, and the effective population size (θ) to generate DNA sequence data. To observe how the proposed method behaves when the data consist of a small number of taxa, we generated data using two networks with six taxa (Figure 2a and b).

In one of the six-taxon networks (Figure 2a), there is a recent hybridization event close to the leaf nodes, whereas the hybridization event occurs close to the root node in the other six-taxon network (Figure 2b) (often called ancient hybridization). Similarly, we have considered two twenty-taxon networks, one with recent hybridization (Figure 2d) and one with ancient hybridization (Figure 2e). Finally, we have considered a seven-taxon network with one recent and one ancient hybridization event (Figure 2c). For all five hybridization scenarios, the value of each τ i is set to be 0.5 or 1.0. The speciation time between the outgroup and the ingroup (τ r ) is set at 5.0 to ensure that the genetic material between the two groups is sufficiently different. For each scenario, the effective population size θ has been fixed to 0.001, 0.005, 0.01, or 0.03. Eleven different values of γ are chosen (γ ∈ {0.00, 0.05, 0.10, 0.15, 0.20, 0.25, 0.30, 0.35, 0.40, 0.45, 0.50}) for each setup. Combining all these parameter values, we have 88 scenarios in total for each of the networks presented in Figure 2. We note that the scenarios we have chosen were selected to be representative of the kinds of situations commonly encountered in practice. Our simulations encompass a range of settings (small vs. large trees, recent vs. ancient hybridization, etc.), indicating that our method should perform well in a range of real-world scenarios that extend beyond those explicitly considered here.

We simulated a total of 1000 gene trees from each of the 88 different scenarios described above using ms (Hudson 2002). Note that ms only accepts input in a tree format. Consequently, generating gene trees directly from a species network is impossible using ms. Species networks with hybridization, however, can be viewed as a combination of species trees since a network with l hybrid species can be decomposed into 2 l species trees by making the hybrid a sister taxon to one of its parents (Figure 1). Once the desired network is decomposed into species trees, one can simulate gene trees from each of the species trees accordingly to achieve the desired γ. For instance, to achieve a γ = 0.5 from Figure 1, one can generate 500 gene trees using species tree S 1 (middle panel of Figure 1) and 500 gene trees using species tree S 2 (rightmost panel of Figure 1). For a tree with two hybridization scenarios, one can use a total of four trees and can generate 250 gene trees from each of those four species trees to get 0.5 as the value of one of the γ (e.g., γ 1) while fixing the other γ (e.g., γ 2) at a fixed value 0.5. An example ms command to generate 500 gene trees and save it to a file named tre.1 from the species tree S 1 (middle panel of Figure 1) is:

ms4500 − t5 − TI41111 − ej0.543 − ej1.032 − ej512

|tail − n + 4|grep − v > tre.1.

To simulate DNA sequence data from the network, we used the gene trees generated using ms as input to seq-gen (Rambaut and Grass 1997). The HKY model was used for DNA substitution with the ratio of transitions and transversions equal to 1 (κ = 1), meaning our model assigns equal rates to transitions and transversions. Also, we set frequencies of nucleotide A, C, G, and T at 0.300414, 0.191363, 0.196748, and 0.311475, respectively, following the work of Yu et al. (2014). We considered sequence lengths of 50, 100, 150, and 200, which led to a total sequence length of 50 × 103, 100 × 103, 150 × 103, and 200 × 103, respectively, when the sequences were concatenated using goalign (https://github.com/fredericlemoine/goalign). Finally, the aligned DNA sequence data were converted to Phylip format, and a map file mapping individuals to the corresponding population was created. These two files were used to calculate the individual H statistic and the corresponding p-value.

The resulting data for each of the simulation setups, as well as a Python package to carry out the proposed test, can be found at https://github.com/rhaque62/pyghdet.

2.2 Empirical data

We applied the proposed method to two empirical datasets. The rattlesnake dataset (Gerard et al. 2011; Gibbs et al. 2011) contains six subspecies of rattlesnakes and two outgroup species and thought not to contain hybrid species. To examine the distribution of p-values in an empirical setting in which all null hypotheses are thought to be true, Figure 9 in the Appendix shows a histogram of observed p-values for this dataset. The yeast data set (Rokas et al. 2003) contains seven species of yeast and one outgroup and has been found to contain one or more hybrid species (Yu et al. 2011). The results are described in the following sections.

3 Results

3.1 Simulation study

To evaluate the performance of the proposed statistic using the Cauchy Combination test, we plot the power curves for each of the scenarios (see Figures 58 in the Appendix for results under all of the simulation settings). As a representative example, panel (a) of Figure 3 shows power plots of the proposed test for the twenty-taxon tree with one ancient hybridization event (Figure 2e) with an α = 0.05. When γ = 0, these plots show that the power of the test is less than α = 0.05 across all the scenarios; i.e., under the null hypothesis of no hybridization, the power of the test is less than the selected level of significance. This result shows that the proposed test has the correct type-I error rate. Also, it is noticeable from the results that regardless of other parameters, as the length of the sequence increased, the power of the test also increased (setups with the length of sequence equal to 200k are the most powerful, and the power gradually decreases as the length decreases to 50k).

Figure 3: 
Results of the power simulations for the twenty-taxon hybrid species network with one ancient hybridization event in (Figure 2e) using the Cauchy combination test (panel a) and the MCM test (panel b). Four different DNA sequence lengths (50k, 100k, 150k, and, 200k), four different effective population sizes (θ = 0.001, 0.005, 0.01, and, 0.03), and two different branch lengths (τ = 0.50 and 1.0) have been considered. Here, γ represents the proportion of genetic material contributed by each parent to the hybrid taxon.
Figure 3:

Results of the power simulations for the twenty-taxon hybrid species network with one ancient hybridization event in (Figure 2e) using the Cauchy combination test (panel a) and the MCM test (panel b). Four different DNA sequence lengths (50k, 100k, 150k, and, 200k), four different effective population sizes (θ = 0.001, 0.005, 0.01, and, 0.03), and two different branch lengths (τ = 0.50 and 1.0) have been considered. Here, γ represents the proportion of genetic material contributed by each parent to the hybrid taxon.

The effect of branch length is also as expected. Holding everything else constant, tests for data with branch lengths equal to 1.0 are more powerful than those with 0.50. This result is expected because the prevalence of ILS increases in gene trees as branch length decreases, making hybridization harder to detect. We see a similar effect of the parameter θ, which controls the mutation rate (higher values correspond to more mutations). As expected, decreased value of θ results in decreased power to detect hybridization when other parameters are held constant. In addition, the power of the test is also increasing as the value of γ increases, regardless of other parameters. This result is also expected since a bigger value of γ is associated with a greater contribution from each parent, making it easier to detect hybridization. Overall, the test performs well. We can see that in almost all of the cases, the test reaches a high power (0.9–1.0) given a reasonably large sequence length (100 k or more) when the value of γ is 0.25 or greater.

It is also worth noting that the test adheres to all theoretical expectations. The only case when the proposed test does not perform well is when both the branch length and parameter θ are extremely small. As can be seen in the first panel of Figure 3, the test has extremely low power to detect hybridization when the branch lengths are 0.50 and θ = 0.001 even when the sequence length is high and the value of γ is high. However, this is expected because, with branch lengths of 0.50, one should expect a large amount of ILS and an extremely low mutation rate as θ is extremely small. The combination of these two extreme situations makes it extremely hard to detect hybridization.

The performance of the global test using the MCM statistic proposed by Chen (2022) is similar to that of the test using the Cauchy combination test statistic proposed by Liu and Xie (2020). As a representative example, panel (b) of Figure 3 shows power plots of the proposed test for the twenty-taxon tree with one ancient hybridization event (Figure 2e) with an α = 0.05. Figure 3 shows that the global test satisfies the theoretical expectations. We observed that the power of the test increases as the branch length (τ), length of DNA sequences (N), hybridization parameter (γ), and effective population size (θ) increase when the effect of each of these parameters are observed fixing all the other parameters.

We noticed that the power of the global test using the MCM test is slightly higher than the Cauchy combination test. For instance, the power of the test reaches 1.0 for a slightly lower value of γ when compared with the Cauchy combination test. Also, the power of the test using the MCM statistic is slightly higher in some cases when the length of the DNA sequence is small (e.g., N = 50k). However, the type-I error rate under the MCM test framework was greater than the chosen level of significance (α) in a few cases (7 out of 128 cases) (Figure 4), while the global test using the Cauchy combination framework maintained the correct type-I error rate in all the scenarios.

Figure 4: 
Type-I error rate under the Cauchy combination test (panel a) and the MCM test (panel b) for all the scenarios (T20_deep: twenty species with one ancient hybridization, T20_recent: twenty species with one recent hybridization, T6_deep: six species with one ancient hybridization, and T6_recent: six species with one recent hybridization). The red dots are situations where the type-I error rates were greater than the chosen level of significance (α = 0.05). For each setup, there are 32 simulated type-I error rates considering lengths of the DNA sequence alignment (50k, 100k, 150k, and, 200k), branch lengths (τ = 0.50 and τ = 1.0), and effective population sizes (θ = 0.001, 0.005, 0.01, and, 0.03). Each of these box plots consists of these 32 type-I error rates.
Figure 4:

Type-I error rate under the Cauchy combination test (panel a) and the MCM test (panel b) for all the scenarios (T20_deep: twenty species with one ancient hybridization, T20_recent: twenty species with one recent hybridization, T6_deep: six species with one ancient hybridization, and T6_recent: six species with one recent hybridization). The red dots are situations where the type-I error rates were greater than the chosen level of significance (α = 0.05). For each setup, there are 32 simulated type-I error rates considering lengths of the DNA sequence alignment (50k, 100k, 150k, and, 200k), branch lengths (τ = 0.50 and τ = 1.0), and effective population sizes (θ = 0.001, 0.005, 0.01, and, 0.03). Each of these box plots consists of these 32 type-I error rates.

3.2 Empirical data: Sistrurus Rattlesnakes

The dataset consists of six subspecies from two distinct rattlesnake species and two outgroup species, leading to a phylogeny with eight tips. Eighteen nuclear loci and one mitochondrial locus have been sequenced for a total of N = 8466 sites in the DNA sequence alignment. Our interest is to test the following null hypothesis:

H 0 : None of the rattlesnake species are hybrid

vs.

H A : At least one of the rattlesnake species is hybrid.

To perform this test at α = 0.05, we first created all possible subsets ( 2 × 6 3 = 40 ) of the data with three ingroup species and one outgroup species. Once we have all the subsets, using the H statistic in Equation (4), we performed three hypothesis tests by considering one species as the hybrid and the other two as parental species using each of these subsets, totaling 3 × 40 = 120 individual tests. Finally, we used the resulting p-values from these individual tests in Equation (8) to compute the p-value of the global null hypothesis using the Cauchy combination test. We also calculated the global p-value for the MCM test using the p-values from the individual tests and the p-value from the Cauchy combination test in Equation (3). The p-values of both the Cauchy combination and MCM test were substantially larger than the chosen level of significance (p c > 0.99 and p mcm > 0.99). Hence, we failed to reject the null hypothesis and concluded that there were no hybrid species within the six subspecies of rattlesnake. This finding is consistent with other analyses using this dataset (Gerard et al. 2011; Gibbs et al. 2011).

3.3 Empirical data: yeast

The yeast data set of Rokas et al. (2003) contains seven Saccharomyces species, S. cerevisiae (Scer), S. paradoxus (Spar), S. mikatae (Smik), S. kudriavzevii (Skud), S. bayanus (Sbay), S. castellii (Scas), S. kluyveri (Sklu), and the outgroup fungus Candida albicans (Calb). The sequences come from 106 genes, totaling 127,026 sites in the DNA sequence alignment. Using the dataset, we have 3 × 7 3 = 105 quartet trees, each with two parents and one potential hybrid from the seven species and the outgroup. We aim to test the following null hypothesis:

H 0 : None of the yeast species are hybrid

vs.

H A : At least one of the yeast species is hybrid.

To perform this test, we calculated the H statistic (Equation (4)) for all of the 105 possible quartet trees, along with p-values for all the individual tests. Finally, using the p-values from all the individual tests, we computed the p-value (p c ) of the Cauchy combination test for the global null hypothesis using Equation (8). We also computed the global p-value (p mcm ) of the MCM test using p-values from all the individual tests and the p-value from the Cauchy combination test in Equation (3). The p-value for the global test was found to be less than 0.001 for both the Cauchy combination and the MCM test. Hence, we have enough evidence to reject the null hypothesis and conclude that the yeast dataset contains at least one hybrid species. A network analysis of the yeast dataset by Yu et al. (2011) found several hybrid species among the seven species. Hence, our result is consistent with Yu et al. (2011).

4 Discussion

We have proposed two methods for assessing whether a data set contains one or more hybrid species that take incomplete lineage sorting into account. Based on DNA sequence data on a set of taxa, we propose our test as the first step before inferring a phylogenetic tree or network. One of the key benefits of the proposed method is its speed. The computation time of the proposed test is almost negligible since we used observed site pattern frequencies to calculate the test statistic. The speed makes the test feasible for genome-scale data sets with a large number of taxa. On one hand, individual tests are based on observed site patterns derived from a multinomial distribution, which allows a normal distribution to be derived from the asymptotic distribution of the site patterns for the test statistic. On the other hand, the null distribution of the global test was standard Cauchy, leading to a straightforward method for calculating the p-value.

Our simulation studies show that the methods are powerful to detect any presence of hybridization events even when the hybridization event is close to the root. Results show that deep or ancient hybridization events are harder to detect compared to recent hybridization events. However, if the branch length and length of the DNA sequence are large enough, the proposed method can still detect the presence of hybridization with high power. The proposed methods are extremely efficient: for six taxon networks with 250,000 sites, the whole process can be completed within 25 s, while for a data set with 20 taxa and one million sites, the whole process can be completed within 10 min. This shows that the proposed methods are extremely scalable for large data sets with a large number of sites. It is also possible to use parallel computing for the individual tests and then combine the results at the end if more time efficiency is needed.

The methods are based on the Hils statistic proposed by Kubatko and Chifman (2019) in combination with two combination tests (the Cauchy combination test and the MCM test). We notice that the type I error rate under the MCM test is sometimes slightly larger than the chosen level of significance, and the Cauchy combination test is more conservative. Thus, there is a possibility of improvement by making the type I error rate smaller for the MCM test while maintaining the same power. There are also opportunities to try to reduce the number of individual tests needed to perform the global test. As we are only interested to know whether one or more hybrid taxa are present in a data set, it is reasonable to perform one test per quartet tree to determine if there are any hybrid species in that set and finally combine the results.

We note that the proposed framework can be considered as an example of solving problems where it is difficult to perform a global test but easy to perform individual tests that have an unknown dependence structure. The framework can be followed in any field of application where a global test is desired, but only p-values from individual tests are available. Example applications can include meta-analysis, where p-values from different studies are combined to test a global hypothesis, finding rare genes, and testing more general phylogenetic hypotheses. Also, as we have mentioned in the Introduction, any individual test (e.g., Patterson’s D-Statistics) can be used instead of the test proposed by Kubatko and Chifman (2019) to test the global null of the absence of hybrid species in a dataset. However, because of the dependent nature of a phylogenetic tree, these individual tests will always have a non-negligible dependence structure. In this paper, we solve the problem of using a single hypothesis test for the entire collection of species, removing the need to test subsets separately and aggregate the results in an ad-hoc manner.

5 Conclusions

In many areas of biological research, it is central to understand the relationships between species. There are two primary ways to represent such relationships: a species-level phylogeny and a species network. Therefore, depending on the data, it is necessary to decide which analysis method to use. In doing so, we face several challenges, such as tests hypothesis in large genomic data sets in a computational efficient manner, modeling the data at the gene level as well as species level, and developing methods to account for potential evolutionary mechanisms (e.g., incomplete lineage sorting, gene duplication, hybridization) that generate incongruence between individual gene histories. The multispecies coalescent model is commonly used to model incomplete lineage sorting and has also been used to develop a framework for testing for hybridization. We use existing individual tests for hybridization together with combination tests to develop a novel approach for testing for the presence of hybridization within a given collection of species. We demonstrate that the proposed method is powerful in detecting hybridization events in several possible situations via simulation. We also demonstrate the method is capable of detecting hybridization events in empirical data. It is also notable that our method is extremely efficient and thus can handle genome-scale data with a large number of taxa. Thus, the proposed method provides researchers with an efficient way of detecting hybridization events in a data set and formally deciding whether a tree or a network analysis is more appropriate for a given data set.


Corresponding author: Md Rejuan Haque, Division of Biostatistics, College of Public Health, and Department of Statistics, The Ohio State University, Columbus, OH 43210, USA, E-mail:

  1. Research ethics: Not applicable.

  2. Author contributions: The authors have accepted responsibility for the entire content of this manuscript and approved its submission.

  3. Competing interests: The authors state no conflict of interest.

  4. Research funding: None declared.

  5. Data availability: The raw data can be obtained on request from the corresponding author.

Appendix
Figure 5: 
Results of the power simulations for the six-taxon hybrid species network with one recent hybridization event in (Figure 2a) using the Cauchy combination test (panel a) and the MCM test (panel b). Four different DNA sequence lengths (50k, 100k, 150k, and, 200k), four different effective population sizes (θ = 0.001, 0.005, 0.01, and, 0.03), two different branch lengths (τ = 0.50 and 1.0) has been considered. Here, γ represents the amount of genetic material contributed by each parent to the hybrid taxon.
Figure 5:

Results of the power simulations for the six-taxon hybrid species network with one recent hybridization event in (Figure 2a) using the Cauchy combination test (panel a) and the MCM test (panel b). Four different DNA sequence lengths (50k, 100k, 150k, and, 200k), four different effective population sizes (θ = 0.001, 0.005, 0.01, and, 0.03), two different branch lengths (τ = 0.50 and 1.0) has been considered. Here, γ represents the amount of genetic material contributed by each parent to the hybrid taxon.

Figure 6: 
Results of the power simulations for the six-taxon hybrid species network with one ancient hybridization event in (Figure 2b) using the Cauchy combination test (panel a) and the MCM test (panel b). Four different DNA sequence lengths (50k, 100k, 150k, and, 200k), four different effective population sizes (θ = 0.001, 0.005, 0.01, and, 0.03), two different branch lengths (τ = 0.50 and 1.0) has been considered. Here, γ represents the amount of genetic material contributed by each parent to the hybrid taxon.
Figure 6:

Results of the power simulations for the six-taxon hybrid species network with one ancient hybridization event in (Figure 2b) using the Cauchy combination test (panel a) and the MCM test (panel b). Four different DNA sequence lengths (50k, 100k, 150k, and, 200k), four different effective population sizes (θ = 0.001, 0.005, 0.01, and, 0.03), two different branch lengths (τ = 0.50 and 1.0) has been considered. Here, γ represents the amount of genetic material contributed by each parent to the hybrid taxon.

Figure 7: 
Results of the power simulations for the seven-taxon hybrid species network with one ancient and one recent hybridization events in (Figure 2c) under the Cauchy combination test framework (panel a) and MCM test framework (panel b). Four different DNA sequence lengths (50k, 100k, 150k, and, 200k), four different effective population sizes (θ = 0.001, 0.005, 0.01, and, 0.03), two different branch lengths (τ = 0.50 and 1.0) has been considered. Here, γ represents the amount of genetic material contributed by each parent to the hybrid taxon h
1 while fixing γ
2 at 0.50 for hybrid taxon h
2.
Figure 7:

Results of the power simulations for the seven-taxon hybrid species network with one ancient and one recent hybridization events in (Figure 2c) under the Cauchy combination test framework (panel a) and MCM test framework (panel b). Four different DNA sequence lengths (50k, 100k, 150k, and, 200k), four different effective population sizes (θ = 0.001, 0.005, 0.01, and, 0.03), two different branch lengths (τ = 0.50 and 1.0) has been considered. Here, γ represents the amount of genetic material contributed by each parent to the hybrid taxon h 1 while fixing γ 2 at 0.50 for hybrid taxon h 2.

Figure 8: 
Results of the power simulations for the twenty-taxon hybrid species network with one current hybridization event in (Figure 2d) using the Cauchy combination test (panel a) and the MCM test (panel b). Four different DNA sequence lengths (50k, 100k, 150k, and, 200k), four different effective population sizes (θ = 0.001, 0.005, 0.01, and, 0.03), two different branch lengths (τ = 0.50 and 1.0) has been considered. Here, γ represents the amount of genetic material contributed by each parent to the hybrid taxon.
Figure 8:

Results of the power simulations for the twenty-taxon hybrid species network with one current hybridization event in (Figure 2d) using the Cauchy combination test (panel a) and the MCM test (panel b). Four different DNA sequence lengths (50k, 100k, 150k, and, 200k), four different effective population sizes (θ = 0.001, 0.005, 0.01, and, 0.03), two different branch lengths (τ = 0.50 and 1.0) has been considered. Here, γ represents the amount of genetic material contributed by each parent to the hybrid taxon.

Figure 9: 
Distribution of the p-values obtained from all the individual tests using the Sistrurus Rattlesnakes dataset where no hybrid has been found.
Figure 9:

Distribution of the p-values obtained from all the individual tests using the Sistrurus Rattlesnakes dataset where no hybrid has been found.

References

Arias-Castro, E., Candès, E.J., and Plan, Y. (2011). Global testing under sparse alternatives: ANOVA, multiple comparisons and the higher criticism. Ann. Stat. 39: 2533–2556, https://doi.org/10.1214/11-aos910.Search in Google Scholar

Barnett, I., Mukherjee, R., and Lin, X. (2017). The generalized higher criticism for testing SNP-set effects in genetic association studies. J. Am. Stat. Assoc. 112: 64–76. https://doi.org/10.1080/01621459.2016.1192039.Search in Google Scholar PubMed PubMed Central

Chen, Z. (2022). Robust tests for combining p-values under arbitrary dependency structures. Sci. Rep. 12: 1–8. https://doi.org/10.1038/s41598-022-07094-7.Search in Google Scholar PubMed PubMed Central

Donoho, D. and Jin, J. (2004). Higher criticism for detecting sparse heterogeneous mixtures. Ann. Stat. 32: 962–994. https://doi.org/10.1214/009053604000000265.Search in Google Scholar

Dowling, T.E. and DeMarais, B.D. (1993). Evolutionary significance of introgressive hybridization in cyprinid fishes. Nature 362: 444–446. https://doi.org/10.1038/362444a0.Search in Google Scholar

Durand, E.Y., Patterson, N., Reich, D., and Slatkin, M. (2011). Testing for ancient admixture between closely related populations. Mol. Biol. Evol. 28: 2239–2252. https://doi.org/10.1093/molbev/msr048.Search in Google Scholar PubMed PubMed Central

Fisher, R. (1932). Statistical methods for research workers, 4th ed. Oliver & Boyd, London.Search in Google Scholar

Gerard, D., Gibbs, H.L., and Kubatko, L. (2011). Estimating hybridization in the presence of coalescence using phylogenetic intraspecific sampling. BMC Evol. Biol. 11: 1–12. https://doi.org/10.1186/1471-2148-11-291.Search in Google Scholar PubMed PubMed Central

Gibbs, H.L., Murphy, M., and Chiucchi, J.E. (2011). Genetic identity of endangered Massasauga rattlesnakes (Sistrurus sp.) in Missouri. Conserv. Genet. 12: 433–439. https://doi.org/10.1007/s10592-010-0151-3.Search in Google Scholar

Good, J.M., Demboski, J.R., Nagorsen, D.W., and Sullivan, J. (2003). Phylogeography and introgressive hybridization: chipmunks (genus Tamias) in the northern Rocky Mountains. Evolution 57: 1900–1916. https://doi.org/10.1554/02-352.Search in Google Scholar

Grant, P.R., Grant, B.R., Markert, J.A., Keller, L.F., and Petren, K. (2004). Convergent evolution of Darwin’s finches caused by introgressive hybridization and selection. Evolution 58: 1588–1599. https://doi.org/10.1111/j.0014-3820.2004.tb01738.x.Search in Google Scholar PubMed

Green, R.E., Krause, J., Briggs, A.W., Maricic, T., Stenzel, U., Kircher, M., Patterson, N., Li, H., Zhai, W., Fritz, M.H.-Y., et al.. (2010). A draft sequence of the Neandertal genome. Science 328: 710–722. https://doi.org/10.1126/science.1188021.Search in Google Scholar PubMed PubMed Central

Hudson, R.R. (2002). Generating samples under a Wright–Fisher neutral model of genetic variation. Bioinformatics 18: 337–338. https://doi.org/10.1093/bioinformatics/18.2.337.Search in Google Scholar PubMed

Joly, S., McLenachan, P.A., and Lockhart, P.J. (2009). A statistical approach for distinguishing hybridization and incomplete lineage sorting. Am. Nat. 174: E54–E70. https://doi.org/10.1086/600082.Search in Google Scholar PubMed

Kong, S. and Kubatko, L.S. (2021). Comparative performance of popular methods for hybrid detection using genomic data. Syst. Biol. 70: 891–907. https://doi.org/10.1093/sysbio/syaa092.Search in Google Scholar PubMed

Koziol, J.A. and Perlman, M.D. (1978). Combining independent chi-squared tests. J. Am. Stat. Assoc. 73: 753–763. https://doi.org/10.1080/01621459.1978.10480095.Search in Google Scholar

Kubatko, L.S. (2009). Identifying hybridization events in the presence of coalescence via model selection. Syst. Biol. 58: 478–488. https://doi.org/10.1093/sysbio/syp055.Search in Google Scholar PubMed

Kubatko, L.S. and Chifman, J. (2019). An invariants-based method for efficient identification of hybrid species from large-scale genomic data. BMC Evol. Biol. 19: 1–13. https://doi.org/10.1186/s12862-019-1439-7.Search in Google Scholar PubMed PubMed Central

Liu, Y. and Xie, J. (2019). Accurate and efficient p-value calculation via Gaussian approximation: a novel Monte-Carlo method. J. Am. Stat. Assoc. 114: 384–392. https://doi.org/10.1080/01621459.2017.1407776.Search in Google Scholar PubMed PubMed Central

Liu, Y. and Xie, J. (2020). Cauchy combination test: a powerful test with analytic p-value calculation under arbitrary dependency structures. J. Am. Stat. Assoc. 115: 393–402. https://doi.org/10.1080/01621459.2018.1554485.Search in Google Scholar PubMed PubMed Central

Maddison, W.P. (1997). Gene trees in species trees. Syst. Biol. 46: 523–536. https://doi.org/10.1093/sysbio/46.3.523.Search in Google Scholar

Mallet, J. (2005). Hybridization as an invasion of the genome. Trends Ecol. Evol. 20: 229–237. https://doi.org/10.1016/j.tree.2005.02.010.Search in Google Scholar PubMed

Mallet, J. (2007). Hybrid speciation. Nature 446: 279–283. https://doi.org/10.1038/nature05706.Search in Google Scholar PubMed

Mavárez, J., Salazar, C.A., Bermingham, E., Salcedo, C., Jiggins, C.D., and Linares, M. (2006). Speciation by hybridization in Heliconius butterflies. Nature 441: 868–871. https://doi.org/10.1038/nature04738.Search in Google Scholar PubMed

Meng, C. and Kubatko, L.S. (2009). Detecting hybrid speciation in the presence of incomplete lineage sorting using gene tree incongruence: a model. Theor. Popul. Biol. 75: 35–45. https://doi.org/10.1016/j.tpb.2008.10.004.Search in Google Scholar PubMed

Pamilo, P. and Nei, M. (1988). Relationships between gene trees and species trees. Mol. Biol. Evol. 5: 568–583. https://doi.org/10.1093/oxfordjournals.molbev.a040517.Search in Google Scholar PubMed

Patterson, N., Moorjani, P., Luo, Y., Mallick, S., Rohland, N., Zhan, Y., Genschoreck, T., Webster, T., and Reich, D. (2012). Ancient admixture in human history. Genetics 192: 1065–1093. https://doi.org/10.1534/genetics.112.145037.Search in Google Scholar PubMed PubMed Central

Rabier, C.-E., Berry, V., Stoltz, M., Santos, J.D., Wang, W., Glaszmann, J.-C., Pardi, F., and Scornavacca, C. (2021). On the inference of complex phylogenetic networks by Markov chain Monte-Carlo. PLoS Comput. Biol. 17: e1008380. https://doi.org/10.1371/journal.pcbi.1008380.Search in Google Scholar PubMed PubMed Central

Rambaut, A. and Grass, N.C. (1997). Seq-Gen: an application for the Monte Carlo simulation of DNA sequence evolution along phylogenetic trees. Bioinformatics 13: 235–238. https://doi.org/10.1093/bioinformatics/13.3.235.Search in Google Scholar PubMed

Rieseberg, L.H. (1997). Hybrid origins of plant species. Annu. Rev. Ecol. Systemat. 28: 359–389. https://doi.org/10.1146/annurev.ecolsys.28.1.359.Search in Google Scholar

Rokas, A., Williams, B.L., King, N., and Carroll, S.B. (2003). Genome-scale approaches to resolving incongruence in molecular phylogenies. Nature 425: 798–804. https://doi.org/10.1038/nature02053.Search in Google Scholar PubMed

Roques, S., Sévigny, J.M., and Bernatchez, L. (2001). Evidence for broadscale introgressive hybridization between two redfish (genus Sebastes) in the North-West Atlantic: a rare marine example. Mol. Ecol. 10: 149–165. https://doi.org/10.1046/j.1365-294x.2001.01195.x.Search in Google Scholar PubMed

Salzburger, W., Baric, S., and Sturmbauer, C. (2002). Speciation via introgressive hybridization in East African cichlids? Mol. Ecol. 11: 619–625. https://doi.org/10.1046/j.0962-1083.2001.01438.x.Search in Google Scholar PubMed

Tajima, F. (1983). Evolutionary relationship of DNA sequences in finite populations. Genetics 105: 437–460. https://doi.org/10.1093/genetics/105.2.437.Search in Google Scholar PubMed PubMed Central

Takahata, N. and Nei, M. (1985). Gene genealogy and variance of interpopulational nucleotide differences. Genetics 110: 325–344. https://doi.org/10.1093/genetics/110.2.325.Search in Google Scholar PubMed PubMed Central

Thórsson, Æ.T., Salmela, E., and Anamthawat-Jónsson, K. (2001). Morphological, cytogenetic, and molecular evidence for introgressive hybridization in birch. J. Hered. 92: 404–408. https://doi.org/10.1093/jhered/92.5.404.Search in Google Scholar PubMed

Tippett, L.H.C. (1931). The methods of statistics. Williams and Norgate, London.Search in Google Scholar

Weigel, D.E., Peterson, J.T., and Spruell, P. (2002). A model using phenotypic characteristics to detect introgressive hybridization in wild Westslope Cutthroat Trout and Rainbow Trout. Trans. Am. Fish. Soc. 131: 389–403. https://doi.org/10.1577/1548-8659(2002)131<0389:amupct>2.0.co;2 10.1577/1548-8659(2002)131<0389:AMUPCT>2.0.CO;2Search in Google Scholar

Yu, Y., Dong, J., Liu, K.J., and Nakhleh, L. (2014). Maximum likelihood inference of reticulate evolutionary histories. Proc. Natl. Acad. Sci. U. S. A. 111: 16448–16453. https://doi.org/10.1073/pnas.1407950111.Search in Google Scholar

Yu, Y., Than, C., Degnan, J.H., and Nakhleh, L. (2011). Coalescent histories on phylogenetic networks and detection of hybridization despite incomplete lineage sorting. Syst. Biol. 60: 138–149. https://doi.org/10.1093/sysbio/syq084.Search in Google Scholar

Received: 2022-12-07
Accepted: 2024-01-27
Published Online: 2024-02-19

© 2024 the author(s), published by De Gruyter, Berlin/Boston

This work is licensed under the Creative Commons Attribution 4.0 International License.

Downloaded on 7.5.2024 from https://www.degruyter.com/document/doi/10.1515/sagmb-2022-0061/html
Scroll to top button