Disentangling the effects of geographic and ecological isolation on genetic differentiation (original) (raw)
. Author manuscript; available in PMC: 2014 Nov 1.
Published in final edited form as: Evolution. 2013 Jul 24;67(11):10.1111/evo.12193. doi: 10.1111/evo.12193
Abstract
Populations can be genetically isolated by both geographic distance and by differences in their ecology or environment that decrease the rate of successful migration. Empirical studies often seek to investigate the relationship between genetic differentiation and some ecological variable(s) while accounting for geographic distance, but common approaches to this problem (such as the partial Mantel test) have a number of drawbacks. In this article, we present a Bayesian method that enables users to quantify the relative contributions of geographic distance and ecological distance to genetic differentiation between sampled populations or individuals. We model the allele frequencies in a set of populations at a set of unlinked loci as spatially correlated Gaussian processes, in which the covariance structure is a decreasing function of both geographic and ecological distance. Parameters of the model are estimated using a Markov chain Monte Carlo algorithm. We call this method Bayesian Estimation of Differentiation in Alleles by Spatial Structure and Local Ecology (BEDASSLE), and have implemented it in a user-friendly format in the statistical platform R. We demonstrate its utility with a simulation study and empirical applications to human and teosinte datasets.
Keywords: isolation by distance, isolation by ecology, partial Mantel test, landscape genetics
Introduction
The level of genetic differentiation between populations is determined by the homogenizing action of gene flow balanced against differentiating processes such as local adaptation, different adaptive responses to shared environments, and random genetic drift. Geography often limits dispersal, so that the rate of migration is higher between nearby populations and lower between more distant populations. The combination of local genetic drift and distance-limited migration results in local differences in allele frequencies whose magnitude increases with geographic distance, resulting in a pattern of isolation by distance (Wright, 1943). Extensive theoretical work has described expected patterns of isolation by distance under a variety of models of genetic drift and migration (Charlesworth et al., 2003), in both equilibrium populations in which migration and drift reach a balance, and under non-equilibrium demographic models, such as population expansion or various scenarios of colonization (Slatkin, 1993). A range of theoretical approaches have been applied, with authors variously computing probabilities of identity of gene lineages (e.g. Malécot, 1975; Rousset, 1997) or correlations in allele frequencies (e.g.Slatkin and Maruyama, 1975; Weir and Cockerham, 1984), or working with the structured coalescent (e.g. Hey, 1991; Nordborg and Krone, 2002). Although these approaches differ somewhat in detail, their expectations can all be described by a pattern in which allele frequencies are more similar between nearby populations than between distant ones.
In addition to geographic distance, populations can also be isolated by ecological and environmental differences if processes such as dispersal limitations (Wright, 1943), biased dispersal (e.g.Edelaar and Bolnick, 2012), or selection against migrants due to local adaptation (Wright, 1943; Hendry, 2004) decrease the rate of successful migration. Thus, in an environmentally heterogeneous landscape, genome-wide differentiation may increase between populations as either geographic distance or ecological distance increase. The relevant ecological distance may be distance along a single environmental axis, such as difference in average annual rainfall, or distance along a discrete axis describing some landscape or ecological feature not captured by pairwise geographic distance, such as being on serpentine versus non-serpentine soil, or being on different host plants.
Isolation by distance has been observed in many species (Vekemans and Hardy, 2004; Meirmans, 2012), with a large literature focusing on identifying other ecological and environmental correlates of genomic differentiation. The goals of these empirical studies are generally 1) to determine whether an ecological factor is playing a role in generating the observed pattern of genetic differentiation between populations and, 2) if it is, to determine the strength of that factor relative to that of geographic distance. The vast majority of this work makes use of the partial Mantel test to assess the association between pairwise genetic distance and ecological distance while accounting for geographic distance (Smouse et al., 1986).
A number of valid objections have been raised to the reliability and inter-pretability of the partial Mantel (e.g. Legendre and Fortin, 2010; Guillot and Rousset, 2013). First, because the test statistic of the Mantel test is a matrix correlation, it assumes a linear dependence between the distance variables, and will therefore behave poorly if there is a nonlinear relationship (Legendre and Fortin, 2010). Second, the Mantel and partial Mantel tests can exhibit high false positive rates when the variables measured are spatially autocorrelated (e.g., when an environmental attribute, such as serpentine soil, is patchily distributed on a landscape), since this structure is not accommodated by the permutation procedure used to assess significance (Guillot and Rousset, 2013). Finally, in our view the greatest limitation of the partial Mantel test in its application to landscape genetics may be that it is only able to answer the first question posed above — whether an ecological factor plays a role in generating a pattern of genetic differentiation between populations — rather than the first_and_ the second — the strength of that factor relative to that of geographic distance. By attempting to control for the effect of geographic distance with matrix regressions, the partial Mantel test makes it hard to simultaneously infer the effect sizes of geography and ecology on genetic differentiation, and because the correlation coefficients are inferred for the matrices of post-regression residuals, the inferred effects of both variables are not comparable — they are not in a common currency. We perceive this to be a crucial lacuna in the populations genetics methods toolbox, as studies quantifying the effects of local adaptation (e.g. Rosenblum and Harmon, 2011), host-associated differentiation (e.g. Drès and Mallet, 2002; Gómez-Díaz et al., 2010), or isolation over ecological distance (e.g. Andrew et al., 2012; Mosca et al., 2012) all require rigorous comparisons to the effect of isolation by geographic distance.
In this article, we present a method that enables users to quantify the relative contributions of geographic distance and ecological distance to genetic differentiation between sampled populations or individuals. To do this, we borrow tools from geostatistics (Diggle et al., 1998) and model the allele frequencies at a set of unlinked loci as spatial Gaussian processes. We use statistical machinery similar to that employed by the Smooth and Continuous AssignmenTs (SCAT) program designed by (Wasser et al., 2004) and the BayEnv and BayEnv2 programs designed by (Coop et al., 2010) and (Günther and Coop, 2013). Under this model, the allele frequency of a local population deviates away from a global mean allele frequency specific to that locus, and populations covary, to varying extent, in their deviation from this global mean. We model the strength of the covariance between two populations as a decreasing function of both geographic and ecological distance between them, so that populations that are closer in space or more similar in ecology tend to have more similar allele frequencies. We note that this model is not an explicit population genetics model, but a statistical model – we fit the observed spatial pattern of genetic variation, rather than modeling the processes that generated it. Informally, we can think of this model as representing the simplistic scenario of a set of spatially homogeneous populations at migration-drift equilibrium under isolation by distance.
The parameters of this model are estimated in a Bayesian framework using a Markov chain Monte Carlo algorithm (Metropolis et al., 1953; Hastings, 1970). We demonstrate the utility of this method with two previously published datasets. The first is a dataset from several subspecies of Zea mays, known collectively as teosinte (Fang et al., 2012), in which we examine the contribution of difference in elevation to genetic differentiation between populations. The second is a subset of the Human Genome Diversity Panel (HGDP, (Conrad et al., 2006;Li et al., 2008)), for which we quantify the effect size of the Himalaya mountain range on genetic differentiation between human populations. We have coded this method — Bayesian Estimation of Differentiation in Alleles by Spatial Structure and Local Ecology (BEDASSLE) — in a user-friendly format in the statistical platform R (R Development Core Team, 2007), and have made the code available for download at_genescape.org_.
Methods
Data
Our data consist of L unlinked biallelic single nucleotide polymorphisms (SNPs) in K populations; a matrix of pairwise geographic distance between the sampled populations (D); and one or more environmental distance matrices (E). The elements of our environmental distance matrix may be binary (e.g., same or opposite side of a hypothesized barrier to gene flow) or continuous (e.g., difference in elevation or average annual rainfall between two sampled populations). The matrices D and _E_can be arbitrary, so long as they are nonnegative definite, a constraint satisfied if they are each matrices of distances with respect to some metric. We summarize the genetic data as a set of allele counts (C) and sample sizes (S). We use_C_ℓ,k to denote the number of observations of one of the two alleles at biallelic locus ℓ in population k out of a total sample size of_S_ℓ,k alleles. The designation of which allele is counted (for convenience, we denote the counted allele as allele ‘1’), is arbitrary, but must be consistent among populations at the same locus.
Likelihood Function
We model the data as follows. The_C_ℓ,k observed ‘1’ alleles in population k at locus ℓ result from randomly sampling a number_S_ℓ,k of alleles from an underlying population in which allele 1 is at frequency_f_ℓ,k.These population frequencies _f_ℓ,k are themselves random variables, independent between loci but correlated between populations in a way that depends on pairwise geographic and ecological distance. A flexible way to model these correlations is to assume that the allele frequen-cies _f_ℓ,_k_are multivariate normal random variables, inverse logit-transformed to lie between 0 and 1. In other words, we assume that_f_ℓ,k is obtained by adding a deviation θℓ,k to the global value µℓ, and transforming:
fℓ,k=f(θℓ,k+μℓ)=11+exp(−(θℓ,k+μℓ)). | (1) |
---|
Under this notation, µℓ is the transformed mean allele frequency at locus ℓ and θℓ,k is the population- and locus-specific deviation from that transformed mean. We can then write the binomial probability of seeing_C_ℓ,k of allele ‘1’ at locus ℓ in population k as
P(Cℓ,k|Sℓ,k,fℓ,k)=(Sℓ,kCℓ,k)fℓ,kCℓ,k(1−fℓ,k)Sℓ,k−Cℓ,k. | (2) |
---|
In doing so, we are assuming that the individuals are outbred, so that the _S_ℓ,k alleles represent independent draws from this population frequency. We will return to relax this assumption later.
To model the covariance of the allele frequencies across populations, we assume that θℓ,k are multivariate normally distributed, with mean zero and a covariance matrix Ω that is a function of the pairwise geographic and ecological distances between the sampled populations. We model the covariance between populations i and_j_ as
Ωi,j=1α0exp(−(αDDi,j+αEEi,j)α2), | (3) |
---|
where D i,j and_E_ i,j are the pairwise geographic and ecological distances between populations i and_j_, respectively, and α_D_and α_E_ are the effect sizes of geographic distance and ecological distance, respectively. The parameter α0 controls the variance of population specific deviate θ (i.e. at D i,j +E i,j = 0), and α2 controls the shape of the decay of the covariance with distance. As alluded to above, as many separate ecological distance variables may be included as desired, each with its own α_Ex_ effect size parameter, but here we restrict discussion to a model with one.
With this model, writing α = (α0,α_D_, α_E_, α2), the likelihood of the SNP counts observed at locus ℓ in all sampled populations can now be expressed as
| P(Cℓ,θℓ|Sℓ,μℓ,α)=P(θℓ|Ω(α))∏k=1KP(Cℓ,k|Sℓ,k,f(θℓ,μℓ)) | (4) | | -------------------------------------------------------- | --- |
where we drop subscripts to indicate a vector (e.g._C_ℓ = (C_ℓ1, …,C_ℓ_K)), and_P(θℓ|Ω) is the multivariate normal density with mean zero and covariance matrix Ω.
The joint likelihood of the SNP counts C and the transformed population allele frequencies θ across all_L_ unlinked loci in the sampled populations is just the product across loci:
| P(C,θ|S,μ,α)=∏ℓ=1ℓP(θℓ|Ω(α))∏k=1KP(Cℓ,k|Sℓ,k,f(θℓ,μℓ)). | (5) | | ----------------------------------------------------------- | --- |
Posterior Probability
We take a Bayesian approach to inference on this problem, and specify priors on each of our parameters. We place exponential priors on α_D_ and α_E_,each with mean 1; and a gamma prior on α_0_,with shape and rate parameters both equal to 1. We took the prior on α2 to be uniform between 0.1 and 2. Finally, we chose a Gaussian prior for each µℓ, with mean 0, variance 1/β, and a gamma distributed hyper-prior on β with shape and rate both equal to 0.001. For a discussion of the rationale for these priors, please see the Appendix
The full expression for the joint posterior density, including all priors, is therefore given by
| P(θ,μ,α0,αD,αE,α2,β|C,S)∝(∏ℓ=1LP(θℓ,k|Ω)P(μℓ|β)∏k=1KP(Cℓ,k|Sℓ,k,fℓ,k))×P(β)P(α0)P(αD)P(αE)P(α2) | (6) | | -------------------------------------------------------------------------------------------------- | --- |
where the various P denote the appropriate marginal densities, and the proportionality is up to the normalization constant given by the right-hand side integrated over all parameters.
Markov chain Monte Carlo
We wish to estimate the posterior distribution of our parameters, particularly α_D_ and α_E_ (or at least, their ratio). As the integral of the posterior density given above cannot be solved analytically, we use Markov chain Monte Carlo (MCMC) to sample from the distribution. We wrote a custom MCMC sampler in the statistical platform R (R Development Core Team, 2007). The details of our MCMC procedure are given in the Appendix.
Model Adequacy
Our model is a simplification of the potentially complex relationships present in the data, and there are likely other correlates of differentiation not included in the model. Therefore, it is important to test the model’s fit to the data, and to highlight features of the data that the model fails to capture. To do this, we use posterior predictive sampling, using the set of pairwise population FST values as a summary statistic (Weir and Hill, 2002), as we are primarily interested in the fit to the differentiation between pairs of populations. In posterior predictive sampling, draws of parameters are taken from the posterior and used to simulate new datasets, summaries of which can be compared to those observed in the original datasets (Gelman et al., 1996).
Our posterior predictive sampling scheme proceeds as follows. For each replicate of the simulations we
- Take a set of values of β and all α parameters from their joint posterior, i.e. our MCMC output.
- Compute a covariance matrix Ω from this set of α and the pairwise geographic and ecological distance matrices from the observed data.
- Use Ω to generate L multivariate normally distributed θ, and use β to generate a set of normally distributed µ. These θ and µ are transformed using equation (1) into allele frequencies for each population-locus combination, and binomially distributed allele counts are sampled using those frequencies and the per-population sample sizes from the observed data.
- Calculate FST between each pair of populations across all loci using the count data. Specifically we use the FST estimator defined by the equation given on the top of page 730 in Weir and Hill (2002).
We then use various visualizations of_FST_(i,j), e.g. plotted against distance between i and j, to compare the patterns in the observed dataset to the patterns in the simulated datasets. This functions as a powerful and informative visual summary of the ability of the model to describe the observed data. Since FST is a good measure of genetic differentiation, users can assess how well the method is able to pick up general trends in the data (e.g., increasing genetic differentiation with ecological or geographic distance) and how well those general trends in the model match the slope of their observed counterparts, and also identify specific pairwise population comparisons that the model is doing a poor job describing. These latter may help reveal other important processes that are generating genetic differentiation between populations, such as unmeasured ecological variables, or heterogeneity in population demography.
Accounting for overdispersion
A consequence of the form of the covariance given in equation (3) is that all populations have the same variance of allele frequencies about the global mean (and this is Ω_ii_ = 1/α0). This will be the case in a homogeneous landscape, but is not expected under many scenarios, such as those characterized by local differences in population size, inbreeding rate, historical bottlenecks, or population substructure. In practice, this leads to overdispersion – particular populations deviating more from global means than others. Indeed, in both empirical datasets examined in this paper, there are clearly populations with much greater deviation in allele frequencies from the global mean than predicted from their geographical and ecological distances.
To account for this, we will explicitly model the within-population correlations in allelic identity due to varying histories. In so doing, we simultaneously keep outlier populations from having an undue influence on our estimates of α_D_ and α_E_, the effect sizes of the distance variables measured, and highlight those populations that the model is describing poorly. Introducing correlations accounts for overdispersion because a population whose allele frequencies differ more from its predicted frequencies across loci has individuals whose allelic identities are more correlated (and the converse is also true). To see this, observe that, for instance, if one completely selfing population and one outbred population each have a given allele at frequency p, then the variance in sampled allele frequency will be twice as high in the selfing population, since the number of effective independent draws from the pool of alleles is half as large.
To introduce within-population correlations we assume that the allele frequencies from which the allele counts_C_ℓ,k are drawn are not fixed at f_ℓ,k, but rather randomly distributed, with mean given by_f_ℓ,k and variance controlled by another parameter. Specifically, given µℓ and θℓ,k, we suppose that the allele frequency at locus ℓ in population k is beta-distributed with parameters Φ_k f_ℓ,k_and Φ_k(1 –_f_ℓ,k), where_f_ℓ,k =f(µℓ, θℓ,k) as before, and Φ_k is a population-specific parameter, estimated separately in each population, that controls the extent of allelic correlations between draws from individuals in population_k_. To see why this introduces allelic correlations, consider the following equivalent description of the distribution of_C_ℓ,k We sample the alleles one at a time; if we have drawn n alleles; then the (n + 1)st allele is either: a new draw with probability Φ_k_/(Φ_k_+ n) (in which case it is of type ‘1’ with probability f_ℓ,k and of type ‘0’ with probability 1 –_f_ℓ,k); otherwise, it is of the same type as a previously sampled allele, randomly chosen from the_n sampled so far. Conceptually, each allele is either a “close relative” of an allele already sampled, or else a “new draw” from the “ancestral population” with allele frequency f_ℓ,k. Smaller values of Φ_k lead to increased allelic correlations, which in turn increase the variance of population allele frequencies.
Conveniently, the random frequency integrates out, so that the likelihood of the count data becomes
P(Cℓ,k|Sℓ,k,fℓ,k=f(θℓ,k,μℓ))=(Sℓ,kCℓ,k)B(Cℓ,k+Φkfℓ,k,Sℓ,k−Cℓ,k+Φk(1−fℓ,k))B(Φkfℓ,k,Φk(1−fℓ,k)), | (7) |
---|
where B(x,y) is the beta function. This is known as the “beta-binomial” model (Williams, 1975), and is used in a population genetics context by Balding and Nichols (1995, 1997); see Balding (2003) for a review.
The parameter Φ_k_ can be related to one of Wright’s F_-statistics (Wright, 1943). As derived in previous work (Balding and Nichols, 1995, 1997), if we define_Fk by Φ_k_ =Fk/(1 –Fk) (0 ≤ Fk< 1), then Fk is analogous to the inbreeding coefficient for population k relative to its set of the spatially predicted population frequencies (Cockerham and Weir, 1986; Balding, 2003), with higher Fk corresponding to higher allelic correlation in population k, as one would expect given increased drift (inbreeding) in that population. However, it is important to note that Fk cannot solely be taken as an estimate of the past strength of drift, since higher_Fk_ would also be expected in populations that simply fit the model less well. We report values of_Fk_ in the output and results, and discuss the interpretation of this parameter further in the discussion.
We have coded this beta-binomial approach as an alternative to the basic model (see Results for a comparison of both approaches on empirical data). To combine estimation of this overdispersion model into our inference framework, we place an inverse exponential prior on Φ_k_(that is, 1/Φ_k_ ~ Exp(5)). This prior and the beta-binomial probability density function are incorporated into the posterior.
Simulation Study
We conducted two simulation studies to evaluate the performance of the method. In the first, we simulated data under the inference model, and in the second, we simulated under a spatially explicit coalescent model.
For the datasets simulated under the model, each simulated dataset consisted of 30 populations each with 10 diploid individuals sequenced at 1000 polymorphic bi-allelic loci. Separately for each dataset, the geographic locations of the populations were sampled uniformly from the unit square, and geographic distances (D i,j) were calculated as the Euclidean distance between them. We also simulated geographically autocorrelated environmental variables, some continuous, some discrete (see Figure 1_a_ and_c_). For both discrete and continuous variables we simulated datasets in which ecological distance had no effect on genetic differentiation between populations; these simulations tested whether our method avoids the false positive issues of the partial Mantel test. We also simulated datasets with an effect of both geographic and ecological distance on genetic distance across a range of relative effect sizes (varying the ratio α_E_/α_D_) to test our power to detect their relative effects. The study thus consisted of four sections, each comprised of 50 datasets: discrete and continuous ecological variables, with or without an effect of ecology.
Figure 1.
a) Populations simulated in the unit square, colored by their value of a continuous ecological variable.
b) Pairwise FST between simulated populations from (a), colored by difference in their values of the continuous ecological variable.
c) Populations simulated in the unit square, colored by their value of a binary ecological variable.
d) Pairwise FST between simulated populations from (c), colored by difference in their values of the binary ecological variable.
For each dataset, we set α = 0.5, and sampled α_D_ and α2 from uniform distributions (U(0.2,4) and U(0.1,2) respectively); the choice of α_E_ varied, depending on the specific scenario (described below). These parameters were chosen to give a range of pairwise population _FST_spanning an order of magnitude between approximately 0.02 and 0.2, and a realistic allele frequency spectrum. The covariance matrix Ω was calculated using these α and the pairwise geographic and ecological distance matrices (normalized by their standard deviations), and Ω was used to generate the multivariate, normally distributed θ. Values of µ were drawn from a normal distribution with variance β = 0.09. Allele frequencies at each locus were calculated for each population from the θ and µ using equation (1), and SNP counts at each locus in each population were drawn from binomial distributions parameterized by that allele frequency with the requirement that all loci be polymorphic. We simulated under the following ecological scenarios.
1. Continuous, Autocorrelated Ecological Variable
For the continuous case, we simulated the values of an ecological variable across populations by sampling from a multivariate normal distribution with mean zero and covariance between population_i_ and population j equal to Cov(E(i),E(j)) = exp(–Di,j/ac), where ac determines the scale of the autocorrelation (following Guillot and Rousset, 2013). For all simulations, we set_ac_ = 0.7, to represent a reasonably distributed ecological variable on a landscape.
2. Binary Ecological Variable
A binary variable was produced by declaring that the latitudinal equator in the unit square was a barrier to dispersal, so that all populations on the same side of the barrier were separated by an ecological distance of zero, and all population pairs that spanned the equator were separated by an ecological distance of 1.
A. Zero Effect Size
For each type of ecological variable, we produced 50 simulated datasets with α_E_ = 0, so that ecological distance had no effect on the covariance of θ, and hence on genetic differentiation between populations. For each of these simulated datasets, we performed a partial Mantel test in R using the package_ecodist_ (Goslee and Urban, 2007) with 1,000,000 permutations.
B. Varying Effect Size
We also produced 50 simulated datasets for each type of ecological variable by simulating ten datasets for each value of α_E_/α_D_from 0.2 to 1.0 in intervals of 0.2 (see Figure 1_b_ and d). (As above, values of α_D_ were drawn from a uniform distribution (U(0.2, 4)), so this determines α_E_).
For the datasets simulated using a spatially explicit coalescent process, allelic count data were simulated on a fixed lattice using the program ms (Hudson, 2002). A total of 49 populations were simulated, evenly spaced in a seven-by-seven grid, of which a subset of 25 populations were sampled to make the final dataset; these 25 sampled populations were arranged in a five-by-five grid, as shown in Figure 2. Each population consisted of 10 chromosomes sampled at 1,000 polymorphic, unlinked, biallelic loci. Migration occurred between neighboring populations (with no diagonal migration) at a rate of 4_Nmi,j_ = 4. In all simulations, a longitudinal potential barrier to gene flow was included just to the east of the central line (see Figure 2). Migration rate between populations that were separated by this barrier was diminished by dividing by some barrier effect size, which varied between simulation sets. For 40 datasets, the barrier effect size was set to 1, so that the barrier had no effect on genetic differentiation across it. The barrier effect size was set to 5, 10, and 15, for 20 datasets each, for a total of 100 datasets simulated under the spatial coalescent. For all datasets, geographic distance was measured as the pairwise Euclidean distance between populations on the lattice, and ecological distance was defined as zero between populations on the same side of the barrier, and 1 between populations on opposite sides.
Figure 2.
Populations simulated using a spatially explicit coalescent model in the unit square. All simulated populations are indicated with black dots, while populations that were sampled for inclusion in each dataset are indicated by large black dots. All pairwise migration is indicated with gray arrows. The barrier to dispersal is given by the red dotted line, across which the standard migration rate was divided by a barrier effect size, which we varied.
All analyses on the simulated datasets were run for 1,000,000 MCMC iterations, which appeared sufficient in most cases for convergence on the stationary distribution. The chain was sampled every 1,000 generations, and all summary statistics from the simulation study were calculated after a burn-in of 20%. The metrics of method performance used on the datasets simulated under the inference model were precision, accuracy, and coverage of the α_E_ : α_D_ ratio. We defined_precision_ as breadth of the 95% credible set of the marginal posterior distribution; accuracy as the absolute value of the difference between the median value of the marginal posterior distributions and the values used to simulate the data in each dataset; and coverage as the proportion of analyses for which the value used to simulate the data fell within the 95% credible set of the marginal posterior distribution for that parameter. For the datasets simulated under the spatial coalescent process, we wished to assess the ability of the method to accurately recover the relative strength of the barrier to gene flow.
For approximately 30% of all analyses, the MCMC runs displayed obvious difficulty with convergence within the first 1,000,000 generations. The signs of potentially poor single-chain MCMC behavior that we looked for included: acceptance rates that are too low or too high (generally 20–70% acceptance rates are thought to be optimal); parameter trace plots that exhibit high autocorrelation times; acceptance rates that have not plateaued by the end of the analysis; and marginal distributions that are multimodal, or not approximately normal (for a more complete discussion on MCMC diagnosis, please see Gilks et al. (1996); for plots of example MCMC output, see Figures S5, S6, and S7). In some cases, this was because the naive scales of the various tuning parameters of the random-walk proposal mechanisms were inappropriate for the particular dataset, and mixing was too slow over the number of generations initially specified (as diagnosed by visualizing the parameter acceptance rates of MCMC generations). This was addressed by re-running analyses on those datasets using different random-walk tuning parameters, or by increasing the number of generations over which the MCMC ran. In the other cases, failure to converge was due to poor performance of the MCMC in regions of parameter space too near the prior boundaries. Specifically, when the chain was randomly started at values of some α parameters too close to zero, it was unable to mix out of that region of parameter space. This problem was addressed by re-running the analyses using different, randomly chosen initial values for the α parameters. In our R package release of the code we provide simple diagnostic tools for the MCMC output, and further guidance for their use.
Empirical Data
To demonstrate the utility of this method, we applied it to two empirical datasets: one consisting of populations of teosinte (Zea mays), the wild progenitor of maize, and one consisting of human populations from the HGDP panel. Both processed datasets are available for download at genescape.org. See Tables S1 and S2 in the Supplementary Materials for names and metadata of populations used.
The teosinte dataset consisted of 63 populations of between 2 and 30 diploid individuals genotyped at 978 biallelic, variant SNP loci (Fang et al., 2012). Each population was associated with a latitude, longitude, and elevation at the point of sampling (see Figure S2 and Table S1). Pairwise geographic great-circle distances and ecological distances were calculated for all pairs of populations, where ecological distance was defined as the difference in elevation between populations. Both pairwise distance variables were normalized by their standard deviations.
The human dataset was the Eurasian subset of that available from the HGDP (Conrad et al., 2006; Li et al., 2008), consisting of 33 populations of between 6 and 45 individuals genotyped at 1000 biallelic, variant SNP loci (see Figure S3 and Table S2). Pairwise geographic great-circle distances and ecological distances were calculated for all pairs of populations, where ecological distance was defined as 0 or 1 if the populations were on the same or opposite side of the Himalaya mountain range, respectively. For the purposes of our analysis the western edge of the Himalaya was defined at 75° East.
For comparison, the method was run on each of the two datasets both with and without the beta-binomial overdispersion model. MCMC marginal traces were examined visually to assess convergence on a stationary distribution. The chain was thinned by sampling every 1000 generations, and the median and 95% credible sets were reported on the marginal distribution after a burn-in of 20%. The MCMC analysis for the teosinte dataset without the overdispersion model was run for 10 million generations; the analysis with the overdispersion model was run for 15 million generations. For the HGDP dataset, the numbers of generations were 25 million and 35 million, for the analyses without and with the overdispersion model, respectively.
Results
Simulation Results
As described above, we conducted two simulation studies. The performance of the method in inference of the parameters of greatest interest is given below.
First we note that, consistent with the results of (Guillot and Rousset, 2013), the spatial autocorrelation in our ecological variable caused the partial Mantel to have a high false positive rate when α_E_ = 0, which suggests that the partial Mantel test is not well calibrated to assess the significance of ecological distance on patterns of genetic differentiation. At a significance level of p = 0.05, the false positive rate for the datasets simulated under the inference model with a binary ecological distance variable was 8%, and for the continuous ecological variable, the false positive error rate was 24%. For the datasets simulated under the spatial coalescent process with a barrier effect size of 1 (meaning that the barrier had no effect on genetic differentiation across it), the false positive error rate was 37.5% (see Figure S4).
The precision and accuracy results for the datasets simulated under the model with a continuous and discrete ecological variable are visualized in Figure 3, panels a and_b_, respectively, across the six simulated values of the ratio α_E_/α_D_. Median precision, accuracy, and coverage are reported in Table 1.
Figure 3.
a) Performance of the method for the 100 datasets simulated with a continuous ecological distance variable.
b) Performance of the method for the 100 datasets simulated with a binary ecological distance variable.
In each, the left panel depicts performance on the 50 datasets for which α_E_ was fixed at 0, and the right panel depicts performance on the 50 datasets for which α_E_ varied.
Table 1.
Simulation Studies 1A and 1B were conducted with a continuous ecological variable and α_E_ = 0 and α_E_ > 0, respectively. Simulation Studies 2A and 2B were conducted with a binary ecological variable and α_E_ = 0 and α_E_ > 0, respectively.Precision is breadth of the 95% credible set of the marginal posterior distribution (smaller values indicate better method performance). Accuracy is the absolute value of the difference between the median value of the marginal posterior distributions and the values used to simulate the data (smaller values indicate better method performance).Coverage is the proportion of analyses for which the value used to simulate the data fell within the 95% credible set of the marginal posterior distribution for that parameter (higher values indicate better method performance). Coverage is not reported for the simulations in which the effect size of the ecological distance variable was fixed to zero (α_E_ = 0), as the parameter value used to generate the data is on the prior bound on α_E_ and coverage was therefore zero.
Sim Study 1A | Sim Study 1B | Sim Study 2A | Sim Study 2B | |
---|---|---|---|---|
Precision | 0.041 | 0.30 | 0.15 | 0.96 |
Accuracy | 0.013 | 0.0066 | 0.031 | 0.033 |
Coverage | NA | 94% | NA | 94% |
The performance of the method on the datasets simulated using the spatial coalescent model is given in Figure 4, which shows the posterior distributions of α_E_: α_D_ ratio from each analyzed dataset over the four barrier effect sizes.
Figure 4.
The marginal distributions on the α_E_/α_D_ratio from the analyses performed on the datasets simulated using a spatially explicit coalescent process. The migration rate between populations separated by the barrier was divided by a barrier effect size, which varied among simulations.
Inset: Pairwise FST colored by whether populations were on the same or opposite sides of a barrier to dispersal, plotted against pairwise geographic distance for example datasets for each of the 4 barrier effect sizes.
a) Barrier effect size of 1 (n=40);
**b)**Barrier effect size of 5 (n=20);
**c)**Barrier effect size of 10 (n=20);
d) Barrier effect size of 15 (n=20).
Empirical Results
Teosinte Results
For the Zea mays SNP dataset analysis, the mean and median of the posterior ratio of the effect size of pairwise difference in elevation to the effect size of pairwise geographic distance (i.e.- the α_E_ : α_D_ ratio) was 0.153, and the 95% credible set was 0.137 to 0.171 (see Figure S10a). The interpretation of this ratio is that one thousand meters of elevation difference between two populations has a similar impact on genetic differentiation as around 150 (137–171) kilometers of lateral distance.
Accounting for overdispersion (using the beta-binomial model) we obtain slightly different results, with a mean and median α_E_ : α_D_ ratio of 0.205, and a 95% credible set from 0.180 to 0.233 (1,000 meters difference in elevation ≈ 205 kilometers lateral distance, see Figure S10_b_). Values of our _F_statistics Fk estimated across populations ranged from 2 × 10−4 to 0.53, and are shown inSupplemental Figure S2.
Posterior predictive sampling indicates incorporating overdispersion with the beta-binomial extension results in a better fit to the data (seeFigure 5_a_ and_b_): the mean Pearson’s product moment correlation between the posterior predictive datasets and the observed data without the beta-binomial extension was 0.64, while the mean correlation with the beta-binomial model was 0.86 (see Figure S1_a_). The ability of the model to predict specific pairwise population FST is shown Figure S8.
Figure 5.
Posterior predictive sampling with 1,000 simulated datasets, using pairwise FST as a summary statistic of the allelic count data for:
a) the Teosinte dataset, using the standard model;
b) the Teosinte dataset, using the overdispersion model;
c) HGDP dataset, standard model.
d) HGDP dataset, overdispersion model.
HGDP results
For the human (HGDP) SNP dataset analysis, the mean posterior α_E_ : α_D_ ratio was 5.13 × 104, the median was 5.00 × 104, and the 95% credible set was 3.09 × 104 to 7.85 × 104 (see Figure S11_a_). However, this result seems to be sensitive to outlier populations, as the beta-binomial extension of this method on the same dataset yields significantly different results, with a mean α_E_ : α_D_ ratio of 1.35 × 104, a median of 1.34 × 104, and a 95% credible set from 1.09 × 104 to 1.65 × 104 (see Figure S10_b_). This latter result is broadly consistent with that of Rosenberg (2011), who found an effect size ratio of 9.52 × 103 in a linear regression analysis that treated pairwise population comparisons as independent observations. The interpretation of our result is that being on the opposite side of the Himalaya mountain range has the impact of between approximately 11 and 16 thousand kilometers of extra pairwise geographic distance on genetic differentiation.
Under our beta-binomial extension values of_Fk_ estimated across populations ranged from 3.2 × 10−4 to 0.06. Population values of_Fk_ are shown on the map in Figure S3.
Posterior predictive sampling again indicates a better fit to the data including overdispersion (see Figure 5_c_ and d): the mean Pearson’s product moment correlation between the posterior predictive datasets and the observed data without the beta-binomial extension was 0.88, while the mean correlation with the beta-binomial model was 0.91 (see Figure S1_b_). The ability of the model to predict specific pairwise population FST is shown inFigure S9.
Discussion
In this paper, we have presented a method that uses raw allelic count data to infer the relative contribution of geographic and ecological distance to genetic distance between sampled populations. The method performs quite well: we have shown that it reliably and accurately estimates correct parameter values using simulations, and produces sensible models that produce a good fit to observed patterns of differentiation in real datasets. We feel that our method has broad utility to the field of landscape genetics and to studies of local adaptation, and holds a number of advantages over existing methods. (although see Wang et al. (2012) for another recent approach.) It allows users to simultaneously quantify effect sizes of geographic distance and ecological distance (rather than assessing the significance of a correlation once the effect of geography has been removed, as in the partial Mantel test). Explicitly modeling the covariance in allele frequencies allows users to accommodate non-independence in the data, and the method’s Bayesian framework naturally accommodates uncertainty and provides a means of evaluating model adequacy. The inclusion of overdispersion allows fit to a set of populations with heterogeneous demographic histories. In addition, the basic model presented here – a parametric model of spatial covariance in allele frequencies – is extremely versatile, allowing for the inclusion of multiple ecological or geographic distance variables, as well as great flexibility in the function used to model the covariance.
Simulation Study
Our method performed well in both simulation studies (see Figure 3, Table 1, and Figure 4), and was able to effectively recognize and indicate when a ecological variable contributes significantly to genetic differentiation. This is in contrast to the partial Mantel, which has a high false positive rate in the presence of spatial autocorrelation of environmental variables (see Figure S4).
For datasets simulated under the inference model, coverage, accuracy, and precision were all satisfactory (see Table 1). The precision of our estimator of α_E_ was generally lower for our discrete ecological variable, likely due to the strong spatial structure of the discrete ecological variable.
For datasets simulated using the spatial coalescent, there were no true values for the α_E_ : α_D_ ratio to compare with those inferred by the method. However, we note that the α_E_ : α_D_ ratios estimated across analyzed simulated datasets tracked the barrier effect sizes used to simulate them, and that when the barrier had no effect on migration, the marginal distributions on the α_E_ : α_D_ ratio estimated were stacked up against the prior bound at zero and had very low median values. The width of the 95% credible set of the marginal posteriors grew with the barrier effect size as a result of the flattening of the posterior probability surface as true parameter value increased. Overall, the method performed well on the datasets simulated under a model different from that used for inference (and presumably closer to reality).
An issue we observed in practice is that at some parameter values, different combinations of α are essentially nonidentifiable — the form of the covariance given in equation (3) sometimes allows equally reasonable fits at different values of α2, or at different combinations of α0, α_D_, and α_E_. (In other cases, all four parameters can be well-estimated.) Even when this is the case, the α_E_ : α_D_ ratio, which is the real parameter of interest, remains constant across the credible region, even as α_E_ and α_D_ change together to compensate for changes in α2 and α0. Such ‘ridges’ in the likelihood surface are readily diagnosed by viewing the trace plots and joint marginals of the α parameters (seeFigures S5 and S6).
Empirical Results
Teosinte
The application of our method to the teosinte SNP dataset indicated that difference in elevation has a potentially substantial contribution to genetic differentiation between teosinte populations. Difference in elevation could be correlated with another, as yet unmeasured ecological variable, so we cannot claim to report a causal link, but these results are certainly suggestive, especially in the light of the research on morphological adaptations in teosinte to high altitude (Eagles and Lothrop, 1994).
The analysis of the teosinte SNP data with the beta-binomial extension of our method shows a much better model fit, and highlights a number of populations with particularly high Fk_values. These populations (highlighted in Figure S2) all belong to the subspecies Zea mays mexicana, which primarily occurs at higher altitudes and is hypothesized to have undergone significant drift due to small effective population sizes or bottlenecks (Fukunaga et al., 2005). In addition, a number of these populations occur in putative hybrid zones between_Zea mays mexicana and Zea mays parviglumis, a separate, co-occuring subspecies (Heerwaarden et al., 2011). Like drift, admixture would have the effect of increasing the variance in observed allele frequencies around the expectation derived from the strict geographic/ecological distance model, and would drive up the inferred_Fk_ parameters for admixed populations.
HGDP
In the Human Genome Diversity Panel data we find a strong effect of separation by the Himalayas on genetic differentiation, confirming previous results (e.g. Rosenberg et al., 2005). To obtain a good fit to the data it is necessary to model overdispersion (with the beta-binomial extension). This lack of model fit of the basic model can be seen in the posterior predictive sampling in Figure 5 c and_d_, which highlights the importance of assessing model adequacy during analysis. Under the beta-binomial extension the α_E_/α_D_ratio estimates an effect of the Himalayas far greater than the distance simply to circumnavigate around the Himalayas. We think this likely reflects the fact that Eurasian populations are away from migration-selection equilibrium, reflecting past large-scale population expansions (Keinan et al., 2007).
With overdispersion included, the model appears to describe the data reasonably well, suggesting substantial heterogeneity beyond that dictated by geographic distance and separation by the Himalayas between the sampled populations. A number of populations stand out in their_Fk_ values, in particular the Kalash, the Lahu, the Mozabites, the Hazara, and the Uygur (highlighted in Figure S3). This is consistent with the known history of these populations and previous work on these samples (Rosenberg et al., 2002), which suggests that these populations are unusual for their geographic position (that is, they depart from expectations of their covariance in allele frequencies with their neighbors). The Hazara and Uygur populations are known to be recently admixed populations between central Asian and East ancestry populations. The Mozabite population has substantial recent admixture from Sub-Saharan African populations (Rosenberg et al., 2002; Rosenberg, 2011). The Kalash, who live in northwest Pakistan, are an isolated population with low heterozygosity, suggesting a historically small effective population size. Finally the Lahu have unusually low heterozygosity compared to the other East Asian populations, suggesting that they too may have had an unusually low effective population size. Thus our beta-binomial model, in addition to improving the fit to the data, is successfully highlighting populations that are outliers from simple patterns of isolation by distance.
Population-specific variance
As noted above, in both empirical datasets analyzed, the beta-binomial extension to the basic model offers substantially better model fit. This could in part reflect ecological variables not included in the analyses, in addition to heterogeneity in demographic processes, both of which could shape genetic variation in these populations by pushing population allele frequencies away from their expectations under our simple isolation by distance and ecology model. Our _Fk_statistic provides a useful way to highlight populations that show the strongest deviations away from our model, and to prevent these deviations from obscuring environmental correlations or causing spurious correlations. Therefore, we recommend that the extended model be used as the default model for analyses.
Limitations
The flexibility of this statistical model is accompanied by computational expense. Depending on the number of loci and populations in a dataset, as well as the number of MCMC generation required to accurately describe the stationary distribution, analyses can take anywhere from hours to days. Speedups could be obtained by parallelization or porting code to C. In addition, as with any method that employs an MCMC algorithm, users should take care to assess MCMC performance to ensure that the chain is mixing well, has been run for a sufficient number of generations, and has converged on a stationary distribution (Gilks et al., 1996). Users are well advised to run multiple independent chains from random initial locations in parameter space, and to compare the output of those analyses to confirm that all are describing the same stationary distributions.
Our model rests on a number of assumptions, principal among which is that population allele frequencies are well-represented by a spatially homogeneous process, such as are obtained under mutation-migration equilibrium. That is, we assume that current patterns of gene flow between populations are solely responsible for observed patterns of genetic differentiation. Some examples of biological situations that may violate the assumptions of our model include: two populations that have higher genetic differentiation than expected based on their pairwise geographic distance because they arrived in nearby locations as part of separate waves of colonization; or two populations that have been recently founded on either side of some landscape element that truly does act as a barrier to gene flow, but that do not exhibit strong genetic differentiation yet, because the system is not in equilibrium. In reality, we expect that very few natural populations will conform perfectly to the assumptions of our model; however, we feel that the method will provide valid approximations of the patterns for many systems, and that it will be a useful tool for teasing apart patterns of genetic variation in populations across heterogeneous landscapes.
Extensions
The flexibility of this method translates well into extendability. Among a number of natural extensions the community might be interested in implementing, we highlight a few here.
One natural extension is to incorporate different definitions of the ecological distance between our populations. Just because two populations have no difference in their ecological variable state does not guarantee that there is not great heterogeneity in the distance between them. For example, a pair of populations separated by the Grand Canyon might have nearly identical elevations, but the cost to migrants between them incurred by elevation may well be significant. One solution to this would be to enter a simple binary barrier variable, or to calculate least-cost paths between populations, and use those distances in lieu of geographic distance. A more elegant solution would be to use “isolation by resistance” distances, obtained by rasterizing landscapes and employing results relating mean passage rates of random walks in a heterogeneous environment to quantities from circuit theory in order to calculate the conductance (ease of migration) between nodes on that landscape (McRae and Beier, 2007). This method has the advantage of integrating over all possible pathways between populations. Currently, users must specify the resistance of landscape elements a priori, but those resistance parameters could be incorporated into our parametric covariance function, and estimated along with the other parameters of our model in the same MCMC. This approach carries great appeal, as it combines the conceptual rigor of accommodating multiple migration paths with the methodological rigor of statistically estimated, rather than user-specified, parameter values.
Another extension is the further relaxation of the assumption of process homogeneity in decay of allelic covariance over geographic and ecological distance. Specifically, the method currently requires that a single unit of pairwise ecological distance translate into the same extent of pairwise genetic differentiation between all population pairs. This assumption is unlikely to be realistic in most empirical examples, especially if populations are locally adapted. For example, individuals from populations adapted to high elevation may be able to migrate more easily over topography than individuals from populations adapted to low elevations. Such heterogeneity could be accommodated by using different covariance functions for different, pre-specified population pairings.
A final extension that could be integrated into this method is a model selection framework, in which models with and without an ecological distance variable, or with different combinations of ecological distance variables, can be rigorously compared. Because our method is implemented in a Bayesian framework, we could select between models by calculating Bayes factors (the ratio of the marginal likelihoods of the data under two competing hypotheses) (Dickey, 1971; Verdinelli and Wasserman, 1995). This approach would seem to offer the best of both worlds: robust parameter inference that accommodates uncertainty in addition to output that could be interpreted as definitive evidence for or against the association of an ecological variable of interest with genetic differentiation between populations.
Conclusion
In closing, we present a tool that can be useful in a wide variety of contexts, allowing a description of the landscape as viewed by the movements of genetic material between populations. We urge users to be cautious in their interpretation of results generated with this model. A correlation between genetic differentiation and an ecological distance variable does not guarantee a causal relationship, especially because unmeasured ecological variables may be highly correlated with those included in an analysis. In addition, evidence of a correlation between genetic differentiation and an ecological variable may not be evidence of local adaptation or selection against migrants, as both neutral and selective forces can give rise to an association between genetic divergence and ecological distance.
Finally, we are making this method available online at_genescape.org_, and we hope that users elaborate on the framework we present here to derive new models that are better able to describe empirical patterns of isolation by distance — both geographic and ecological.
Supplementary Material
01
Acknowledgements
We thank Yaniv Brandvain, Marjorie Weber, Luke Mahler, Will Wetzel, and the Coop lab for their counsel, Jeff Ross-Ibarra and Torsten Günther for their help with empirical datasets, and Jon Wilkins and two anonymous reviewers for their comments on previous drafts. This material is based upon work supported by the National Science Foundation under Grant No. 1262645 (PR and GC), NSF GRFP No. 1148897 (GB), a NIH Ruth L. Kirschstein NRSA fellowship F32GM096686 (PR), and a Sloan Foundation fellowship (GC).
Appendix
Priors
We denote a gamma distribution with given shape and rate parameters as Γ(shape, rate), a normal distribution with given mean and variance parameters as N(mean, variance), an exponential distribution with given rate parameter Exp(rate), and a uniform distribution between given upper and lower boundaries as U(lower, upper). The priors specified on the parameters of this model are: α0 ~ Γ(0.001,0.001); α_D_ ~ Exp(1); α_E_ ~ Exp(1); α2 ~ U(0.1,2); and µℓ ~N(0,1/β), with a hyper-prior β ~ Γ(1,1).
The priors on α_D_ and α_E_ were chosen to reflect the assumption that there is some, and potentially very great, effect of isolation by geography and ecology. The priors on α2, α0, and β were the same as those used by (Wasser et al., 2004), and, in the case of the latter two (on β and α0), were chosen because they were conjugate to the likelihood, so their parameters could therefore be updated by a Gibbs sampling step.
In early implementations of our method, we experimented with uniform priors on α_D_ and α_E_ (U(0,4)), as used by Wasser et al. (2004) (although they did not have a parameter analogous to α_E_). We replaced these uniform priors with exponentials to reflect the fact that we have no prior belief that there should be any upper bound to the effects geographic or ecological distance may have on genetic differentiation. In practice, we found that for all simulated and empirical datasets tested, there was sufficient information in the data for the likelihood function to swamp the effect of the priors — whether uniform or exponential — on α_D_ and α_E_.
However, in all analyses, we encourage users to visualize the marginal distri-butions of each parameter at the end of a run and compare it to its prior. If the marginal distribution looks exactly like the prior, there may be insufficient information in the data to parameterize the model effectively, and the prior may be having an unduly large impact on analysis. If the marginal distribution for a parameter shows that it is “piling up” against its prior’s hard bound (e.g., the marginal distribution on α_E_ has a median of 1e-3, close to its hard bound at 0), that may suggest that the current form of the prior is not describing the natural distribution of the parameter for that particular dataset well (e.g., α_E_ “wants” to be zero, but the prior is constraining it). In both cases (the marginal posterior and the prior have significant overlap; the prior is exhibiting an edge effect), we suggest that the user experiment with different priors and/or model parameterizations to see what effect they are having on inference.
MCMC
Our MCMC scheme proceeds as follows. The chain is initiated at maximum likelihood estimates (MLEs) for θ and µ, and, for α0, α_D_, α_E_, and α2, at values drawn randomly from their priors. The empirical standard deviation of the MLEs of µ is used as the initial value of β.
In each generation one of {µ, β, θ, α0, α_D_, α_E_, α2} is selected at random to be updated.
The priors on β and α0 are conjugate to their marginal posteriors, and each is updated via a Gibbs sampling step. The updated value of β given the current µ is drawn from
β|μ1,⋯,μL~Γ(0.001+L2,0.001+12∑ℓ=1Lμℓ2), | (8) |
---|
and the updated value of α0 conditional on the current set of θ is drawn from
α0|θ1,⋯,θL~Γ(1+Lk2,1+12∑ℓ=1Lθℓ,kχ−1θℓ,kT), | (9) |
---|
where k is the number of populations sampled, L is the number of loci sequenced, and χ = α0Ω = exp(–(α_D_ D i,j + α_E_ E i,j)α2).
The remaining parameters are updated by a Metropolis-Hastings step; here we describe the proposal mechanisms. The proposed updates to θ do not affect each other, and so are accepted or rejected independently. Following Wasser et al. (2004) (derived from (Christensen and Waagepetersen, 2002; Møller et al., 1998)), the proposal is chosen as θ′ℓ = θℓ +R_ℓ_Z, where_R_ is a vector of normally distributed random variables with mean zero and small variance (controlled by the scale of the tuning parameter on θ) and Z is the Cholesky decomposition of Ω (so that ZZT = Ω). Under this proposal mechanism, proposed updates to θℓtend to stay within the region of high posterior probability, so that more updates are accepted and mixing is improved relative to a scheme in which the θ in each population were updated individually.
Updates to α_D_, α_E_, and α2are accomplished via a random-walk sampler (adding a normally distributed random variable with mean zero and small variance to the current value) (Gilks et al., 1996). Updates to elements of µℓ are also accomplished via a random-walk sampler, and again the updates to each locus are accepted or rejected independently.
In the overdispersion model, initial values of Φ_k_ are drawn from the prior for each population. Updates are proposed one population at a time via a random-walk step, and are accepted or rejected independently.
Well-suited values of tuning parameters (variances in the proposal distributions for µ, θ, α_D_, α_E_ and α2) and the number of generations required to accurately describe the joint posterior will vary from dataset to dataset, and so may require tuning.
References
- Andrew RL, Ostevik KL, Ebert DP, Rieseberg LH. Adaptation with gene flow across the landscape in a dune sunflower. Molecular ecology. 2012;21:2078–2091. doi: 10.1111/j.1365-294X.2012.05454.x. [DOI] [PubMed] [Google Scholar]
- Balding DJ. Likelihood-based inference for genetic correlation coefficients. Theoretical Population Biology. 2003;63:221–230. doi: 10.1016/s0040-5809(03)00007-8. [DOI] [PubMed] [Google Scholar]
- Balding DJ, Nichols RA. A method for quantifying differentiation between populations at multi-allelic loci and its implications for investigating identity and paternity. Genetica. 1995;96:3–12. doi: 10.1007/BF01441146. [DOI] [PubMed] [Google Scholar]
- Balding DJ, Nichols Ra. Significant genetic correlations among Caucasians at forensic DNA loci. Heredity. 1997;108:583–589. doi: 10.1038/hdy.1997.97. [DOI] [PubMed] [Google Scholar]
- Charlesworth B, Charlesworth D, Barton NH. The effects of genetic and geographic structure on neutral variation. Annual Review of Ecology, Evolution, and Systematics. 2003;34:99–125. [Google Scholar]
- Christensen OF, Waagepetersen R. Bayesian prediction of spatial count data using generalized linear mixed models. Biometrics. 2002;58:280–286. doi: 10.1111/j.0006-341x.2002.00280.x. [DOI] [PubMed] [Google Scholar]
- Cockerham CC, Weir BS. Estimation of inbreeding parameters in stratified populations. Annals of Human Genetics. 1986;50:271–281. doi: 10.1111/j.1469-1809.1986.tb01048.x. [DOI] [PubMed] [Google Scholar]
- Conrad DF, Jakobsson M, Coop G, Wen X, Wall JD, Rosenberg Na, Pritchard JK. A worldwide survey of haplotype variation and linkage disequilibrium in the human genome. Nature Genetics. 2006;38:1251–1260. doi: 10.1038/ng1911. [DOI] [PubMed] [Google Scholar]
- Coop G, Witonsky D, Di Rienzo A, Pritchard JK. Using environmental correlations to identify loci underlying local adaptation. Genetics. 2010;185:1411–1423. doi: 10.1534/genetics.110.114819. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Dickey J. The weighted likelihood ratio, linear hypotheses on normal location parameters. The Annals of Mathematical Statistics. 1971;42:204–223. [Google Scholar]
- Diggle PJ, Tawn JA, Moyeed RA. Model-based geostatistics. Jounal of the Royal Statistical Society. Series C (Applied Statistics) 1998;47:299–350. [Google Scholar]
- Drès M, Mallet J. Host races in plant-feeding insects and their importance in sympatric speciation. Philosophical transactions of the Royal Society of London. Series B, Biological sciences. 2002;357:471–492. doi: 10.1098/rstb.2002.1059. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Eagles HA, Lothrop JE. Highland maize from central Mexico-Its origin, characteristics, and use in breeding programs. Crop Science. 1994;34:11–19. [Google Scholar]
- Edelaar P, Bolnick DI. Non-random gene flow: an underappreciated force in evolution and ecology. Trends in Ecology & Evolution. 2012;27:659–665. doi: 10.1016/j.tree.2012.07.009. [DOI] [PubMed] [Google Scholar]
- Fang Z, Pyh¨aj¨arvi T, Weber AL, Dawe RK, Glaubitz JC, Jesus JD, Gonz´alez S, Ross-ibarra C, Doebley J, Morrell PL. Megabase-scale inversion polymorphism in the wild ancestor of maize. Genetics. 2012;191:883–894. doi: 10.1534/genetics.112.138578. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Fukunaga K, Hill J, Vigouroux Y, Matsuoka Y, G JS, Liu K, Buckler ES, Doebley J. Genetic diversity and population structure of teosinte. Genetics. 2005;169:2241–2254. doi: 10.1534/genetics.104.031393. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Gelman A, Meng X-l, Stern H. Posterior predictive assessment of model fitness via realized discrepancies. Statistica Sinica. 1996;6:733–807. [Google Scholar]
- Gilks W, Richardson S, Spiegelhalter D. Interdisciplinary Statistics. Chapman & Hall; 1996. Markov Chain Monte Carlo in Practice. [Google Scholar]
- Gó mez-Díaz E, Doherty FP, Jr, Duneau D, McCoy KD. Cryptic vector divergence masks vector-specific patterns of infection: an example from the marine cycle of Lyme borreliosis. Evolutionary Applications. 2010;3:391–401. doi: 10.1111/j.1752-4571.2010.00127.x. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Goslee SC, Urban DL. The ecodist package for dissimilarity-based analysis of ecological data. Journal of Statistical Software. 2007;22:1–19. [Google Scholar]
- Guillot G, Rousset F. Dismantling the Mantel tests. Methods in Ecology and Evolution. 2013;4:336–344. [Google Scholar]
- Gü nther T, Coop G. Robust identification of local adaptation from allele frequencies. arXiv. 2013;1209:3029v1. doi: 10.1534/genetics.113.152462. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Hastings W. Monte Carlo sampling methods using Markov chains and their applications. Biometrika. 1970;57:97–109. [Google Scholar]
- Heerwaarden JV, Doebley J, Briggs WH, Glaubitz JC, Goodman MM, Gonzalez JdJS, Ross-Ibarra J. Genetic signals of origin, spread, and introgression in a large sample of maize landraces. PNAS. 2011;108:1088–1092. doi: 10.1073/pnas.1013011108. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Hendry AP. Selection against migrants contributes to the rapid evolution of ecologically dependent reproductive isolation. Evolutionary Ecology Research. 2004;6:1219–1236. [Google Scholar]
- Hey J. A multi-dimensional coalescent process applied to multi-allelic selection models and migration models. Theoretical Population Biology. 1991;39:30–48. doi: 10.1016/0040-5809(91)90039-i. [DOI] [PubMed] [Google Scholar]
- Hudson RR. Generating samples under a Wright-Fisher neutral model of genetic variation. Bioinformatics. 2002;18:337–338. doi: 10.1093/bioinformatics/18.2.337. [DOI] [PubMed] [Google Scholar]
- Keinan A, Mullikin JC, Patterson N, Reich D. Measurement of the human allele frequency spectrum demonstrates greater genetic drift in East Asians than in Europeans. Nature genetics. 2007;39:1251–1255. doi: 10.1038/ng2116. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Legendre P, Fortin M-J. Comparison of the Mantel test and alternative approaches for detecting complex multivariate relationships in the spatial analysis of genetic data. Molecular Ecology Resources. 2010;10:831–844. doi: 10.1111/j.1755-0998.2010.02866.x. [DOI] [PubMed] [Google Scholar]
- Li JZ, Absher DM, Tang H, Southwick AM, Casto AM, Ramachandran S, Cann HM, Barsh GS, Feldman M, Cavalli-Sforza LL, Myers RM. Worldwide human relationships inferred from genome-wide patterns of variation. Science. 2008;319:1100–1104. doi: 10.1126/science.1153717. [DOI] [PubMed] [Google Scholar]
- Malé cot G. Heterozygosity and relationship in regularly subdivided populations. Theoretical Population Biology. 1975;8:212–241. doi: 10.1016/0040-5809(75)90033-7. [DOI] [PubMed] [Google Scholar]
- McRae BH, Beier P. Circuit theory predicts gene flow in plant and animal populations. PNAS. 2007;104:19885–19890. doi: 10.1073/pnas.0706568104. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Meirmans PG. The trouble with isolation by distance. Moleculary Ecology. 2012;21:2839–2846. doi: 10.1111/j.1365-294X.2012.05578.x. [DOI] [PubMed] [Google Scholar]
- Metropolis N, Rosenbluth AW, Rosenbluth MN, Teller AH, Teller E. Equation of State Calculations. Journal of Chemical Physics. 1953;21:p1087–p1092. [Google Scholar]
- Møller J, Syversveen AR, Waagepetersen RP. Log Gaussian Cox Processes. Scandinavian Journal of Statistics. 1998;25:451–482. [Google Scholar]
- Mosca E, Eckert AJ, Pierro EADI, Rocchini D, Porta NLA. The geographical and environmental determinants of genetic diversity for four alpine conifers of the European Alps. Molecular Ecology. 2012;21:5530–5545. doi: 10.1111/mec.12043. [DOI] [PubMed] [Google Scholar]
- Nordborg M, Krone SM. Separation of time scales and convergence to the coalescent in structured populations. In: Slatkin M, Veuille M, editors. Modern Developments in Theoretical Populations Genetics. Oxford: Oxford University Press; 2002. pp. 130–164. [Google Scholar]
- R Development Core Team. R: A Language and Environment for Statistical Computing. URL: R Foundation for Statistical Computing, Vienna, Austria; 2007. http://www.R-project.org ISBN 3-900051-07-0. [Google Scholar]
- Rosenberg NA. A Population-Genetic Perspective on the Similarities and Differences among Worldwide Human Populations. Human Biology. 2011;83:659–684. doi: 10.3378/027.083.0601. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Rosenberg NA, Mahajan S, Ramachandran S, Zhao C, Pritchard JK, Feldman MW. Clines , clusters , and the effect of study design on the inference of human population structure. PLoS Genetics. 2005;1:660–671. doi: 10.1371/journal.pgen.0010070. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Rosenberg Na, Pritchard JK, Weber JL, Cann HM, Kidd KK, Zhivotovsky La, Feldman MW. Genetic structure of human populations. Science. 2002;298:2381–2385. doi: 10.1126/science.1078311. [DOI] [PubMed] [Google Scholar]
- Rosenblum EB, Harmon LJ. “Same same but different”: replicated ecological speciation at White Sands. Evolution. 2011;65:946–960. doi: 10.1111/j.1558-5646.2010.01190.x. [DOI] [PubMed] [Google Scholar]
- Rousset F. Genetic differentiation and estimation of gene flow from F-statistics under isolation by distance. Genetics. 1997;145:1219–1228. doi: 10.1093/genetics/145.4.1219. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Slatkin AM, Maruyama T. The influence of gene flow on genetic distance. The American Naturalist. 1975;109:597–601. [Google Scholar]
- Slatkin M. Isolation by distance in equilibrium and non-equilibrium populations. Evolution. 1993;47:264–279. doi: 10.1111/j.1558-5646.1993.tb01215.x. [DOI] [PubMed] [Google Scholar]
- Smouse PE, Long JC, Sokal RR. Multiple regression and correlation extensions of the Mantel test of matrix correspondence extensions of the multiple regression and correlation Mantel test of matrix correspondence. Systematic Zoology. 1986;35:627–632. [Google Scholar]
- Vekemans X, Hardy O. New insights from fine-scale spatial genetic structure analyses in plant populations. Molecular Ecology. 2004;13:921–935. doi: 10.1046/j.1365-294x.2004.02076.x. [DOI] [PubMed] [Google Scholar]
- Verdinelli I, Wasserman L. Computing Bayes factors using a generalization of the Savage-Dickey density ratio. Journal of the American Statistical Association. 1995;90:614–618. [Google Scholar]
- Wang IJ, Glor RE, Losos J. Quantifying the roles of ecology and geography in spatial genetic divergence. Ecology Letters. 2012;16:175–182. doi: 10.1111/ele.12025. [DOI] [PubMed] [Google Scholar]
- Wasser SK, Shedlock AM, Comstock K, Ostrander Ea, Mutayoba B, Stephens M. Assigning African elephant DNA to geographic region of origin: Applications to the ivory trade. PNAS. 2004;101:14847–14852. doi: 10.1073/pnas.0403170101. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Weir BS, Cockerham CC. Estimating F-statistics for the analysis of population structure. Evolution. 1984;38:1358–1370. doi: 10.1111/j.1558-5646.1984.tb05657.x. [DOI] [PubMed] [Google Scholar]
- Weir BS, Hill WG. Estimating F-statistics. Annual Review of Genetics. 2002:721–750. doi: 10.1146/annurev.genet.36.050802.093940. [DOI] [PubMed] [Google Scholar]
- Williams DA. The analysis of binary responses from toxicological experiments involving reproduction and teratogenicity. Biometrics. 1975;31:949–952. 394. [PubMed] [Google Scholar]
- Wright S. Isolation by distance. Genetics. 1943;28:114–138. doi: 10.1093/genetics/28.2.114. [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
01