MSMS: a coalescent simulation program including recombination, demographic structure and selection at a single locus (original) (raw)

Bioinformatics. 2010 Aug 15; 26(16): 2064–2065.

Gregory Ewing

1Department of Mathematics, University of Vienna, Nordbergstrasse 15, A-1090 Vienna, Austria and 2Max F. Perutz Laboratories, Dr. Bohrgasse 9, A-1030 Vienna, Austria

Joachim Hermisson

1Department of Mathematics, University of Vienna, Nordbergstrasse 15, A-1090 Vienna, Austria and 2Max F. Perutz Laboratories, Dr. Bohrgasse 9, A-1030 Vienna, Austria

1Department of Mathematics, University of Vienna, Nordbergstrasse 15, A-1090 Vienna, Austria and 2Max F. Perutz Laboratories, Dr. Bohrgasse 9, A-1030 Vienna, Austria

* To whom correspondence should be addressed.

Associate Editor: Jeffrey Barrett

Received 2010 Apr 9; Revised 2010 Jun 4; Accepted 2010 Jun 10.

Copyright © The Author(s) 2010. Published by Oxford University Press.

This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/2.5), which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.

Supplementary Materials

[Supplementary Data]

GUID: B6801396-1AE3-4E4E-8625-A13A0AFBFC37

GUID: 25C3B16D-0C64-4432-A94D-70DC088D037F

GUID: CAA96376-D6A1-4B8F-A807-404AB225D7FA

Abstract

Motivation: We have implemented a coalescent simulation program for a structured population with selection at a single diploid locus. The program includes the functionality of the simulator ms to model population structure and demography, but adds a model for deme- and time-dependent selection using forward simulations. The program can be used, e.g. to study hard and soft selective sweeps in structured populations or the genetic footprint of local adaptation. The implementation is designed to be easily extendable and widely deployable. The interface and output format are compatible with ms. Performance is comparable even with selection included.

Availability: The program is freely available from http://www.mabs.at/ewing/msms/ along with manuals and examples. The source is freely available under a GPL type license.

Contact: ta.ca.eivinu@gniwe.yrogerg

Supplementary information: Supplementary data are available at Bioinformatics online.

1 INTRODUCTION

Coalescent simulations are a standard method to generate population samples under various models of evolution. The widely used simulator ms (Hudson, 2002) is a powerful tool for genealogies in general structured populations under neutrality, with high performance and an easy to use interface. However, it does not allow for selection. On the other hand, several programs are available to simulate polymorphism data for a neutral locus linked to a selected site (e.g. Kim and Stephan, 2002; Pennings and Hermisson, 2006b; Spencer and Coop, 2004; Teshima and Innan, 2009), but they do not allow for population structure. Only the specific scenario of selection in a colony population that has split from a large founder population without subsequent migration has been considered (Li and Stephan, 2006; Thornton and Jensen, 2006).

In msms, we implement the functionality of Hudson's ms in a simulator that allows for selection at a single diploid locus with two alleles. For neutral genealogies, the program's usage and assumptions are identical to ms. In particular, msms is compatible with ms in both output format and command-line options. This permits the wide range of tools that work with ms to also work with msms. Complicated models can now have selection added with only small adjustments and by swapping ms with msms. Applications include the power of various tests to detect selective sweeps in structured populations, including adaptation from standing genetic variation and from recurrent mutation (soft sweeps, Hermisson and Pennings, 2005; Pennings and Hermisson, 2006a, b), the genetic footprint of local adaptation and adaptive gene flow after population splits. Extensions beyond previous programs also include the output of joint site-frequency spectra and the possibility of specifying multiple neutral loci. The performance of msms is generally comparable to ms (see the manual for a detailed runtime comparison). Complex population demographies can therefore be studied with selection added, without imposing additional computation time limitations. The name msms refers to the German ‘ “mach” Stichprobe mit Selektion’ (i.e. make sample with selection).

2 METHODS

The coalescent (Hudson, 1983; Kingman, 1982) is a stochastic process to generate genealogies from a population by tracing randomly sampled alleles backwards in time. Population structure, as well as demography and recombination, are readily incorporated into a coalescent framework (for review, Hein et al., 2005; Wakeley, 2008).

To include selection into a simulator based on the coalescent, we extend the approach of (Kaplan et al., 1989) and (Barton et al., 2004) to the case of structured populations. Conditioned on the frequency paths of the beneficial allele in all demes of the metapopulation, ancestral lines of one or several neutral markers linked to the beneficial allele can be followed backward in time. The structured coalescent as described by (Kaplan et al., 1989) is thus structured by both geography and genetic background. It is implemented by a three-step procedure. (i) The first step consists of generating the frequency paths (trajectories) of the selected allele in all demes by multinomial sampling in a Wright–Fisher model for an arbitrary geographical structure. (ii) The second step is the construction of the genealogy at one or several neutral loci using a structured ancestral recombination graph conditioned on the frequency paths. (iii) Finally, neutral mutations are added to the branches of the genealogical tree according to a Poisson process.

Forward simulations: in the current version of the program, the time-forward simulations to generate the trajectory of the selected allele assume selection at a single locus with two alleles A and a. Individuals can be haploid or diploid. For diploids, the fitness values for the three genotypes in deme i are 1+s aa i, 1+s aA i and 1+s AA i, respectively. Selection can differ among demes. Migration from deme j to deme i is defined as the proportion m ij of deme i that is made up of migrants from deme j. We let m ii = 1−∑j m ij, which is the proportion of non-migrants in deme i. Let N i and n i be the population size and the number of A copies in deme i, respectively, and x i = n i/(2_N_ i) its frequency. The simulator allows for recurrent mutation at the selected site with rates μ from aA and ν from Aa. All model parameters (for mutation, selection and migration) can change with time in a step-wise fashion. The simulator further allows for changes in population size and in the number of demes (splits and mergers) in analogy to ms.

Consider a single deme i with proportion x i of the A allele. Selection, mutation and migration occur according to the deterministic recurrence equation

equation image

(1)

where

equation image

(2)

equation image

(3)

Drift is included by binomial sampling of the infinite population. The number n _i_′ of A copies in the next generation is given by

equation image

(4)

Simulation runs start with an arbitrary starting frequency x(0) and continue until some stopping condition (e.g. loss or fixation of the A allele) is met.

Coalescent simulations: simulations with recombination according to the ancestral recombination graph (ARG) are carried out conditional on the frequency of the A allele in all demes. In the time before the first origin of A, and after its fixation in all demes (if applicable), the ARG is the standard neutral one (in a structured population). During the selection phase, events of different types (coalescence, mutation at the selected locus, migration and recombination) occur according to a competing Poisson process scheme with rates that depend on the stochastic trajectory of the selected allele. For example, coalescence in the A and a background in deme i occurs at rates proportional to 1/x i(t) and 1/(1−x i(t)), respectively, and migration of the selected allele from j to i (forward in time) at a rate proportional to x j(t)/x i(t). Rates for other events and all details are given in the documentation (Supplementary Material).

The coalescent simulations assume continuous time and events are not linked to discrete generation steps. For high recombination and migration rates, and for low frequencies of the A allele, the time between consecutive events can get smaller than a generation time. To reconcile this with the discrete time-forward simulations, the trajectory of the A allele is treated as a piecewise constant function with (potential) jumps after every generation.

Validation: the simulation program was validated using ms for complex neutral genealogies and for selection in a panmictic population using the simulator by (Pennings and Hermisson, 2006a), which builds on (Kim and Stephan, 2002). No significant deviations were found when comparing marginal allele frequency spectra or the distribution of Tajima's D statistic with a χ2 or Kolmogrorov–Smirnov test. Appropriate analytical results have also been checked and matched.

3 EXAMPLE

To illustrate the capabilities of the program and its performance, we use the model for human demography with parameters inferred in (Gutenkunst et al., 2009). Example parameters and the program command line for this example are included in the manual. Figure 1 shows pairwise spectra for Europeans and Asians under neutrality and with selection only in the European population. The initial frequency of the beneficial allele is zero, adaptation occurs from recurrent mutation at the selected locus. The run times for 10 000 replicates were 142 s and 156 s for the neutral and the selected case, respectively (using a single core of a 2.2 GHz quad core AMD Opteron processor). Thus, considering selection does not adversely degrade running times permitting a wide range of applications.

An external file that holds a picture, illustration, etc. Object name is btq322f1.jpg

Demographic model and pairwise joint frequency spectra. The model includes four populations (YRI, CHB, CEU and MXL for African, Asian, European and Mexican, respectively), splits, admixture, migration, growth and bottlenecks. The left and right spectra show the neutral and a selected case, respectively. The scale is counts for each bin.

Supplementary Material

ACKNOWLEDGEMENTS

We thank Peter Pfaffelhuber, Cornelia Borck, Ines Hellmann, Pleuni Pennings and Pavlos Pavlidis for discussions and β-testing the program, and Jayne Ewing for support and proof reading. We thank CIBIV for the use of the computer cluster and other infrastructure support.

Funding: Deutsche Forschungsgemeinschaft (DFG); the Vienna Science and Technology Fund (WWTF).

Conflict of Interest: none declared.

REFERENCES


Articles from Bioinformatics are provided here courtesy of Oxford University Press