msHOT: modifying Hudson's ms simulator to incorporate crossover and gene conversion hotspots (original) (raw)

Abstract

Summary: We have incorporated both crossover and gene conversion hotspots into an existing coalescent-based program for simulating genetic variation data for a sample of chromosomes from a population.

Availability: The source code for msHOT is available at Author Webpage, along with accompanying instructions.

Contact: hellenth@stats.ox.ac.uk

1 INTRODUCTION

Richard R. Hudson's ‘program for generating samples under neutral models’ (ms simulator) (2002) is a widely-used program for simulating genetic variation data, in particular single nucleotide polymorphism (SNP) data, for randomly-sampled haplotypes from a population. The program allows the user to specify various aspects of population demography (e.g. population sizes and migration patterns) and factors governing evolution [e.g. mutation, crossover and gene conversion (gc) rates]. However, it presently does not allow for variation in recombination rates. In particular, ‘hotspots,’ or areas of the genome in which crossover and/or (allelic) gc occur at higher rates than the genome-wide average, appear to be common in humans (Jeffreys et al., 2001; Jeffreys and May, 2004; Myers et al., 2005). We have incorporated both crossover and gc hotspots into a freely available, updated simulator called msHOT. The output and usage is the same as in the ms program of Hudson (2002), but includes additional arguments for specifying hotspot features. Though other coalescent-based simulators have been written to incorporate variable crossover rates (Schaffner et al., 2005), ours is the first to our knowledge to include the option for gc hotspots as well.

2 MODEL

The current implementation of ms allows the user to specify the (population-scaled) rate of crossing-over, ρ, and the relative rate of gc to crossover, f, for the genetic region to be simulated. Here ρ = 4_N_0_r_, where _N_0 is the current diploid population size and r is the probability of a crossover occuring in the region in a single transmission from parent to offspring, and f = g/r, where g denotes the probability of a gc initiating in the region of interest in a single transmission from parent to offspring (Wiuf and Hein, 2000). Since r and g are typically small, dividing these parameters by the sequence length of the genetic region gives, respectively, the crossover probability per base pair, _r_bp, and the gc probability per base pair, _g_bp.

Our modification msHOT allows the user to insert as many (non-overlapping) crossover hotspots and (non-overlapping) gene conversion hotspots into the genetic region as they wish by specifying the locations and intensities for each. Specifically, incorporating H crossover hotspots requires the user to specify a left endpoint (ah), right endpoint (bh), and intensity (λh) for each, h = 1, … , H. Inside hotspot h, the probability of a crossover occuring between two adjacent base pairs in a single transmission from parent to offspring is _λhr_bp. Outside any hotspot, this probability is _r_bp. Similarly, incorporating formula gc hotspots requires the user to specify a left endpoint (⁠formula⁠), right endpoint (⁠formula⁠) and intensity (⁠formula⁠) for each, formula⁠. Inside gc hotspot h, the probability of a gc intitation between two adjacent base pairs in a single transmission from parent to offspring is formula_g_bp. Outside any hotspot, this probability is _g_bp (see Figure 1a). Gene conversion hotspots may overlap with crossover hotspots.

(a) Illustration of varying crossover and/or gc intensities in a genetic region [S, E]. Here the crossover, (respectively gc) probability rbp, (respectively gbp) is increased by a multiple λ1 in [a1,b1] and by a multiple λ2 in [a2,b2]. (b) Illustration of the three distinct gc types that can influence variation in the genetic region [S, E]. The grey vertical lines represent the initiation point of each gc event, and the black horizontal bars represent the tract length of each of the gc events.

Fig. 1

(a) Illustration of varying crossover and/or gc intensities in a genetic region [_S, E_]. Here the crossover, (respectively gc) probability _r_bp, (respectively _g_bp) is increased by a multiple λ1 in [_a_1,_b_1] and by a multiple λ2 in [_a_2,_b_2]. (b) Illustration of the three distinct gc types that can influence variation in the genetic region [_S, E_]. The grey vertical lines represent the initiation point of each gc event, and the black horizontal bars represent the tract length of each of the gc events.

In Hudson's ms, gc events initiate at some base pair, which is assumed to form the left-point of the region affected by the gc. The right-point is then determined by the length (in physical distance) of the region affected by the gc (i.e. the tract length), which is assumed to have a geometric distribution with user-specified mean. This difference in the treatment of the left and right endpoints causes some bothersome asymmetry when the rate of gc initiation is allowed to vary along the region. To deal with this, we changed the model to assume that gc events initiate at some point and then spread both right and left independently according to geometric distributions with user-specified mean _t_∗. Thus, in our model, the tract length is the sum of two independent geometric distributions. Incidentally, this may also better represent current knowledge of the biology underlying gc events (Szostak et al., 1983).

3 IMPLEMENTATION

The basic algorithm of msHOT is as described in Hudson (1983). In brief, ms generates ancestral recombination graphs for a sample of chromsomes by stochastically determining ‘events’ to occur on the ancestral material of the chromosomes going back in time, until all the material has coalesced into a common ancestor. We refer to any individual segment of this ancestral material as an ‘ancestral segment.’ Potential ‘events’ include the coalesence of two such segments or a recombination event (crossover or gc) occuring in a single segment. Incorporating hotspots involves changing the rates at which these recombination events occur, as described below. (The consequences of these events, which involve splitting ancestral segments, are not changed by the introduction of hotspots and are already dealt with in Hudson's code.)

The rate of each possible recombination event, backwards in time, is determined by computing the probability of the event occurring in a single generation forwards in time, and multiplying this by 4_N_0. We therefore focus on computing the relevant probabilities forwards in time. In the following we use [_S, E_] to denote an ancestral segment beginning at S and ending at E.

Crossover. Assume the entire simulated region contains H crossover hotspots, each with left endpoint ah, right endpoint bh, and intensity λh, h = 1, … , H. Under the model described above, the probability of a crossover initiating at any particular location z ∈ [_S, E_] is:

formula

(1)

Here Iz∈[ah,bh] is an indicator function, taking the value 1 if z is in crossover hotspot h and 0 otherwise. The total probability of a crossover occurring in [_S, E_] is found by summing over z in Equation (1). If a crossover is to occur in [_S, E_], the location z is selected with probability proportional to Equation (1).

Gene conversion. Each gc event can be thought of as having an ‘initiation point’ and ‘right’ and ‘left’ endpoints. We distinguish three types of gc event that can influence patterns of genetic variation in [_S, E_] (see Figure 1b):

  1. Type i: a gc event initiates within [_S, E_] and has endpoints that may be either inside or outside this region.
  2. Type ii: a gc event initiates to the left of S and has a right endpoint within [_S, E_].
  3. Type iii: a gc event initiates to the right of E and has a left endpoint within [_S, E_].

The following subsections give the relative probabilities, and describe how to determine the endpoints, for each of these types of event. Assume the entire simulated region contains formula gene conversion hotspots, each with left endpoint formula⁠, right endpoint formula⁠, and intensity formula⁠, formula⁠.

‘Type i’: The probability of a type i event initiating at z ∈ [_S, E_]:

formula

(2)

where formula is an indicator denoting whether location z is in gc hotspot h. If a type i event occurs, its endpoints are determined by first selecting the initiation point, z, with probabilities proportional to Equation (2), and then simulating the left and right endpoints as _z − T_1 and _z + T_2, where Ti are randomly sampled from a geometric(_t_∗). (These endpoints may fall outside [_S, E_].)

‘Type ii’: The probability that a type ii event initiates at a location y to the left of S (thus y is outside [_S, E_], in contrast to z above) and has a right endpoint at x ∈ [_S, E_] is given by:

formula

(3)

where formula⁠, y ranges from -∞ to S (for simplicity, we have assumed, as in ms, that the chromosome has infinite length), and formulais an indicator for whether y is in hotspot h. The total probability of a type ii event occurring in [_S, E_] is obtained by summing Equation (3) over possible values of x and y. (We deal with the infinite sum over y by use of standard geometric series results. All locations outside the simulated region are assumed to have the background probability of a gene conversion, _g_bp.) If a type ii event occurs, its right endpoint, _x_∗, is chosen via a truncated geometric distribution [i.e. Pr(_X_∗ = x_∗) ∝ q_x∗−S(1 − q),for _x_∗ = S + 1, … , _E_].

‘Type iii’: The type iii gc events are similar to the type ii events above, but with locations starting from the end of an ancestral segment and counting from right to left.

The authors thank E.C. Anderson for sharing an annotated version of Hudson's code edited to incorporate crossover hotspots and R.R. Hudson for kindly agreeing to distribute our modified version of his code.

Conflict of Interest: none declared.

REFERENCES

Properties of a neutral allele model with intragenic recombination

,

Theor. Popul. Biol.

,

1983

, vol.

23

(pg.

183

-

201

)

Generating samples under a Wright-Fisher neutral model of genetic variation

,

Bioinformatics

,

2002

, vol.

18

(pg.

337

-

338

)

Intense and highly localized gene conversion activity in human meiotic crossover hot spots

,

Nat. Genet.

,

2004

, vol.

36

(pg.

151

-

156

)

et al.

Intensely punctate meiotic recombination in the class II region of the major histocompatibility complex

,

Nat. Genet.

,

2001

, vol.

29

(pg.

217

-

222

)

et al.

A fine-scale map of recombination rates and hotspots across the human genome

,

Science

,

2005

, vol.

310

(pg.

321

-

324

)

et al.

Calibrating a coalescent simulation of human genome sequence variation

,

Genome Res.

,

2005

, vol.

15

(pg.

1576

-

1583

)

et al.

The double-strand-break repair model for recombination

,

Cell

,

1983

, vol.

33

(pg.

25

-

35

)

The coalescent with gene conversion

,

Genetics

,

2000

, vol.

155

(pg.

451

-

462

)

Author notes

Associate Editor: Keith A Crandall

© The Author 2006. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oxfordjournals.org