Genome-wide inference of ancestral recombination graphs - PubMed (original) (raw)

Genome-wide inference of ancestral recombination graphs

Matthew D Rasmussen et al. PLoS Genet. 2014.

Abstract

The complex correlation structure of a collection of orthologous DNA sequences is uniquely captured by the "ancestral recombination graph" (ARG), a complete record of coalescence and recombination events in the history of the sample. However, existing methods for ARG inference are computationally intensive, highly approximate, or limited to small numbers of sequences, and, as a consequence, explicit ARG inference is rarely used in applied population genomics. Here, we introduce a new algorithm for ARG inference that is efficient enough to apply to dozens of complete mammalian genomes. The key idea of our approach is to sample an ARG of [Formula: see text] chromosomes conditional on an ARG of [Formula: see text] chromosomes, an operation we call "threading." Using techniques based on hidden Markov models, we can perform this threading operation exactly, up to the assumptions of the sequentially Markov coalescent and a discretization of time. An extension allows for threading of subtrees instead of individual sequences. Repeated application of these threading operations results in highly efficient Markov chain Monte Carlo samplers for ARGs. We have implemented these methods in a computer program called ARGweaver. Experiments with simulated data indicate that ARGweaver converges rapidly to the posterior distribution over ARGs and is effective in recovering various features of the ARG for dozens of sequences generated under realistic parameters for human populations. In applications of ARGweaver to 54 human genome sequences from Complete Genomics, we find clear signatures of natural selection, including regions of unusually ancient ancestry associated with balancing selection and reductions in allele age in sites under directional selection. The patterns we observe near protein-coding genes are consistent with a primary influence from background selection rather than hitchhiking, although we cannot rule out a contribution from recurrent selective sweeps.

PubMed Disclaimer

Conflict of interest statement

The authors have declared that no competing interests exist.

Figures

Figure 1

Figure 1. An ancestral recombination graph (ARG) for four sequences.

(A) Going backwards in time (from bottom to top), the graph shows how lineages that lead to modern-day chromosomes (bottom) either “coalesce” into common ancestral lineages (dark blue circles), or split into the distinct parental chromosomes that were joined (in forward time) by recombination events (light blue circles). Each coalescence and recombination event is associated with a specific time (dashed lines), and each recombination event is also associated with a specific breakpoint along the chromosomes (here, formula image and formula image). Each non-recombining interval of the sequences (shown in red, green, and purple) corresponds to a “local tree” embedded in the ARG (shown in matching colors). Recombinations cause these trees to change along the length of the sequences, making the correlation structure of the data set highly complex. The ARG for four sequences is denoted formula image in our notation. (B) Representation of formula image in terms of a sequence of local trees formula image and recombination events formula image. A local tree formula image is shown for each nonrecombining segment in colors matching those in (A). Each tree, formula image, can be viewed as being constructed from the previous tree, formula image, by placing a recombination event along the branches of formula image (light blue circles), breaking the branch at this location, and then allowing the broken lineage to re-coalesce to the rest of the tree (dashed lines in matching colors; new coalescence points are shown in gray). Together, the local trees and recombinations provide a complete description of the ARG. The Sequentially Markov Coalescent (SMC) approximate the full coalescent-with-recombination by assuming that formula image is statistically independent of all previous trees given formula image. (C) An alignment of four sequences, formula image, corresponding to the linearized ARG shown in (B). For simplicity, only the derived alleles at polymorphic sites are shown. The sequences are assumed to be generated by a process that samples an ancestral sequences from a suitable background distribution, then allows each nonrecombining segment of this sequence to mutate stochastically along the branches of the corresponding local tree. Notice that the correlation structure of the sequences is fully determined by the local trees; that is, formula image is conditionally independent of the recombinations formula image given the local trees formula image.

Figure 2

Figure 2. The “threading” operation.

The threading operation adds an formula imageth sequence to an ARG of formula image sequences under a discretized version of the SMC (the DSMC) that requires all coalescence and recombination events to occur precisely at pre-defined time points, formula image (horizontal dashed lines). In this example, the fourth sequence has been removed from ARG formula image from Figure 1, leaving a tree with formula image leaves at each position formula image (formula image; shown in black). The fourth sequence (shown in red) is re-threaded through the remaining portion of the ARG by a two-step process that first samples a coalescence point formula image for this sequence at each formula image (dark blue points), thereby defining a new tree formula image, and second, samples a recombination point formula image to reconcile each adjacent pair of trees, formula image (light blue points). For simplicity, only the distinct local trees for the four nonrecombining segments (after threading) are shown. The gray box highlights the pair of trees immediately flanking the breakpoint formula image. Notice that the first recombination from Figure 1 is retained (dark gray nodes and dashed line in left-most tree). In general, new recombinations are prohibited at the locations of “given” recombinations formula image (see text). Note that it is possible for the attachment point of the formula imageth sequence in the local trees to move due to old recombinations as well as new ones (not shown in this example).

Figure 3

Figure 3. Graphical models for Discretized Sequentially Markov Coalescent (DSMC) models.

(A) Full DSMC model for formula image samples with local trees, formula image, recombinations, formula image, and alignment columns, formula image. Together, formula image and formula image define an ancestral recombination graph, formula image. Solid circles indicate observed variables and empty circles indicate latent variables. Arrows indicate direct dependencies between variables and correspond to conditional probability distributions described in the text. Notice that the formula image variables can be integrated out of this model, leading to the conventional graph topology for a hidden Markov model. (B) The same model as in (A), but now partitioning the latent variables into components that describe the history of the first formula image sequences (formula image and formula image) and components specific to the formula imageth sequence (formula image and formula image). The formula image and formula image variables are represented by solid circles because they are now “clamped” at specific values. A sample of formula image represents a threading of the formula imageth sequence through the ARG. (C) Reduced model after elimination of formula image by integration, enabling efficient sampling of coalescent threadings formula image. This is the model used by the first step in our two-step sampling approach. In the second step, the formula image variables are sampled conditional on formula image, separately for each formula image. In this model, the grouped nodes have complex joint dependencies, leading to a heterogeneous state space and normalization structure, but the linear conditional independence structure of an HMM is retained.

Figure 4

Figure 4. Simulation results.

(A) Recovery of global features of simulated ARGs from sequence data. This plot is based on sets of 20 1-Mb sequences generated under our standard simulation parameters (see Methods) with formula image (see Supplementary Figure 10 for additional results). From left to right are shown true (formula image-axis) versus inferred (formula image-axis) values of the log joint probability (the logarithm of equation 2), the total number of recombinations, and the total branch length of the ARG. Each data point in each plot represents one of 100 simulated data sets. In the vertical dimension, circles represent averages across 100 sampled ARGs based on the corresponding data sets, sampled at intervals of 10 after a burn-in of 200 iterations, and error bars represent the interval between the 2.5 and 97.5 percentiles. In the second and third plots, circles are interpretable as posterior expected values and error bars as 95% Bayesian credible intervals. (B) Posterior mean TMRCA (dark red line, with 95% credible intervals in light red) versus true TMRCA (black line) along a simulated genomic segment of 1 Mb. This plot is based on a single representative data set of 20 1-Mb sequences generated under our standard simulation parameters with formula image (see Supplementary Figure S5 for additional results).

Figure 5

Figure 5. Measures of genetic variation near protein-coding genes and partial selective sweeps.

Shown (from top to bottom) are nucleotide diversity (formula image), time to most recent common ancestry (TMRCA), and relative TMRCA halflife (RTH) for the 13 individuals (26 haploid genomes) of European descent (CEU and TSI populations) in the Complete Genomics data set (similar plots for African population are shown in Supplementary Figure S11). Nucleotide diversity formula image was computed as the average rate of nucleotide differences per site across all pairs of chromosomes, whereas sitewise values of the TMRCA and RTH were computed by averaging over local trees sampled by ARGweaver. (A) Estimates for 17,845 protein-coding genes from the Consensus Coding Sequence (CCDS) track in the UCSC Genome Browser (hg19). Estimates for noncoding regions were computed by averaging in a sliding window of size 300 bp then averaging across genes. Estimates for coding exons were computed by first averaging over fourfold degenerate (4d) sites of each exonic type (first, middle, last), then averaging across genes (see Methods). Only 4d sites were considered to focus on the influence of selection from linked sites rather than direct selection. Nevertheless, the decreased values for the exons suggest some influence from direct selection. The differences between exons and flanking sites may also be influenced by windowing in the noncoding regions. “First exon” is taken to begin at the annotated start codon and “last exon” to end at the stop codon, so that both exclude untranslated regions. The TMRCA is measured in thousands of generations. RTH is ratio of the time required for the first 50% of lineages to find a most recent common ancestor to the full TMRCA (see Supplementary Figure S10). Error bars (dashed lines for noncoding regions) indicate 95% confidence intervals as estimated by bootstrapping over regions. (B) Similar plots for 255 100-kb regions predicted to have undergone partial selective sweeps in the CEU population based on the iHS statistic . In this case, all measures are computed in a sliding window of 10,000 bases. Notice that both protein-coding genes and putative selective sweeps display substantial reductions in nucleotide diversity, but the genes show a much more prominent reduction in TMRCA, whereas the sweeps show a much more prominent reduction in RTH. These signatures are consistent with a dominant influence from background selection rather than hitchhiking in protein-coding genes (see text).

Figure 6

Figure 6. Mean allele age as a function of annotation class and derived allele frequency.

(A) Estimated age of derived allele in generations, averaged across polymorphic sites of various annotation classes. Estimates were derived from ARGs sampled by ARGweaver based on the Complete Genomics data set (see Methods). Error bars represent one standard deviation above and below the mean. Neut = putatively neutral sites; 4d = fourfold degenerate sites in coding regions; CNS = conserved noncoding sequences identified by phastCons; PPh:{Benign,PosDam,ProbDam} = missense mutations identified by PolyPhen-2 as “benign”, “possibly damaging”, or “probably damaging”, respectively; CV:{NonPath,Path} = mutations in “nonpathogenic” (categories 1–3) or “pathogenic” (categories 4 & 5) classes in the ClinVar database, respectively. (B) Similar plot with categories further divided by derived allele frequencies (DAF) in numbers of chromosomes out of 108. Error bars represent 95% confidence intervals, as assessed by bootstrapping. In categories that combine multiple frequencies (e.g., 4–5, 6–8), a subsampling strategy was used to ensure that the relative contributions of the different frequencies matched those of the Neut class. Estimates for DAF>20 were excluded due to sparse data. Notice that ages generally increase with DAF, as expected (see Supplementary Figure S7), but at a considerably reduced rate in categories under strong selection.

Similar articles

Cited by

References

    1. Hein J, Schierup M, Wiuf C (2005) Gene genealogies, variation and evolution: a primer in coalescent theory. Oxford: Oxford University Press.
    1. Wakeley J (2009) Coalescent theory: an introduction. Greenwood Village: Roberts & Co. Publishers.
    1. Fisher RA (1930) The Genetical Theory of Natural Selection. Oxford: Oxford University Press.
    1. Wright S (1931) Evolution in Mendelian Populations. Genetics 16: 97–159. - PMC - PubMed
    1. Kimura M (1962) On the probability of fixation of mutant genes in a population. Genetics 47: 713–719. - PMC - PubMed

Publication types

MeSH terms

LinkOut - more resources