A dirichlet process prior for estimating lineage-specific substitution rates - PubMed (original) (raw)

A dirichlet process prior for estimating lineage-specific substitution rates

Tracy A Heath et al. Mol Biol Evol. 2012 Mar.

Abstract

We introduce a new model for relaxing the assumption of a strict molecular clock for use as a prior in Bayesian methods for divergence time estimation. Lineage-specific rates of substitution are modeled using a Dirichlet process prior (DPP), a type of stochastic process that assumes lineages of a phylogenetic tree are distributed into distinct rate classes. Under the Dirichlet process, the number of rate classes, assignment of branches to rate classes, and the rate value associated with each class are treated as random variables. The performance of this model was evaluated by conducting analyses on data sets simulated under a range of different models. We compared the Dirichlet process model with two alternative models for rate variation: the strict molecular clock and the independent rates model. Our results show that divergence time estimation under the DPP provides robust estimates of node ages and branch rates without significantly reducing power. Further analyses were conducted on a biological data set, and we provide examples of ways to summarize Markov chain Monte Carlo samples under this model.

PubMed Disclaimer

Figures

F<sc>IG</sc> 1.

FIG 1.

An example of substitution rate models for a single simulation replicate. A birth–death tree was used to generate branch lengths in units of substitutions/site under six different models: the global molecular clock (GMC), local molecular clock (LMC), compound Poisson process model (CPP), log-normally distributed autocorrelated rates (AR-LN), gamma-distributed uncorrelated rates (IR-G), and the Dirichlet process model (DPPR). Note that the global clock tree (GMC) is proportional to the simulation tree, although the branches are scaled by the clock rate

F<sc>IG</sc> 2.

FIG 2.

The posterior mean lineage-specific rates estimated under the DPP model compared with the true branch rates for data sets with substitution rates generated under different models of among-lineage rate variation: (a) the GMC, (b) LMCs, (c) the CPP, (d) AR-LN, (e) IR-G, and (f) uncorrelated rates generated under the Dirichlet process. Each point represents a single mean branch rate estimate across all simulation replicates. The solid line indicates the line of equality.

F<sc>IG</sc> 3.

FIG 3.

The percentage error calculated for branch rate estimates under the DPP (gray), the GMC (white), and the independent rates model (cross-hatched). Box plots indicate each sample minimum, lower quartile, median, upper quartile, and sample maximum of the percentage error in branch rate estimates across all simulation replicates for each model of substitution rate variation: the GMC, LMC, CPP, AR-LN, IR-G, and DPP-R.

F<sc>IG</sc> 4.

FIG 4.

The sizes of branch rate 95% CIs from analyses under the DPP (•), GMC (□), and independent rates models (×) plotted against the true branch rates. Analyses were performed on data sets with substitution rate variation generated under six different models: (a) the GMC, (b) LMCs, (c) the CPP, (d) AR-LN, (e) IR-G, and (f) uncorrelated rates generated under the Dirichlet process. For each comparison, the true branch-specific rates were binned, so that each bin contained 100 rate values and the average 95% CI range was calculated for each bin.

F<sc>IG</sc> 5.

FIG 5.

The proportion of node height estimates where the true value was sampled within the 95% CI (coverage probability) for analyses assuming the DPP (•), global clock (□), and IR-G (×) models compared with the true relative node heights. Coverage probabilities are presented for data generated under each rate variation model: (a) the GMC, (b) LMCs, (c) the CPP, (d) AR-LN, (e) IR-G, and (f) uncorrelated rates generated under the Dirichlet process. For each comparison, the true node heights were binned, so that each bin contained 100 nodes and the coverage probability was calculated for each bin.

F<sc>IG</sc> 6.

FIG 6.

The sizes of node height 95% CIs produced by analyses under the DPP (•), global clock (□), and IR-G (×) models compared with the true relative node heights. Analyses were performed on data sets with substitution rate variation generated under six different models: (a) the GMC, (b) LMCs, (c) the CPP, (d) AR-LN, (e) IR-G, and (f) uncorrelated rates generated under the Dirichlet process. For each comparison, the true node heights were binned, so that each bin contained 100 nodes and the average 95% CI width was calculated for each bin.

F<sc>IG</sc> 7.

FIG 7.

An example of the results yielded from divergence time analysis under the DPP. (a) The true tree topology and branch lengths used to simulate the data set, with branch rates generated under a local clock model. The branches are colored according to their rate (black: 0.7; blue: 0.02; and red:1.2 substitutions/site/time). Terminal branches/taxa are labeled with letters (A–B) and internal branches/nodes are labeled with numbers (1–8). (b) The estimates of lineage-specific substitution rate (in units of substitutions/site/time) under the Dirichlet process model. Each estimate is labeled according to its corresponding branch in the tree (a) and colored according to the mean partition estimated under the DPP model. True rates for each branch are indicated with inverted triangles, mean rates sampled under the DPP model are represented with open circles, and 95% CIs are shown with lines. (c) The average relative node ages estimated under the DPP model. Gray bars indicate 95% CIs of node heights and each branch is colored according to the mean partition estimated under the DPP model. Yellow bars represent the true divergence time for each internal node.

F<sc>IG</sc> 8.

FIG 8.

The posterior and prior probabilities of the number of rate categories (k) for analyses on a primate data set with different expected values of the DPP concentration parameter (α). The histograms show the probability of values of k sampled by the MCMC algorithm when sampling from the posterior distribution (top, dark bars) or from the prior distribution (bottom, light bars). Four separate analyses were conducted, each with different parameterizations of the gamma-distributed hyperprior on α. The expected values of α are: (a) 0.476, (b) 1.396, (c) 9.184, and (d) 240.67. The median values of k and the 95% CIs are indicated for each E(α).

F<sc>IG</sc> 9.

FIG 9.

Branch rate estimates from an analysis of primate mitochondrial sequences. (a) The topology with branch lengths proportional to the number of expected substitutions per site with branches colored according to the mean partition estimated under the DPP. (b) The topology with branch lengths proportional to the mean rate estimated for each branch and colored according to a gradient where blue indicates the lowest rate and red indicates the highest. The subclades designated with specific rates by Yang and Yoder (2003) are highlighted with gray boxes. In their study, the Simiiformes had the highest rate and Microcebus had the next highest rate compared with the remaining lineages.

F<sc>IG</sc> 10.

FIG 10.

A comparison of divergence times estimated under different methods. The branch lengths are divergence times estimated by the previous study using a maximum likelihood local clock method (Yang and Yoder 2003). The gray bars show the node age 95% CIs obtained from the divergence time analysis using the DPP on rate variation presented in this study. White circles indicate nodes calibrated by the fossil age estimates presented in the original study.

Similar articles

Cited by

References

    1. Ané C, Larget B, Baum DA, Smith SD, Rokas A. Bayesian estimation of concordance among gene trees. Mol Biol Evol. 2007;24:412–426. - PubMed
    1. Antoniak CE. Mixtures of Dirichlet processes with applications to non-parametric problems. Ann Stat. 1974;2:1152–1174.
    1. Dorazio RM. On selecting a prior for the precision parameter of the Dirichlet process mixture models. J Stat Plan Inference. 2009;139:3384–3390.
    1. Dornburg A, Brandley MC, McGowen MR, Near TJ. Relaxed clocks and inferences of heterogeneous patterns of nucleotide substitution and divergence time estimates across whales and dolphins (Mammalia: Cetacea) Mol Biol Evol. 2011;29:721–736. - PubMed
    1. Drummond AJ, Ho SY, Phillips MJ, Rambaut A. Relaxed phylogenetics and dating with confidence. PLoS Biol. 2006:4, e88. - PMC - PubMed

Publication types

MeSH terms

LinkOut - more resources