A dirichlet process prior for estimating lineage-specific substitution rates - PubMed (original) (raw)
A dirichlet process prior for estimating lineage-specific substitution rates
Tracy A Heath et al. Mol Biol Evol. 2012 Mar.
Abstract
We introduce a new model for relaxing the assumption of a strict molecular clock for use as a prior in Bayesian methods for divergence time estimation. Lineage-specific rates of substitution are modeled using a Dirichlet process prior (DPP), a type of stochastic process that assumes lineages of a phylogenetic tree are distributed into distinct rate classes. Under the Dirichlet process, the number of rate classes, assignment of branches to rate classes, and the rate value associated with each class are treated as random variables. The performance of this model was evaluated by conducting analyses on data sets simulated under a range of different models. We compared the Dirichlet process model with two alternative models for rate variation: the strict molecular clock and the independent rates model. Our results show that divergence time estimation under the DPP provides robust estimates of node ages and branch rates without significantly reducing power. Further analyses were conducted on a biological data set, and we provide examples of ways to summarize Markov chain Monte Carlo samples under this model.
Figures
FIG 1.
An example of substitution rate models for a single simulation replicate. A birth–death tree was used to generate branch lengths in units of substitutions/site under six different models: the global molecular clock (GMC), local molecular clock (LMC), compound Poisson process model (CPP), log-normally distributed autocorrelated rates (AR-LN), gamma-distributed uncorrelated rates (IR-G), and the Dirichlet process model (DPPR). Note that the global clock tree (GMC) is proportional to the simulation tree, although the branches are scaled by the clock rate
FIG 2.
The posterior mean lineage-specific rates estimated under the DPP model compared with the true branch rates for data sets with substitution rates generated under different models of among-lineage rate variation: (a) the GMC, (b) LMCs, (c) the CPP, (d) AR-LN, (e) IR-G, and (f) uncorrelated rates generated under the Dirichlet process. Each point represents a single mean branch rate estimate across all simulation replicates. The solid line indicates the line of equality.
FIG 3.
The percentage error calculated for branch rate estimates under the DPP (gray), the GMC (white), and the independent rates model (cross-hatched). Box plots indicate each sample minimum, lower quartile, median, upper quartile, and sample maximum of the percentage error in branch rate estimates across all simulation replicates for each model of substitution rate variation: the GMC, LMC, CPP, AR-LN, IR-G, and DPP-R.
FIG 4.
The sizes of branch rate 95% CIs from analyses under the DPP (•), GMC (□), and independent rates models (×) plotted against the true branch rates. Analyses were performed on data sets with substitution rate variation generated under six different models: (a) the GMC, (b) LMCs, (c) the CPP, (d) AR-LN, (e) IR-G, and (f) uncorrelated rates generated under the Dirichlet process. For each comparison, the true branch-specific rates were binned, so that each bin contained 100 rate values and the average 95% CI range was calculated for each bin.
FIG 5.
The proportion of node height estimates where the true value was sampled within the 95% CI (coverage probability) for analyses assuming the DPP (•), global clock (□), and IR-G (×) models compared with the true relative node heights. Coverage probabilities are presented for data generated under each rate variation model: (a) the GMC, (b) LMCs, (c) the CPP, (d) AR-LN, (e) IR-G, and (f) uncorrelated rates generated under the Dirichlet process. For each comparison, the true node heights were binned, so that each bin contained 100 nodes and the coverage probability was calculated for each bin.
FIG 6.
The sizes of node height 95% CIs produced by analyses under the DPP (•), global clock (□), and IR-G (×) models compared with the true relative node heights. Analyses were performed on data sets with substitution rate variation generated under six different models: (a) the GMC, (b) LMCs, (c) the CPP, (d) AR-LN, (e) IR-G, and (f) uncorrelated rates generated under the Dirichlet process. For each comparison, the true node heights were binned, so that each bin contained 100 nodes and the average 95% CI width was calculated for each bin.
FIG 7.
An example of the results yielded from divergence time analysis under the DPP. (a) The true tree topology and branch lengths used to simulate the data set, with branch rates generated under a local clock model. The branches are colored according to their rate (black: 0.7; blue: 0.02; and red:1.2 substitutions/site/time). Terminal branches/taxa are labeled with letters (A–B) and internal branches/nodes are labeled with numbers (1–8). (b) The estimates of lineage-specific substitution rate (in units of substitutions/site/time) under the Dirichlet process model. Each estimate is labeled according to its corresponding branch in the tree (a) and colored according to the mean partition estimated under the DPP model. True rates for each branch are indicated with inverted triangles, mean rates sampled under the DPP model are represented with open circles, and 95% CIs are shown with lines. (c) The average relative node ages estimated under the DPP model. Gray bars indicate 95% CIs of node heights and each branch is colored according to the mean partition estimated under the DPP model. Yellow bars represent the true divergence time for each internal node.
FIG 8.
The posterior and prior probabilities of the number of rate categories (k) for analyses on a primate data set with different expected values of the DPP concentration parameter (α). The histograms show the probability of values of k sampled by the MCMC algorithm when sampling from the posterior distribution (top, dark bars) or from the prior distribution (bottom, light bars). Four separate analyses were conducted, each with different parameterizations of the gamma-distributed hyperprior on α. The expected values of α are: (a) 0.476, (b) 1.396, (c) 9.184, and (d) 240.67. The median values of k and the 95% CIs are indicated for each E(α).
FIG 9.
Branch rate estimates from an analysis of primate mitochondrial sequences. (a) The topology with branch lengths proportional to the number of expected substitutions per site with branches colored according to the mean partition estimated under the DPP. (b) The topology with branch lengths proportional to the mean rate estimated for each branch and colored according to a gradient where blue indicates the lowest rate and red indicates the highest. The subclades designated with specific rates by Yang and Yoder (2003) are highlighted with gray boxes. In their study, the Simiiformes had the highest rate and Microcebus had the next highest rate compared with the remaining lineages.
FIG 10.
A comparison of divergence times estimated under different methods. The branch lengths are divergence times estimated by the previous study using a maximum likelihood local clock method (Yang and Yoder 2003). The gray bars show the node age 95% CIs obtained from the divergence time analysis using the DPP on rate variation presented in this study. White circles indicate nodes calibrated by the fossil age estimates presented in the original study.
Similar articles
- Bayesian random local clocks, or one rate to rule them all.
Drummond AJ, Suchard MA. Drummond AJ, et al. BMC Biol. 2010 Aug 31;8:114. doi: 10.1186/1741-7007-8-114. BMC Biol. 2010. PMID: 20807414 Free PMC article. - A hierarchical Bayesian model for calibrating estimates of species divergence times.
Heath TA. Heath TA. Syst Biol. 2012 Oct;61(5):793-809. doi: 10.1093/sysbio/sys032. Epub 2012 Feb 14. Syst Biol. 2012. PMID: 22334343 Free PMC article. - Bayesian dating of shallow phylogenies with a relaxed clock.
Brown RP, Yang Z. Brown RP, et al. Syst Biol. 2010 Mar;59(2):119-31. doi: 10.1093/sysbio/syp082. Epub 2009 Dec 10. Syst Biol. 2010. PMID: 20525625 - A biologist's guide to Bayesian phylogenetic analysis.
Nascimento FF, Reis MD, Yang Z. Nascimento FF, et al. Nat Ecol Evol. 2017 Oct;1(10):1446-1454. doi: 10.1038/s41559-017-0280-x. Epub 2017 Sep 21. Nat Ecol Evol. 2017. PMID: 28983516 Free PMC article. Review. - Molecular-clock methods for estimating evolutionary rates and timescales.
Ho SY, Duchêne S. Ho SY, et al. Mol Ecol. 2014 Dec;23(24):5947-65. doi: 10.1111/mec.12953. Epub 2014 Oct 30. Mol Ecol. 2014. PMID: 25290107 Review.
Cited by
- Modeling Substitution Rate Evolution across Lineages and Relaxing the Molecular Clock.
Mello B, Schrago CG. Mello B, et al. Genome Biol Evol. 2024 Sep 3;16(9):evae199. doi: 10.1093/gbe/evae199. Genome Biol Evol. 2024. PMID: 39332907 Free PMC article. Review. - Detecting Episodic Evolution through Bayesian Inference of Molecular Clock Models.
Tay JH, Baele G, Duchene S. Tay JH, et al. Mol Biol Evol. 2023 Oct 4;40(10):msad212. doi: 10.1093/molbev/msad212. Mol Biol Evol. 2023. PMID: 37738550 Free PMC article. - Latent functional diversity may accelerate microbial community responses to temperature fluctuations.
Smith TP, Mombrikotb S, Ransome E, Kontopoulos DG, Pawar S, Bell T. Smith TP, et al. Elife. 2022 Nov 29;11:e80867. doi: 10.7554/eLife.80867. Elife. 2022. PMID: 36444646 Free PMC article. - Generalizing Bayesian phylogenetics to infer shared evolutionary events.
Oaks JR, Wood PL Jr, Siler CD, Brown RM. Oaks JR, et al. Proc Natl Acad Sci U S A. 2022 Jul 19;119(29):e2121036119. doi: 10.1073/pnas.2121036119. Epub 2022 Jul 15. Proc Natl Acad Sci U S A. 2022. PMID: 35858351 Free PMC article. - Investigating the reliability of molecular estimates of evolutionary time when substitution rates and speciation rates vary.
Ritchie AM, Hua X, Bromham L. Ritchie AM, et al. BMC Ecol Evol. 2022 May 10;22(1):61. doi: 10.1186/s12862-022-02015-8. BMC Ecol Evol. 2022. PMID: 35538412 Free PMC article.
References
- Ané C, Larget B, Baum DA, Smith SD, Rokas A. Bayesian estimation of concordance among gene trees. Mol Biol Evol. 2007;24:412–426. - PubMed
- Antoniak CE. Mixtures of Dirichlet processes with applications to non-parametric problems. Ann Stat. 1974;2:1152–1174.
- Dorazio RM. On selecting a prior for the precision parameter of the Dirichlet process mixture models. J Stat Plan Inference. 2009;139:3384–3390.
- Dornburg A, Brandley MC, McGowen MR, Near TJ. Relaxed clocks and inferences of heterogeneous patterns of nucleotide substitution and divergence time estimates across whales and dolphins (Mammalia: Cetacea) Mol Biol Evol. 2011;29:721–736. - PubMed
Publication types
MeSH terms
LinkOut - more resources
Full Text Sources