STEM: species tree estimation using maximum likelihood for gene trees under coalescence (original) (raw)
Journal Article
,
1Departments of Statistics and Evolution, Ecology, and Organismal Biology, The Ohio State University, Columbus, OH 43210, 2Department of Biological Sciences, Louisiana State University, Baton Rouge, LA 70803 and 3Department of Ecology and Evolutionary Biology, University of Michigan, Ann Arbor, MI 48109, USA
*To whom correspondence should be addressed.
Search for other works by this author on:
,
1Departments of Statistics and Evolution, Ecology, and Organismal Biology, The Ohio State University, Columbus, OH 43210, 2Department of Biological Sciences, Louisiana State University, Baton Rouge, LA 70803 and 3Department of Ecology and Evolutionary Biology, University of Michigan, Ann Arbor, MI 48109, USA
Search for other works by this author on:
1Departments of Statistics and Evolution, Ecology, and Organismal Biology, The Ohio State University, Columbus, OH 43210, 2Department of Biological Sciences, Louisiana State University, Baton Rouge, LA 70803 and 3Department of Ecology and Evolutionary Biology, University of Michigan, Ann Arbor, MI 48109, USA
Search for other works by this author on:
Received:
28 November 2008
Revision received:
04 February 2009
Accepted:
04 February 2009
Published:
10 February 2009
Navbar Search Filter Mobile Enter search term Search
Abstract
Summary: STEM is a software package written in the C language to obtain maximum likelihood (ML) estimates for phylogenetic species trees given a sample of gene trees under the coalescent model. It includes options to compute the ML species tree, search the space of all species trees for the k trees of highest likelihood and compute ML branch lengths for a user-input species tree.
Availability: The STEM package, including source code, is freely available at http://www.stat.osu.edu/~lkubatko/software/STEM/.
Contact: lkubatko@stat.osu.edu
Supplementary information: Supplementary data are available at Bioinformatics online.
1 INTRODUCTION
The increasing availability of sequence data from multiple loci for inferring phylogenetic trees has led to a growing awareness that the evolutionary histories of individual genes may differ substantially from the underlying species tree. This incongruence can result from numerous process, including horizontal transfer, gene duplication and incomplete lineage sorting (deep coalescence) (Maddison, 1997). When phylogenetic trees representing the species history are of primary interest, it is therefore necessary to either modify standard phylogenetic methods to handle multi-locus data, or to develop new methods that explicitly model the source of discord (Ane et al., 2007; Liu, 2008; Liu and Pearl, 2007). Although several recent studies have claimed that the commonly used procedure of concatenating multi-gene data prior to phylogenetic analysis performs well (Chen and Li, 2001; Rokas et al., 2003), others have highlighted situations in which such procedures fail (Carstens and Knowles, 2007; Kolaczkowski and Thornton, 2004; Kubatko and Degnan, 2007; Mossel and Vigoda, 2005).
Here, we describe a new software package called STEM that estimates the maximum likelihood (ML) species tree from a sample of gene trees, assuming that discord between the observed gene trees and the species tree arises solely from the coalescent process (Kingman, 1982). As is the case with other available programs for estimating species phylogenies from multilocus data [e.g. BEST, (Liu, 2008)], STEM assumes no recombination within loci, free recombination between loci and no gene flow following speciation. STEM provides the analytically derived ML estimate of the species trees when only a single estimate is desired. In addition, STEM provides a capability for searching the space of species trees for a collection of k species trees with high likelihood, where k is set by the user. Finally, STEM can compute ML branch lengths on any given species tree, which reduces the search for high-likelihood trees to a discrete (topology only) space, as well as allows evaluation of any species tree of interest.
As noted above, the programs BEST (Liu, 2008; Liu and Pearl, 2007) and BUCKy (Ane et al., 2007) are related to STEM in that they also seek to provide a species-level phylogenetic estimate. However, STEM is distinct from these in that (i) it uses a maximum likelihood, rather than Bayesian, framework to obtain an estimate; and (ii) the availability of analytic results in the ML case using gene trees as the data allow computations to be carried out more rapidly than the Markov chain Monte Carlo (MCMC)-based analyses utilized by these programs.
2 DESCRIPTION
2.1 Phylogenetic model
Let g j denote the gene tree topology and branch lengths for the tree representing locus j (_j_=1, 2,….N) in a sample of N loci. Assuming that the N loci are sampled independently throughout the genome, the likelihood function is
(1)
where S represents the species tree and τ is the set of branch lengths on that tree. The function f(.|.) is the gene tree density under the coalescent model given by Rannala and Yang (2003). We note that this density is general enough to allow samples of multiple lineages per species-level taxon. Membership of alleles to species-level taxa is specified as input to STEM.
The likelihood in (1) is a function of the parameter θ=4_N_ _e_μ, where N e is the effective population size and μ is the per-site mutation rate. In the most general case, θ may vary along species tree branches. However, it is not uncommon to assume a single θ for the entire tree. For example, Liu (2006) showed that when it can be assumed that there is a single θ for the entire tree, it is possible to analytically derive the joint ML estimate of θ and of the species tree topology and branch lengths. He calls the estimator of the tree obtained in this way the Maximum Tree (MT), and shows that it is a consistent estimator of the species tree when the gene trees and branch lengths are known without error.
Mossel and Roch (2009) also consider a sample of gene trees with branch lengths known without error and derive a consistent estimator of the species tree in the case in which θ is known (but not necessarily equal) for all branches of the species tree, which they call the GLASS tree (an acronym for Global LAteSt Split, which is derived from the method used to compute it). The GLASS tree coincides with MT whenever it can be assumed that the θ along all branches of the species tree are the same and take their value from the MLE for θ. The relationship of the ML tree returned by STEM to these methods is noted below.
Input to the STEM program requires a sample of gene trees with branch lengths in units of expected number of nucleotide substitutions per site along with an overall value of θ to be applied to all loci. The value of θ is used to convert gene tree branch lengths into coalescent units (number of 2_N_ e generations) by multiplying all gene tree branch lengths by 1/θ. Further, because evolutionary rates may vary across sampled loci, the user may also provide rates to be applied to each locus separately. For example, if rate r i is specified for locus i, then all branch lengths in gene tree i will be additionally multipied by 1/r i. In addition to adjusting for variation in the mutation rate of each locus, the r i values allow the user to adjust for ploidy in the individual genes (e.g. the rate provided for an mtDNA locus should be divided by 2 to incorporate the haploid status of this marker). While selection of the θ and r i values is completely at the discretion of the user, reasonable settings for these parameters can be straightforwardly obtained. For example, the θ parameter could be estimated by some available method, such as Watterson's estimator (Watterson, 1975). The r i values could be estimated by examining average divergence from an outgroup, as suggested by Yang (2002).
2.2 STEM output
When the ML estimate of the species tree is requested, STEM returns the MT of Liu (2006) for the particular user-specified values of θ and the gene-specific rates. STEM is also able to evaluate the likelihood for any given species tree rapidly by incorporating a new result that analytically derives ML branch lengths for an arbitrary species tree under (1). The details of this result, which is an extension of the work of Liu (2006), are provided in Supplementary Material 1. In addition, STEM includes an option to search this space for a set of species trees of high likelihood using a simulated annealing algorithm, similar to that used by Salter and Pearl (2001).
2.3 Performance
We demonstrate the usefulness of the STEM package using simulated data. First, a sample of 10 gene trees is generated from the species tree in Figure 1a using the program COAL (Degnan and Salter, 2005). Branches y and z were set to 1.0 coalescent units, while branch length x was varied between 0.2 and 1.0 in increments of 0.2, to include settings in which inference of the species tree is known to be difficult (Kubatko and Degnan, 2007). The second step is the simulation of DNA sequence data along the sampled gene trees using Seq-Gen (Rambaut and Grassly, 1997).
Fig. 1.
(a) Model tree used for the simulations; (b) Results of the simulations comparing the performance of STEM to concatenation in terms of the percent of times the true species trees is obtained as a function of x.
Once the data are generated, ML estimates of the individual gene trees are obtained using the program PAUP* (Swofford, 2003) and then used as input to STEM. The entire simulation was repeated 100 times for each value of x. Figure 1b compares the results of the STEM program with the naive method of estimating a single ML tree from the concatenated sequence. For both methods (STEM and concatenation), the same mutation model (JC69) was used to generate data and to perform ML estimation in PAUP* in order to remove model misspecification as a source of error in species tree estimates. STEM clearly shows an improvement over concatenation in this setting, even when species tree branch lengths are short.
3 CONCLUSION
As the availability of multi-locus data for inference of species trees increases, the need for development of software to model relationships between gene and species trees is also increasing. STEM provides a computationally efficient method to estimate ML species phylogenies and to explore the likelihood surface under the coalescent model for a given sample of gene trees that will serve as a useful compliment to the more comptuationally intensive Bayesian methods (Ane et al., 2007; Liu, 2008) currently available.
ACKNOWLEDGEMENTS
We thank Liang Liu for generously sharing manuscripts during development of this software, and James Degnan and other anonymous reviewers for helpful comments on an earlier version.
Funding: NSF DMS-07-02277 (L.S.K.); NSF DEB-04-47224 (L.L.K).
Conflict of Interest: none declared.
References
et al.
Bayesian estimation of concordance among gene trees
,
Mol. Biol. Evol.
,
2007
, vol.
24
(pg.
412
-
426
)
Genomic divergences between humans and other hominoids and the effective population size of the common ancestor of humans and chimpanzees
,
Am. J. Hum. Genet.
,
2001
, vol.
68
(pg.
444
-
456
)
Estimating species phylogeny from gene-tree probabilities despite incomplete lineage sorting: an example from Melanoplus grasshoppers
,
Syst. Biol.
,
2007
, vol.
56
(pg.
400
-
411
)
Gene tree distributions under the coalescent process
,
Evolution
,
2005
, vol.
59
(pg.
24
-
37
)
The coalescent
,
Stoch. Proc. Appl.
,
1982
, vol.
13
(pg.
235
-
248
)
Performance of maximum parsimony and maximum likelihood phylogenetics when evolution is heterogeneous
,
Nature
,
2004
, vol.
431
(pg.
980
-
984
)
Inconsistency of phylogenetic estimates from concatenated data under coalescence
,
Syst. Biol.
,
2007
, vol.
56
(pg.
17
-
24
)
Reconstructing posterior distributions of a species phylogeny using estimated gene tree distributions
,
PhD. Dissertation
,
2006
BEST: Bayesian estimation of species trees under the coalescent model
,
Bioinformatics
,
2008
, vol.
24
(pg.
2542
-
2543
)
Species trees from gene trees: reconstructing Bayesian posterior distributions of a species phylogeny using estimated gene tree distributions
,
Syst. Biol.
,
2007
, vol.
56
(pg.
504
-
514
)
Gene trees in species trees
,
Syst. Biol.
,
1997
, vol.
46
(pg.
523
-
536
)
Incomplete lineage sorting: consistent phylogeny estimation from multiple loci
,
IEEE/ACM Trans. Comput. Biol. Bioinform.
,
2009
Phylogenetic MCMC algorithms are misleading on mixtures of trees
,
Science
,
2005
, vol.
309
(pg.
2207
-
2209
)
Seq-Gen: an application for the Monte Carlo simulation of DNA sequence evolution along phylogenetic tree
,
Comput. Appl. Biosci.
,
1997
, vol.
13
(pg.
235
-
238
)
Bayes estimation of species divergence times and ancestral population sizes using DNA sequences from multiple loci
,
Genetics
,
2003
, vol.
164
(pg.
1645
-
1656
)
et al.
Genome-scale approaches to resolving incongruence in molecular phylogenies
,
Nature
,
2003
, vol.
425
(pg.
798
-
804
)
A stochastic search strategy for estimation of maximum likelihood phylogenetic trees
,
Syst. Biol.
,
2001
, vol.
50
(pg.
7
-
17
)
PAUP* Phylogenetic analysis using parsimony (* and other methods)
,
Version 4
,
2003
Sunderland, MA
Sinauer Associates
On the number of segregation sites
,
Theor. Popul. Biol.
,
1975
, vol.
7
(pg.
256
-
276
)
Likelihood and Bayes estimation of ancestral population sizes in Hominoids using data from multiple loci
,
Genetics
,
2002
, vol.
162
(pg.
1811
-
1823
)
Author notes
Associate Editor: Martin Bishop
© The Author 2009. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oxfordjournals.org
Supplementary data
Citations
Views
Altmetric
Metrics
Total Views 3,184
2,437 Pageviews
747 PDF Downloads
Since 11/1/2016
Month: | Total Views: |
---|---|
November 2016 | 8 |
December 2016 | 3 |
January 2017 | 20 |
February 2017 | 35 |
March 2017 | 25 |
April 2017 | 30 |
May 2017 | 21 |
June 2017 | 14 |
July 2017 | 19 |
August 2017 | 39 |
September 2017 | 44 |
October 2017 | 25 |
November 2017 | 29 |
December 2017 | 62 |
January 2018 | 45 |
February 2018 | 33 |
March 2018 | 100 |
April 2018 | 72 |
May 2018 | 45 |
June 2018 | 57 |
July 2018 | 28 |
August 2018 | 48 |
September 2018 | 36 |
October 2018 | 37 |
November 2018 | 69 |
December 2018 | 32 |
January 2019 | 38 |
February 2019 | 33 |
March 2019 | 66 |
April 2019 | 79 |
May 2019 | 34 |
June 2019 | 61 |
July 2019 | 46 |
August 2019 | 38 |
September 2019 | 58 |
October 2019 | 42 |
November 2019 | 52 |
December 2019 | 33 |
January 2020 | 36 |
February 2020 | 18 |
March 2020 | 32 |
April 2020 | 36 |
May 2020 | 20 |
June 2020 | 28 |
July 2020 | 26 |
August 2020 | 30 |
September 2020 | 30 |
October 2020 | 30 |
November 2020 | 25 |
December 2020 | 17 |
January 2021 | 23 |
February 2021 | 26 |
March 2021 | 36 |
April 2021 | 43 |
May 2021 | 37 |
June 2021 | 31 |
July 2021 | 32 |
August 2021 | 22 |
September 2021 | 30 |
October 2021 | 39 |
November 2021 | 33 |
December 2021 | 12 |
January 2022 | 18 |
February 2022 | 23 |
March 2022 | 27 |
April 2022 | 40 |
May 2022 | 21 |
June 2022 | 24 |
July 2022 | 32 |
August 2022 | 35 |
September 2022 | 56 |
October 2022 | 34 |
November 2022 | 22 |
December 2022 | 28 |
January 2023 | 22 |
February 2023 | 31 |
March 2023 | 27 |
April 2023 | 23 |
May 2023 | 31 |
June 2023 | 12 |
July 2023 | 15 |
August 2023 | 21 |
September 2023 | 26 |
October 2023 | 36 |
November 2023 | 29 |
December 2023 | 30 |
January 2024 | 28 |
February 2024 | 27 |
March 2024 | 18 |
April 2024 | 46 |
May 2024 | 27 |
June 2024 | 20 |
July 2024 | 24 |
August 2024 | 44 |
September 2024 | 20 |
October 2024 | 19 |
Citations
344 Web of Science
×
Email alerts
Citing articles via
More from Oxford Academic