STEM: species tree estimation using maximum likelihood for gene trees under coalescence (original) (raw)

Journal Article

1Departments of Statistics and Evolution, Ecology, and Organismal Biology, The Ohio State University, Columbus, OH 43210, 2Department of Biological Sciences, Louisiana State University, Baton Rouge, LA 70803 and 3Department of Ecology and Evolutionary Biology, University of Michigan, Ann Arbor, MI 48109, USA

*To whom correspondence should be addressed.

Search for other works by this author on:

Received:

28 November 2008

Revision received:

04 February 2009

Accepted:

04 February 2009

Published:

10 February 2009

Navbar Search Filter Mobile Enter search term Search

Abstract

Summary: STEM is a software package written in the C language to obtain maximum likelihood (ML) estimates for phylogenetic species trees given a sample of gene trees under the coalescent model. It includes options to compute the ML species tree, search the space of all species trees for the k trees of highest likelihood and compute ML branch lengths for a user-input species tree.

Availability: The STEM package, including source code, is freely available at http://www.stat.osu.edu/~lkubatko/software/STEM/.

Contact: lkubatko@stat.osu.edu

Supplementary information: Supplementary data are available at Bioinformatics online.

1 INTRODUCTION

The increasing availability of sequence data from multiple loci for inferring phylogenetic trees has led to a growing awareness that the evolutionary histories of individual genes may differ substantially from the underlying species tree. This incongruence can result from numerous process, including horizontal transfer, gene duplication and incomplete lineage sorting (deep coalescence) (Maddison, 1997). When phylogenetic trees representing the species history are of primary interest, it is therefore necessary to either modify standard phylogenetic methods to handle multi-locus data, or to develop new methods that explicitly model the source of discord (Ane et al., 2007; Liu, 2008; Liu and Pearl, 2007). Although several recent studies have claimed that the commonly used procedure of concatenating multi-gene data prior to phylogenetic analysis performs well (Chen and Li, 2001; Rokas et al., 2003), others have highlighted situations in which such procedures fail (Carstens and Knowles, 2007; Kolaczkowski and Thornton, 2004; Kubatko and Degnan, 2007; Mossel and Vigoda, 2005).

Here, we describe a new software package called STEM that estimates the maximum likelihood (ML) species tree from a sample of gene trees, assuming that discord between the observed gene trees and the species tree arises solely from the coalescent process (Kingman, 1982). As is the case with other available programs for estimating species phylogenies from multilocus data [e.g. BEST, (Liu, 2008)], STEM assumes no recombination within loci, free recombination between loci and no gene flow following speciation. STEM provides the analytically derived ML estimate of the species trees when only a single estimate is desired. In addition, STEM provides a capability for searching the space of species trees for a collection of k species trees with high likelihood, where k is set by the user. Finally, STEM can compute ML branch lengths on any given species tree, which reduces the search for high-likelihood trees to a discrete (topology only) space, as well as allows evaluation of any species tree of interest.

As noted above, the programs BEST (Liu, 2008; Liu and Pearl, 2007) and BUCKy (Ane et al., 2007) are related to STEM in that they also seek to provide a species-level phylogenetic estimate. However, STEM is distinct from these in that (i) it uses a maximum likelihood, rather than Bayesian, framework to obtain an estimate; and (ii) the availability of analytic results in the ML case using gene trees as the data allow computations to be carried out more rapidly than the Markov chain Monte Carlo (MCMC)-based analyses utilized by these programs.

2 DESCRIPTION

2.1 Phylogenetic model

Let g j denote the gene tree topology and branch lengths for the tree representing locus j (_j_=1, 2,….N) in a sample of N loci. Assuming that the N loci are sampled independently throughout the genome, the likelihood function is

formula

(1)

where S represents the species tree and τ is the set of branch lengths on that tree. The function f(.|.) is the gene tree density under the coalescent model given by Rannala and Yang (2003). We note that this density is general enough to allow samples of multiple lineages per species-level taxon. Membership of alleles to species-level taxa is specified as input to STEM.

The likelihood in (1) is a function of the parameter θ=4_N_ _e_μ, where N e is the effective population size and μ is the per-site mutation rate. In the most general case, θ may vary along species tree branches. However, it is not uncommon to assume a single θ for the entire tree. For example, Liu (2006) showed that when it can be assumed that there is a single θ for the entire tree, it is possible to analytically derive the joint ML estimate of θ and of the species tree topology and branch lengths. He calls the estimator of the tree obtained in this way the Maximum Tree (MT), and shows that it is a consistent estimator of the species tree when the gene trees and branch lengths are known without error.

Mossel and Roch (2009) also consider a sample of gene trees with branch lengths known without error and derive a consistent estimator of the species tree in the case in which θ is known (but not necessarily equal) for all branches of the species tree, which they call the GLASS tree (an acronym for Global LAteSt Split, which is derived from the method used to compute it). The GLASS tree coincides with MT whenever it can be assumed that the θ along all branches of the species tree are the same and take their value from the MLE for θ. The relationship of the ML tree returned by STEM to these methods is noted below.

Input to the STEM program requires a sample of gene trees with branch lengths in units of expected number of nucleotide substitutions per site along with an overall value of θ to be applied to all loci. The value of θ is used to convert gene tree branch lengths into coalescent units (number of 2_N_ e generations) by multiplying all gene tree branch lengths by 1/θ. Further, because evolutionary rates may vary across sampled loci, the user may also provide rates to be applied to each locus separately. For example, if rate r i is specified for locus i, then all branch lengths in gene tree i will be additionally multipied by 1/r i. In addition to adjusting for variation in the mutation rate of each locus, the r i values allow the user to adjust for ploidy in the individual genes (e.g. the rate provided for an mtDNA locus should be divided by 2 to incorporate the haploid status of this marker). While selection of the θ and r i values is completely at the discretion of the user, reasonable settings for these parameters can be straightforwardly obtained. For example, the θ parameter could be estimated by some available method, such as Watterson's estimator (Watterson, 1975). The r i values could be estimated by examining average divergence from an outgroup, as suggested by Yang (2002).

2.2 STEM output

When the ML estimate of the species tree is requested, STEM returns the MT of Liu (2006) for the particular user-specified values of θ and the gene-specific rates. STEM is also able to evaluate the likelihood for any given species tree rapidly by incorporating a new result that analytically derives ML branch lengths for an arbitrary species tree under (1). The details of this result, which is an extension of the work of Liu (2006), are provided in Supplementary Material 1. In addition, STEM includes an option to search this space for a set of species trees of high likelihood using a simulated annealing algorithm, similar to that used by Salter and Pearl (2001).

2.3 Performance

We demonstrate the usefulness of the STEM package using simulated data. First, a sample of 10 gene trees is generated from the species tree in Figure 1a using the program COAL (Degnan and Salter, 2005). Branches y and z were set to 1.0 coalescent units, while branch length x was varied between 0.2 and 1.0 in increments of 0.2, to include settings in which inference of the species tree is known to be difficult (Kubatko and Degnan, 2007). The second step is the simulation of DNA sequence data along the sampled gene trees using Seq-Gen (Rambaut and Grassly, 1997).

(a) Model tree used for the simulations; (b) Results of the simulations comparing the performance of STEM to concatenation in terms of the percent of times the true species trees is obtained as a function of x.

Fig. 1.

Once the data are generated, ML estimates of the individual gene trees are obtained using the program PAUP* (Swofford, 2003) and then used as input to STEM. The entire simulation was repeated 100 times for each value of x. Figure 1b compares the results of the STEM program with the naive method of estimating a single ML tree from the concatenated sequence. For both methods (STEM and concatenation), the same mutation model (JC69) was used to generate data and to perform ML estimation in PAUP* in order to remove model misspecification as a source of error in species tree estimates. STEM clearly shows an improvement over concatenation in this setting, even when species tree branch lengths are short.

3 CONCLUSION

As the availability of multi-locus data for inference of species trees increases, the need for development of software to model relationships between gene and species trees is also increasing. STEM provides a computationally efficient method to estimate ML species phylogenies and to explore the likelihood surface under the coalescent model for a given sample of gene trees that will serve as a useful compliment to the more comptuationally intensive Bayesian methods (Ane et al., 2007; Liu, 2008) currently available.

ACKNOWLEDGEMENTS

We thank Liang Liu for generously sharing manuscripts during development of this software, and James Degnan and other anonymous reviewers for helpful comments on an earlier version.

Funding: NSF DMS-07-02277 (L.S.K.); NSF DEB-04-47224 (L.L.K).

Conflict of Interest: none declared.

References

et al.

Bayesian estimation of concordance among gene trees

Mol. Biol. Evol.

2007

, vol.

(pg.

412

426

)

Genomic divergences between humans and other hominoids and the effective population size of the common ancestor of humans and chimpanzees

Am. J. Hum. Genet.

2001

, vol.

(pg.

444

456

)

Estimating species phylogeny from gene-tree probabilities despite incomplete lineage sorting: an example from Melanoplus grasshoppers

Syst. Biol.

2007

, vol.

(pg.

400

411

)

Gene tree distributions under the coalescent process

Evolution

2005

, vol.

(pg.

)

The coalescent

Stoch. Proc. Appl.

1982

, vol.

(pg.

235

248

)

Performance of maximum parsimony and maximum likelihood phylogenetics when evolution is heterogeneous

Nature

2004

, vol.

431

(pg.

980

984

)

Inconsistency of phylogenetic estimates from concatenated data under coalescence

Syst. Biol.

2007

, vol.

(pg.

)

Reconstructing posterior distributions of a species phylogeny using estimated gene tree distributions

PhD. Dissertation

2006

BEST: Bayesian estimation of species trees under the coalescent model

Bioinformatics

2008

, vol.

(pg.

2542

2543

)

Species trees from gene trees: reconstructing Bayesian posterior distributions of a species phylogeny using estimated gene tree distributions

Syst. Biol.

2007

, vol.

(pg.

504

514

)

Gene trees in species trees

Syst. Biol.

1997

, vol.

(pg.

523

536

)

Incomplete lineage sorting: consistent phylogeny estimation from multiple loci

IEEE/ACM Trans. Comput. Biol. Bioinform.

2009

Phylogenetic MCMC algorithms are misleading on mixtures of trees

Science

2005

, vol.

309

(pg.

2207

2209

)

Seq-Gen: an application for the Monte Carlo simulation of DNA sequence evolution along phylogenetic tree

Comput. Appl. Biosci.

1997

, vol.

(pg.

235

238

)

Bayes estimation of species divergence times and ancestral population sizes using DNA sequences from multiple loci

Genetics

2003

, vol.

164

(pg.

1645

1656

)

et al.

Genome-scale approaches to resolving incongruence in molecular phylogenies

Nature

2003

, vol.

425

(pg.

798

804

)

A stochastic search strategy for estimation of maximum likelihood phylogenetic trees

Syst. Biol.

2001

, vol.

(pg.

)

PAUP* Phylogenetic analysis using parsimony (* and other methods)

Version 4

2003

Sunderland, MA

Sinauer Associates

On the number of segregation sites

Theor. Popul. Biol.

1975

, vol.

(pg.

256

276

)

Likelihood and Bayes estimation of ancestral population sizes in Hominoids using data from multiple loci

Genetics

2002

, vol.

162

(pg.

1811

1823

)

Author notes

Associate Editor: Martin Bishop

Supplementary data

Citations

Views

Altmetric

Metrics

Total Views 3,184

2,437 Pageviews

747 PDF Downloads

Since 11/1/2016

Month:	Total Views:
November 2016	8
December 2016	3
January 2017	20
February 2017	35
March 2017	25
April 2017	30
May 2017	21
June 2017	14
July 2017	19
August 2017	39
September 2017	44
October 2017	25
November 2017	29
December 2017	62
January 2018	45
February 2018	33
March 2018	100
April 2018	72
May 2018	45
June 2018	57
July 2018	28
August 2018	48
September 2018	36
October 2018	37
November 2018	69
December 2018	32
January 2019	38
February 2019	33
March 2019	66
April 2019	79
May 2019	34
June 2019	61
July 2019	46
August 2019	38
September 2019	58
October 2019	42
November 2019	52
December 2019	33
January 2020	36
February 2020	18
March 2020	32
April 2020	36
May 2020	20
June 2020	28
July 2020	26
August 2020	30
September 2020	30
October 2020	30
November 2020	25
December 2020	17
January 2021	23
February 2021	26
March 2021	36
April 2021	43
May 2021	37
June 2021	31
July 2021	32
August 2021	22
September 2021	30
October 2021	39
November 2021	33
December 2021	12
January 2022	18
February 2022	23
March 2022	27
April 2022	40
May 2022	21
June 2022	24
July 2022	32
August 2022	35
September 2022	56
October 2022	34
November 2022	22
December 2022	28
January 2023	22
February 2023	31
March 2023	27
April 2023	23
May 2023	31
June 2023	12
July 2023	15
August 2023	21
September 2023	26
October 2023	36
November 2023	29
December 2023	30
January 2024	28
February 2024	27
March 2024	18
April 2024	46
May 2024	27
June 2024	20
July 2024	24
August 2024	44
September 2024	20
October 2024	19

Citations

344 Web of Science

STEM: species tree estimation using maximum likelihood for gene trees under coalescence (original) (raw)

Abstract

1 INTRODUCTION

2 DESCRIPTION

2.1 Phylogenetic model

2.2 STEM output

2.3 Performance

3 CONCLUSION

ACKNOWLEDGEMENTS

References

Author notes

Supplementary data

Citations

Views

Altmetric

Email alerts

Citing articles via

Latest

Most Cited

STEM: species tree estimation using maximum likelihood for gene trees under coalescence (original) (raw)

Abstract

1 INTRODUCTION

2 DESCRIPTION

2.1 Phylogenetic model

2.2 STEM output

2.3 Performance

3 CONCLUSION

ACKNOWLEDGEMENTS

References

Author notes

Supplementary data

Citations

Views

Altmetric

Email alerts

Citing articles via

Latest

Most Read

Most Cited