STEM: species tree estimation using maximum likelihood for gene trees under coalescence (original) (raw)

Journal Article

,

1Departments of Statistics and Evolution, Ecology, and Organismal Biology, The Ohio State University, Columbus, OH 43210, 2Department of Biological Sciences, Louisiana State University, Baton Rouge, LA 70803 and 3Department of Ecology and Evolutionary Biology, University of Michigan, Ann Arbor, MI 48109, USA

*To whom correspondence should be addressed.

Search for other works by this author on:

,

1Departments of Statistics and Evolution, Ecology, and Organismal Biology, The Ohio State University, Columbus, OH 43210, 2Department of Biological Sciences, Louisiana State University, Baton Rouge, LA 70803 and 3Department of Ecology and Evolutionary Biology, University of Michigan, Ann Arbor, MI 48109, USA

Search for other works by this author on:

1Departments of Statistics and Evolution, Ecology, and Organismal Biology, The Ohio State University, Columbus, OH 43210, 2Department of Biological Sciences, Louisiana State University, Baton Rouge, LA 70803 and 3Department of Ecology and Evolutionary Biology, University of Michigan, Ann Arbor, MI 48109, USA

Search for other works by this author on:

Received:

28 November 2008

Revision received:

04 February 2009

Accepted:

04 February 2009

Published:

10 February 2009

Navbar Search Filter Mobile Enter search term Search

Abstract

Summary: STEM is a software package written in the C language to obtain maximum likelihood (ML) estimates for phylogenetic species trees given a sample of gene trees under the coalescent model. It includes options to compute the ML species tree, search the space of all species trees for the k trees of highest likelihood and compute ML branch lengths for a user-input species tree.

Availability: The STEM package, including source code, is freely available at http://www.stat.osu.edu/~lkubatko/software/STEM/.

Contact: lkubatko@stat.osu.edu

Supplementary information: Supplementary data are available at Bioinformatics online.

1 INTRODUCTION

The increasing availability of sequence data from multiple loci for inferring phylogenetic trees has led to a growing awareness that the evolutionary histories of individual genes may differ substantially from the underlying species tree. This incongruence can result from numerous process, including horizontal transfer, gene duplication and incomplete lineage sorting (deep coalescence) (Maddison, 1997). When phylogenetic trees representing the species history are of primary interest, it is therefore necessary to either modify standard phylogenetic methods to handle multi-locus data, or to develop new methods that explicitly model the source of discord (Ane et al., 2007; Liu, 2008; Liu and Pearl, 2007). Although several recent studies have claimed that the commonly used procedure of concatenating multi-gene data prior to phylogenetic analysis performs well (Chen and Li, 2001; Rokas et al., 2003), others have highlighted situations in which such procedures fail (Carstens and Knowles, 2007; Kolaczkowski and Thornton, 2004; Kubatko and Degnan, 2007; Mossel and Vigoda, 2005).

Here, we describe a new software package called STEM that estimates the maximum likelihood (ML) species tree from a sample of gene trees, assuming that discord between the observed gene trees and the species tree arises solely from the coalescent process (Kingman, 1982). As is the case with other available programs for estimating species phylogenies from multilocus data [e.g. BEST, (Liu, 2008)], STEM assumes no recombination within loci, free recombination between loci and no gene flow following speciation. STEM provides the analytically derived ML estimate of the species trees when only a single estimate is desired. In addition, STEM provides a capability for searching the space of species trees for a collection of k species trees with high likelihood, where k is set by the user. Finally, STEM can compute ML branch lengths on any given species tree, which reduces the search for high-likelihood trees to a discrete (topology only) space, as well as allows evaluation of any species tree of interest.

As noted above, the programs BEST (Liu, 2008; Liu and Pearl, 2007) and BUCKy (Ane et al., 2007) are related to STEM in that they also seek to provide a species-level phylogenetic estimate. However, STEM is distinct from these in that (i) it uses a maximum likelihood, rather than Bayesian, framework to obtain an estimate; and (ii) the availability of analytic results in the ML case using gene trees as the data allow computations to be carried out more rapidly than the Markov chain Monte Carlo (MCMC)-based analyses utilized by these programs.

2 DESCRIPTION

2.1 Phylogenetic model

Let g j denote the gene tree topology and branch lengths for the tree representing locus j (_j_=1, 2,….N) in a sample of N loci. Assuming that the N loci are sampled independently throughout the genome, the likelihood function is

formula

(1)

where S represents the species tree and τ is the set of branch lengths on that tree. The function f(.|.) is the gene tree density under the coalescent model given by Rannala and Yang (2003). We note that this density is general enough to allow samples of multiple lineages per species-level taxon. Membership of alleles to species-level taxa is specified as input to STEM.

The likelihood in (1) is a function of the parameter θ=4_N_ _e_μ, where N e is the effective population size and μ is the per-site mutation rate. In the most general case, θ may vary along species tree branches. However, it is not uncommon to assume a single θ for the entire tree. For example, Liu (2006) showed that when it can be assumed that there is a single θ for the entire tree, it is possible to analytically derive the joint ML estimate of θ and of the species tree topology and branch lengths. He calls the estimator of the tree obtained in this way the Maximum Tree (MT), and shows that it is a consistent estimator of the species tree when the gene trees and branch lengths are known without error.

Mossel and Roch (2009) also consider a sample of gene trees with branch lengths known without error and derive a consistent estimator of the species tree in the case in which θ is known (but not necessarily equal) for all branches of the species tree, which they call the GLASS tree (an acronym for Global LAteSt Split, which is derived from the method used to compute it). The GLASS tree coincides with MT whenever it can be assumed that the θ along all branches of the species tree are the same and take their value from the MLE for θ. The relationship of the ML tree returned by STEM to these methods is noted below.

Input to the STEM program requires a sample of gene trees with branch lengths in units of expected number of nucleotide substitutions per site along with an overall value of θ to be applied to all loci. The value of θ is used to convert gene tree branch lengths into coalescent units (number of 2_N_ e generations) by multiplying all gene tree branch lengths by 1/θ. Further, because evolutionary rates may vary across sampled loci, the user may also provide rates to be applied to each locus separately. For example, if rate r i is specified for locus i, then all branch lengths in gene tree i will be additionally multipied by 1/r i. In addition to adjusting for variation in the mutation rate of each locus, the r i values allow the user to adjust for ploidy in the individual genes (e.g. the rate provided for an mtDNA locus should be divided by 2 to incorporate the haploid status of this marker). While selection of the θ and r i values is completely at the discretion of the user, reasonable settings for these parameters can be straightforwardly obtained. For example, the θ parameter could be estimated by some available method, such as Watterson's estimator (Watterson, 1975). The r i values could be estimated by examining average divergence from an outgroup, as suggested by Yang (2002).

2.2 STEM output

When the ML estimate of the species tree is requested, STEM returns the MT of Liu (2006) for the particular user-specified values of θ and the gene-specific rates. STEM is also able to evaluate the likelihood for any given species tree rapidly by incorporating a new result that analytically derives ML branch lengths for an arbitrary species tree under (1). The details of this result, which is an extension of the work of Liu (2006), are provided in Supplementary Material 1. In addition, STEM includes an option to search this space for a set of species trees of high likelihood using a simulated annealing algorithm, similar to that used by Salter and Pearl (2001).

2.3 Performance

We demonstrate the usefulness of the STEM package using simulated data. First, a sample of 10 gene trees is generated from the species tree in Figure 1a using the program COAL (Degnan and Salter, 2005). Branches y and z were set to 1.0 coalescent units, while branch length x was varied between 0.2 and 1.0 in increments of 0.2, to include settings in which inference of the species tree is known to be difficult (Kubatko and Degnan, 2007). The second step is the simulation of DNA sequence data along the sampled gene trees using Seq-Gen (Rambaut and Grassly, 1997).

(a) Model tree used for the simulations; (b) Results of the simulations comparing the performance of STEM to concatenation in terms of the percent of times the true species trees is obtained as a function of x.

Fig. 1.

(a) Model tree used for the simulations; (b) Results of the simulations comparing the performance of STEM to concatenation in terms of the percent of times the true species trees is obtained as a function of x.

Once the data are generated, ML estimates of the individual gene trees are obtained using the program PAUP* (Swofford, 2003) and then used as input to STEM. The entire simulation was repeated 100 times for each value of x. Figure 1b compares the results of the STEM program with the naive method of estimating a single ML tree from the concatenated sequence. For both methods (STEM and concatenation), the same mutation model (JC69) was used to generate data and to perform ML estimation in PAUP* in order to remove model misspecification as a source of error in species tree estimates. STEM clearly shows an improvement over concatenation in this setting, even when species tree branch lengths are short.

3 CONCLUSION

As the availability of multi-locus data for inference of species trees increases, the need for development of software to model relationships between gene and species trees is also increasing. STEM provides a computationally efficient method to estimate ML species phylogenies and to explore the likelihood surface under the coalescent model for a given sample of gene trees that will serve as a useful compliment to the more comptuationally intensive Bayesian methods (Ane et al., 2007; Liu, 2008) currently available.

ACKNOWLEDGEMENTS

We thank Liang Liu for generously sharing manuscripts during development of this software, and James Degnan and other anonymous reviewers for helpful comments on an earlier version.

Funding: NSF DMS-07-02277 (L.S.K.); NSF DEB-04-47224 (L.L.K).

Conflict of Interest: none declared.

References

et al.

Bayesian estimation of concordance among gene trees

,

Mol. Biol. Evol.

,

2007

, vol.

24

(pg.

412

-

426

)

Genomic divergences between humans and other hominoids and the effective population size of the common ancestor of humans and chimpanzees

,

Am. J. Hum. Genet.

,

2001

, vol.

68

(pg.

444

-

456

)

Estimating species phylogeny from gene-tree probabilities despite incomplete lineage sorting: an example from Melanoplus grasshoppers

,

Syst. Biol.

,

2007

, vol.

56

(pg.

400

-

411

)

Gene tree distributions under the coalescent process

,

Evolution

,

2005

, vol.

59

(pg.

24

-

37

)

The coalescent

,

Stoch. Proc. Appl.

,

1982

, vol.

13

(pg.

235

-

248

)

Performance of maximum parsimony and maximum likelihood phylogenetics when evolution is heterogeneous

,

Nature

,

2004

, vol.

431

(pg.

980

-

984

)

Inconsistency of phylogenetic estimates from concatenated data under coalescence

,

Syst. Biol.

,

2007

, vol.

56

(pg.

17

-

24

)

Reconstructing posterior distributions of a species phylogeny using estimated gene tree distributions

,

PhD. Dissertation

,

2006

BEST: Bayesian estimation of species trees under the coalescent model

,

Bioinformatics

,

2008

, vol.

24

(pg.

2542

-

2543

)

Species trees from gene trees: reconstructing Bayesian posterior distributions of a species phylogeny using estimated gene tree distributions

,

Syst. Biol.

,

2007

, vol.

56

(pg.

504

-

514

)

Gene trees in species trees

,

Syst. Biol.

,

1997

, vol.

46

(pg.

523

-

536

)

Incomplete lineage sorting: consistent phylogeny estimation from multiple loci

,

IEEE/ACM Trans. Comput. Biol. Bioinform.

,

2009

Phylogenetic MCMC algorithms are misleading on mixtures of trees

,

Science

,

2005

, vol.

309

(pg.

2207

-

2209

)

Seq-Gen: an application for the Monte Carlo simulation of DNA sequence evolution along phylogenetic tree

,

Comput. Appl. Biosci.

,

1997

, vol.

13

(pg.

235

-

238

)

Bayes estimation of species divergence times and ancestral population sizes using DNA sequences from multiple loci

,

Genetics

,

2003

, vol.

164

(pg.

1645

-

1656

)

et al.

Genome-scale approaches to resolving incongruence in molecular phylogenies

,

Nature

,

2003

, vol.

425

(pg.

798

-

804

)

A stochastic search strategy for estimation of maximum likelihood phylogenetic trees

,

Syst. Biol.

,

2001

, vol.

50

(pg.

7

-

17

)

PAUP* Phylogenetic analysis using parsimony (* and other methods)

,

Version 4

,

2003

Sunderland, MA

Sinauer Associates

On the number of segregation sites

,

Theor. Popul. Biol.

,

1975

, vol.

7

(pg.

256

-

276

)

Likelihood and Bayes estimation of ancestral population sizes in Hominoids using data from multiple loci

,

Genetics

,

2002

, vol.

162

(pg.

1811

-

1823

)

Author notes

Associate Editor: Martin Bishop

© The Author 2009. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oxfordjournals.org

Supplementary data

Citations

Views

Altmetric

Metrics

Total Views 3,184

2,437 Pageviews

747 PDF Downloads

Since 11/1/2016

Month: Total Views:
November 2016 8
December 2016 3
January 2017 20
February 2017 35
March 2017 25
April 2017 30
May 2017 21
June 2017 14
July 2017 19
August 2017 39
September 2017 44
October 2017 25
November 2017 29
December 2017 62
January 2018 45
February 2018 33
March 2018 100
April 2018 72
May 2018 45
June 2018 57
July 2018 28
August 2018 48
September 2018 36
October 2018 37
November 2018 69
December 2018 32
January 2019 38
February 2019 33
March 2019 66
April 2019 79
May 2019 34
June 2019 61
July 2019 46
August 2019 38
September 2019 58
October 2019 42
November 2019 52
December 2019 33
January 2020 36
February 2020 18
March 2020 32
April 2020 36
May 2020 20
June 2020 28
July 2020 26
August 2020 30
September 2020 30
October 2020 30
November 2020 25
December 2020 17
January 2021 23
February 2021 26
March 2021 36
April 2021 43
May 2021 37
June 2021 31
July 2021 32
August 2021 22
September 2021 30
October 2021 39
November 2021 33
December 2021 12
January 2022 18
February 2022 23
March 2022 27
April 2022 40
May 2022 21
June 2022 24
July 2022 32
August 2022 35
September 2022 56
October 2022 34
November 2022 22
December 2022 28
January 2023 22
February 2023 31
March 2023 27
April 2023 23
May 2023 31
June 2023 12
July 2023 15
August 2023 21
September 2023 26
October 2023 36
November 2023 29
December 2023 30
January 2024 28
February 2024 27
March 2024 18
April 2024 46
May 2024 27
June 2024 20
July 2024 24
August 2024 44
September 2024 20
October 2024 19

Citations

344 Web of Science

×

Email alerts

Citing articles via

More from Oxford Academic