Best-Fit Maximum-Likelihood Models for Phylogenetic Inference: Empirical Tests with Known Phylogenies (original) (raw)
Related papers
Wellcome open research, 2018
Phylogenetic reconstruction is a necessary first step in many Background analyses which use whole genome sequence data from bacterial populations. There are many available methods to infer phylogenies, and these have various advantages and disadvantages, but few unbiased comparisons of the range of approaches have been made. : We simulated data from a defined 'true tree' using a realistic Methods evolutionary model. We built phylogenies from this data using a range of methods, and compared reconstructed trees to the true tree using two measures, noting the computational time needed for different phylogenetic reconstructions. We also used real data from Streptococcus pneumoniae alignments to compare individual core gene trees to a core genome tree. : We found that, as expected, maximum likelihood trees from good Results quality alignments were the most accurate, but also the most computationally intensive. Using less accurate phylogenetic reconstruction methods, we were able to obtain results of comparable accuracy; we found that approximate results can rapidly be obtained using genetic distance based methods. In real data we found that highly conserved core genes, such as those involved in translation, gave an inaccurate tree topology, whereas genes involved in recombination events gave inaccurate branch lengths. We also show a tree-of-trees, relating the results of different phylogenetic reconstructions to each other. : We recommend three approaches, depending on requirements Conclusions for accuracy and computational time. For the most accurate tree, use of either RAxML or IQ-TREE with an alignment of variable sites produced by mapping to a reference genome is best. Quicker approaches that do not perform full maximum likelihood optimisation may be useful for many analyses requiring a phylogeny, as generating a high quality input alignment is likely to be the major limiting factor of accurate tree topology. We have publicly released our simulated data and code to enable further comparisons.
Phylogenetic inference using molecular data
2009
We review phylogenetic inference methods with a special emphasis on inference from molecular data. We begin with a general comment on phylogenetic inference using DNA sequences, followed by a clear statement of the relevance of a good alignment of sequences. Then we provide a general description of models of sequence evolution, including evolutionary models that account for rate heterogeneity along the DNA sequences or complex secondary structure (i.e., ribosomal genes). We then present an overall description of the most relevant inference methods, focusing on key concepts of general interest. We point out the most relevant traits of methods such as maximum parsimony (MP), distance methods, maximum likelihood (ML) and Bayesian inference (BI). Finally, we discuss different measures of support for the estimated phylogeny and discuss how this relates to confidence in particular nodes of a phylogeny reconstruction.
BMC Evolutionary Biology, 2010
Background: Explicit evolutionary models are required in maximum-likelihood and Bayesian inference, the two methods that are overwhelmingly used in phylogenetic studies of DNA sequence data. Appropriate selection of nucleotide substitution models is important because the use of incorrect models can mislead phylogenetic inference. To better understand the performance of different model-selection criteria, we used 33,600 simulated data sets to analyse the accuracy, precision, dissimilarity, and biases of the hierarchical likelihood-ratio test, Akaike information criterion, Bayesian information criterion, and decision theory. Results: We demonstrate that the Bayesian information criterion and decision theory are the most appropriate model-selection criteria because of their high accuracy and precision. Our results also indicate that in some situations different models are selected by different criteria for the same dataset. Such dissimilarity was the highest between the hierarchical likelihood-ratio test and Akaike information criterion, and lowest between the Bayesian information criterion and decision theory. The hierarchical likelihood-ratio test performed poorly when the true model included a proportion of invariable sites, while the Bayesian information criterion and decision theory generally exhibited similar performance to each other.
Model use in phylogenetics: nine key questions
Trends in Ecology & Evolution, 2007
Models of character evolution underpin all phylogeny estimations, thus model adequacy remains a crucial issue for phylogenetics and its many applications. Although progress has been made in selecting appropriate models for phylogeny estimation, there is still concern about their purpose and proper use. How do we interpret models in a phylogenetic context? What are their effects on phylogeny estimation? How can we improve confidence in the models that we choose? That the phylogenetics community is asking such questions denotes an important stage in the use of explicit models. Here, we examine these and other common questions and draw conclusions about how the community is using and choosing models, and where this process will take us next.
Impacts of Misspecifying the Evolutionary Model in Phylogenetic Tree Estimation
We consider phylogenetic tree estimation with emphasis on estimating the number of groups (clades). We rarely know the full evolutionary model, so we want to understand the impact of model estimation errors. Sensitivity to misspecifying the model or model parameters depends on how distinct the clades are, so it is important to consider differing degrees of clade resolution. We do this by varying the macroscopic growth rate and microscopic mutation rate of the taxa. We simulate DNA sequence data using coalescent theory to simulate the sample genealogy, and one of several mutation models. For each case, we compute all pairwise distances between sequences using the true and several alternate models. The within-group variance (with distance data represented in principal coordinates) is used to choose the number of clades. We conclude that the estimated number of clades can be sensitive to model estimation errors, to an extent determined by clade resolution.
Success of maximum likelihood phylogeny inference in the four-taxon case
Molecular Biology and Evolution, 1995
We used simulated data to investigate a number of properties of maximum-likelihood (ML) phylogenetic tree estimation for the case of four taxa. Simulated data were generated under a broad range of conditions, including wide variation in branch lengths, differences in the ratio of transition and transversion substitutions, and the absence or presence of gamma-distributed site-to-site rate variation. Data were analyzed in the ML framework with two different substitution models, and we compared the ability of the two models to reconstruct the correct topology. Although both models were inconsistent for some branch-length combinations in the presence of siteto-site variation, they models were efficient predictors of topology under most simulation conditions. We also examined the performance of the likelihood ratio (LR) test for significant positive interior branch length. This test was found to be misleading under many simulation conditions, rejecting too otten under some simulation conditions. Under the null hypothesis of zero length internal branch, LR statistics are assumed to be asymptotically distributed XT; with limited data, the distribution of LR statistics under the null hypothesis varies from x!.
On the maximum likelihood method in molecular phylogenetics
Journal of Molecular Evolution, 1991
The efficiency of obtaining the correct tree by the maximum likelihood method (Felsenstein 1981) for inferring trees from DNA sequence data was compared with trees obtained by distance methods. It was shown that the maximum likelihood method is superior to distance methods in the efficiency particularly when the evolutionary rate differs among lineages.
The impact of sequence parameter values on phylogenetic accuracy
An accurately inferred phylogeny is important to the study of evolution. Factors affecting the accuracy of an inferred tree can be traced to several sequential steps leading to the inference of the phylogeny. We have examined here the impact of some features of nucleotide sequences in alignments on phylogenetic (topological) accuracy rather than any source of error during the process of sequence alignment or choice of the method of inference (as is usually done). Specifically, we have studied (using computer simulation) the implications of changing the values of the following five parameters, individually and in combination: sequence length (l), nucleotide substitution rate (r), nucleotide base composition (θ), the transition-transversion rate ratio (κ), and the substitution rate heterogeneity among the sites (α). An interesting, and unexpected, result was that κ has a strong positive relationship with phylogenetic accuracy, especially at high substitution rates. This simulation-based work has implications for empirical researchers in the field and should enable them to choose from among the multiple genes typically available today for a more accurate inference of the phylogeny being studied.