Standardized Phylogenetic Tree: A Reference to Discover Functional Evolution (original) (raw)
Related papers
Zenodo (CERN European Organization for Nuclear Research), 2020
How gene function evolves is a central question of evolutionary biology. It can be investigated by comparing functional genomics results between species and between genes. Most comparative studies of functional genomics have used pairwise comparisons. Yet it has been shown that this can provide biased results, as genes, like species, are phylogenetically related. Phylogenetic comparative methods should be used to correct for this, but they depend on strong assumptions, including unbiased tree estimates relative to the hypothesis being tested. Such methods have recently been used to test the "ortholog conjecture," the hypothesis that functional evolution is faster in paralogs than in orthologs. Although pairwise comparisons of tissue specificity (s) provided support for the ortholog conjecture, phylogenetic independent contrasts did not. Our reanalysis on the same gene trees identified problems with the time calibration of duplication nodes. We find that the gene trees used suffer from important biases, due to the inclusion of trees with no duplication nodes, to the relative age of speciations and duplications, to systematic differences in branch lengths, and to non-Brownian motion of tissue specificity on many trees. We find that incorrect implementation of phylogenetic method in empirical gene trees with duplications can be problematic. Controlling for biases allows successful use of phylogenetic methods to study the evolution of gene function and provides some support for the ortholog conjecture using three different phylogenetic approaches.
Refined Evolutionary Trees Through an Exceptionally Compatible Alignment-Substitution Model
Journal of applied biology and biotechnology, 2024
A phylogenetic tree commonly represents evolutionary relationships within a set of protein sequences. Various methods and strategies have been used to improve the accuracy of phylogenetic trees, but their capacity to derive a biologically credible relationship appears to be overestimated. Although the quality of the protein sequence alignment and the choice of substitution matrix are preliminary constraints to define the biological accuracy of the overlapped residues, the alignment is not iteratively optimized through the statistical testing of residue-substitution models. The exact alignment protocol and substitution model information are by default used for every sequence set by a server to construct an often-irrelevant phylogenetic tree, and no sequence-based tailoring of phylogenetic strategy is implemented by any server. Rigorously constructing 270 evolutionary trees, constructed using IQ-TREE based on 13 different alignments (Clustal-Omega, Kalign, MAFFT, MUSCLE, TCoffee, and Promals3D, as well as their HHPred-based hidden Markov model [HMM] alignments using HHPred) and nine substitution models (Dayhoff, JJT, block substitution matrix62, WAG, probability matrix from blocks [PMB], direct computation with mutability [DCMUT], JTTDCmut, LG, and variable time), the present study highlights the failure of the current methods and emphasizes the need for a more accurate scrutiny of the entire phylogenetic methodology. MUSCLE alignment and the LG and Dayhoff matrices yield more accurate phylogenetic results for sequences shorter than 500 residues for the log-likelihood measure. Moreover, Kalign 1 HMM alignment yields the top-ranked tree with the lowest tree length score with only the PMB matrix, making this substitution model more accurate in terms of total tree length score. The suggested strategy would be beneficial for understanding the potential pitfalls of phylogenetic inference and would aid us in deriving a more accurate evolutionary relationship for a sequence dataset.
Evolution of genes and taxa: a primer
Plant Molecular Evolution, 2000
The rapidly growing fields of molecular evolution and systematics have much to offer to molecular biology, but like any field have their own repertoire of terms and concepts. Homology, for example, is a central theme in evolutionary biology whose definition is complex and often controversial. Homology extends to multigene families, where the distinction between orthology and paralogy is key. Nucleotide sequence alignment is also a homology issue, and is a key stage in any evolutionary analysis of sequence data. Models based on our understanding of the processes of nucleotide substitution are used both in the estimation of the number of evolutionary changes between aligned sequences and in phylogeny reconstruction from sequence data. The three common methods of phylogeny reconstruction -parsimony, distance and maximum likelihood -differ in their use of these models. All three face similar problems in finding optimal -and reliable -solutions among the vast number of possible trees. Moreover, even optimal trees for a given gene may not reflect the relationships of the organisms from which the gene was sampled. Knowledge of how genes evolve and at what rate is critical for understanding gene function across species or within gene families. The Neutral Theory of Molecular Evolution serves as the null model of molecular evolution and plays a central role in data analysis. Three areas in which the Neutral Theory plays a vital role are: interpreting ratios of nonsynonymous to synonymous nucleotide substitutions, assessing the reliability of molecular clocks, and providing a foundation for molecular population genetics.
Most comparative studies of functional genomics have used pairwise comparisons. Yet it has been shown that this can provide biased results, since genes, like species, are phylogenetically related. Phylogenetic comparative methods should allow to correct for this, but they depend on strong assumptions, including unbiased tree estimates relative to the hypothesis being tested. An ongoing trend in comparative genomic studies is to adopt phylogenetic comparative method to answer a wide range of biological questions including evolutionary hypotheses testing. Notably among them is the recently controversial 'Ortholog Conjecture' that assumes the functional evolution is faster in paralogs than orthologs. Using pairwise comparisons of tissue specificity index (tau), earlier we provided support for the ortholog conjecture. In contrast, a recent publication suggested that the ortholog conjecture is not supported by gene expression tissue-specificity using phylogenetic independent cont...
Comparison of models for nucleotide substitution used in maximum-likelihood phylogenetic estimation
1994
Using real sequence data, we evaluate the adequacy of assumptions made in evolutionary models of nucleotide substitution and the effects that these assumptions have on estimation of evolutionary trees. Two aspects of the assumptions are evaluated. The first concerns the pattern of nucleotide substitution, including equilibrium base frequencies and the transition / transversion-rate ratio. The second concerns the variation of substitution rates over sites. The maximum-likelihood estimate of tree topology appears quite robust to both these aspects of the assumptions of the models, but evaluation of the reliability of the estimated tree by using simpler, less realistic models can be misleading. Branch lengths are underestimated when simpler models of substitution are used, but the underestimation caused by ignoring rate variation over nucleotide sites is much more serious. The goodness of fit of a model is reduced by ignoring spatial rate variation, but unrealistic assumptions about the pattern of nucleotide substitution can lead to an extraordinary reduction in the likelihood. It seems that evolutionary biologists can obtain accurate estimates of certain evolutionary parameters even with an incorrect phylogeny, while systematists cannot get the right tree with confidence even when a realistic, and more complex, model of evolution is assumed.
BMC Evolutionary Biology, 2010
Background: Explicit evolutionary models are required in maximum-likelihood and Bayesian inference, the two methods that are overwhelmingly used in phylogenetic studies of DNA sequence data. Appropriate selection of nucleotide substitution models is important because the use of incorrect models can mislead phylogenetic inference. To better understand the performance of different model-selection criteria, we used 33,600 simulated data sets to analyse the accuracy, precision, dissimilarity, and biases of the hierarchical likelihood-ratio test, Akaike information criterion, Bayesian information criterion, and decision theory. Results: We demonstrate that the Bayesian information criterion and decision theory are the most appropriate model-selection criteria because of their high accuracy and precision. Our results also indicate that in some situations different models are selected by different criteria for the same dataset. Such dissimilarity was the highest between the hierarchical likelihood-ratio test and Akaike information criterion, and lowest between the Bayesian information criterion and decision theory. The hierarchical likelihood-ratio test performed poorly when the true model included a proportion of invariable sites, while the Bayesian information criterion and decision theory generally exhibited similar performance to each other.
Proceedings of the …, 2005
Because of the increase of genomic data, multiple genes are often available for the inference of phylogenetic relationships. The simple approach for combining multiple genes from the same taxon is to concatenate the sequences and then ignore the fact that different positions in the concatenated sequence came from different genes. Here, we discuss two criteria for inferring the optimal tree topology from data sets with multiple genes. These criteria are designed for multigene data sets where gene-specific evolutionary features are too important to ignore. One criterion is conventional and is obtained by taking the sum of log-likelihoods over all genes. The other criterion is obtained by dividing the log-likelihood for a gene by its sequence length and then taking the arithmetic mean over genes of these ratios. A similar strategy could be adopted with parsimony scores. The optimal tree is then declared to be the one for which the sum or the arithmetic mean is maximized. These criteria are justified within a two-stage hierarchical framework. The first level of the hierarchy represents gene-specific evolutionary features, and the second represents site-specific features for given genes. For testing significance of the optimal topology, we suggest a two-stage bootstrap procedure that involves resampling genes and then resampling alignment columns within resampled genes. An advantage of this procedure over concatenation is that it can effectively account for gene-specific evolutionary features. We discuss the applicability of the two-stage bootstrap idea to the Kishino-Hasegawa test and the Shimodaira-Hasegawa test.
The impact of sequence parameter values on phylogenetic accuracy
An accurately inferred phylogeny is important to the study of evolution. Factors affecting the accuracy of an inferred tree can be traced to several sequential steps leading to the inference of the phylogeny. We have examined here the impact of some features of nucleotide sequences in alignments on phylogenetic (topological) accuracy rather than any source of error during the process of sequence alignment or choice of the method of inference (as is usually done). Specifically, we have studied (using computer simulation) the implications of changing the values of the following five parameters, individually and in combination: sequence length (l), nucleotide substitution rate (r), nucleotide base composition (θ), the transition-transversion rate ratio (κ), and the substitution rate heterogeneity among the sites (α). An interesting, and unexpected, result was that κ has a strong positive relationship with phylogenetic accuracy, especially at high substitution rates. This simulation-based work has implications for empirical researchers in the field and should enable them to choose from among the multiple genes typically available today for a more accurate inference of the phylogeny being studied.