LifePrint: a novel k-tuple distance method for construction of phylogenetic trees (original) (raw)

Assessment of Protein Distance Measures and Tree-Building Methods for Phylogenetic Tree Reconstruction

Molecular Biology and Evolution, 2005

Distance-based methods are popular for reconstructing evolutionary trees of protein sequences, mainly because of their speed and generality. A number of variants of the classical neighbor-joining (NJ) algorithm have been proposed, as well as a number of methods to estimate protein distances. We here present a large-scale assessment of performance in reconstructing the correct tree topology for the most popular algorithms. The programs BIONJ, FastME, Weighbor, and standard NJ were run using 12 distance estimators, producing 48 tree-building/distance estimation method combinations. These were evaluated on a test set based on real trees taken from 100 Pfam families. Each tree was used to generate multiple sequence alignments with the ROSE program using three evolutionary models. The accuracy of each method was analyzed as a function of both sequence divergence and location in the tree. We found that BIONJ produced the overall best results, although the average accuracy differed little between the tree-building methods (normally less than 1%). A noticeable trend was that FastME performed poorer than the rest on long branches. Weighbor was several orders of magnitude slower than the other programs. Larger differences were observed when using different distance estimators. Protein-adapted Jukes-Cantor and Kimura distance correction produced clearly poorer results than the other methods, even worse than uncorrected distances. We also assessed the recently developed Scoredist measure, which performed equally well as more complex methods.

Construction of phylogenetic profiles based on the genetic distance of hundreds of genomes

Biochemical and Biophysical Research Communications, 2007

Phylogenetic profiles have been widely applied in functional genomics research, especially in the prediction of protein-protein interactions (PPIs). A key issue in phylogenetic profiling is how to effectively select reference organisms from the available hundreds of genomes. In this study, we performed an assessment of reference organism selection based on the genetic distance between the target organism and 167 reference organisms. We found that inclusion of reference organisms from all distance levels had better performance in the prediction of PPIs than that at each distance level. The PPI prediction reached an optimal level when 70% of the reference organisms at all distance levels were selected; and this performance was similar to that in the optimal condition based on the taxonomy tree in our previous study. Because measurement of genetic distance is direct and simple compared to the topology of the taxonomy tree, we suggest selecting reference organisms based on genetic distance in the construction of phylogenetic profiles.

The neighbor-joining method: a new method for reconstructing phylogenetic trees

Molecular biology and evolution, 1987

A new method called the neighbor-joining method is proposed for reconstructing phylogenetic trees from evolutionary distance data. The principle of this method is to find pairs of operational taxonomic units (OTUs [= neighbors]) that minimize the total branch length at each stage of clustering of OTUs starting with a starlike tree. The branch lengths as well as the topology of a parsimonious tree can quickly be obtained by using this method. Using computer simulation, we studied the efficiency of this method in obtaining the correct unrooted tree in comparison with that of five other tree-making methods: the unweighted pair group method of analysis, Farris's method, Sattath and Tversky's method, Li's method, and Tateno et al.'s modified Farris method. The new, neighbor-joining method and Sattath and Tversky's method are shown to be generally better than the other methods.

A DNA sequence distance measure approach for phylogenetic tree construction

2010

Distance measure is an important issue in phylogenetic analysis. Traditional measure approaches are time-consuming due to the fact that they require multiple sequence alignment, while the k-tuple distance is easy to compute and has been used in phylogenetic tree reconstruction. However, k-tuple distance is not effective for analysing identical sequences. Considering the occurrences of k-tuples and the sequence structure which may contain k-tuple locations as well as the order relation among them, we propose a new distance measure method to construct phylogenetic tree in this paper. The experimental results show that the new approach is capable of efficiently building phylogenetic trees.

Exploring the Relationship between Sequence Similarity and Accurate Phylogenetic Trees

Molecular Biology and Evolution, 2006

We have characterized the relationship between accurate phylogenetic reconstruction and sequence similarity, testing whether high levels of sequence similarity can consistently produce accurate evolutionary trees. We generated protein families with known phylogenies using a modified version of the PAML/EVOLVER program that produces insertions and deletions as well as substitutions. Protein families were evolved over a range of 100-400 point accepted mutations; at these distances 63% of the families shared significant sequence similarity. Protein families were evolved using balanced and unbalanced trees, with ancient or recent radiations. In families sharing statistically significant similarity, about 60% of multiple sequence alignments were 95% identical to true alignments. To compare recovered topologies with true topologies, we used a score that reflects the fraction of clades that were correctly clustered. As expected, the accuracy of the phylogenies was greatest in the least divergent families. About 88% of phylogenies clustered over 80% of clades in families that shared significant sequence similarity, using Bayesian, parsimony, distance, and maximum likelihood methods. However, for protein families with short ancient branches (ancient radiation), only 30% of the most divergent (but statistically significant) families produced accurate phylogenies, and only about 70% of the second most highly conserved families, with median expectation values better than 10 ΓΏ60 , produced accurate trees. These values represent upper bounds on expected tree accuracy for sequences with a simple divergence history; proteins from 700 Giardia families, with a similar range of sequence similarities but considerably more gaps, produced much less accurate trees. For our simulated insertions and deletions, correct multiple sequence alignments did not perform much better than those produced by T-COFFEE, and including sequences with expressed sequence tag-like sequencing errors did not significantly decrease phylogenetic accuracy. In general, although less-divergent sequence families produce more accurate trees, the likelihood of estimating an accurate tree is most dependent on whether radiation in the family was ancient or recent. Accuracy can be improved by combining genes from the same organism when creating species trees or by selecting protein families with the best bootstrap values in comprehensive studies.

PhyloPat: an updated version of the phylogenetic pattern database contains gene neighborhood

Phylogenetic patterns show the presence or absence of certain genes in a set of full genomes derived from different species. They can also be used to determine sets of genes that occur only in certain evolutionary branches. Previously, we presented a database named PhyloPat which allows the complete Ensembl gene database to be queried using phylogenetic patterns. Here, we describe an updated version of PhyloPat which can be queried by an improved web server. We used a single linkage clustering algorithm to create 241 697 phylogenetic lineages, using all the orthologies provided by Ensembl v49. PhyloPat offers the possibility of querying with binary phylogenetic patterns or regular expressions, or through a phylogenetic tree of the 39 included species. Users can also input a list of Ensembl, EMBL, EntrezGene or HGNC IDs to check which phylogenetic lineage any gene belongs to. A link to the FatiGO web interface has been incorporated in the HTML output. For each gene, the surrounding genes on the chromosome, color coded according to their phylogenetic lineage can be viewed, as well as FASTA files of the peptide sequences of each lineage. Furthermore, lists of omnipresent, polypresent, oligopresent and anticorrelating genes have been included. PhyloPat is freely available at http://www.cmbi.ru.nl/phylopat.

A new sequence distance measure for phylogenetic tree construction

Bioinformatics, 2003

Motivation: Most existing approaches for phylogenetic inference use multiple alignment of sequences and assume some sort of an evolutionary model. The multiple alignment strategy does not work for all types of data, e.g. whole genome phylogeny, and the evolutionary models may not always be correct. We propose a new sequence distance measure based on the relative information between the sequences using Lempel-Ziv complexity. The distance matrix thus obtained can be used to construct phylogenetic trees. Results: The proposed approach does not require sequence alignment and is totally automatic. The algorithm has successfully constructed consistent phylogenies for real and simulated data sets.

Phylogenetic tree construction using trinucleotide usage profile (TUP)

BMC Bioinformatics, 2016

Background: It has been a challenging task to build a genome-wide phylogenetic tree for a large group of species containing a large number of genes with long nucleotides sequences. The most popular method, called feature frequency profile (FFP-k), finds the frequency distribution for all words of certain length k over the whole genome sequence using (overlapping) windows of the same length. For a satisfactory result, the recommended word length (k) ranges from 6 to 15 and it may not be a multiple of 3 (codon length). The total number of possible words needed for FFP-k can range from 4 6 = 4096 to 4 15. Results: We propose a simple improvement over the popular FFP method using only a typical word length of 3. A new method, called Trinucleotide Usage Profile (TUP), is proposed based only on the (relative) frequency distribution using non-overlapping windows of length 3. The total number of possible words needed for TUP is 4 3 = 64, which is much less than the total count for the recommended optimal "resolution" for FFP. To build a phylogenetic tree, we propose first representing each of the species by a TUP vector and then using an appropriate distance measure between pairs of the TUP vectors for the tree construction. In particular, we propose summarizing a DNA sequence by a matrix of three rows corresponding to three reading frames, recording the frequency distribution of the non-overlapping words of length 3 in each of the reading frame. We also provide a numerical measure for comparing trees constructed with various methods. Conclusions: Compared to the FFP method, our empirical study showed that the proposed TUP method is more capable of building phylogenetic trees with a stronger biological support. We further provide some justifications on this from the information theory viewpoint. Unlike the FFP method, the TUP method takes the advantage that the starting of the first reading frame is (usually) known. Without this information, the FFP method could only rely on the frequency distribution of overlapping words, which is the average (or mixture) of the frequency distributions of three possible reading frames. Consequently, we show (from the entropy viewpoint) that the FFP procedure could dilute important gene information and therefore provides less accurate classification.

Proper Distance Metrics for Phylogenetic Analysis Using Complete Genomes without Sequence Alignment

International Journal of Molecular Sciences, 2010

A shortcoming of most correlation distance methods based on the composition vectors without alignment developed for phylogenetic analysis using complete genomes is that the "distances" are not proper distance metrics in the strict mathematical sense. In this paper we propose two new correlation-related distance metrics to replace the old one in our dynamical language approach. Four genome datasets are employed to evaluate the effects of this replacement from a biological point of view. We find that the two proper distance metrics yield trees with the same or similar topologies as/to those using the old "distance" and agree with the tree of life based on 16S rRNA in a majority of the basic branches. Hence the two proper correlation-related distance metrics proposed here improve our dynamical language approach for phylogenetic analysis.

Phylo_dCor: distance correlation as a novel metric for phylogenetic profiling

BMC Bioinformatics

Background: Elaboration of powerful methods to predict functional and/or physical protein-protein interactions from genome sequence is one of the main tasks in the post-genomic era. Phylogenetic profiling allows the prediction of protein-protein interactions at a whole genome level in both Prokaryotes and Eukaryotes. For this reason it is considered one of the most promising methods. Results: Here, we propose an improvement of phylogenetic profiling that enables handling of large genomic datasets and infer global protein-protein interactions. This method uses the distance correlation as a new measure of phylogenetic profile similarity. We constructed robust reference sets and developed Phylo-dCor, a parallelized version of the algorithm for calculating the distance correlation that makes it applicable to large genomic data. Using Saccharomyces cerevisiae and Escherichia coli genome datasets, we showed that Phylo-dCor outperforms phylogenetic profiling methods previously described based on the mutual information and Pearson's correlation as measures of profile similarity. Conclusions: In this work, we constructed and assessed robust reference sets and propose the distance correlation as a measure for comparing phylogenetic profiles. To make it applicable to large genomic data, we developed Phylo-dCor, a parallelized version of the algorithm for calculating the distance correlation. Two R scripts that can be run on a wide range of machines are available upon request.