Parallel implementation of a quartet-based algorithm for phylogenetic analysis (original) (raw)
Related papers
A novel quartet-based method for phylogenetic inference
2005
In this paper we introduce a new quartet-based method. This method makes use of the Bayes (or quartet) weights of quartets as those used in the quartet puzzling. However, all the weights from the related quartets are accumulated to form a global quartet weight matrix. This matrix provides integrated information and can lead us to recursively merge small sub-trees to larger ones until the final single tree is obtained. The experimental results show that the probability for the correct tree to be among a very small number of trees constructed using our method is very high. These significant results open a new research direction to further investigate more efficient algorithms for phylogenetic inference. 1.
Large-scale maximum likelihood-based phylogenetic analysis on the IBM BlueGene/L
Proceedings of the 2007 ACM/IEEE conference on Supercomputing - SC '07, 2007
Phylogenetic inference is a grand challenge in Bioinformatics due to immense computational requirements. The increasing popularity of multi-gene alignments in biological studies, which typically provide a stable topological signal due to a more favorable ratio of the number of base pairs to the number of sequences, coupled with rapid accumulation of sequence data in general, poses new challenges for high performance computing. In this paper, we demonstrate how stateof-the-art Maximum Likelihood (ML) programs can be efficiently scaled to the IBM BlueGene/L (BG/L) architecture, by porting RAxML, which is currently among the fastest and most accurate programs for phylogenetic inference under the ML criterion. We simultaneously exploit coarse-grained and fine-grained parallelism that is inherent in every ML-based biological analysis. Performance is assessed using datasets consisting of 212 sequences and 566,470 base pairs, and 2,182 sequences and 51,089 base pairs, respectively. To the best of our knowledge, these are the largest datasets analyzed under ML to date. The capability to analyze such datasets will help to address novel biological questions via phylogenetic analyses. Our experimental results indicate that the fine-grained parallelization scales well up to 1,024 processors. Moreover, a larger number of processors can be efficiently exploited by a combination of coarse-grained and fine-grained parallelism. Finally, we demonstrate that our parallelization scales equally well on an AMD Opteron cluster with a less favorable network latency to processor speed ratio. We recorded super-linear speedups in several cases due to increased cache efficiency.
Parallel Phylogenetic Inference
ACM/IEEE SC 2000 Conference (SC'00), 2000
Recent advances in DNA sequencing technology have created large data sets upon which phylogenetic inference can be performed. However, current research is limited by the prohibitive time necessary to perform tree search on even a reasonably sized data set. Some parallel algorithms have been developed but the biological research community does not use them because they don't trust the results from newly developed parallel software. This paper presents a new phylogenetic algorithm that allows existing, trusted phylogenetic software packages to be executed in parallel using the DOGMA parallel processing system. The results presented here indicate that data sets that currently take as much as 11 months to search using current algorithms, can be searched in as little as 2 hours using as few as 8 processors. This reduction in the time necessary to complete a phylogenetic search allows new research questions to be explored in many of the biological sciences.
Building large phylogenetic trees on coarse-grained parallel machines
Algorithmica, 2006
Phylogenetic analysis is an area of computational biology concerned with the reconstruction of evolutionary relationships between organisms, genes, and gene families. Maximum likelihood evaluation has proven to be one of the most reliable methods for constructing phylogenetic trees. The huge computational requirements associated with maximum likelihood analysis means that it is not feasible to produce large phylogenetic trees using a single processor. We have completed a fully cross platform coarse grained distributed application, DPRml, which overcomes many of the limitations imposed by the current set of parallel phylogenetic programs. We have completed a set of efficiency tests that show how to maximise efficiency while using the program to build large phylogenetic trees. The software is publicly available under the terms of the GNU general public licence from the system webpage at
In biological research, scientists often need to use the information of the species to infer the evolutionary relationship among them. The evolutionary relationships are generally represented by a labeled binary tree, called the evolutionary tree (or phylogenetic tree). The phylogeny problem is computationally intensive, and thus it is suitable for parallel computing environment. In this paper, a fast algorithm for constructing Neighbor-Joining phylogenetic trees has been developed. The CPU time is drastically reduced as compared with sequential algorithms. The new algorithm includes three techniques: Firstly, a linear array A[N] is introduced to store the sum of every row of the distance matrix (the same as SK), which can eliminate many repeated (redundancy) computations, and the value of A[i] are computed only once at the beginning of the algorithm, and are updated by three elements in the iteration. Secondly, a very compact formula for the sum of all the branch lengths of OTUs (Operational Taxonomic Units) i and j has been designed. Thirdly, multiple parallel threads are used for computation of nearest neighboring pair.
Clustering based Distributed Phylogenetic Tree Construction
Phylogenetic tree construction has received much attention recently due to the availability of vast biological data. In this study, we provide a three step method to build phylogenetic trees. Firstly, a density-based clustering algorithm is used to provide clusters of the population at hand using the distance matrix which shows the distances of the species. Secondly, a phylogenetic tree for each cluster is constructed by using the neighbor-joining (NJ) algorithm and finally, the roots of the small phylogenetic trees are connected again by the NJ algorithm to form one large phylogenetic tree. To our knowledge, this is the first method for building phylogenetic trees that uses clustering prior to forming the tree. As such, it provides independent phylogenetic tree formation within each cluster as the second step, hence is suitable for parallel/distributed processing, enabling fast processing of very large biological data sets. The proposed method, clustered neighbor-joining (CNJ) is applied to 145 samples from the Y-DNA Haplogroup G. Distances between male samples are the variation in their set of y-chromosomal short tandem repeat (STR) values. We show that the clustering method we use is superior to other clustering methods as applied to Y-DNA data and also independent, fast distributed construction of phylogenetic trees is possible with this method.