Hadoop Mapreduce Based Distributed Phylogenetic Analysis (original) (raw)
Related papers
A Detailed Survey on Approaches of Phylogenetic Analysis
All organisms have evolved from a common ancestor. The distance between these species is measured using phylogenetic analysis. It enables us to extract evolutionary relationship from sequence analysis. These relationships are depicted on phylogenetic trees. This article provides a detailed survey on different sequential approaches of sequential alignment, clustering and complete details of how a mapreduce technology improves the performance of phylogenetic analysis. A comprehensive comparison of these methods is presented in this paper.
Reconstructing evolutionary trees in parallel for massive sequences
BMC Systems Biology, 2017
Background: Building the evolutionary trees for massive unaligned DNA sequences is challenging and crucial. However, reconstructing evolutionary tree for ultra-large sequences is hard. Massive multiple sequence alignment is also challenging and time/space consuming. Hadoop and Spark are developed recently, which bring spring light for the classical computational biology problems. In this paper, we tried to solve the multiple sequence alignment and evolutionary reconstruction in parallel. Results: HPTree, which is developed in this paper, can deal with big DNA sequence files quickly. It works well on the >1GB files, and gets better performance than other evolutionary reconstruction tools. Users could use HPTree for reonstructing evolutioanry trees on the computer clusters or cloud platform (eg. Amazon Cloud). HPTree could help on population evolution research and metagenomics analysis. Conclusions: In this paper, we employ the Hadoop and Spark platform and design an evolutionary tree reconstruction software tool for unaligned massive DNA sequences. Clustering and multiple sequence alignment are done in parallel. Neighbour-joining model was employed for the evolutionary tree building. We opened our software together with source codes via http://lab.malab.cn/soft/HPtree/.
Phylogenetic analysis of large sequence data sets
2005
Phylogenetic analysis is an integral part of biological research. As the number of sequenced genomes increases, available data sets are growing in number and size. Several algorithms have been proposed to handle these larger data sets. A family of algorithms known as disc covering methods (DCMs), have been selected by the NSF funded CIPRes project to boost the performance of existing phylogenetic algorithms. Recursive Iterative Disc Covering Method 3 (Rec-I-DCM3), recursively decomposes the guide tree into subtrees, executing a phylogenetic search on the subtree and merging the subtrees, for a set number of iterations. This paper presents a detailed analysis of this algorithm.
Multiple Sequence Alignment Based Method for Construction of Phylogenetic Trees
IJCSMC, 2019
Due to the importance of DNA (genetic material) and protein sequences, their comparison becomes the major part of biology. But the presence of large and complex datasets of biological information requires an efficient computational methodology to handle them. The sequence comparison facilitates identification of genes and conserved sequence patterns to infer the evolutionary relationship among different species. This paper uses Multiple Sequence Alignment (MSA) method that aligns multiple sequences at a time to depict phylogeny. The p53 protein sequences of ten different species are loaded from the NCBI (National Center for Biotechnology Information) databank in the FASTA format. Based on the evolutionary distances of these species, two phylogenetic trees are constructed for the two divided parts of this dataset. A single tree is generated by joining two trees using pruning method. To obtain an optimal alignment, each sequence in the pruned alignment is locally aligned with the consensus sequence. The minimum optimal alignment is obtained after performing left and right shift operations.
Algorithms for Molecular Biology, 2017
Background: Multiple sequence alignment (MSA) plays a key role in biological sequence analyses, especially in phylogenetic tree construction. Extreme increase in next-generation sequencing results in shortage of efficient ultra-large biological sequence alignment approaches for coping with different sequence types. Methods: Distributed and parallel computing represents a crucial technique for accelerating ultra-large (e.g. files more than 1 GB) sequence analyses. Based on HAlign and Spark distributed computing system, we implement a highly cost-efficient and time-efficient HAlign-II tool to address ultra-large multiple biological sequence alignment and phylogenetic tree construction. Results: The experiments in the DNA and protein large scale data sets, which are more than 1GB files, showed that HAlign II could save time and space. It outperformed the current software tools. HAlign-II can efficiently carry out MSA and construct phylogenetic trees with ultra-large numbers of biological sequences. HAlign-II shows extremely high memory efficiency and scales well with increases in computing resource. Conclusions: THAlign-II provides a user-friendly web server based on our distributed computing infrastructure. HAlign-II with open-source codes and datasets was established at http://lab.malab.cn/soft/halign.
Clustering based Distributed Phylogenetic Tree Construction
Phylogenetic tree construction has received much attention recently due to the availability of vast biological data. In this study, we provide a three step method to build phylogenetic trees. Firstly, a density-based clustering algorithm is used to provide clusters of the population at hand using the distance matrix which shows the distances of the species. Secondly, a phylogenetic tree for each cluster is constructed by using the neighbor-joining (NJ) algorithm and finally, the roots of the small phylogenetic trees are connected again by the NJ algorithm to form one large phylogenetic tree. To our knowledge, this is the first method for building phylogenetic trees that uses clustering prior to forming the tree. As such, it provides independent phylogenetic tree formation within each cluster as the second step, hence is suitable for parallel/distributed processing, enabling fast processing of very large biological data sets. The proposed method, clustered neighbor-joining (CNJ) is applied to 145 samples from the Y-DNA Haplogroup G. Distances between male samples are the variation in their set of y-chromosomal short tandem repeat (STR) values. We show that the clustering method we use is superior to other clustering methods as applied to Y-DNA data and also independent, fast distributed construction of phylogenetic trees is possible with this method.
Parallel implementation of a quartet-based algorithm for phylogenetic analysis
2006
Abstract This paper describes a parallel implementation of our recently developed algorithm for phylogenetic analysis on the IBM BlueGene/L cluster. This algorithm constructs evolutionary trees for a given set of DNA or protein sequences based on the topological information of every possible quartet trees. Our experimental results showed that it has several advantages over many popular algorithms.
Phylogenetics Algorithms and Applications
Advances in Intelligent Systems and Computing
Phylogenetics is a powerful approach in finding evolution of current day species. By studying phylogenetic trees, scientists gain a better understanding of how species have evolved while explaining the similarities and differences among species. The phylogenetic study can help in analysing the evolution and the similarities among diseases and viruses, and further help in prescribing their vaccines against them. This paper explores computational solutions for building phylogeny of species along with highlighting benefits of alignment-free methods of phylogenetics. The paper has also discussed the application of phylogenetic study in disease diagnosis and evolution.