Computational space reduction and parallelization of a new clustering approach for large groups of sequences (original) (raw)

Clustering Techniques for Biological Sequence Analysis - A Review.pdf

In the present scenario there are a variety of technical tools for supporting and validating wet-lab experiments in the field of science and biotechnology. In order to analyze biological sequences it is necessary to group similar genes. Grouping of genes can be done by using various techniques like pattern matching, classification, clustering etc. In the present study clustering is used as a tool for analyzing biological data. Clustering of Biological sequences is a very interesting and fascinating area as various researchers are working on it. But simple clustering algorithms are not much suitable for sequence analysis problems. Most of the biological sequence analysis problems are NP-hard and some strong optimization algorithm are required for these types of problems. The manuscript presented here is a survey of various clustering techniques useful for analysis of biological sequences. The 3+ stage review process is adopted for the review of literature. To prepare this report 98 papers have been reviewed from year 1997 to 2014 according to the year of publish. The papers reviewed have discussed various issues related to the analysis of biological sequences. The major issues discovered in the reviewed papers were prediction, sequence alignment, motif discovery, cluster boundary prediction etc. Various solution approaches used by researchers for the biological sequence analysis are evolutionary clustering, neural networks, hierarchical clustering, k-means, Go technologies, feature selection, incremental approach, bio-inspired methods, particle swarm optimization, fuzzy techniques, rough set theory and bi-clustering etc. Researchers have applied these solution approaches on various types of datasets. In this communication we have also discussed about these datasets and the parameters used with results mentioned in papers.

A comparative study of clustering algorithms for protein sequences

2009 Fourth International on Conference on Bio-Inspired Computing, 2009

Abstract-With the development of sequencing technologies, more and more protein sequences are uncharacterized. Clustering protein sequences into homologous groups can help to annotate uncharacterized protein sequences. In recent years, many clustering algorithms have ...

SpCLUST: Towards a fast and reliable clustering for potentially divergent biological sequences

Computers in Biology and Medicine

This paper presents SpCLUST, a new C++ package that takes a list of sequences as input, aligns them with MUSCLE, computes their similarity matrix in parallel and then performs the clustering. SpCLUST extends a previously released software by integrating additional scoring matrices which enables it to cover the clustering of amino-acid sequences. The similarity matrix is now computed in parallel according to the master/slave distributed architecture, using MPI. Performance analysis, realized on two real datasets of 100 nucleotide sequences and 1049 amino-acids ones, show that the resulting library substantially outperforms the original Python package. The proposed package was also intensively evaluated on simulated and real genomic and protein data sets. The clustering results were compared to the most known traditional tools, such as UCLUST, CD-HIT and DNACLUST. The comparison showed that SpCLUST outperforms the other tools when clustering divergent sequences, and contrary to the others, it does not require any user intervention or prior knowledge about the input sequences.

Acceleration of sequence clustering using longest common subsequence filtering

BMC Bioinformatics, 2013

Background: Huge numbers of genomes can now be sequenced rapidly with recent improvements in sequencing throughput. However, data analysis methods have not kept up, making it difficult to process the vast amounts of available sequence data. This increased processing time is especially critical in DNA sequence clustering because of the intrinsic difficulty in parallelization. Thus, there is a strong demand for a faster clustering algorithm.

Assessment of the parallelization approach of d2_cluster for high‐performance sequence clustering

Journal of …, 2002

The exponential increase in expressed sequence tag (EST) sequence data amplifies the computational cost of clustering sequences such that new algorithms are required to analyze data at a greater rate. We have parallelized d2_cluster on a SGI Origin 2000 multiprocessor and observed a speedup of approximately 100ϫ on 126 processors when processing a 15,876 EST dataset. The parallelized d2_cluster code is obtainable from the SANBI website

A grammar-based distance metric enables fast and accurate clustering of large sets of 16S sequences

BMC Bioinformatics, 2010

Background: We propose a sequence clustering algorithm and compare the partition quality and execution time of the proposed algorithm with those of a popular existing algorithm. The proposed clustering algorithm uses a grammar-based distance metric to determine partitioning for a set of biological sequences. The algorithm performs clustering in which new sequences are compared with cluster-representative sequences to determine membership. If comparison fails to identify a suitable cluster, a new cluster is created. Results: The performance of the proposed algorithm is validated via comparison to the popular DNA/RNA sequence clustering approach, CD-HIT-EST, and to the recently developed algorithm, UCLUST, using two different sets of 16S rDNA sequences from 2,255 genera. The proposed algorithm maintains a comparable CPU execution time with that of CD-HIT-EST which is much slower than UCLUST, and has successfully generated clusters with higher statistical accuracy than both CD-HIT-EST and UCLUST. The validation results are especially striking for large datasets. Conclusions: We introduce a fast and accurate clustering algorithm that relies on a grammar-based sequence distance. Its statistical clustering quality is validated by clustering large datasets containing 16S rDNA sequences.

Partitioning clustering algorithms for protein sequence data sets

Biodata Mining, 2009

Background: Genome-sequencing projects are currently producing an enormous amount of new sequences and cause the rapid increasing of protein sequence databases. The unsupervised classification of these data into functional groups or families, clustering, has become one of the principal research objectives in structural and functional genomics. Computer programs to automatically and accurately classify sequences into families become a necessity. A significant number of methods have addressed the clustering of protein sequences and most of them can be categorized in three major groups: hierarchical, graph-based and partitioning methods. Among the various sequence clustering methods in literature, hierarchical and graph-based approaches have been widely used. Although partitioning clustering techniques are extremely used in other fields, few applications have been found in the field of protein sequence clustering. It is not fully demonstrated if partitioning methods can be applied to protein sequence data and if these methods can be efficient compared to the published clustering methods.

A novel hierarchical clustering algorithm for gene sequences

2012

Background: Clustering DNA sequences into functional groups is an important problem in bioinformatics. We propose a new alignment-free algorithm, mBKM, based on a new distance measure, DMk, for clustering gene sequences. This method transforms DNA sequences into the feature vectors which contain the occurrence, location and order relation of k-tuples in DNA sequence. Afterwards, a hierarchical procedure is applied to clustering DNA sequences based on the feature vectors.

A parallel computational framework for ultra-large-scale sequence clustering analysis

Bioinformatics (Oxford, England), 2018

The rapid development of sequencing technology has led to an explosive accumulation of genomic data. Clustering is often the first step to be performed in sequence analysis. However, existing methods scale poorly with respect to the unprecedented growth of input data size. As high-performance computing systems are becoming widely accessible, it is highly desired that a clustering method can easily scale to handle large-scale sequence datasets by leveraging the power of parallel computing. In this paper, we introduce SLAD (Separation via Landmark-based Active Divisive clustering), a generic computational framework that can be used to parallelize various de novo operational taxonomic unit (OTU) picking methods and comes with theoretical guarantees on both accuracy and efficiency. The proposed framework was implemented on Apache Spark, which allows for easy and efficient utilization of parallel computing resources. Experiments performed on various datasets demonstrated that SLAD can signi...

An efficient incremental protein sequence clustering algorithm

… 2003. Conference on …, 2003

Abstract-Clustering is the division of data into groups of similar objects. The main objective of this unsupervised learning technique is to find a natural grouping or meaning-ful partition by using a distance or similarity function. Clus-tering is mainly used for dimensionality ...