A Deterministic Algorithm for DNA Sequence Comparison (original) (raw)

Similarity evaluation of DNA sequences based on nucleotides similarity

Proceedings of the 3rd International Conference on Information and Communication Systems - ICICS 12, 2012

Background: DNA sequence analysis is an important research topic in bioinformatics. Evaluating the similarity between sequences, which is crucial for sequence analysis, has attracted much research effort in the last two decades, and a dozen of algorithms and tools have been developed. These methods are based on alignment, word frequency and geometric representation respectively, each of which has its advantage and disadvantage. Results: In this paper, for effectively computing the similarity between DNA sequences, we introduce a novel method based on frequency patterns and entropy to construct representative vectors of DNA sequences. Experiments are conducted to evaluate the proposed method, which is compared with two recently-developed alignment-free methods and the BLASTN tool. When testing on the b-globin genes of 11 species and using the results from MEGA as the baseline, our method achieves higher correlation coefficients than the two alignmentfree methods and the BLASTN tool. Conclusions: Our method is not only able to capture fine-granularity information (location and ordering) of DNA sequences via sequence blocking, but also insensitive to noise and sequence rearrangement due to considering only the maximal frequent patterns. It outperforms major existing methods or tools.

A deterministic algorithm for alpha-numeric sequence comparison withh application to protein sequence detection

This paper describes an extension of a deterministic algorithm, [1,2], that was initially designed to measure the rate of similarity between DNA sequences and any sequence made up with symbols of alphabets of cardinality 4. Here, a modified and extended version to handle sequences of symbols from alphabets of cardinality >4 is presented. This extension opens up its application area. As a test ground, we search for peptides within a protein database. Computational results on real data and a comparison with BLAST will be discussed. Keywords: BLAST, Deterministic algorithm, Alpha-numeric sequence, numeration system, Database

A Comparison of Computation Techniques for Dna Sequence Comparison

This Project shows a comparison survey done on DNA sequence comparison techniques. The various techniques implemented are sequential comparison, multithreading on a single computer and multithreading using parallel processing. This Project shows the issues involved in implementing a dynamic programming algorithm for biological sequence comparison on a general purpose parallel computing platform Tiling is an important technique for extraction of parallelism. Informally, tiling consists of partitioning the iteration space into several chunks of computation called tiles (blocks) such that sequential traversal of the tiles covers the entire iteration space. The idea behind tiling is to increase the granularity of computation and decrease the amount of communication incurred between processors. This makes tiling more suitable for distributed memory architectures where communication startup costs are very high and hence frequent communication is undesirable. Our work to develop sequencecomparison mechanism and software supports the identification of sequences of DNA.

A New Edit Distance Method for Finding Similarity DNA sequence

The P-Bigram method is a string comparison methods base on an internal two characters-based similarity measure. The edit distance between two strings is the minimal number of elementary editing operations required to transform one string into the other. The elementary editing operations include deletion, insertion, substitution two characters. In this paper, we address the P-Bigram method to sole the similarity problem in DNA sequence. This method provided an efficient algorithm that locates all minimum operation in a string. We have been implemented algorithm and found that our program calculated that smaller distance than one string. We develop P-Bigram edit distance and show that edit distance or the similarity and implementation using dynamic programming. The performance of the proposed approach is evaluated using number edit and percentage similarity measures.

A probabilistic measure for alignment-free sequence comparison

Bioinformatics/computer Applications in The Biosciences, 2004

Motivation: Alignment-free sequence comparison methods are still in the early stages of development compared to those of alignment-based sequence analysis. In this paper, we introduce a probabilistic measure of similarity between two biological sequences without alignment. The method is based on the concept of comparing the similarity/dissimilarity between two constructed Markov models.

A Fast-Optimal DNA Sequences Similarity Search

A routine operation for a biologist is to query a new discovered DNA sequence against a collection of sequence databases to find a list of similar sequences. The obtained results are used to infer the function of the query sequence. The size of DNA databases are growth exponentially every year. Consequently, algorithms that find optimal sensitive results of sequence similarity can be time-consuming. Frequently, quadratic running time complexity dynamic programming algorithms used to produce a local optimal sequence alignment. However, this algorithm is cost-prohibitive in dealing with a long DNA sequences. By means of local alignment, this paper presents a framework to search a set of similar sequences in a large scale of DNA databases with optimal output and minimum cost. The Knuth-Morris-Pratt algorithm (KMP) is adapted and acts as a filtering mechanism before exhaustive dynamic programming is applied. The KMP algorithm is used to scan the generated patterns from query sequence to the sequences in databases. This filtering process generates scores which are used for ranking purposes. The Smith-Waterman algorithm then is applied to each sequence starting from the top of the constructed ranking. The paper also discusses the optimal patterns length that is highly appropriate for the databases scanning processes. The results from an experiment show that this filtering mechanism would discard irrelevant sequences from executed for Smith-Waterman algorithm. Therefore, the time for searching and retrieving the set of similar sequences from databases to the query is minimized.

A New Combinatorial Approach to Sequence Comparison

Theory of Computing Systems, 2008

In this paper we introduce a new alignment-free method for comparing sequences which is combinatorial by nature and does not use any compressor nor any information-theoretic notion. Such a method is based on an extension of the Burrows-Wheeler Transform, a transformation widely used in the context of Data Compression. The new extended transformation takes as input a multiset of sequences and produces as output a string obtained by a suitable rearrangement of the characters of all the input sequences. By using such a transformation we give a general method for comparing sequences that takes into account how much the characters coming from the different input sequences are mixed in the output string. Such a method is tested on a real data set for the whole mitochondrial genome phylogeny problem. However, the goal of this paper is to introduce a new and general methodology for automatic categorization of sequences.

Recognition of characteristic patterns in sets of functionally equivalent DNA sequences

Bioinformatics, 1987

An algorithm has been developed for the identification of unknown patterns which are distinctive for a set of short DNA sequences believed to be functionally equivalent. A pattern is defined as being a string, containing fully or partially specified nucleotides at each position of the string. The advantage of this 'vague' definition of the pattern is that it imposes minimum constraints on the characterization of patterns. A new feature of the approach developed here is that it allows a fair' simultaneous testing of patterns of all degrees of degeneracy. This analysis is based on an evaluation of inhomogeneity in the empirical occurrence distribution of any such pattern within a set of sequences. The use of the nonparametric kernel density estimation of Parzen allows one to assess small disturbances among the sequence alignments. The method also makes it possible to identify sequence subsets with different characteristic patterns. This algorithm was implemented in the analysis of patterns characteristic of sets of promoters, terminators and splice junction sequences. The results are compared with those obtained by other methods.

Comparing DNA Sequences by Dynamic Programming in Sequential and Parallel Computer Environments

Proc. of the 2006 …, 2006

Comparing two sequences by using dynamic programming algorithms is studied. Both serial and (multiple processor) parallel computer algorithms are discussed. Numerical performance of the developed software is validated through small to large-scale applications. Results (based upon comparing 2 large sequences with 40,000 and 36,000 character length, respectively, and using 2-24 parallel processors) indicate that the developed software is reliable and