A Deterministic Algorithm for DNA Sequence Comparison (original) (raw)

Proceedings of the 3rd International Conference on Information and Communication Systems - ICICS 12, 2012

Background: DNA sequence analysis is an important research topic in bioinformatics. Evaluating the similarity between sequences, which is crucial for sequence analysis, has attracted much research effort in the last two decades, and a dozen of algorithms and tools have been developed. These methods are based on alignment, word frequency and geometric representation respectively, each of which has its advantage and disadvantage. Results: In this paper, for effectively computing the similarity between DNA sequences, we introduce a novel method based on frequency patterns and entropy to construct representative vectors of DNA sequences. Experiments are conducted to evaluate the proposed method, which is compared with two recently-developed alignment-free methods and the BLASTN tool. When testing on the b-globin genes of 11 species and using the results from MEGA as the baseline, our method achieves higher correlation coefficients than the two alignmentfree methods and the BLASTN tool. Conclusions: Our method is not only able to capture fine-granularity information (location and ordering) of DNA sequences via sequence blocking, but also insensitive to noise and sequence rearrangement due to considering only the maximal frequent patterns. It outperforms major existing methods or tools.

A deterministic algorithm for alpha-numeric sequence comparison withh application to protein sequence detection

This paper describes an extension of a deterministic algorithm, [1,2], that was initially designed to measure the rate of similarity between DNA sequences and any sequence made up with symbols of alphabets of cardinality 4. Here, a modified and extended version to handle sequences of symbols from alphabets of cardinality >4 is presented. This extension opens up its application area. As a test ground, we search for peptides within a protein database. Computational results on real data and a comparison with BLAST will be discussed. Keywords: BLAST, Deterministic algorithm, Alpha-numeric sequence, numeration system, Database

A Comparison of Computation Techniques for Dna Sequence Comparison

This Project shows a comparison survey done on DNA sequence comparison techniques. The various techniques implemented are sequential comparison, multithreading on a single computer and multithreading using parallel processing. This Project shows the issues involved in implementing a dynamic programming algorithm for biological sequence comparison on a general purpose parallel computing platform Tiling is an important technique for extraction of parallelism. Informally, tiling consists of partitioning the iteration space into several chunks of computation called tiles (blocks) such that sequential traversal of the tiles covers the entire iteration space. The idea behind tiling is to increase the granularity of computation and decrease the amount of communication incurred between processors. This makes tiling more suitable for distributed memory architectures where communication startup costs are very high and hence frequent communication is undesirable. Our work to develop sequencecomparison mechanism and software supports the identification of sequences of DNA.

A New Edit Distance Method for Finding Similarity DNA sequence

The P-Bigram method is a string comparison methods base on an internal two characters-based similarity measure. The edit distance between two strings is the minimal number of elementary editing operations required to transform one string into the other. The elementary editing operations include deletion, insertion, substitution two characters. In this paper, we address the P-Bigram method to sole the similarity problem in DNA sequence. This method provided an efficient algorithm that locates all minimum operation in a string. We have been implemented algorithm and found that our program calculated that smaller distance than one string. We develop P-Bigram edit distance and show that edit distance or the similarity and implementation using dynamic programming. The performance of the proposed approach is evaluated using number edit and percentage similarity measures.

Computational Biology and Chemistry, 2010

Index-based search algorithms are an important part of a genomic search, and how to construct indices is the key to an index-based search algorithm to compute similarities between two DNA sequences. In this paper, we propose an efficient query processing method that uses special transformations to construct an index. It uses small storage and it rapidly finds the similarity between two sequences in a DNA sequence database. At first, a sequence is partitioned into equal length windows. We select the likely subsequences by computing Hamming distance to query sequence. The algorithm then transforms the subsequences in each window into a multidimensional vector space by indexing the frequencies of the characters, including the positional information of the characters in the subsequences. The result of our experiments shows that the algorithm has faster run time than other heuristic algorithms based on index structure. Also, the algorithm is as accurate as those heuristic algorithms.

On-line String Matching in Highly Similar DNA Sequences

We consider the problem of on-line exact string matching of a pattern in a set of highly similar sequences. This can be useful in cases where indexing the sequences is not feasible. We present a preliminary study by restricting the problem for a specific case where we adapt the classical Morris-Pratt algorithm to consider borders with errors. We give an original algorithm for computing borders at Hamming distance 1. We exhibit experimental results showing that our algorithm is much faster than searching for the pattern in each sequences with a very fast on-line exact string matching algorithm.

A New Approach to Pattern Matching in Degenerate DNA/RNA Sequences and Distributed Pattern Matching

Mathematics in Computer Science, 2008

In this paper, we consider the pattern matching problem in DNA and RNA sequences where either the pattern or the text can be degenerate i.e. contain sets of characters. We present an asymptotically faster algorithm for the above problem that works in O(n log m) time, where n and m is the length of the text and the pattern respectively. We also suggest an efficient implementation of our algorithm, which works in linear time when the pattern size is small. Finally, we also describe how our approach can be used to solve the distributed pattern matching problem.

A fast and efficient algorithm for DNA sequence similarity identification

Complex & Intelligent Systems

DNA sequence similarity analysis is necessary for enormous purposes including genome analysis, extracting biological information, finding the evolutionary relationship of species. There are two types of sequence analysis which are alignment-based (AB) and alignment-free (AF). AB is effective for small homologous sequences but becomes NP-hard problem for long sequences. However, AF algorithms can solve the major limitations of AB. But most of the existing AF methods show high time complexity and memory consumption, less precision, and less performance on benchmark datasets. To minimize these limitations, we develop an AF algorithm using a 2D k-merk−mercountmatrixinspiredbytheCGRapproach.Thenweshrinkthematrixbyanalyzingtheneighborsandthenmeasuresimilaritiesusingthebestcombinationsofpairwisedistance(PD)andphylogenetictreemethods.Wealsodynamicallychoosethevalueofkfork - m e r count matrix inspired by the CGR approach. Then we shrink the matrix by analyzing the neighbors and then measure similarities using the best combinations of pairwise distance (PD) and phylogenetic tree methods. We also dynamically choose the value of k fork−mercountmatrixinspiredbytheCGRapproach.Thenweshrinkthematrixbyanalyzingtheneighborsandthenmeasuresimilaritiesusingthebestcombinationsofpairwisedistance(PD)andphylogenetictreemethods.Wealsodynamicallychoosethevalueofkfork-mer$$ k - m e r . We develop an efficient system for finding the po...

Distance measures for biological sequences: Some recent approaches

International Journal of Approximate Reasoning, 2008

Sequence comparison has become a very essential tool in modern molecular biology. In fact, in biomolecular sequences high similarity usually implies significant functional or structural similarity. Traditional approaches use techniques that are based on sequence alignment able to measure character level differences. However, the recent developments of whole genome sequencing technology give rise to need of similarity measures able to capture the rearrangements involving large segments contained in the sequences. This paper is devoted to illustrate different methods recently introduced for the alignment-free comparison of biological sequences. Goal of the paper is both to highlight the peculiarities of each of such approaches by focusing on its advantages and disadvantages and to find the common features of all these different methods.