A deterministic algorithm for alpha-numeric sequence comparison withh application to protein sequence detection (original) (raw)
Related papers
A Deterministic Algorithm for DNA Sequence Comparison
This paper suggests a novel way for measuring similarity between sequences of symbols from alphabets of small cardinality such as DNA and RNA sequences. The approach relies on finding one-to-one mappings between these sequences and a subset of the real numbers. Gaps in non-identical sequences are easily detected. Computational results on DNA sequences and a comparison with BLAST are included.
Multiple Sequence Comparison-A Peptide Matching Approach 1
1995
We present in this paper a peptide matching approach to the multiple comparison of a set of protein sequences. This approach consists in looking for all the words that are common to q of these sequences, where q is a parameter. The comparison between words is done by using as reference an object called a model. In the case of proteins, a model is a product of subsets of the alphabet Σ of the amino acids. These subsets belong to a cover of Σ, that is, their union covers all of Σ. A word is said to be an instance of a model if it belongs to the model. A further flexibility is introduced in the comparison by allowing for up to e errors in the comparison between a word and a model. These errors may concern gaps or substitutions not allowed by the cover. A word is said to be this time an occurrence of a model if the Levenshtein distance between it and an instance of the model is inferior or equal to e. This corresponds to what we call a Set-Levenshtein distance between the occurrences an...
Multiple sequence comparison — a peptide matching approach
Theoretical Computer Science, 1997
Abstract: We present in this paper a peptide matching approach to the multiple comparison of a setof protein sequences. This approach consists in looking for all the words that are common to qof these sequences, where q is a parameter.
Detection of protein coding sequences using a mixture model for local protein amino acid sequence.
Locating protein coding regions in genomic DNA is a critical step in accessing the information generated by large scale sequencing projects. Current methods for gene detection depend on statistical measures of content differences between coding and noncoding DNA in addition to the recognition of promoters, splice sites, and other regulatory sites. Here we explore the potential value of recurrent amino acid sequence patterns 3-19 amino acids in length as a content statistic for use in gene nding approaches. A nite mixture model incorporating these patterns can partially discriminate protein sequences which have no (detectable) known homologs from randomized versions of these sequences, and from short ( 50 amino acids) non-coding segments extracted from the S. cerevisiea genome. The mixture model derived scores for a collection of human exons were not correlated with the GENSCAN scores, suggesting that the addition of our protein pattern recognition module to current gene recognition programs may improve their performance.
International Journal of Engineering & Technology, 2018
The methods of comparison of protein sequences based on different classified groups of amino acids add a significant contribution to the literature of protein sequence comparison. But the methods vary with choice of different classified groups. Therefore, the purpose of the paper is to develop a unified approach towards the analysis of protein sequence comparison based on classification of amino acids in different groups of different cardinality. The paper considers 4 group classification, 5 group classification and 6 group classifications of amino acids, and in each case it applies the unified method for comparing two types of protein sequences, viz., 9 proteins of ND5 category and 50 Corona virus Spike Proteins. The results agree with those, which were obtained earlier by other methods based on classified groups of amino acids. An-yway it is found that the present unified formula is relatively simpler and fundamentally different from the earlier ones. Further, it can be applied co...
Similarity evaluation of DNA sequences based on nucleotides similarity
Proceedings of the 3rd International Conference on Information and Communication Systems - ICICS 12, 2012
Background: DNA sequence analysis is an important research topic in bioinformatics. Evaluating the similarity between sequences, which is crucial for sequence analysis, has attracted much research effort in the last two decades, and a dozen of algorithms and tools have been developed. These methods are based on alignment, word frequency and geometric representation respectively, each of which has its advantage and disadvantage. Results: In this paper, for effectively computing the similarity between DNA sequences, we introduce a novel method based on frequency patterns and entropy to construct representative vectors of DNA sequences. Experiments are conducted to evaluate the proposed method, which is compared with two recently-developed alignment-free methods and the BLASTN tool. When testing on the b-globin genes of 11 species and using the results from MEGA as the baseline, our method achieves higher correlation coefficients than the two alignmentfree methods and the BLASTN tool. Conclusions: Our method is not only able to capture fine-granularity information (location and ordering) of DNA sequences via sequence blocking, but also insensitive to noise and sequence rearrangement due to considering only the maximal frequent patterns. It outperforms major existing methods or tools.
Efficient Seeding Techniques for Protein Similarity Search
Communications in Computer and Information Science, 2008
We apply the concept of subset seeds proposed in [1] to similarity search in protein sequences. The main question studied is the design of efficient seed alphabets to construct seeds with optimal sensitivity/selectivity trade-offs. We propose several different design methods and use them to construct several alphabets. We then perform an analysis of seeds built over those alphabet and compare them with the standard Blastp seeding method [2,3], as well as with the family of vector seeds proposed in . While the formalism of subset seed is less expressive (but less costly to implement) than the accumulative principle used in Blastp and vector seeds, our seeds show a similar or even better performance than Blastp on Bernoulli models of proteins compatible with the common BLOSUM62 matrix.
A survey on improving pattern matching algorithms for biological sequences
Pattern matching is a highly useful procedure in several stages of the computational pipelines. Furthermore, some research trends in this research domain contributed to growing biological databases and updated them throughout time. This article proposes an comparison and analysis of different algorithms for match equivalent pattern matching like complexity, efficiency, and techniques. Which algorithm is best for which DNA sequence and why? This describes the different algorithms for various activities that include pattern matching as an important aspect of functionality. This article shows that BM, Horspool, ZT, QS, FS, Smith, and SSABS methods employ the bad character preprocessing function. In addition, BM, SSABS, TVSBS, and BRFS methods are using two approaches in the preprocessing stage, which decreases the preprocessing time. Furthermore, KR, QS, SSABS, BRFS, and Shift-Or are not recommended for the long pattern, whereas ZT, FS, d-BM, Raita, and Smith are not recommended for the short pattern. This is because they are time-consuming and certain algorithms, such as ZT and DCPM, use a lot of time and space during the matching and search process, while others, such as d-BM and TSW, save space and time. Although DCPM, BRFS, and QS are quicker than other algorithms, FLPM, PAPM, and LFPM rank highest in terms of complexity time.