Identification of consensus patterns in unaligned DNA sequences known to be functionally related

Identification of consensus patterns in unaligned dna and protein sequences: a large-deviation stati

1995

Using log-likelihood statistics to compare sequence alignments, we have been able to determine alignments from multiple, unaligned, functionally related, DNA (Stormo and Hartzell. 1989. Proc. Natl. Acad. Sci. USA 86, 1183–1187; Hertz et al. 1990. Comput. Appl. Biosci. 6, 81–92) and protein sequences. In this paper, we reanalyze DNA sequences that bind the E. coli repressor LexA to demonstrate the ability of our scoring scheme to identify patterns when each sequence can contain zero or more binding sites. The scoring formula we have used previously does not allow for insertions and deletions in the alignments. In this paper, we use large-deviation statistics to extend the scoring formula to allow for insertions and deletions. The insertion-deletion penalty of this scoring scheme depends exclusively on the observed alignment rather than on previous observations or the user’s intuition. We also describe the close relationship between our formulas and hidden markov models. Finally, we p...

Identifying DNA and protein patterns with statistically significant alignments of multiple sequences

Bioinformatics/computer Applications in The Biosciences, 1999

Motivation: Molecular biologists frequently can obtain interesting insight by aligning a set of related DNA, RNA or protein sequences. Such alignments can be used to determine either evolutionary or functional relationships. Our interest is in identifying functional relationships. Unless the sequences are very similar, it is necessary to have a specific strategy for measuring-or scoring-the relatedness of the aligned sequences. If the alignment is not known, one can be determined by finding an alignment that optimizes the scoring scheme. Results: We describe four components to our approach for determining alignments of multiple sequences. First, we review a log-likelihood scoring scheme we call information content. Second, we describe two methods for estimating the P value of an individual information content score: (i) a method that combines a technique from large-deviation statistics with numerical calculations; (ii) a method that is exclusively numerical. Third, we describe how we count the number of possible alignments given the overall amount of sequence data. This count is multiplied by the P value to determine the expected frequency of an information content score and, thus, the statistical significance of the corresponding alignment. Statistical significance can be used to compare alignments having differing widths and containing differing numbers of sequences. Fourth, we describe a greedy algorithm for determining alignments of functionally related sequences. Finally, we test the accuracy of our P value calculations, and give an example of using our algorithm to identify binding sites for the Escherichia coli CRP protein.

A linear programming approach for identifying a consensus sequence on DNA sequences

Bioinformatics, 2005

Motivation: Maximum-likelihood methods for solving the consensus sequence identification (CSI) problem on DNA sequences may only find a local optimum rather than the global optimum. Additionally, such methods do not allow logical constraints to be imposed on their models. This study develops a linear programming technique to solve CSI problems by finding an optimum consensus sequence. This method is computationally more efficient and is guaranteed to reach the global optimum. The developed method can also be extended to treat more complicated CSI problems with ambiguous conserved patterns. Results: A CSI problem is first formulated as a non-linear mixed 0-1 optimization program, which is then converted into a linear mixed 0-1 program. The proposed method provides the following advantages over maximum-likelihood methods: (1) It is guaranteed to find the global optimum. (2) It can embed various logical constraints into the corresponding model. (3) It is applicable to problems with many long sequences. (4) It can find the second and the third best solutions. An extension of the proposed linear mixed 0-1 program is also designed to solve CSI problems with an unknown spacer length between conserved regions. Two examples of searching for CRP-binding sites and for FNR-binding sites in the Escherichia coli genome are used to illustrate and test the proposed method.

An alignment-free model for comparison of regulatory sequences

Bioinformatics, 2010

Some recent comparative studies have revealed that regulatory regions can retain function over large evolutionary distances, even though the DNA sequences are divergent and difficult to align. It is also known that such enhancers can drive very similar expression patterns. This poses a challenge for the in silico detection of biologically related sequences, as they can only be discovered using alignment-free methods. Results: Here, we present a new computational framework called Regulatory Region Scoring (RRS) model for the detection of functional conservation of regulatory sequences using predicted occupancy levels of transcription factors of interest. We demonstrate that our model can detect the functional and/or evolutionary links between some non-alignable enhancers with a strong statistical significance. We also identify groups of enhancers that are likely to be similarly regulated. Our model is motivated by previous work on prediction of expression patterns and it can capture similarity by strong binding sites, weak binding sites and even the statistically significant absence of sites. Our results support the hypothesis that weak binding sites contribute to the functional similarity of sequences.

A probabilistic measure for alignment-free sequence comparison

Bioinformatics/computer Applications in The Biosciences, 2004

Motivation: Alignment-free sequence comparison methods are still in the early stages of development compared to those of alignment-based sequence analysis. In this paper, we introduce a probabilistic measure of similarity between two biological sequences without alignment. The method is based on the concept of comparing the similarity/dissimilarity between two constructed Markov models.

Pattern recognition in several sequences: Consensus and alignment

Bulletin of Mathematical Biology, 1984

The comparison of several sequences is central to many problems of molecular biology. Finding consensus patterns that define genetic control regions or that determine structu- ral or functional themes are examples of these problems. Previously proposed methods, such as dynamic programming, are not adequate for solving problems of realistic size. This paper gives a new and practical solution for finding

Contextual alignment of biological sequences (Extended abstract)

Bioinformatics, 2002

We present a model of contextual alignment of biological sequences. It is an extension of the classical alignment, in which we assume that the cost of a substitution depends on the surrounding symbols. In this model the cost of transforming one sequence into another depends on the order of editing operations. We present efficient algorithms for calculating this cost, as well as reconstructing (the representation of) all the orders of operations which yield this optimal cost. A precise characterization of the families of linear orders which can emerge this way is given. Contact: jty@mimuw.edu.pl decide homology between proteins of quite different sequences. However, despite very intense efforts, predicting the shape of a protein, based on its sequence, is viewed as an extremely difficult problem.

Rigorous pattern-recognition methods for DNA sequences

Journal of Molecular Biology, 1985

The basic nature of the sequence features that define a promoter sequence for Escherichia coli RNA polymerase have been established by a variety of biochemical and genetic methods. We have developed rigorous analytical methods for finding unknown patterns that occur imperfectly in a set of several sequences, and have used them to examine a set of bacterial promoters. The algorithm easily discovers the "consensus" sequences for the -10 and -35 regions, which are essentially identical to the results of previous analyses, but requires no prior assumptions about the common patterns. By explicitly specifying the nature of the search for consensus sequences, we give a rigorous definition to this concept that should be widely applicable. We also have provided estimates for the statistical significance of common patterns discovered in sets of sequences. In addition to providing a rigorous basis for defining known consensus regions, we have found additional features in these promoters that may have functional significance. These added features were located on either side of the -35 region. The pattern 5', or upstream, from the -35 region was found using the standard alphabet (A, G, C and T), but the pattern between the -10 and the -35 regions was detectable only in a sub-alphabet. Recent results relating DNA sequence to helix conformation suggest that the former (upstream) pattern may have a functional significance. Possible roles in promoter function are discussed in this light, and an observation of altered promoter function involving the upstream region is reported that appears to support the suggestion of function in at least one case.

Recognition of characteristic patterns in sets of functionally equivalent DNA sequences

Bioinformatics, 1987

An algorithm has been developed for the identification of unknown patterns which are distinctive for a set of short DNA sequences believed to be functionally equivalent. A pattern is defined as being a string, containing fully or partially specified nucleotides at each position of the string. The advantage of this 'vague' definition of the pattern is that it imposes minimum constraints on the characterization of patterns. A new feature of the approach developed here is that it allows a fair' simultaneous testing of patterns of all degrees of degeneracy. This analysis is based on an evaluation of inhomogeneity in the empirical occurrence distribution of any such pattern within a set of sequences. The use of the nonparametric kernel density estimation of Parzen allows one to assess small disturbances among the sequence alignments. The method also makes it possible to identify sequence subsets with different characteristic patterns. This algorithm was implemented in the analysis of patterns characteristic of sets of promoters, terminators and splice junction sequences. The results are compared with those obtained by other methods.

Probabilistic clustering of sequences: Inferring new bacterial regulons by comparative genomics

Proceedings of the National Academy of Sciences, 2002

Genome-wide comparisons between enteric bacteria yield large sets of conserved putative regulatory sites on a gene-by-gene basis that need to be clustered into regulons. Using the assumption that regulatory sites can be represented as samples from weight matrices (WMs), we derive a unique probability distribution for assignments of sites into clusters. Our algorithm, ''PROCSE'' (probabilistic clustering of sequences), uses Monte Carlo sampling of this distribution to partition and align thousands of short DNA sequences into clusters. The algorithm internally determines the number of clusters from the data and assigns significance to the resulting clusters. We place theoretical limits on the ability of any algorithm to correctly cluster sequences drawn from WMs when these WMs are unknown. Our analysis suggests that the set of all putative sites for a single genome (e.g., Escherichia coli) is largely inadequate for clustering. When sites from different genomes are combined and all the homologous sites from the various species are used as a block, clustering becomes feasible. We predict 50 -100 new regulons as well as many new members of existing regulons, potentially doubling the number of known regulatory sites in E. coli.

Identification of consensus patterns in unaligned DNA sequences known to be functionally related (original) (raw)

Related papers