Detecting DNA regulatory motifs by incorporating positional trends in information content (original) (raw)

Supervised Detection of Regulatory Motifs in DNA Sequences

Statistical Applications in Genetics and Molecular Biology, 2003

Identification of transcription factor binding sites (regulatory motifs) is a major interest in contemporary biology. We propose a new likelihood based method, COMODE, for identifying structural motifs in DNA sequences. Commonly used methods (e.g. MEME, Gibbs sampler) model binding sites as families of sequences described by a position weight matrix (PWM) and identify PWMs that maximize the likelihood of observed sequence data under a simple multinomial mixture model. This model assumes that the positions of the PWM correspond to independent multinomial distributions with four cell probabilities. We address supervising the search for DNA binding sites using the information derived from structural characteristics of protein-DNA interactions. We extend the simple multinomial mixture model by incorporating constraints on the information content profiles or on specific parameters of the motif PWMs. The parameters of this extended model are estimated by maximum likelihood using a nonlinear constraint optimization method. Likelihoodbased cross-validation is used to select model parameters such as motif width and constraint type. The performance of COMODE is compared with existing motif detection methods on simulated data that incorporate real motif examples from Saccharomyces cerevisiae. The proposed method is especially effective when the motif of interest appears as a weak signal in the data. Some of the transcription factor binding data of Lee et al. (2002) were also analyzed using COMODE and biologically verified sites were identified.

Subtle Motif Discovery for Detection of Dna Regulatory Sites

Proceedings of the 5th Asia-Pacific Bioinformatics Conference, 2007

We address the problem of detecting consensus motifs, that occur with subtle variations, across multiple sequences. These are usually functional domains in DNA sequences such as transcriptional binding factors or other regulatory sites. The problem in its generality has been considered difficult and various benchmark data serve as the litmus test for different computational methods. We present a method centered around unsupervised combinatorial pattern discovery. The parameters are chosen using a careful statistical analysis of consensus motifs. This method works well on the benchmark data and is general enough to be extended to a scenario where the variation in the consensus motif includes indels (along with mutations). We also present some results on detection of transcription binding factors in human DNA sequences.

Identifying Regulatory Signals in DNA-Sequences with a Non-statistical Approximation Approach

The identification of regulatory signals is one of the most challenging tasks in bioinformatics. The development of gene-profiling technologies now makes it possible to obtain vast data on gene expression in a particular organism under various conditions. This has created the opportunity to identify and analyze the parts of the genome believed to be responsible for transcription control-the transcription factor DNA-binding motifs (TFBMs). Developing a practical and efficient computational tool to identify TFBMs will enable us to better understand the interplay among thousands of genes in a complex eukaryotic organism. This problem, which is mathematically formulated as the motif finding problem in computer science, has been studied extensively in recent years. We develop a new mathematical model and approximation technique for motif searching. Based on the graph theoretic and geometric properties of this approach, we propose a nonstatistical approximation algorithm to find motifs in...

Finding regulatory DNA motifs using alignment-free evolutionary conservation information

Nucleic Acids Research, 2010

As an increasing number of eukaryotic genomes are being sequenced, comparative studies aimed at detecting regulatory elements in intergenic sequences are becoming more prevalent. Most comparative methods for transcription factor (TF) binding site discovery make use of global or local alignments of orthologous regulatory regions to assess whether a particular DNA site is conserved across related organisms, and thus more likely to be functional. Since binding sites are usually short, sometimes degenerate, and often independent of orientation, alignment algorithms may not align them correctly. Here, we present a novel, alignment-free approach for using conservation information for TF binding site discovery. We relax the definition of conserved sites: we consider a DNA site within a regulatory region to be conserved in an orthologous sequence if it occurs anywhere in that sequence, irrespective of orientation. We use this definition to derive informative priors over DNA sequence positions, and incorporate these priors into a Gibbs sampling algorithm for motif discovery. Our approach is simple and fast. It requires neither sequence alignments nor the phylogenetic relationships between the orthologous sequences, yet it is more effective on real biological data than methods that do.

Discovering Motifs with Transcription Factor Domain Knowledge

Biocomputing 2007 - Proceedings of the Pacific Symposium, 2006

We introduce a new motif-discovery algorithm, DIMDom, which exploits two additional kinds of information not commonly exploited: (a) the characteristic pattern of binding site classes, where class is determined based on biological information about transcription factor domains and (b) posterior probabilities of these classes. We compared the performance of DIMDom with MEME on all the transcription factors of Drosophila with at least one known binding site in the TRANSFAC database and found that DOMDom outperformed MEME with 2.5 times the number of successes and 1.5 times in the accuracy in finding binding sties and motifs. * The research was supported in parts by the RGC grant HKU 7120/06E.

Computational Discovery of Gene Regulatory Binding Motifs: A Bayesian Perspective

Statistical Science, 2004

The Bayesian approach together with Markov chain Monte Carlo techniques has provided an attractive solution to many important bioinformatics problems such as multiple sequence alignment, microarray analysis and the discovery of gene regulatory binding motifs. The employment of such methods and, more broadly, explicit statistical modeling, has revolutionized the field of computational biology. After reviewing several heuristicsbased computational methods, this article presents a systematic account of Bayesian formulations and solutions to the motif discovery problem. Generalizations are made to further enhance the Bayesian approach. Motivated by the need of a speedy algorithm, we also provide a perspective of the problem from the viewpoint of optimizing a scoring function. We observe that scoring functions resulting from proper posterior distributions, or approximations to such distributions, showed the best performance and can be used to improve upon existing motif-finding programs. Simulation analyses and a real-data example are used to support our observation.

New scoring schema for finding motifs in DNA Sequences

BMC Bioinformatics, 2009

Background: Pattern discovery in DNA sequences is one of the most fundamental problems in molecular biology with important applications in finding regulatory signals and transcription factor binding sites. An important task in this problem is to search (or predict) known binding sites in a new DNA sequence. For this reason, all subsequences of the given DNA sequence are scored based on an scoring function and the prediction is done by selecting the best score. By assuming no dependency between binding site base positions, most of the available tools for known binding site prediction are designed. Recently Tomovic and Oakeley investigated the statistical basis for either a claim of dependence or independence, to determine whether such a claim is generally true, and they presented a scoring function for binding site prediction based on the dependency between binding site base positions. Our primary objective is to investigate the scoring functions which can be used in known binding site prediction based on the assumption of dependency or independency in binding site base positions. Results: We propose a new scoring function based on the dependency between all positions in biding site base positions. This scoring function uses joint information content and mutual information as a measure of dependency between positions in transcription factor binding site. Our method for modeling dependencies is simply an extension of position independency methods. We evaluate our new scoring function on the real data sets extracted from JASPAR and TRANSFAC data bases, and compare the obtained results with two other well known scoring functions. Conclusion: The results demonstrate that the new approach improves known binding site discovery and show that the joint information content and mutual information provide a better and more general criterion to investigate the relationships between positions in the TFBS. Our scoring function is formulated by simple mathematical calculations. By implementing our method on several biological data sets, it can be induced that this method performs better than methods that do not consider dependencies.

Supervised Detection of Conserved Motifs in DNA Sequences with Cosmo

Statistical Applications in Genetics and Molecular Biology, 2007

A number of computational methods have been proposed for identifying transcription factor binding sites from a set of unaligned sequences that are thought to share the motif in question. We here introduce an algorithm, called cosmo, that allows this search to be supervised by specifying a set of constraints that the position weight matrix of the unknown motif must satisfy. Such constraints may be formulated, for example, on the basis of prior knowledge about the structure of the transcription factor in question. The algorithm is based on the same two-component multinomial mixture model used by MEME, with stronger reliance, however, on the likelihood principle instead of more ad-hoc criteria like the E-value. The intensity parameter in the ZOOPS and TCM models, for instance, is estimated based on a profile-likelihood approach, and the width of the unknown motif is selected based on BIC. These changes allow cosmo to outperform MEME even in the absence of any constraints, as evidenced ...

Supervised posteriors for DNA-motif classification

2007

Markov models have been proposed for the classification of DNA-motifs using generative approaches for parameter learning. Here, we propose to apply the discriminative paradigm for this problem and study twod ifferent priors to facilitate parameter estimation using the maximum supervised posterior.Considering sevensets of eukaryotic transcription factor binding sites we find this approach to be superior employing area under the ROCcurveand false positive rate as performance criterion, and better in general using sensitivity.I naddition, we discuss potential reasons for the improvedperformance.