Supervised posteriors for DNA-motif classification (original) (raw)

Supervised Detection of Regulatory Motifs in DNA Sequences

Statistical Applications in Genetics and Molecular Biology, 2003

Identification of transcription factor binding sites (regulatory motifs) is a major interest in contemporary biology. We propose a new likelihood based method, COMODE, for identifying structural motifs in DNA sequences. Commonly used methods (e.g. MEME, Gibbs sampler) model binding sites as families of sequences described by a position weight matrix (PWM) and identify PWMs that maximize the likelihood of observed sequence data under a simple multinomial mixture model. This model assumes that the positions of the PWM correspond to independent multinomial distributions with four cell probabilities. We address supervising the search for DNA binding sites using the information derived from structural characteristics of protein-DNA interactions. We extend the simple multinomial mixture model by incorporating constraints on the information content profiles or on specific parameters of the motif PWMs. The parameters of this extended model are estimated by maximum likelihood using a nonlinear constraint optimization method. Likelihoodbased cross-validation is used to select model parameters such as motif width and constraint type. The performance of COMODE is compared with existing motif detection methods on simulated data that incorporate real motif examples from Saccharomyces cerevisiae. The proposed method is especially effective when the motif of interest appears as a weak signal in the data. Some of the transcription factor binding data of Lee et al. (2002) were also analyzed using COMODE and biologically verified sites were identified.

Informative priors based on transcription factor structural class improve de novo motif discovery

Bioinformatics, 2006

Motivation: An important problem in molecular biology is to identify the locations at which a transcription factor (TF) binds to DNA, given a set of DNA sequences believed to be bound by that TF. In previous work, we showed that information in the DNA sequence of a binding site is sufficient to predict the structural class of the TF that binds it. In particular, this suggests that we can predict which locations in any DNA sequence are more likely to be bound by certain classes of TFs than others. Here, we argue that traditional methods for de novo motif finding can be significantly improved by adopting an informative prior probability that a TF binding site occurs at each sequence location. To demonstrate the utility of such an approach, we present PRIORITY, a powerful new de novo motif finding algorithm. Results: Using data from TRANSFAC, we train three classifiers to recognize binding sites of basic leucine zipper, forkhead, and basic helix loop helix TFs. These classifiers are used to equip PRIORITY with three class-specific priors, in addition to a default prior to handle TFs of other classes. We apply PRIORITY and a number of popular motif finding programs to sets of yeast intergenic regions that are reported by ChIP-chip to be bound by particular TFs. PRIORITY identifies motifs the other methods fail to identify, and correctly predicts the structural class of the TF recognizing the identified binding sites. Availability: Supplementary material and code can be found at

Bayesian Clustering of Transcription Factor Binding Motifs

Journal of the American Statistical Association, 2008

Genes are often regulated in living cells by proteins called transcription factors that bind directly to short segments of DNA in close proximity to specific genes. These binding sites have a conserved nucleotide appearance, which is called a motif. Several recent studies of transcriptional regulation require the reduction of a large collection of motifs into clusters based on the similarity of their nucleotide composition. We present a principled approach to this clustering problem based on a Bayesian hierarchical model that accounts for both within-and between-motif variability. We use a Dirichlet process prior distribution that allows the number of clusters to vary and we also present a novel generalization that allows the core width of each motif to vary. This clustering model is implemented, using a Gibbs sampling strategy, on several collections of transcription factor motif matrices. Our stochastic implementation allows us to examine the variability of our results in addition to focusing on a set of best clusters. Our clustering results identify several motif clusters that suggest that several transcription factor protein families are actually mixtures of several smaller groups of highly similar motifs, which provide substantially more refined information compared with the full set of motifs in the family. Our clusters provide a means by which to organize transcription factors based on binding motif similarities and can be used to reduce motif redundancy within large databases such as JASPAR and TRANSFAC, which aides the use of these databases for further motif discovery. Finally, our clustering procedure has been used in combination with discovery of evolutionarily conserved motifs to predict co-regulated genes. An alternative to our Dirichlet process prior distribution is presented that differs substantially in terms of a priori clustering characteristics, but shows no substantive difference in the clustering results for our dataset. Despite our specific application to transcription factor binding motifs, our Bayesian clustering model based on the Dirichlet process has several advantages over traditional clustering methods that could make our procedure appropriate and useful for many clustering applications.

Automatic extraction of motifs represented in the hidden Markov model from a number of DNA sequences

Bioinformatics, 1998

MOTIVATION: Automatic extraction of motifs that occur frequently on a set of unaligned DNA sequences is useful for predicting the binding sites of unknown transcription factors. Several programs for this purpose have been released. However, in our opinion, they are not practical enough to be applied to a large number of upstream sequences. RESULTS: We propose a new program called YEBIS (Yet another Environment for the analysis of BIopolymer Sequences) which is capable of extracting a set of motifs, without any a priori knowledge, from a number of functionally related DNA sequences. Using the hidden Markov model, these motifs are represented in a more general form than other conventional methods, such as the weight matrix method. When applied to several sets of benchmark data, it was found that YEBIS had comparable capability to the existing methods, but was much faster. Moreover, it could extract all known motifs from the LTR sequences (long terminal repeat sequences) in a single ru...

BayesMD: flexible biological modeling for motif discovery

2008

We present BayesMD, a Bayesian Motif Discovery model with several new features. Three different types of biological a priori knowledge are built into the framework in a modular fashion. A mixture of Dirichlets is used as prior over nucleotide probabilities in binding sites. It is trained on transcription factor (TF) databases in order to extract the typical properties of TF binding sites. In a similar fashion we train organism-specific priors for the background sequences. Lastly, we use a prior over the position of binding sites.

DNA motif elucidation using belief propagation

Nucleic acids research, 2013

Protein-binding microarray (PBM) is a high-throughout platform that can measure the DNA-binding preference of a protein in a comprehensive and unbiased manner. A typical PBM experiment can measure binding signal intensities of a protein to all the possible DNA k-mers (k = 8 $10); such comprehensive binding affinity data usually need to be reduced and represented as motif models before they can be further analyzed and applied. Since proteins can often bind to DNA in multiple modes, one of the major challenges is to decompose the comprehensive affinity data into multimodal motif representations. Here, we describe a new algorithm that uses Hidden Markov Models (HMMs) and can derive precise and multimodal motifs using belief propagations. We describe an HMM-based approach using belief propagations (kmerHMM), which accepts and preprocesses PBM probe raw data into median-binding intensities of individual k-mers. The k-mers are ranked and aligned for training an HMM as the underlying motif representation. Multiple motifs are then extracted from the HMM using belief propagations. Comparisons of kmerHMM with other leading methods on several data sets demonstrated its effectiveness and uniqueness. Especially, it achieved the best performance on more than half of the data sets. In addition, the multiple binding modes derived by kmerHMM are biologically meaningful and will be useful in interpreting other genome-wide data such as those generated from ChIP-seq. The executables and source codes are available at the authors' websites: e.g.

Optimized mixed Markov models for motif identification

BMC bioinformatics, 2006

Identifying functional elements, such as transcriptional factor binding sites, is a fundamental step in reconstructing gene regulatory networks and remains a challenging issue, largely due to limited availability of training samples. We introduce a novel and flexible model, the Optimized Mixture Markov model (OMiMa), and related methods to allow adjustment of model complexity for different motifs. In comparison with other leading methods, OMiMa can incorporate more than the NNSplice's pairwise dependencies; OMiMa avoids model over-fitting better than the Permuted Variable Length Markov Model (PVLMM); and OMiMa requires smaller training samples than the Maximum Entropy Model (MEM). Testing on both simulated and actual data (regulatory cis-elements and splice sites), we found OMiMa's performance superior to the other leading methods in terms of prediction accuracy, required size of training data or computational time. Our OMiMa system, to our knowledge, is the only motif findi...

Sequence features of DNA binding sites reveal structural class of associated transcription factor

Bioinformatics, 2005

Motivation: A key goal in molecular biology is to understand the mechanisms by which a cell regulates the transcription of its genes. One important aspect of this transcriptional regulation is the binding of transcription factors (TFs) to their specific cis-regulatory counterparts on the DNA. TFs recognize and bind their DNA counterparts according to the structure of their DNA-binding domains (e.g. zinc finger, leucine zipper, homeodomain). The structure of these domains can be used as a basis for grouping TFs into classes. Although the structure of DNAbinding domains varies widely across TFs generally, the TFs within a particular class bind to DNA in a similar fashion, suggesting the existence of class-specific features in the DNA sequences bound by each class of TFs. Results: In this paper, we apply a sparse Bayesian learning algorithm to identify a small set of class-specific features in the DNA sequences bound by different classes of TFs; the algorithm simultaneously learns a true multi-class classifier that uses these features to predict the DNA-binding domain of the TF that recognizes a particular set of DNA sequences. We train our algorithm on the six largest classes in TRANSFAC, comprising a total of 587 TFs. We learn a six-class classifier for this training set that achieves 87% leave-one-out crossvalidation accuracy. We also identify features within cis-regulatory sequences that are highly specific to each class of TF, which has significant implications for how TF binding sites should be modeled for the purpose of motif discovery.

Apples and oranges: avoiding different priors in Bayesian DNA sequence analysis

BMC Bioinformatics, 2010

Background: One of the challenges of bioinformatics remains the recognition of short signal sequences in genomic DNA such as donor or acceptor splice sites, splicing enhancers or silencers, translation initiation sites, transcription start sites, transcription factor binding sites, nucleosome binding sites, miRNA binding sites, or insulator binding sites. During the last decade, a wealth of algorithms for the recognition of such DNA sequences has been developed and compared with the goal of improving their performance and to deepen our understanding of the underlying cellular processes. Most of these algorithms are based on statistical models belonging to the family of Markov random fields such as position weight matrix models, weight array matrix models, Markov models of higher order, or moral Bayesian networks. While in many comparative studies different learning principles or different statistical models have been compared, the influence of choosing different prior distributions for the model parameters when using different learning principles has been overlooked, and possibly lead to questionable conclusions. Results: With the goal of allowing direct comparisons of different learning principles for models from the family of Markov random fields based on the same a-priori information, we derive a generalization of the commonly-used product-Dirichlet prior. We find that the derived prior behaves like a Gaussian prior close to the maximum and like a Laplace prior in the far tails. In two case studies, we illustrate the utility of the derived prior for a direct comparison of different learning principles with different models for the recognition of binding sites of the transcription factor Sp1 and human donor splice sites. Conclusions: We find that comparisons of different learning principles using the same a-priori information can lead to conclusions different from those of previous studies in which the effect resulting from different priors has been neglected. We implement the derived prior in the open-source library Jstacs to enable an easy application to comparative studies of different learning principles in the field of sequence analysis. In case of the DNA alphabet, the BDeu metric determines the hyper-parameters for the PWM model to be a c,ℓ,b = 8, while it Keilwagen et al.