Identifying Regulatory Signals in DNA-Sequences with a Non-statistical Approximation Approach (original) (raw)

Approximate Algorithms for Regulatory Motif Discovery in DNA

2019

Motif discovery is the problem of finding common substrings within a set of biological strings. Therefore it can be applied to finding Transcription Factor Binding Sites (TFBS) that have common patterns (motifs). A transcription factor molecule can bind to multiple binding sites in the promoter region of different genes to make these genes co-regulating. The Planted (l, d) Motif Problem (PMP) is a classic version of motif discovery where l is the motif length and d represents the maximum allowed mutation distance. The quorum Planted (l, d, q) Motif Problem (qPMP) is a version of PMP where the motif of length l occurs in at least q percent of the sequences with up to d mismatches. In this thesis we develop the Strong Motif Finder (SMF) and quorum Strong Motif Finder (qSMF) algorithms and evaluate their performance. The Strong Motif Finder (SMF) returns a list of its highest ranked (strongest) motifs. The performance of SMF is compared with the APMotif and MEME algorithms with respect to execution time and prediction accuracy. Several performance metrics are used at both the nucleotide and the site level. The algorithms are tested on simulated datasets. The time comparisons show that SMF is faster than the APMotif and the MEME (ANR) and similar in speed to the MEME (ZOOPS). The MEME algorithm with choice OOPS is the fastest but is not practical if no prior knowledge is available. The prediction accuracy results reveal that the SMF outperforms the APMotif, and performs at the level of the best prediction accuracy of the MEME (with OOPS choice), notwithstanding that the SMF is not given a-priori information. In addition, the SMF is tested on real DNA datasets of orthologous regularity regions from multiple species, without using their related phylogenetic tree. The experiments indicate that the SMF results agree with published motifs. The quorum Strong Motif Finder (qSMF) returns a list of highest ranked (strongest) motifs occurring in at least q percent of the data sequences. The algorithm is tested on ChIP-Seq (large) data that was sampled using the SamSelect algorithm. In comparison with the FMotif algorithm, the experimental results show that qSMF is faster and returns predicted motifs similar to results in the literature and to motifs discovered by the ENCODE project tool which uses the established motif finding algorithms of AlignACE, MEME, MDscan, Trawler, and Weeder. In order to determine the strength or the significance of the predicted motifs, a scoring function, the Motif Strength Score (MSS), is proposed for ranking the discovered motifs in both algorithms. In future work, this score can be combined with other statistical scores, such as the complexity score, P-value and information content, to better determine the motif significance.

Motif Finding with Application to the Transcription Factor Binding Sites Problem

International Journal of Computer Applications, 2015

DNA sequencing of different species has resulted in the generation of huge amount of biological data. There is an increasing need to develop computational techniques to search for relevant information in the DNA data. Discovering motifs involves determining short sequence segments which have a high probability of repeated occurrences over many sequences in different species. Motifs are useful in finding transcription factor binding sites, transcriptional regulatory elements and so on. Transcription factor binding sites (TFBSs) is important for understanding the genetic regulatory system. Our method is based on the Ant Colony Optimization (ACO) and Gibbs sampling algorithm to discover DNA motifs (collections of TFBSs) in a set of DNA-sequences. We first applied an ACO algorithm to find a set of better candidate positions for the motif. The resultant positions are given as input to the Gibbs sampler method for calculating score for each sequence. Based on the score, motif for TF binding sites is identified.

Supervised Detection of Regulatory Motifs in DNA Sequences

Statistical Applications in Genetics and Molecular Biology, 2003

Identification of transcription factor binding sites (regulatory motifs) is a major interest in contemporary biology. We propose a new likelihood based method, COMODE, for identifying structural motifs in DNA sequences. Commonly used methods (e.g. MEME, Gibbs sampler) model binding sites as families of sequences described by a position weight matrix (PWM) and identify PWMs that maximize the likelihood of observed sequence data under a simple multinomial mixture model. This model assumes that the positions of the PWM correspond to independent multinomial distributions with four cell probabilities. We address supervising the search for DNA binding sites using the information derived from structural characteristics of protein-DNA interactions. We extend the simple multinomial mixture model by incorporating constraints on the information content profiles or on specific parameters of the motif PWMs. The parameters of this extended model are estimated by maximum likelihood using a nonlinear constraint optimization method. Likelihoodbased cross-validation is used to select model parameters such as motif width and constraint type. The performance of COMODE is compared with existing motif detection methods on simulated data that incorporate real motif examples from Saccharomyces cerevisiae. The proposed method is especially effective when the motif of interest appears as a weak signal in the data. Some of the transcription factor binding data of Lee et al. (2002) were also analyzed using COMODE and biologically verified sites were identified.

A mathematical model in the study of genes for identifying transcription factor binding sites

Computers & Mathematics with Applications, 2006

This paper deals with the analysis of a mathematical model for a study of genomics concerned with the differential regulation of gene expression. The approach being studied here pioneers the motif-based regression analysis of a single transcriptome. Implemented as the algorithm REDUCE-an acronym that stands for regulatory element detection using correlation with expression-the method naturally takes into account the combinatorial nature of gene expression regulation and provides context-specific information about transcription factor activities. (~) 2006 Elsevier Ltd. All rights reserved.

Maximally Efficient Modeling of DNA Sequence Motifs at All Levels of Complexity

Genetics, 2011

Identification of transcription factor binding sites is necessary for deciphering gene regulatory networks. Several new methods provide extensive data about the specificity of transcription factors but most methods for analyzing these data to obtain specificity models are limited in scope by, for example, assuming additive interactions or are inefficient in their exploration of more complex models. This article describes an approach—encoding of DNA sequences as the vertices of a regular simplex—that allows simultaneous direct comparison of simple and complex models, with higher-order parameters fit to the residuals of lower-order models. In addition to providing an efficient assessment of all model parameters, this approach can yield valuable insight into the mechanism of binding by highlighting features that are critical to accurate models.

Subtle Motif Discovery for Detection of Dna Regulatory Sites

Proceedings of the 5th Asia-Pacific Bioinformatics Conference, 2007

We address the problem of detecting consensus motifs, that occur with subtle variations, across multiple sequences. These are usually functional domains in DNA sequences such as transcriptional binding factors or other regulatory sites. The problem in its generality has been considered difficult and various benchmark data serve as the litmus test for different computational methods. We present a method centered around unsupervised combinatorial pattern discovery. The parameters are chosen using a careful statistical analysis of consensus motifs. This method works well on the benchmark data and is general enough to be extended to a scenario where the variation in the consensus motif includes indels (along with mutations). We also present some results on detection of transcription binding factors in human DNA sequences.

An Efficient Combinatorial Approach for Solving the DNA Motif Finding Problem

2009

The detection of an over-represented sub-sequence in a set of (carefully chosen) DNA sequences is often the main clue leading to the investigation of a possible functional role for such a subsequence. Over-represented substrings (with possibly local mutations) in a biological string are termed motifs. A typical functional unit that can be modeled by a motif is a Transcription Factor Binding Site (TFBS), a portion of the DNA sequence apt to the binding of a protein that participates in complex transcriptomic biochemical reactions.

Making connections between novel transcription factors and their DNA motifs

Genome Research, 2005

The key components of a transcriptional regulatory network are the connections between trans-acting transcription factors and cis-acting DNA-binding sites. In spite of several decades of intense research, only a fraction of the estimated ∼300 transcription factors in Escherichia coli have been linked to some of their binding sites in the genome. In this paper, we present a computational method to connect novel transcription factors and DNA motifs in E. coli. Our method uses three types of mutually independent information, two of which are gleaned by comparative analysis of multiple genomes and the third one derived from similarities of transcription-factor-DNA-binding-site interactions. The different types of information are combined to calculate the probability of a given transcription-factor-DNA-motif pair being a true pair. Tested on a study set of transcription factors and their DNA motifs, our method has a prediction accuracy of 59% for the top predictions and 85% for the top t...

Recent Advances in the Computational Discovery of Transcription Factor Binding Sites

Algorithms, 2009

The discovery of gene regulatory elements requires the synergism between computational and experimental techniques in order to reveal the underlying regulatory mechanisms that drive gene expression in response to external cues and signals. Utilizing the large amount of high-throughput experimental data, constantly growing in recent years, researchers have attempted to decipher the patterns which are hidden in the genomic sequences. These patterns, called motifs, are potential binding sites to transcription factors which are hypothesized to be the main regulators of the transcription process. Consequently, precise detection of these elements is required and thus a large number of computational approaches have been developed to support the de novo identification of TFBSs. Even though novel approaches are continuously proposed and almost all have reported some success in yeast and other lower organisms, in higher organisms the problem still remains a challenge. In this paper, we therefore review the recent developments in computational methods for transcription factor binding site prediction. We start with a brief review of the basic approaches for binding site representation and promoter identification, then discuss the techniques to locate physical TFBSs, identify functional binding sites using orthologous information, and infer functional TFBSs within some context defined by additional prior knowledge. Finally, we briefly explore the opportunities for expanding these approaches towards the computational identification of transcriptional regulatory networks. 2005, 21(10), 553-558. 135. Gonye, G.E.; Chakravarthula, P.; Schwaber, J.S.; Vadigepalli, R. From promoter analysis to transcriptional regulatory network prediction using PAINT. Methods Mol Biol 2007, 408, 49-68. 136. Vadigepalli, R.; Chakravarthula, P.; Zak, D.E.; Schwaber, J.S.; Gonye, G.E. PAINT: a promoter analysis and interaction network generation tool for gene regulatory network identification. Omics 2003, 7(3), 235-252. 137. Haverty, P.M.; Frith, M.C.; Weng, Z. CARRIE web service: automated transcriptional regulatory network inference and interactive analysis. Nucleic Acids Res 2004, 32(Web Server issue), W213-216. 138. Haverty, P.M.; Hansen, U.; Weng, Z. Computational inference of transcriptional regulatory networks from expression profiling and transcription factor binding site identification.

Detecting DNA regulatory motifs by incorporating positional trends in information content

Genome biology, 2004

On the basis of the observation that conserved positions in transcription factor binding sites are often clustered together, we propose a simple extension to the model-based motif discovery methods. We assign position-specific prior distributions to the frequency parameters of the model, penalizing deviations from a specified conservation profile. Examples with both simulated and real data show that this extension helps discover motifs as the data become noisier or when there is a competing false motif.