A Biophysical Approach to Transcription Factor Binding Site Discovery (original) (raw)

Marko Djordjevic 1,
Anirvan M. Sengupta 2, and
Boris I. Shraiman 2,3
1 Department of Physics, Columbia University, New York, New York 10025, USA
2 Department of Physics and BioMaPS Institute, Rutgers University, Piscataway, New Jersey 08854, USA

Abstract

Identification of transcription factor binding sites within regulatory segments of genomic DNA is an important step toward understanding of the regulatory circuits that control expression of genes. Here, we describe a novel bioinformatics method that bases classification of potential binding sites explicitly on the estimate of sequence-specific binding energy of a given transcription factor. The method also estimates the chemical potential of the factor that defines the threshold of binding. In contrast with the widely used information-theoretic weight matrix method, the new approach correctly describes saturation in the transcription factor/DNA binding probability. This results in a significant improvement in the number of expected false positives, particularly in the ubiquitous case of low-specificity factors. In the strong binding limit, the algorithm is related to the “support vector machine” approach to pattern recognition. The new method is used to identify likely genomic binding sites for the E. coli transcription factors collected in the DPInteract database. In addition, for CRP (a global regulatory factor), the likely regulatory modality (i.e., repressor or activator) of predicted binding sites is determined.

Footnotes

[Supplemental material is available online at www.genome.org. The complete list of predicted sites may be found at http://www.biomaps.rutgers.edu/bioinformatics/QPMEME.htm.\]
Article and publication are at http://www.genome.org/cgi/doi/10.1101/gr.1271603.
↵4 For brevity, from now on we refer to the free energy of binding simply as “binding energy.” In biophysical literature, the commonly used notation for this quantity would be Δ_G_(S) rather than E(S).
↵5 Although it is convenient to refer to TFs with variable binding sites as “low-specificity” factors, it must be remembered that the variability of binding sites is likely to be the result of these TFs being present at higher concentration than “high-specificity” factors—as opposed to having intrinsically weaker sequence dependence of TF/DNA interaction.
↵6 This provides a possible explanation for the case of FNR (see Fig. 4), where we remarked that the chemical potential we deduced from the search may be too low. This could happen if the experiments that generated the collection of FNR binding sites did not include the physiological condition of maximal FNR activation.
↵3 Corresponding author. E-MAIL shraiman{at}physics.rutgers.edu; FAX (805) 893-4111.
- Accepted September 3, 2003.
- Received March 1, 2003.
Cold Spring Harbor Laboratory Press