Transcriptional regulation by the numbers: models (original) (raw)

. Author manuscript; available in PMC: 2012 Oct 27.

Published in final edited form as: Curr Opin Genet Dev. 2005 Apr;15(2):116–124. doi: 10.1016/j.gde.2005.02.007

Abstract

The expression of genes is regularly characterized with respect to how much, how fast, when and where. Such quantitative data demands quantitative models. Thermodynamic models are based on the assumption that the level of gene expression is proportional to the equilibrium probability that RNA polymerase (RNAP) is bound to the promoter of interest. Statistical mechanics provides a framework for computing these probabilities. Within this framework, interactions of activators, repressors, helper molecules and RNAP are described by a single function, the ‘regulation factor’. This analysis culminates in an expression for the probability of RNA polymerase binding at the promoter of interest as a function of the number of regulatory proteins in the cell.

Introduction

The biological literature on the regulation and expression of genes is, with increasing frequency, couched in the language of numbers. Four key ways in which gene expression is characterized quantitatively are through measurement of: (i) the level of expression relative to some reference value; (ii) how fast a given gene is expressed after induction; (iii) the precise relative timing of expression of different genes; and (iv) the spatial location of expression. In the first section of this review we revisit particular examples of such measurements in the bacterial setting. These provide the motivation for the models that form the main substance of this and the companion article [1••]. Through much of these reviews we call attention to particular revealing case studies rather that giving a thorough coverage of the literature.

How much, when and where?

One class of particularly well-characterized examples of gene expression levels includes cases associated with bacterial metabolism and the infection of bacteria by phage [2••,3]. This group will serve as the centerpiece of this and the companion article. In the classic case of the lac operon, several beautiful measurements have been taken. These characterize the extent to which the genes are repressed as a function of the strength of the operators, their spacing and the number of repressor molecules [46]. Similar measurements have been made for other genes implicated in bacterial metabolism, in addition to those tied to the decision between the lytic and lysogenic pathways after infection of Escherichia coli by phage lambda [711]. A second way by which the regulatory status of a given system is quantified is by measuring when genes of interest are being expressed. The list of examples is long and inspiring, and several representative case studies can be found in the literature [1214]. A third way in which an increasingly quantitative picture of gene expression is emerging is based on the ability to make precise statements about the spatial location of the expression of different genes. Here, too, the number of different examples that can be mustered to prove the general point is staggering [1517]. The key point of these examples is to note the growing pressure head of quantitative in vivo data, which calls for more than a cartoon-level description of expression.

The physicochemical modeling of the type of quantitative data described above is still in its infancy. One class of models, which will serve as the basis of this article, comprises the so-called ‘thermodynamic models’ [1820]. The conceptual basis of this class of models is the idea that the expression level of the gene of interest can be deduced by examining the equilibrium probabilities that the DNA associated with that gene is occupied by various molecules — these include RNAP and a battery of transcription factors (TFs) such as repressors and activators. There is a long-standing tradition of using these ideas to unravel the dynamics of gene expression systems — particularly important examples being associated with the famed lac operon and phage lambda systems [18,2126]. Importantly, the thermodynamic models can serve as input to more general chemical kinetic models.

The key aim of this and the accompanying article [1••] is to show how the thermodynamic models yield a general conceptual picture of regulation using what we call the ‘regulation factor’ (see Glossary). Such arguments are useful because they enable direct comparison with quantitative experiments, such as those discussed above. The purpose of models is not just to ‘fit the data’ (although such fits can reveal which mechanisms are operative) but also to provide a conceptual scheme for understanding measurements and, more importantly, for suggesting new experiments. It is also worth noting that when such models fall short it provides an opportunity to find out why and learn something new.

This article is, to a large extent, pedagogical and aims to demonstrate how a microscopic picture of the various states of the gene of interest can be mathematized using statistical mechanics. The companion article [1••] is built around the analysis of case-studies in bacterial transcription and centers specifically on how the activity of a given promoter is altered (the ‘fold-change’ [see Glossary] in promoter activity) by the presence of transcription factors.

Thermodynamic models of gene regulation: the regulation factor

The fundamental tenet of the thermodynamic models for gene regulation is that we can replace the difficult task of computing the level of gene expression, as measured by the concentration of gene product ([protein]), with the more tractable question of the probability (_p_bound) that RNAP occupies the promoter of interest. More precisely, these models are founded on the idea that the instantaneous disposition of the gene of interest can be established from the probability that various molecules —RNAP, activators, repressors and inducers — are bound to their relevant targets.

Such models are based on a variety of different assumptions, all of which can and should be evaluated critically. Perhaps the most glaring assumption is that of equilibrium itself. This assumption can be examined quantitatively on the basis of the relative rates of transcription factor binding, RNAP binding, open complex formation, transcript formation and translation itself. For example, if the rate for open complex formation is much smaller than the rates for RNAP binding and unbinding from the promoter, then the probability of finding the polymerase on the promoter will be given by its equilibrium value. A second key assumption of this class of models is the idea that the probability of promoter occupancy by RNAP is simply proportional to the level of expression of a given gene. The difficulty lies in the fact that there are several different mechanisms that can intervene between RNAP binding and the existence of a functional gene product. Despite these caveats, we argue that this class of models is both instructive and predictive and, in those cases where the models are found wanting, provides an opportunity to learn something.

In this review, we first analyse the probability that RNAP will be bound at the promoter of interest in the absence of any activators or repressors. This is followed by cases of increasing complexity that involve batteries of transcription factors. Although our preliminary discussion is focused on the statistical mechanics of polymerase binding, the framework is the same for generic protein–DNA and protein–protein interactions. For the purposes of this review, we make the simplified assumption that the key molecular players (RNAP and TFs) are bound to the DNA either specifically or non-specifically. This question has been addressed in the context of the λ switch [27], for the lac repressor [21,28] and for RNAP [29]. Stated differently, as a simplification, we will ignore the contribution of ‘free’ polymerase in the cytoplasm, in addition to those RNAP molecules that are engaged in transcription on other promoters. Relaxing this assumption has no effect on the framework developed below. Hence, to evaluate the probability of promoter occupancy in this simple model, the reservoir of RNAPs will be the non-specifically bound molecules (as shown in Figure 1a).

Figure 1.

Figure 1

Probability of promoter occupancy (a) Schematic showing how, in the simple model, the DNA molecule serves as a reservoir for the RNAP molecules, almost all of which are bound to DNA. (b) Illustration of the states of the promoter – either with RNAP not bound or bound and the remaining polymerase molecules distributed among the non-specific sites. The statistical weights associated with these different states of promoter occupancy are also shown. (c) Probability of binding of RNAP to promoter as a function of the number of RNAP molecules for two different promoters. We assume the number of non-specific sites is NNS = 5 × 106, and calculate the binding energy difference using the simple relation Δεpd=kBTln(KpdS/KpdNS), where the equilibrium dissociation constants for specific binding (KpdS) and non-specific binding (KpdNS) are taken from in vitro measurements. In particular, making the simplest assumption that the genomic background for RNAP is given only by the non-specific binding of RNAP with DNA, we take KpdNS=10000nM [37], for the lac promoter KpdS=550nM [38] and for the T7 promoter, KpdS=3nM [39]. For the lac promoter, this results in Δ_εpd_ = −2.9_kBT_ and for the T7 promoter, Δ_εpd_ = −8.1_kBT_.

To evaluate the probability of polymerase binding (pbound) we must sum the Boltzmann weights (see Glossary) over all possible states of P polymerase molecules on DNA [30••,31••]. P is the effective number of RNAP molecules available for binding to the promoter. Estimating this number in vivo is fraught with difficulty because many RNAPs are engaged in transcription at any given time and, as such, are not available for binding. Fortunately, this problem is avoided when calculating the fold-change for all the cases of interest, as we do in the accompanying paper [1••]. This is because, in these cases, the absence of activators results in a very small pbound value and so P drops out of the problem.

We calculate pbound by considering the distribution of P RNAP on the non-specific sites (NNS), which make up the genome itself, and a single promoter. Then we distinguish two classes of outcomes (shown in Figure 1b): all P RNAP molecules bound non-specifically, or one RNAP bound to the promoter and P_−1 RNAP bound non-specifically. Next, we count the number of different ways that these outcomes can be realized. Once these states have been enumerated, we weight each of them according to the Boltzmann law: if ε is the energy of a state, its statistical weight is exp(−ε/kBT_). Finally, to compute the probability of promoter occupancy, we construct the ratio of the sum of the weights for the favorable outcome (i.e. promoter occupied) to the sum over all of the weights.

As noted above, this simple model includes two broad classes of microscopic outcomes: (i) those in which all P polymerase molecules are distributed among the non-specific sites, and (ii) those in which the promoter is occupied and the remaining _P_−1 polymerase molecules are distributed among the non-specific sites. To evaluate the probabilities of these two eventualities we need to know the number of different ways that each outcome can be realized. The statistical question of how many ways there are to distribute P polymerase molecules among NNS non-specific sites on the DNA is a classic problem in combinatorics, and the result is

The overall statistical weight of these states is based not just on how many of them there are but also on their Boltzmann weights according to

Z(P)︸statisticalweight-promoterunoccupied=NNS!P!(NNS-P)!︸numberofarrangements×e-PεpdNS/kBT︸Boltzmannweight, (1)

where εpdNS is an energy that represents the average binding energy of RNAP to the genomic background. The correct treatment of the genomic background requires explicit consideration of the distribution of binding energies of RNAP, and TFs, to different sites — both specific and non-specific — on the DNA. The question of how to treat this problem more generally than the simpleminded treatment given here can be found in [32,33]. The total statistical weight can now be written as

Ztot(P)︸totalstatisticalweight=Z(P)︸promoterunoccupied+Z(P-1)e-εpdS/kBT︸RNAPonpromoter, (2)

where εpdS is the binding energy for RNAP on the promoter (the S stands for ‘specific’). The states and corresponding weights, normalized by the weight of the promoter-unoccupied states, Z(P), are shown in Figure 1b.

To find the probability of RNAP being bound to the promoter of interest, we calculate

pbound=Z(P-1)e-εpdS/kBTZtot(P). (3)

Note that the numerator in this case is the statistical weight of all microscopic states in which the promoter is occupied, and the denominator is the statistical weight of all microscopic states. If we now divide top and bottom by Z(P-1)e-εpdS/kBT and use the functional form given in Equation 1, the probability of promoter occupancy is given by the simple form

pbound=11+NNSPeΔεpd/kBT, (4)

where we have introduced the notation Δεpd=εpdS-εpdNS [34]. To obtain the last equation we made the simplifying assumption that PNNS. The results computed above can be depicted in graphical form (as shown in Figure 1c) by plotting the probability of promoter occupancy as a function of the number of RNAP molecules for two different promoters. For this particular case we have used several rough estimates, explained in the figure legend, concerning the binding energies of RNAP molecules to specific and non-specific sites on the DNA in a typical bacterial cell. One interesting speculation is that the high probability of RNAP occupancy for the T7 promoter, even in the absence of transcription factors, could be related to the infection mechanism of T7 phage [35]. In contrast, it is also interesting to note the very low probability of occupancy of the lac promoter in this simple model in the absence of activation. We view Equation 4 as characterizing the ‘basal’ transcription rate in this simple model. In light of this result, the key conceptual outcome of the remainder of this review is the idea that the presence of transcription factors (activators and repressors, etc.) has the effect of altering Equation 4 to the simple form

pbound=11+NNSPFregeΔεpd/kBT, (5)

where we introduce the regulation factor, Freg. The regulation factor should be seen as describing an effective increase (for Freg > 1) or decrease (for Freg < 1) of the number of RNAP molecules that are available to bind the promoter.

To illustrate precisely the idea of the regulation factor, we show how activators recruit [3] RNAP to the promoter of interest. The recruitment concept is illustrated in schematic form in Figure 2a, where it is seen that the activator molecule recruits the polymerase through favorable contacts characterized by an adhesive energy, εap The point of the schematic is to show how the various states of occupancy of the promoter and activator binding site can be assigned Boltzmann weights, which can then be used to compute their probabilities.

Figure 2.

Figure 2

Statistical mechanics of recruitment (a) Schematic showing the relationship between the various states of the promoter and its regulatory region, and their corresponding weights within the statistical mechanics framework. (b) Fold-change in promoter activity as a function of the number of activated (inducer-bound) CRP molecules, according to Equations 5 and 8, for different values of the adhesive interaction energy between activator and RNAP. As in Figure 1, Δεad=kBTln(KadS/KadNS), with KadNS=10000nM [40] and KadS=0.02nM [41]. These in vitro numbers are chosen as a representative example to provide intuition for the action of activators. Applications to in vivo experiments are given in the accompanying paper [1••]. Several different representative values of the adhesive interaction _ε_ad that are consistent with measured activation are chosen to illustrate how activation depends upon this parameter.

Once again, the first step in our analysis is to determine the total statistical weight. This is obtained by summing the Boltzmann weights of all of the eventualities associated with the activators and polymerase molecules being distributed on the DNA (both non-specific sites and the promoter). As seen in Figure 2a, there are four classes of outcomes: (i) both the activator site and promoter unoccupied; (ii) just the promoter occupied by polymerase; (iii) just the activator site occupied by activator; and (iv) both of the specific sites occupied. This is represented mathematically as

Ztot(P,A)=Z(P,A)︸emptysites+Z(P-1,A)e-εpdS/kBT︸RNAPonpromoter+Z(P,A-1)e-εadS/kBT︸activatoronspecificsite+Z(P-1,A-1)e-(εpdS+εadS+εpd)/kBT︸RNAPandactivatorboundspecifically, (6)

where the statistical weight for P polymerase molecules and A activator molecules distributed among NNS non-specific sites is given by

Z(P,A)=NNS!P!A!(NNS-P-A)!︸numberofarrangements×e-PεpdNS/kBTe-AεadNS/kBT︸weightofeachstate (7)

In Figure 2a the weights of the four states are normalized by the weight of the empty state Z(P,A). In Equation 7 we use the notation εxd to characterize the binding energy of molecule X to DNA, and superscripts S and NS to signify specific or non-specific binding, respectively. Δεxd=εxdS-εxdNS is the difference between the two. For the purposes of this simple model we have assumed that the reservoir for the activator molecules is the genomic DNA, although there is strong evidence that, in the case of the lac operon, many of the activators (cAMP receptor proteins; CRPs) are actually in the cytoplasm [36]. In contrast, as will be seen in the following paper [1••], in our actual applications of thermodynamic models to real operons, the question of whether the reservoir is non-specific DNA or the cytoplasm never arises.

As usual, to compute the probability of interest, we construct the ratio of the sum of weights for all those outcomes that are favorable (i.e. polymerase bound to the promoter) to the sum of weights over the total set of outcomes Ztot(P,A). This results in a value of _p_bound that adopts precisely the form described in Equation 5. The regulation factor, Freg (A), is given by

Freg(A)=1+ANNSe-Δεad/kBTe-εap/kBT1+ANNSe-Δεad/kBT, (8)

where we have made the additional assumption that NNSP, A. Note that if the adhesive interaction between polymerase and activator goes to zero, the regulation factor itself goes to unity. Furthermore, for negative values of this adhesive interaction (i.e. activator and polymerase like to be near each other) the regulation factor is greater than one, which translates into an apparent increase in the number of polymerase molecules available for binding to the promoter. This claim can be seen more clearly if we define the fold-change in promoter activity as the ratio of the probability that RNAP is bound in the presence of transcription factors to the probability that it is bound in the absence of transcription factors: fold-change = pbound(P, A)/pbound(P, A = 0). The fold-change is plotted in Figure 2b for typical values of the adhesive interaction εap and the other binding parameters, for the simple model in which the reservoir for CRP is assumed to be non-specific DNA.

Similar arguments can be made for the action of repressor molecules. Consider repression by R repressor molecules that can bind to an operator (with energy εrdS) that overlaps with the promoter. By enumerating the different states with their associated weights in a way similar to that used in Figure 2a and noting that the state where both the repressor and RNAP bind to their sites is not allowed, we can again derive the form for promoter occupation, Equation 5, but this time with the regulation factor,

Freg(R)=11+RNNSe-Δεrd/kBT. (9)

The above scheme can be extended further to describe co-regulation by two or more activators and/or repressors. For example, in the case of activation considered above, if the binding of the activator to its operator site is assisted itself by a helper protein, which might bind to an adjacent site [1••], then the regulation factor still has the form given in Equation 8 but with the number of activators, A, replaced by an effective number of activators

A′=A1+HNNSe-Δεhd/kBTe-εha/kBT1+HNNSe-Δεhd/kBT. (10)

Note that the multiplicative factor in Equation 10 has the same form as in Equation 8 except that now the number of helper molecules, H, appears in the expression, and the interaction energy εha refers to that between the helper molecules and activators. In fact, this is the generic expression describing the recruitment of one DNA-binding protein by another, and it is not limited to activator–RNAP recruitment.

The introduction of the regulation factor enables a discussion of various regulatory motifs in a unified way, as made explicit by Table 1. These examples will be discussed in the context of particular bacterial gene-regulatory systems in the ensuing paper. The main point captured by this table is that the conceptual picture of thermodynamic models is identical regardless of regulatory motif and involves summing all of the relevant states. It culminates in the regulation factor which, as will be shown in the companion [1••], is equal to the measurable fold-change of promoter activity.

Table 1.

Regulation factors for several different regulatory motifs.

As a final example, we consider the way in which DNA looping can play a role in dictating the regulation factor. Indeed, recent work by Vilar and Leibler [31••] and Vilar and Saiz [42••] and others [25,43] has shown how the thermodynamic models can be applied to regulatory control by looping. In the accompanying paper [1••], we apply these ideas to the particular question of how such regulation depends upon the distance between the two binding sites, but content ourselves here with a discussion of the conceptual basis. Two distinct looping scenarios are shown in Figure 3. In case (a), a repressor molecule, which can bind to two distinct regions on the DNA, loops out the intervening region. The classic example of this mode of action is the Lac repressor. In case (b), one protein, such as CRP, favorably bends the DNA so that a second activator can contact RNAP, although paying a lower free energy cost than it would if it were acting alone. In both cases, the free energy cost associated with making a DNA loop is outweighed by the benefit of additional binding energy between the repressor and DNA [case (a)] and between the activator and RNAP [case (b)].

Figure 3.

Figure 3

DNA bending in transcription regulation. (a) DNA looping enables Lac repressor to bind to the main and the auxiliary operators simultaneously, thereby increasing the weight of the states in which the promoter is unoccupied. This leads to stronger repression than in the single operator case. (b) DNA bending by the activator leads to cooperative binding of the two activators because the free energy cost of bending is paid only once. This leads to a boost in activation above that provided by independent binding of the two activators [45].

In summary, the statistical mechanical framework described here can be used to consider several different regulatory motifs [11,26,30••,32,33,44], as showcased in Table 1. In each of the cases considered in the table, the probability of promoter occupancy is given by Equation 5, with the sole change from one case to the next being the form adopted by the regulation factor itself.

Conclusions and future prospects

We argue that as a result of the increasingly quantitative character of data on gene expression there is a corresponding need for predictive models. We have reviewed a series of general arguments about the way in which batteries of transcription factors work in generic ways to mediate transcriptional regulation. The models described here result in several important classes of predictions. The application of these ideas to particular bacterial scenarios forms the substance of the second article [1••].

Though ideas like those presented here have the potential to serve as a quantitative framework for thinking about transcriptional regulation, there are several outstanding issues. Some especially troubling features of these models are: (i) what are the precise conditions under which equilibrium assumptions are acceptable? (ii) When can the probability of RNAP binding at a promoter serve as a surrogate for gene expression itself? (iii) What is the role of fluctuations? (iv) These models pretend that the basal transcription apparatus is a single molecule that interacts with transcription factors, whereas the transcription apparatus is a complex that is itself probably subject to recruitment for its assembly. Despite these concerns, our view is that thermodynamic models have long demonstrated their utility and it will be of great interest to carefully explore their consequences experimentally. Case studies using the thermodynamic models are reviewed in the accompanying paper [1••].

Acknowledgments

We are grateful to several people for explaining their work and that of others to us, including Michael Welte, Jon Widom, Mark Ptashne, Phil Nelson, Jeff Gelles, Ann Hochschild, Mitch Lewis, Bob Schleif, Michael Elowitz, Paul Wiggins, Mandar Inamdar, Scott Fraser, Richard Ebright, Eric Davidson and Titus Brown. Of course, any errors in interpretation are our own. We are also thankful to Nigel Orme for his extensive contributions to the figures in this paper. We gratefully acknowledge the support of the NIH Director’s Pioneer Award (RP), NSF through a NIRT award (RP), DMR9984471 (JK) and DMR0403997 (JK). JK is a Cottrell Scholar of Research Corporation. UG acknowledges an ‘Emmy Noether’ research grant from the DFG. TH is grateful to financial support by the NSF through grants 0211308, 0216576 and 0225630.

Glossary

Boltzmann factor

For a given state of a thermal system, the Boltzmann factor is the exponential of minus its energy, measured in units of kBT. The ratio of equilibrium probabilities for any two states is given by the ratio of their Boltzmann factors

Partition function

The sum of the Boltzmann factors for all the states available to a thermal system. The equilibrium probability of observing a state of the system is its Boltzmann factor divided by the partition function

Regulation factor

The effective change of the number of RNA polymerases available for binding to the promoter, resulting from the action of transcription factors. The regulation factor is a function of transcription factor concentrations, operator distances, protein–DNA and protein–protein interactions. It is smaller than one for repression, and larger than one for activation

Fold-change

The ratio of gene expression (e.g. transcription rate) in the presence and absence of transcription factors. Within the thermodynamic model, this fold-change is given by the ratio of occupation probability of the promoter of interest by the RNA polymerase holoenzyme, in the presence and absence of transcription factors. For weak promoters that control the transcription of typical bacterial genes, the fold-change in gene expression is given approximately by the regulation factor

Papers of particular interest, published within the annual period of review, have been highlighted as:

• of special interest

•• of outstanding interest