From DNA Sequence to Transcriptional Behavior: A Quantitative Approach (original) (raw)

. Author manuscript; available in PMC: 2009 Aug 3.

Published in final edited form as: Nat Rev Genet. 2009 Jul;10(7):443–456. doi: 10.1038/nrg2591

Abstract

Complex transcriptional behaviors are encoded in the DNA sequences of gene regulatory regions. Advances in our understanding of these behaviors have been gained recently by quantitative models that describe how molecules such as transcription factors and nucleosomes interact with the genomic sequence. An emerging view is that every regulatory sequence is associated with a unique binding affinity landscape for each molecule and, consequently, with a unique set of molecule binding configurations and transcriptional outputs. We present a quantitative framework based on existing methods that unifies these ideas. This framework explains many experimental observations regarding the binding patterns of factors and nucleosomes, and the dynamics of transcriptional activation. It can also be used to model more complex phenomena such as transcriptional noise and the evolution of transcriptional regulation.


Many cellular and organismal processes depend on the establishment of complex patterns of gene expression at precise times and spatial locations, with inaccuracies in carrying out such transcriptional programs often being deleterious and leading to disease. The information for directing such expression patterns is encoded in regulatory DNA sequences — for example, reporter genes attached directly to such regulatory sequences adopt the expression pattern of the endogenous gene1,2,3, and DNA binding and gene expression patterns of an entire human chromosome are essentially unchanged in mice that carry this human chromosome4.

Given the centrality of transcriptional programs to many biological processes, a predictive and quantitative understanding of the transcriptional behaviors encoded by DNA sequences is highly desirable. Such an understanding would allow us to go beyond merely identifying the transcription factors and regulatory DNA elements that are involved, and replace the existing qualitative and phenomenological descriptions by a mechanistic view of the process that integrates the involved components into physically realistic mechanistic models. Indeed, our ability to quantitatively predict the behavior of a regulatory system is a useful objective measure of the extent to which we truly understand how the system works. At a more practical level, the ability to accurately predict transcriptional behaviors from DNA sequences should allow us to predict the effect that sequence variation among individuals in the population has on gene expression and thus on more complex phenotypes and disease; and it would allow for improved rational design of transgenes for biotechnology and gene therapy.

Recent work has significantly advanced our understanding of how genomic sequences are translated into transcriptional outputs. Progress has been made possible by the availability of vast amounts of data on gene regulation, through the development of quantitative models that explain how molecules such as transcription factors5,6,7 and nucleosomes8,9 bind DNA sequences, and how these binding events give rise to expression patterns10,11. In this review, we unify these studies into a single conceptual framework, based on existing methods, that quantitatively models the process of transcriptional regulation. The framework is founded on the idea that transcriptional regulation can be explained by an ‘equilibrium competition’ between nucleosomes and other DNA binding proteins; the details of this competition are specified by every regulatory DNA sequence, through the unique binding affinity ‘landscape’ that every sequence defines for each molecule. Each transcription factor or nucleosome ‘views’ every regulatory sequence in a unique way, depending on its recognition specificity; at any given set of concentrations of the DNA-binding molecules, the range of affinities that the molecules have for any sequence (the binding affinity ‘landscape’) dictates their particular cooperative and competitive binding interactions. This leads to a distinct distribution of molecule BINDING CONFIGURATIONS on that sequence, and consequently to a distinct transcriptional behavior (Fig. 1).

Figure 1. Overview of quantitative models for computing expression from DNA sequences.

Figure 1

Flow diagram of the computational approach, for a simplified regulatory sequence, with nucleosomes and one transcription factor as the input binding molecules. Each of the input molecules has intrinsic binding affinities for every possible sequence of length k (top panels, left and right), where k is the number of basepairs recognized by the binding molecule. These intrinsic molecule affinities dictate how every DNA sequence is ‘translated’ into a unique binding affinity landscape for each molecule along the sequence (top panel, centre). For each factor concentration (bottom panel, left), the model uses these binding affinity landscapes to compute a probability distribution over configurations of bound molecules (see Box 1 for details); a small subset of these configurations are illustrated (bottom panel, centre). Configurations in which two bound molecules overlap are not allowed due to steric hindrance constraints, thereby modeling binding competition between molecules (see bottom-most configuration with probability 0). Finally, each configuration results in a particular transcriptional output (bottom panel, right); the final expression is then the sum of the expression contribution of each configuration, weighted by their probability.

Since transcriptional regulation across organisms utilizes the same types of molecules, which interact according to the universal laws of physical chemistry, the basic rules of this framework apply broadly. Indeed, different aspects of the approach presented here were demonstrated in organisms from bacteria11,12,13,14 and yeast7,8,15,16,17 to flies10 and mammals18,19.

We start by reviewing the significant progress made to understand the intrinsic affinities of various molecules to DNA. This has been achieved through experiments that directly measure the binding affinity landscapes of different types of molecule, and through computational models that identify the sequence rules that underlie and predict these affinity landscapes across several organisms. We then present different models that aim to connect these affinity landscapes to molecule binding configurations and transcriptional outputs. Substantially less is known about the mapping into transcriptional outputs, and we thus highlight where critical information is needed. We show that the framework presented here explains a broad range of experimental observations related to transcriptional regulation, including the binding patterns of transcription factors and nucleosomes, and the dynamics of transcriptional activation. We end by discussing how these models can be used to understand more complex features of transcriptional regulation such as TRANSCRIPTIONAL NOISE and expression divergence across evolution. The models presented here thus provide a concrete framework for understanding transcriptional behavior from DNA sequence.

Binding affinity landscapes

Nucleosomes

Measurements of the sequence-dependent affinity of NUCLEOSOMES20 carried out over the past two decades show that the differing affinities of nucleosomes for different DNA sequences can vary at least 5,000-fold21,22. These differing affinities are thought to reflect the energetic cost of sharply bending the differing DNA sequences around the histone octamer to conform to the nucleosome structure20.

Since direct nucleosome affinity measurements of all possible sequences of length 147 is not possible, approaches for comprehensively characterizing nucleosome affinities are based on computational models that generalize from a manageable number of nucleosome affinity measurements. Early characterizations of nucleosome sequence preferences consisted of ~10bp periodicities of specific dinucleotides along the nucleosome length, first observed in alignments of ~200 in vivo nucleosome sequences from chicken23, and then again in similarly-sized collections from yeast8,24, worm25, fly26, and human26. These dinucleotide periodicities formed the basis of earlier models for predicting the binding affinity landscape of nucleosomes8,9. More recently, the availability of genome-wide measurements of nucleosome occupancy allowed researchers to identify longer sequence motifs that are generally favored or disfavored by nucleosomes, regardless of their position within the nucleosome length. Incorporating such signals into models that describe nucleosome sequence preferences has resulted in significantly improved predictions26,27,28,29.

All of these models were based on measurements of in vivo nucleosomes, the positions of which are determined by multiple factors, including transcription factors30, CHROMATIN REMODELERS31, transcription32, DNA replication33,34, and the sequence preferences of the nucleosomes themselves8,9,15,23,27,28. A general question has therefore been the extent to which these models truly represent nucleosome sequence preferences, as opposed to also capturing the sequence preferences of other factors. A recent study addressed this issue by measuring the genome-wide occupancy of nucleosomes assembled on purified yeast genomic DNA15. The resulting map, in which the positions of nucleosomes are governed only by the intrinsic sequence preferences of nucleosomes, provides a direct experimental measurement of the binding affinity landscape of nucleosomes. A computational model constructed from these data predicted the experimentally measured affinity landscape with a high per-basepair correlation of 0.89, and thus allows us to predict binding affinity landscapes of nucleosomes from DNA sequence alone. Moreover, the nucleosome organization predicted by this model is highly predictive of in vivo nucleosome organization in both yeast and worm, demonstrating that nucleosome sequence preferences are a dominant determinant of in vivo nucleosome organizations15.

Transcription factors

Compared to nucleosomes, transcription factors recognize and bind much shorter stretches of DNA (typically 5–15 basepairs in length). It should therefore be theoretically feasible to measure directly the binding affinity of a given factor to most, if not all, possible recognition sequences. Earlier methods based on FOOTPRINTING, GEL-SHIFT ANALYSIS, SOUTHWESTERN BLOTTING, SELEX, and reporter constructs, could only measure the affinity of factors to a relatively small number of sequences. Significant progress was recently achieved with the use of high-throughput technologies such as CHIP-CHIP4 and CHIP-SEQ35, which essentially measure all of the in vivo bound targets of a given factor. However, the genomic regions identified by these methods are typically hundreds of basepairs, and so identifying the much shorter sequence motifs that are common to the bound regions still requires post-processing computation36.

Another limitation of using in vivo data is that, as in the nucleosome case, the derived binding specificities may also reflect the specificities of other factors. Here too, high-throughput in vitro methods such as PROTEIN BINDING MICROARRAYS37,38 and MICROFLUIDIC PLATFORMS39 are being used, in which binding is measured across all possible ~8–10 basepair sequences and is governed only by the intrinsic sequence preferences of a factor. Protein binding microarrays were recently applied to derive such binding specificities for 168 homeodomain transcription factors from mouse40 and for 112 transcription factors from yeast41,42. Although transforming the resulting microarray intensities into binding affinities is not trivial, the binding affinity of many factors for any location on any DNA sequence can now be characterized with high accuracy.

Comparing affinity landscapes: nucleosomes versus transcription factors

The intrinsic nucleosome affinity landscape explains many aspects of the binding patterns of nucleosomes in vivo15. In contrast, the sequence specificity of many transcription factors is low compared to the size of the genome on which they operate, such that canonical binding sites for such factors will occur by mere happenstance countless times across the genome. For example, a factor whose total binding specificity is five basepairs will probably have over one million canonical binding sites in the human genome, yet the number of factor molecules present in the cell may typically be only one tenth to one thousandth that number. Thus, although a minimal level of binding affinity is a prerequisite for factor binding, most binding sites that meet such criteria occur by happenstance and are not bound in vivo; and consequently, the binding affinity landscape of most factors is a poor predictor of their in vivo bound locations. Nevertheless, as we discuss below, knowing the binding specificities of transcription factors together with other information, especially clustering of transcription factor binding sites and the relation of these binding sites to the nucleosome landscape, allows us to integrate factor specificities together with nucleosomes into highly predictive models for transcriptional regulation.

From binding affinity landscapes to binding configurations

Binding affinity landscapes describe how each molecule ‘translates’ an input DNA sequence into a binding potential that is specific to that molecule. The next step in decoding the transcriptional behavior of a regulatory sequence is to understand the configurations of molecules that are actually bound to the sequence. Several quantitative frameworks5,10,43 addressed exactly this problem. These models consider all possible configurations of molecules on the input sequence. They then associate a statistical weight to each such configuration, which is computed from the concentration of the participating molecules and the strength (affinities) of the binding sites that they occupy in the configuration. The probability of each configuration can then be computed exactly, by dividing its statistical weight with the partition function, equal to the sum of the statistical weight of all possible configurations (Box 1).

Box 1. Computing gene expression from DNA sequence.

Several models were proposed10,11,16 for translating DNA sequences into transcriptional behavior. These models are all based on an assumption of thermodynamic equilibrium (Box 2). They use the intrinsic equilibrium affinities and concentrations of the various DNA binding molecules (activators, repressors, etc.) to compute the probability of RNA polymerase occupancy, and then assume that the expression level is proportional to polymerase occupancy.

The computation is divided into two steps: one that computes the occupancy distribution of molecules on the target DNA sequence, and another that translates this occupancy distribution into a level of expression. Denoting each possible configuration of molecules on the DNA by c, and the probability of polymerase binding by P(E), the overall probability of polymerase binding is the sum, over all ’LEGAL’ CONFIGURATIONS, of the probability of polymerase binding given a particular configuration c, P(E|c), weighted by the probability of the configuration itself, P(c):

Step 1: Occupancy distribution of molecules on target DNA

Under the equilibrium assumption, the probability of a configuration is:

where W(c) represents the statistical weight of_c_. The simplest model assumes that molecules bind independently, and so the statistical weight of a configuration is the product of the contribution of each molecule bound in the configuration. In turn, a molecule’s contribution is determined by its concentration, and by its affinity to the sequence at the bound position. Thus, for a configuration with k molecules m1,…, mk bound at positions p1,…, pk, the statistical weight W(c) of the configuration is:

W(c)=F(0)∏i=1kτ(mi)F(mi,pi,pi+Li),

where τ(mi) is the concentration of the molecule bound at position pi, F0 is the statistical weight of the empty configuration, and

is the energetic contribution from the binding of molecule mi from position pi to position pi+Li, with Li being the binding site length of molecule mi. The normalizing term ∑c′∈CW(c′), also known as the partition function, sums over all possible legal configurations of molecules. The Box Figure illustrates the computation of W(c) for one toy configuration.

Step 2: From occupancy distribution to expression level

The second model component, P(E|c), which translates configurations into expression levels, is much less understood. One simple model10 assumes that each factor bound in the configuration contributes independently to the expression outcome, with activators contributing positively and repressors contributing negatively. This model uses the logistic function to translate these contributions into expression. From a mechanistic perspective, this expression component represents the total attractive force that acts to recruit the polymerase to the sequence. Thus, for a configuration with k molecules m1,…, mk bound at positions p1,…, pk, the probability of expression is:

P(E∣c)=logit(w0+∑i=1kwmi)=1/(1+exp(−(w0+∑i=1kwmi))),

where w0 represents the basal expression level, and wi represents the expression contribution of transcription factor i.

The models above compute the expression of one sequence at a particular binding molecules concentration. To compute the expression pattern of a sequence across a spatial or temporal axis, along which molecule concentrations change, these models are applied separately to every point along the axis, and the expression at each such point is then combined to produce the entire expression pattern along the axis.

Such frameworks model several important aspects of the binding process. First, by allowing molecules to bind anywhere along the input sequence, the entire affinity range is considered, thereby allowing contributions from both strong and weak binding sites16,44. Second, the binding sites of any two molecules are not allowed to overlap in the same configuration, and thus binding competition between molecules, resulting from steric hindrance constraints, is modeled explicitly. Third, conventional cooperative binding interactions can be modeled explicitly, by assigning higher statistical weights to configurations in which two molecules are bound in close proximity10. Finally, a novel cooperativity that arises between factors when both nucleosomes and factors are integrated10,45 is captured automatically.

A noteworthy consequence of these frameworks is that in a system comprising both factors and nucleosomes, the locations at which nucleosomes intrinsically “want” to bind influences the locations at which factors will be bound; and conversely, the constellation of factors trying to bind influences where the nucleosomes will be bound, and with what occupancies. The DNA sequence defines the outcome of this competition, which will change in response to changing constellations of active transcription factors that are induced, for example, by cellular signaling during development or in response to a change in the environment.

Applications and limitations

The models above were used to identify regulatory sequences in fly5 and transcription-factor target interactions in human19, and for predicting, from regulatory sequences, expression patterns in the fly embryo10 and in yeast16. For example, in the segmentation gene network of the fly embryo, the model predicted that cooperative interactions and contributions from both strong and weak sites are important for generating the expression patterns10. These predictions were later supported by large-scale measurements of transcription factor binding in the fly embryo, which revealed prevalent binding to weak sites46. Despite these successes, several aspects of the modeling framework are not well understood, such as the effect of higher-order chromatin structure, the exact way in which factors compete with each other and with nucleosomes, and the actual mechanism and quantitative magnitude of the resulting cooperative interactions.

The assumption of binding equilibrium

The models above assume that molecules bind at thermodynamic equilibrium, such that the probability of any DNA binding configuration is simply its equilibrium probability, equal to the statistical weight of the configuration divided by the partition function. The success of quantitative modeling and prediction of gene regulation in prokaryotes13,14 has rested largely on this equilibrium hypothesis11,47. However, the question of how and even whether regulatory systems truly equilibrate remains unclear. The equilibrium question is an especially daunting problem in eukaryotes, due to added complexities such as nucleosomes; yet, models based on an assumption of equilibrium competition also have high predictive value in eukaryotes10,16, where ATP-dependent nucleosome remodeling mechanisms may contribute to facilitating or subverting such an equilibrium (Box 2).

Box 2. The equilibrium assumption and the role of nucleosome remodelers.

The models we review assume that molecules bind at thermodynamic equilibrium. Although models based on this assumption are highly predictive of gene regulation in both prokaryotes13,14 and eukaryotes10,16, its validity has not been demonstrated. This assumption is particularly challenging to justify (or imagine) in eukaryotes, due to the presence of nucleosomes and many ATP-dependent nucleosome remodeling factors. In vitro, these nucleosome remodeling factors can control the spacing between nucleosomes on long DNA88 and drive nucleosomes to disfavored locations on shorter DNA fragments, e.g., from favored positioning sequences to DNA ends89. In vivo, inactivation of the ATP-dependent chromatin remodeling complex Isw2 leads to an average shift of ~15bp in the location of nucleosomes relative to wild-type cells at some loci, suggesting that Isw2 might determine the positions of nucleosomes at these loci90. In principle, such remodeling activities could invalidate the equilibrium assumption on which the models discussed here are based, and require instead detailed and unique kinetic models for every sequence.

However, other evidence suggests that the equilibrium hypothesis may be a good approximation in vivo. The high similarity of in vivo nucleosome organizations with that obtained in a purified in vitro reconstitution system15, which is believed to achieve and then freeze in a true thermodynamic equilibrium22, shows directly that much of the in vivo nucleosome organization is closely similar to an equilibrium distribution. Moreover, even when remodelers drive nucleosomes to different locations on short DNA fragments in vitro, the positions adopted by nucleosomes remain the same as those favored intrinsically by the nucleosomes, and only the degree at which nucleosomes occupy those different favored positions changes91.

One way of understanding these facts collectively is if the remodeling factors do not themselves determine the destinations of the nucleosomes that they mobilize, but instead act as catalysts of nucleosome mobility, allowing nucleosomes to rapidly sample alternative positions. In this view, the role of ATP hydrolysis is not to force nucleosomes to unfavorable locations; rather, ATP hydrolysis is required to provide sufficient energy for a nucleosome to cross the transition state free energy barrier(s) separating occupancy at thermodynamically favored locations. The same logic is presently used to explain the requirement of ATP for movement of kinesin along a microtubule, and of helicases along DNA92; and indeed, the ATP-dependent motor domains of all of the known nucleosome remodeling factors are members of helicase protein superfamilies.

Thus, in this view, the result of remodeler action is a thermodynamic equilibrium between the nucleosomes and the transcription factors that compete with nucleosomes for occupancy along the genome. When the constellation of transcription factor changes, for example, during development or following some environmental fluctuation, remodeler action allows the system to rapidly re-equilibrate to a new distribution of bound molecule configurations. In this view, the changed nucleosome positions resulting from Isw2 inactivation90 would be interpreted as a failure to establish an equilibrium distribution in the absence of remodeler activity recruited specifically to the affected regions.

From binding configurations to transcriptional output

The final step in modeling transcriptional behavior from sequence is to understand the transcriptional output that results from each configuration of bound molecules. Naively, this should be simple: configurations with bound activators should recruit the transcription initiation machinery and result in high transcription rates, whereas configurations with bound repressors should result in low transcription rates. Indeed, current approaches for translating binding configuration to transcriptional output are based on this intuition: they model transcriptional output as being proportional to the binding probability of the transcription initiation machinery11,16, or proportional to the weighted sum of the bound molecules, where activators have positive weights and repressors have negative weights10. These oversimplified approaches, however, do not model the effect of architectural features of the configuration, such as DNA LOOPING, the location and orientation of bound transcription factors relative to a nucleosome, and the distance of factors from the transcription start site48,49. Current models also assume that once regulatory regions are in transcriptionally active configurations, their rates of transcription will be the same. This ignores the additional layers of regulation that are enabled by trans-acting factors, such as regulation that the transcriptional initiation complex undergoes after it is bound50,51, regulation of transcriptional elongation, and the effects that nucleosomes positioned within the transcript may have on the elongation process. The quantitative details of these additional effects are poorly understood and thus, the simplicity with which existing models translate binding configurations to transcriptional output mainly reflects gaps in our knowledge of this process.

To summarize, the intrinsic binding specificities of each molecule determine the particular binding affinity landscape that the molecule will ‘feel’ on an input DNA sequence. At a given concentration of binding molecules, these landscapes, and the competitive and cooperative interactions between the molecules, dictate the probabilities of all possible configurations of bound molecules. Finally, the transcriptional output of a regulatory sequence is simply the sum of the transcriptional output of all binding configurations, with each configuration weighted by its probability. Having presented the general modeling framework, we now review the experimental observations that it explains, starting with observations regarding nucleosome organizations.

Deciphering the determinants of nucleosome organizations in vivo

As mentioned in an earlier section, using in vivo measurements it is difficult to estimate the relative contribution of multiple factors to nucleosome organization in vivo. Advances in this direction were made possible by comparing the organization of nucleosomes in vivo to the genome-wide organization of nucleosomes assembled in vitro on purified yeast genomic DNA15.

Direct consequences of the nucleosome landscape: distinct nucleosome positioning

This comparison revealed that the large nucleosome depletions around gene ends26,52,53,54 and around transcription factor binding sites24,28,55 that are observed in vivo are largely encoded by nucleosome sequence preferences. The nucleosome affinity landscape might therefore assist in directing transcription factors to their appropriate sites in the genome8,56,57 (Fig. 2a). For Abf1 and Reb1, two highly abundant transcription factors known to influence chromatin structure, the nucleosome affinity landscape encodes only minor nucleosome depletion; the large depletion seen around these factors’ binding sites in vivo is therefore probably due to these factors’ own ability to out-compete nucleosomes15. Nucleosome depletion around gene starts was also found to be encoded by the intrinsic nucleosome affinity landscape, but in this case, the action of chromatin remodelers and the binding of transcription factors and of the transcription initiation machinery also contribute measurably to the depletion15 (Fig. 2a).

Figure 2. Main determinants of in vivo nucleosome organization.

Figure 2

(a) Shown is the nucleosome occupancy in vivo in yeast (blue) and the nucleosome affinity landscape as measured in vitro by assembling purified histones on purified yeast genomic DNA15 (green), averaged across all genes. Occupancy around gene transcription start sites is shown on the left, and around gene translation end sites on the right. Also shown below each graph is a schematic illustration of the key components that contribute to the in vivo nucleosome occupancy. Nucleosome depletion around gene ends is largely encoded by the nucleosome affinity landscape, while nucleosome depletion around gene starts results both from the encoded nucleosome affinity landscape and from the binding action of transcription factors. (b) Across one genomic region from worm with well-positioned nucleosomes, shown is the average nucleosome occupancy for that region in vivo58 (blue) and the average nucleosome affinity landscape for that region as predicted by a model constructed from in vitro data in yeast15 (green). (c) Same as (b), across a genomic region from worm with less-well defined nucleosome locations (“fuzzy nucleosomes”). The agreement between predictions of a model based on nucleosome sequence preferences and the experimental measurements, both at regions with well-positioned nucleosomes (b) and at regions with fuzzy nucleosomes (c), suggests that both types of regions may be encoded by the genomic sequence, through peaked nucleosome affinity landscapes (b) or relatively flat landscapes (c). (d) Nucleosome-disfavoring sequences can have a long-range effect on the nucleosome organization. This example sequence contains a strong nucleosome disfavoring sequence (yellow diamond), which are highly abundant in eukaryotic genomes93. When such a nucleosome-affinity landscape is combined with a high nucleosome concentration, as is the case in vivo, the bound nucleosomes automatically organize into ordered arrays, whose order decays with the distance from the original disfavoring sequence (bottom graph and schematic bottom sequence). This phenomenon is termed ‘statistical positioning’59. (e) Illustration of how a single sequence may potentially encode for different nucleosome organizations in different cell types or biological conditions, by encoding different outcomes of nucleosome-factor competition at different factor concentrations. Shown is a sequence having a uniform landscape for nucleosomes and a landscape for one factor that includes a single strong binding site. In condition 1, where the hypothetical factor is expressed at low levels, the most likely configurations have nucleosomes covering the factor binding site, whereas in condition 2, where the factor is expressed at high levels, the most likely configurations have the factor binding to its site, causing a displacement of nucleosomes from their cognate sites.

Another notable feature of the in vivo nucleosome organization is that some regions of the genome have a small number of well-positioned nucleosomes, whereas others have “fuzzy” nucleosomes, in which many nucleosome positions are observed55,58. The existence of many regions with fuzzy nucleosomes in worm could indicate that much of the nucleosome organization is not dictated by DNA sequence58. However, in principle, regions with well-positioned nucleosomes and regions with fuzzy nucleosomes can both be encoded by the genomic sequence, if we assume a peaked nucleosome affinity landscape in the case of well-positioned regions, and a relatively flat landscape in the case of fuzzy regions. Indeed, both types of regions exist in the map of nucleosomes assembled on purified yeast DNA, and a model of nucleosome sequence preferences constructed from this yeast data is significantly correlated with the in vivo nucleosome organization of worm15 (Fig. 2b,c).

Indirect consequences of the nucleosome landscape: Long-range ordering of nucleosomes

The examples above represent cases where the nucleosome landscape directly accounts for the experimental observations. Other observations can also be explained, by using the part of the framework that converts binding landscapes to binding configurations. For example, several studies observed a long-range ordering of nucleosomes downstream of gene starts, which decays with the distance from the gene start. There are strong nucleosome-disfavoring sequences and nucleosome-positioning sequences upstream and over gene starts, respectively26, which act to create, with high probability, a nucleosome-depleted region upstream of the gene start, together with a well-positioned nucleosome over the gene start. Introducing boundary constraints, such as these nucleosome-disfavoring and positioning sequences, into the framework presented here, automatically results in a long-range periodic ordering of nearby nucleosomes, simply as a consequence of the high concentrations of nucleosomes along the DNA and the steric hindrance between them59 (Fig. 2d). This ordering, or “statistical positioning”, is greatest immediately adjacent to the boundary constraint, and decays with the distance away from it. Thus, the long-range ordering of nucleosomes nearby gene starts can also be explained by the intrinsic nucleosome landscape26, but in this case, it can be due in part simply to indirect long-range effects that the sequences surrounding gene starts exert on nucleosome configurations, at the high nucleosome concentrations that exist in vivo.

As another intriguing example, nucleosomes may have different organizations in different conditions54,60 or cell types. The framework presented here can in principle explain such observations because the concentrations of either the nucleosomes or the transcription factors (or both) changes in different conditions or cell types, resulting in a different distribution of binding configurations (Fig. 2e). Thus, a single binding affinity landscape can in principle encode for many different distributions of binding configurations, depending on the different concentration of molecules.

Relating nucleosome landscapes and transcription factors

Explaining the repressive function of nucleosomes

Aside from explaining DNA-binding patterns of molecules such as nucleosomes, we need to understand the dynamic transcriptional behavior that genes exhibit in response to changes in the concentration of the regulating factors. The quantitative framework presented here can be used to directly read the DNA sequence and predict these responses: this is achieved by computing, from the encoded affinity landscape, the probability of factor binding at increasing factor concentrations. Consider first a hypothetical DNA sequence with a landscape for only one transcription factor, which in turn recognizes a single binding site. Activation dynamics of such a target gene are determined only by the affinity of the single site. When nucleosomes are added to the equation with a uniform energy landscape (no intrinsic favored locations), they compete with the factor for binding, resulting in a lower probability of factor binding; a given level of activation then requires a higher factor concentration61 (Fig. 3a,b). This model thus provides a simple explanation for why nucleosomes are considered to be general repressors62,63,64.

Figure 3. Reading gene expression dynamics from DNA sequence.

Figure 3

**(a)**Nucleosomes act as general repressors. Shown are two example sequences with a transcription factor landscape containing a single binding site, and with either a uniform but moderate-affinity landscape for nucleosomes (sequence ‘1’) or a uniform but low-affinity landscape for nucleosomes (sequence ‘2’). (b) For the two sequences from (a), shown is the probability of transcription factor binding at different factor concentrations, computed by applying the framework presented here to the binding landscapes of those two sequences. (c) Nucleosome disfavoring sequences determine the threshold of activation. Shown are three example sequences with differing nucleosome and factor landscapes: (‘1’) a uniform nucleosome landscape; (‘2’) a landscape with a sequence that strongly disfavors nucleosome formation, located 10bp from the single transcription factor site (‘2’); (‘3’) same as ‘2’, but where the disfavoring sequence is located 135bp from the factor site. (d) The probability of transcription factor binding at each of the three sequences from panel c. (e) For each of the three sequences from (c), shown is the most likely molecule binding configuration at three different factor concentrations (c). (f) Proximal factor sites exhibit cooperative or destructive binding. Shown are three example sequences with a uniform nucleosome affinity landscape and differing factor landscapes: (‘1’) a single factor site; (‘2’) two factor sites separated by 10bp; (‘3’) two factor sites separated by 135bp. (g) The probability of transcription factor binding to the left (red) site at each of the three sequences from (f). (h) Shown are the cooperative and destructive binding effects in sequences ‘2’ and ‘3’, respectively, displayed as the ratio between the factor binding probability at sequence ‘2’ or ‘3’ compared to sequence ‘1’.

Competition between nucleosome and transcription factor binding

In the more realistic setting in which a non-uniform nucleosome landscape dictates a non-uniform nucleosome occupancy distribution, activation dynamics depend on the nucleosome occupancy around the factor site, with activation occurring at lower factor concentration for sequences in which the binding site for the factor is located in a region of low intrinsic nucleosome occupancy61. Indeed, a recent study showed that lower nucleosome occupancy at sites for the yeast transcription factor Pho4, in conditions where Pho4 concentration is low, is predictive of an earlier onset (in both factor concentration and in time itself) for activation65,66. The importance of nucleosome occupancy for activation65, combined with the reliability with which nucleosome occupancy can be predicted from sequence15, means that the dependence of activation on factor concentrations can be predicted directly from sequence, although the accuracy of such predictions remains to be shown. For example, nucleosome-disfavoring sequences generate low nucleosome occupancy in their immediate vicinity and indeed, for nearly all yeast transcription factors, the subset of their sites that are near nucleosome-disfavoring sequences have lower nucleosome occupancy26. This context-dependent accessibility of factor sites thus provides a mechanism, predicted directly from the DNA sequence, by which the same factor may regulate its different targets with different activation dynamics, by positioning some of its sites near nucleosome-disfavoring sequences26,61 (Fig. 3c–e).

Many regulatory sequences contain multiple sites for multiple transcription factors. In sequences where multiple sites are close to each other, each of the corresponding binding factors separately competes with nucleosomes, resulting in indirect binding cooperativity between factors67 and, again, gene activation at lower factor concentration. Indeed, such cooperativity was demonstrated in yeast, between an endogenous yeast transcription factor and two foreign transcription factors from E. coli45. Note that this obligate cooperativity is predicted by the framework directly from the affinity landscape of the sequence, and without invoking specific cooperative mechanisms such as protein-protein interactions61 (Fig. 3f–h). This may be part of the mechanism of cooperative binding that was demonstrated in regulatory sequences from yeast16,68,69 and fly10,70 that contain clusters of factor binding sites. Intriguingly, since this obligate cooperativity occurs between any two factors, it may also explain why some transcription factors exhibit both activatory and inhibitory roles71,72: for instance, a transcriptional repressor can seemingly act as an activator if its competition with nucleosomes promotes the binding of a nearby activator (Fig. 3f–h).

Encoding distinct modes of transcriptional regulation

Chromatin remodelers are important in transcriptional regulation73 and target specific sets of genes74. Although the mechanism by which chromatin remodelers are recruited to specific loci is not well understood, a recent study26 suggested that the differential requirement for remodelers at different loci may be partly explained from the DNA sequence. This study defined two categories of yeast genes based only on sequence information. The differences in the affinity landscapes of these two gene categories suggest that they undergo different modes of regulation, according to whether transcription factors do or do not compete with nucleosomes for access to the DNA. Indeed, genes in the category where competition is predicted to take place have higher rates of histone turnover75 and TRANSCRIPTIONAL NOISE76, consistent with an ongoing dynamic competition between nucleosome assembly and factor binding. Correspondingly, these genes contain more targets of chromatin remodeling complexes74 (Fig. 4).

Figure 4. Distinct modes of transcriptional regulation encoded by DNA sequence.

Figure 4

**(a)**Two sets of yeast genes were defined based on their DNA sequence26: one set by the absence of strong nucleosome disfavoring sequences and the presence of TATA sequences (left), and one by the presence of strong nucleosome disfavoring sequences and the absence of TATA sequences (right). Shown is the nucleosome occupancy in vivo (blue) and the nucleosome affinity landscape as measured in vitro by assembling purified histones on purified yeast genomic DNA15 (green), averaged across all genes of each gene set. Also shown is the approximate affinity landscape for all transcription factors across all genes of each of the two gene sets, using the spatial distribution of factor binding site occurrences as a proxy for the spatial distribution of affinity. (b) Schematic illustration of the most likely configurations of each gene set. In gene set one (left), the nucleosome landscape exhibits high nucleosome occupancy and the transcription factor landscape has a relatively large number of binding sites spread across the regulatory region, suggesting that nucleosomes and factors are in competition for access to the DNA. Supporting this suggestion is the high transcriptional noise, high rate of histone turnover, and enrichment for chromatin remodeler activity, that were found for this gene set26. In contrast, in gene set two (right), the nucleosome landscape shows strong nucleosome depletion around the transcription start site, and the factor landscape has fewer binding sites, but with a preference for these sites to be located at the nucleosome depleted region. These landscapes suggest little competition between factors and nucleosomes, and supporting this is the low noise, low histone turnover, and absence of enrichment for chromatin remodeler targets, that were found for this gene set26.

Thus, by partitioning genes based on the affinity landscapes, two modes of transcriptional regulation can be identified that provide a partial explanation for the differential requirement for chromatin remodelers at different genomic loci. A corollary of these facts is that genomes can encode different dynamic responses – even to the same transcription factor – at two different regulatory sequences, by embedding the factor’s cognate binding sites in two different nucleosome affinity landscapes.

Evolution of binding affinity landscapes

A genetic mechanism for achieving phenotypic diversity

The examples above demonstrate that many diverse aspects of transcriptional regulation can be understood directly from the affinity landscapes encoded in DNA sequences. Since changes in transcriptional regulation are important for generating phenotypic diversity among species, an intriguing possibility is that the genetic mechanisms that underlie these regulatory changes involve changes in the encoded affinity landscapes.

Overall, changes in factor binding site content alone account for only a small fraction of the observed expression divergence in both yeast77,78 and mammals78. However, in the case of the yeast mating system, which is regulated by a single transcription factor, variations in the predicted binding sites indeed explain much of the expression divergence across yeast species78.

A recent study found that a major change in the transcriptional program of yeast species, connected with the capacity for rapid anaerobic growth, is accompanied by corresponding changes in the DNA-encoded nucleosome affinity landscape of the orthologous regulatory sequences79. In aerobic yeast species, where cellular respiration genes are active under typical growth conditions, the regulatory sequences encode a nucleosome landscape with a strong nucleosome-depleted region, whereas in anaerobic yeast species, where cellular respiration genes are inactive under typical growth, the orthologous regulatory sequences encode a relatively nucleosome-occupied landscape. This suggests that DNA sequence changes that directly alter the encoded nucleosome affinity landscape of regulatory sequences may be a general genetic mechanism for achieving phenotypic diversity across evolution.

Explaining transcriptional noise

Noise in gene expression levels

Levels of cell-to-cell expression variability vary across genes76,80,81, and so it is interesting to address whether these noise levels can be predicted directly from the regulatory sequence of a gene. TATA SEQUENCES are predictive of high noise, presumably by amplifying fluctuations in gene activation through facilitation of transcription re-initiation81,82,83,84, whereas nucleosome-disfavoring sequences are predictive of low noise26. This latter observation can be understood using the framework presented here, by examining the effect on activation dynamics that it predicts for sequences that contain nucleosome-disfavoring elements. As discussed above, regulatory target sites that are located close to nucleosome-disfavoring elements will be occupied at lower factor concentrations than are target sites that are far from such elements. Thus, the noisy regime, at which the probability of factor binding is ~0.5, is reached at lower factor concentrations in target sequences where the site is near nucleosome-disfavoring elements61 (Fig. 5a–c). If we assume that the physiological concentration of the regulating factor is such that it binds both types of target sequences with sufficient probability, then at such a concentration, targets with nucleosome-disfavoring sequences have already ‘escaped’ the noisy regime, providing a plausible explanation for the relatively low noise that has been observed for these targets26. In addition, kinetic models show that regulatory sequences with a higher frequency of transition from the transcriptionally inactive state to the transcriptionally active state exhibit less noise81. Since nucleosome-disfavoring elements create nucleosome-depleted regions, sequences that contain such elements have a lower requirement for and are less targeted by chromatin remodelers26, which may result in more rapid transitions between the active and inactive states, providing another plausible explanation for the lower noise of these sequences.

Figure 5. Explaining transcriptional noise from DNA sequence.

Figure 5

**(a)**Nucleosome disfavoring sequences determine the range of factor concentrations at which high transcriptional noise occurs. Shown are two example sequences, one with a uniform nucleosome landscape (‘1’), and one with a nucleosome landscape containing a sequence that strongly disfavors nucleosome formation, located 10bp from the single transcription factor site (‘2’). (b) For the two sequences from (a), shown is the probability of transcription factor binding at different factor concentrations, computed by applying the framework presented here to the binding landscapes of those two sequences61. Under this equilibrium framework, the regime of high transcriptional noise is where the probability of transcription factor binding is ~0.5 (highlighted by the brown rectangle), since at this regime the variance of factor binding is maximal. Note, however, that while the variance of factor binding is one of the determinants of noise levels, other determinants exist as well. (c) For four different factor concentrations (c), shown are the most likely molecule binding configurations at each of the two sequences from (a). Note that at each of the two intermediate concentrations, one of the two sequences is noisy, i.e., the configurations in which the factor is bound and the configurations in which the factor is not bound have near-equal probability. (d) Cooperative binding reduces the range of factor concentrations at which there is high transcriptional noise. Shown are two example sequences with a uniform nucleosome landscape, where one sequence has a single factor site (‘1’), and the other two factor sites separated by 10bp (‘2’). (e) The probability of transcription factor binding to the left (red) site at each of the two sequences from61 (d). The regime of high noise is highlighted (brown rectangle). The range of factor concentrations at which each sequence exhibits high noise is depicted. The range of factor concentrations in which sequence ‘2’ (the sequence with cooperative binding) is noisy is smaller than the corresponding range for sequence ‘1’.

Noise in replication initiation

The ideas above are based on modeling the binding of molecules to DNA and as such, they may apply beyond the context of transcriptional regulation. For example, DNA replication origins also exhibit cell-to-cell variability, with some origins initiating replication in most cell divisions and others initiating only occasionally. Thus, analogous to transcriptional noise, the framework predicts that origins that are close to nucleosome-disfavoring elements would have lower nucleosome occupancy, and thus be more accessible to the replication initiation complex and initiate replication with higher efficiency. Indeed, in fission yeast, where replication efficiency was measured85, origins that are close to nucleosome-disfavoring sequences initiate with higher efficiency26 (probability), and a systematic sequence deletion study around one replication origin found that deletion of a strong nucleosome-disfavoring element resulted in the largest reduction in replication efficiency86.

In summary, by examining the activation dynamics that the framework predicts directly from DNA sequence, measurements of cell-to-cell expression and replication variability can be partly explained from sequence alone. More generally, the same approach can be used to predict the noise of sequences with more complex affinity landscapes, thereby generating testable hypotheses. As one such example, the framework predicts lower noise in sequences in which multiple sites are clustered in close proximity, because as discussed above, site clustering leads to cooperative factor binding and a sharper activation curve, and thus the noisy regime spans a smaller range of factor concentrations61 (Fig. 5d,e).

Summary and future directions

This review presents a unifying quantitative and conceptual framework for translating DNA sequences into transcriptional behaviors. The key idea behind this translation process is that DNA-binding molecules have intrinsic affinities to DNA sequences that are specific to each molecule, and thus, every sequence defines a unique affinity landscape with respect to each molecule. Molecules that interact and bind DNA result in a unique distribution of molecule binding configurations at each sequence and a resulting transcriptional output.

Recent studies have determined the intrinsic affinities of nucleosomes and of many transcription factors, allowing us to translate DNA sequences into affinity landscapes with high accuracy. These landscapes, either directly, or through the application of the framework, partly explain many experimental observations regarding binding patterns of nucleosomes, dynamics of transcriptional activation, and transcriptional noise, demonstrating that diverse sets of transcriptional behaviors can be read directly from the DNA sequence.

Many challenges remain. A large number of predictions generated by current models still need to be validated experimentally. Several aspects of the framework regarding the translation of binding landscapes into binding configurations, and especially the translation of binding configurations into transcriptional output, are still poorly understood and require targeted experiments for determining them at a quantitative level. In particular, the treatment of DNA sequences as being one-dimensional should ultimately be replaced by modeling binding in three dimensions, taking into account both the long-range DNA looping that allows distant enhancers to interact with promoters and the short-range looping that allows factors within the same regulatory module to interact with each other87. Regulatory modules may prove to have as-yet unrecognized highly specific three-dimensional architectures, which could depend on the detailed locations of nucleosomes and other factors that bend or twist DNA. Any such specific architectures will influence the interactions that are mediated by short- and long-range DNA looping. Experiments targeted at measuring the sequence-dependent energetic costs of DNA looping can be used to assign statistical weights to the expanded set of configurations that include short DNA loops. Molecular mechanics studies of longer chromatin regions in vitro or in vivo can supply data needed to model the free energy costs of longer loops. Detailed studies of architectures of regulatory sequences carried out at high spatial resolution will also be required.

Current models assume that the system is in thermodynamic equilibrium and make other simplifying assumptions regarding steric hindrance and integration of effects of multiple factors, all of which need to be validated experimentally or modified appropriately. Current models also assume that different histone variants, different histone posttranslational modifications, and the absence or presence of histone H1, do not significantly influence the nucleosome sequence preference; these assumptions, too, need to be validated or modified. The activity of key components such as chromatin remodelers has thus far not been incorporated into current models. Experiments in purified in vitro systems should be useful for directly measuring the quantitative effect of these components at a genome-wide scale, thereby allowing their integration into future models. Since most of the experimentation and modeling has been done only in bacteria and yeast, it is important to test and develop the models in higher eukaryotes. Although the basic rules of molecular interactions modeled by current approaches should be universal, genomes of multi-cellular organisms may encode landscapes that combine these building blocks in ways that differ from unicellular organisms, leading to new design principles in transcriptional regulation. Finally, while we have focused here on approaches that quantitatively model the translation of DNA sequences into transcriptional behaviors, a major remaining challenge is to understand the functional consequences that these expression patterns have, and ultimately, to be able to tell which deviations from the wild-type patterns may be deleterious and lead to disease. Future experiments, combined with the development of improved quantitative models, should allow us to address these and other challenges, and bring us closer to a mechanistic and predictive understanding of transcriptional regulation in all organisms.

Glossary

BINDING CONFIGURATION

A particular arrangement of molecules along a DNA sequence, including specification of the precise position and orientation (or DNA strand) at which each molecule is bound

TRANSCRIPTIONAL NOISE

The variability in the transcription rate (or in steady state mRNA levels) of genes across different cells from an isogenic cell population grown in the same condition

NUCLEOSOME

The basic unit of chromatin, containing 147bp of DNA wrapped around a histone protein octamer

CHROMATIN REMODELERS

A protein or protein complex that has the capacity to alter the structure of chromatin. Some remodelers require ATP hydrolysis for their activity

FOOTPRINTING

A method for detecting protein-DNA interactions by using an enzyme to cut DNA, followed by analysis of the resulting cleavage pattern. The method is based on the fact that a protein bound to DNA protects that DNA from enzymatic cleavage

GEL-SHIFT ANALYSIS

A technique that uses native gel electrophoresis to determine whether, and how tightly, a protein of interest can bind a given DNA sequence

SOUTHWESTERN BLOTTING

A method that involves identifying DNA-binding proteins after SDS polyacrylamide gel electrophoresis and transfer to a membrane, by their ability to bind to specific oligonucleotide probes

SYSTEMATIC EVOLUTION OF LIGANDS BY EXPONENTIAL ENRICHMENT (SELEX)

A combinatorial technique for producing DNAs that bind specifically and with high affinity to a DNA-binding protein of interest

CHIP-CHIP

ChIP-on-chip (also known as ChIP-chip) is a technique that combines chromatin immunoprecipitation (“ChIP”) with microarray technology (“chip”). It is a high-throughput method for identifying, on a genome-wide scale, DNA regions that are bound in vivo by a target protein of interest

CHIP-SEQ

Same as ChIP-chip, but where the resulting interactions are read out by high-throughput parallel sequencing and not by microarrays as in ChIP-chip

PROTEIN BINDING MICROARRAYS

A method that allows high-throughput characterization of the in vitro DNA binding-site sequence specificities of transcription factors. In this approach, a DNA-binding protein of interest is expressed, purified and then bound directly to a dsDNA microarray spotted with a large number of different potential DNA-binding sites

MICROFLUIDIC PLATFORMS FOR MEASURING DNA-BINDING INTERACTIONS

A high-throughput platform for measuring protein-DNA affinities on the basis of mechanically induced trapping of molecular interactions

DNA LOOPING

A conformation of a dsDNA sequence in which two regions of the DNA that are separated along the DNA in one dimension are brought close together in three dimensional space

TATA SEQUENCES

A DNA sequence with a core of 5′-TATA-3′ found in the promoter region of many genes. It is typically bound by a corresponding TATA-binding protein in the process of recruiting RNA polymerase to a promoter

LEGAL CONFIGURATION

An arrangement of molecules along a DNA sequence in which there is no steric overlap between any two molecules on the DNA

References