RNA-binding proteins: modular design for efficient function (original) (raw)

. Author manuscript; available in PMC: 2017 Jul 12.

Published in final edited form as: Nat Rev Mol Cell Biol. 2007 Jun;8(6):479–490. doi: 10.1038/nrm2178

Abstract

Many RNA-binding proteins have modular structures, being composed of multiple repeats of just a few basic domains that are arranged in a variety of ways to satisfy their diverse functional requirements. Recent studies have investigated how different modules cooperate in regulating the RNA binding specificity and the biological activity of these proteins. They have also investigated how multiple modules cooperate with enzymatic domains to regulate the catalytic activity of enzymes acting upon RNA. These studies have shown how multiple modules define, for many RNA-binding proteins, the fundamental structural unit that is responsible for their biological function.

Introduction

RNA is rarely at a loss for companions; as soon as RNA is transcribed, ribonucleoproteins (RNPs) form co-transcriptionally on the nascent transcript and participate in processing, nuclear export, transport and localization1. The dynamic association of these proteins with RNA defines the lifetime, cellular localization, processing and the rate at which a specific mRNA is translated.

The diversity of functions of RNA-binding proteins would suggest a correspondingly large diversity in the structures that are responsible for RNA recognition. However, most RNA-binding proteins are built from relatively few RNA-binding modules (Table 1). The large structural diversity of substrates is accommodated instead by the presence of multiple copies of these RNA-binding domains presented in a variety of structural arrangements to expand the functional repertoire of these proteins (Figure 1)2. Modules of the same or different structural type combine to create versatile macromolecular binding surfaces to define the specificity of these proteins and combine with enzymatic domains to define the enzymes’ target and regulate catalytic activity (Figure 2). In order to understand the function of RNA-binding proteins, it is therefore important to know how these domains function together as RNA recognition units.

Table 1.

Common RNA binding domains and general features of their RNA binding sites and RNA recognition properties.

Topology RNA Recognition Surface Protein-RNA interactions Representative Structures (PDB ID)
RRM αβ Surface of β–sheet Interacts with 4 nucleotides of ssRNA through stacking, electrostatics and hydrogen bonding U1A N-terminal RRM18(1URN)
KH (Type I and Type II) αβ Hydrophobic cleft formed by variable loop between β2 and β3 and GXXG loop; Type II: Same as type I, except variable loop is between α2 and β2 Recognizes 4 nucleotides of ssRNA through hydrophobic interactions between non-aromatic residues and the bases; sugar-phosphate backbone contacts from GXXG loop, and hydrogen bonding to bases Nova-1 KH3 (Type I)41 (1EC6)NusA (Type II)37(2ASB)
dsRBD αβ α-helix 1, N-terminal portion of α-helix 2, and loop between β1–β2 Shape specific recognition of dsRNA’s minor-major-minor groove pattern through contacts to sugar-phosphate backbone; specific contacts from N-terminal α-helix to RNA in some proteins dsRBD3 from Staufen48(1EKZ)
ZnF-C2H2 αβ Primarily residues in α-helices Protein side chain contacts to bulged bases in loops and through electrostatic interactions between side chains and RNA backbone Fingers 4–6 of TFIIIA56(1UN6)
ZnF-CCCH Little regular secondary structure Aromatic side chains form hydrophobic binding pockets for bases that make direct H-bonds to protein backbone Stacking interactions between aromatic residues and bases create kink in RNA that allows for direct recognition of Watson-Crick edges of the bases by the protein backbone Fingers 1 and 2 of Tis11d57 (1RGO)
S1 β Core formed by two β-strands with contributions from surrounding loops Stacking interactions between base and aromatic residues and hydrogen bonding to the bases Ribonuclease II118(2IX1), Exosome99(2NN6)
PAZ αβ Hydrophobic pocket formed by OB-like β-barrel and small αβ motif Recognizes single-stranded 3′ overhangs of siRNA through stacking interactions and hydrogen bonds PAZ73(1SI3), Argonaute76 (1U04), Dicer72(2FFL)
PIWI αβ Highly conserved pocket including a metal ion that is bound to the exposed C-terminal carboxylate Recognizes the defining 5′ phosphate group in siRNA guide strand with highly conserved binding pocket that includes a metal ion AfPIWI75(1YTU), Argonaute (1U04)
TRAP β Edges of β-sheets between each of the 11 subunits that form the entire protein structure Recognizes GAG triplet through stacking interactions and hydrogen bonding to bases; limited contacts to the backbone TRAP119 (1C9S)
Pumilio α Two repeats combine to form binding pocket for individual bases; helix α2 provides specificity-determining residues Binding pockets for bases provided by stacking interactions; specificity dictated by hydrogen-bonds to Watson-Crick face of base by two amino acids in helix α2 Pumilio84(1M8Y)
SAM α Hydrophobic cavity between three helices surrounded by an electropositive region Shape-dependent recognition of RNA stem-loop, mainly through interactions with sugar-phosphate backbone and a single base in loop. Vts1p120(2ESE)

Figure 1. Many RNA-binding proteins have a modular structure.

Figure 1

Representative examples from some of the most common RNA-binding protein families, as illustrated here, demonstrate the variability in the number of copies (as many as 14 in vigilin) and arrangements that exist in RNA-binding proteins. This variability has direct functional implications. For example, Dicer and RNase III both contain an endonuclease catalytic domain followed by an RNA-binding dsRBD; thus, both proteins recognize double-stranded RNA, but Dicer has evolved to interact specifically with RNA species produced through the RNA interference pathway through additional domains that recognize the unique structural features of these RNAs. Different domains are schematically represented in colored boxes, including the RRM (RNA Recognition Motif; green), by far the most common RNA-binding protein module; the KH (K-Homology) domain, blue, capable of binding both single stranded RNA and DNA; the dsRBD (double-stranded RNA-binding domain), a sequence-independent dsRNA binding module (red); RNA-binding zinc finger domains (light blue or pink). Enzymatic domains and less common functional modules are indicated in a variety of colors.

Figure 2. RNA-binding modules are combined to perform multiple functional roles.

Figure 2

RNA-binding domains function in a variety of ways. (a) They recognize RNA sequences with a specificity and affinity that would not be possible through a single domain or if multiple domains did not cooperate. Multiple domains combine to recognize a longer RNA sequence (left), sequences separated by many nucleotides (centre), or RNAs belonging to different molecules altogether (right). (b) RNA-binding domains can organize mRNAs topologically by interacting simultaneously with multiple RNA sequences or (c) they can function as spacers to properly position other modules for recognition. (d) They can combine with enzymatic domains to define the substrate specificity for catalysis or regulate enzymatic activity. In this figure, the RNA-binding modules are represented as ellipses with their RNA-binding surfaces colored in light blue, with the corresponding binding sites within the RNA colored in red; individual domains are colored differently.

In this review, we begin illustrating general themes as to how modularity facilitates function. We then briefly summarize the principles of RNA recognition by individual RNA-binding domains as a necessary prologue to the subsequent discussion of how specific combinations of modules cooperate functionally and structurally. The reader is referred to several excellent reviews that discuss the molecular mechanisms used by individual domains to recognize specific RNAs in greater detail36. The focus of this review is on how RNA-binding modules are combined and arranged to facilitate a myriad of different interactions and regulatory events.

Modularity facilitates function

Many cellular processes, for example intracellular signaling and the extracellular matrix79, rely on proteins that are constructed through multiple repeats of a few basic modular units. The advantages to constructing a protein with a modular architecture arise from the resulting versatility. By existing in multiple copies (Figure 1), these modules endow a protein with the ability to bind RNA with increased specificity and affinity than would be possible with individual domains, which often bind short RNA stretches with relatively weak affinity. Thus, by constructing an interaction surface through multiple modules, high affinity and specificity for a particular target can be obtained by combining multiple weak interactions. These weak interactions make it easier to regulate formation of these complexes by disassembling them when needed. Furthermore, these multiple binding sites have the ability to evolve independently. The modular architecture is also ideally suited to construct proteins that match in their RNA specificity the relatively poorly conserved sequence features observed in splicing and 3′-end processing sites of eukaryotic mRNAs1012.

The first effect of providing a protein with multiple domains is therefore that the protein itself can recognize a much longer stretch of nucleic acids than would be possible with a single domain (Figure 2A, left). This modularity also allows proteins to recognize sequences that are separated either by an intervening stretch of nucleotides (Figure 2A, centre) or that belong to different RNAs (Figure 2A, right).

The specificity of individual domains within a protein is obviously functionally important, but so is the way in which domains are arranged relative to each other. This is reflected in evolution: higher levels of conservation are often found between domains that occupy the same position in orthologous proteins, as opposed to domains within the same protein but in different position. For example, in both the splicing factor U2AF65 and in the poly(A) binding protein (PABP), RRM1 in yeast is more similar to RRM1 of the human protein compared to RRM3 or RRM4 of the yeast protein.

Much of the ability of these proteins to recognize RNA specifically is dependent upon the linker between the two domains. Long linkers are generally disordered and allow two domains to recognize a diverse set of targets, as seen in the centre and right panels of Figure 2A, while short linkers predispose the domains to bind to a contiguous stretch of nucleic acids (Figure 2A, left side). When this occurs, the linker domain generally becomes ordered, forming a short α-helix in response to RNA binding that positions the two domains relative to one another and sometimes contacts RNA directly1316. In these situations, inter-domain sequences are as well conserved as the domains themselves17, or better, because the precise positioning of domains facilitates their function.

The modular architecture allows a protein to topologically arrange the generally flexible RNA for a particular function (Figure 2B). Conversely, the proteins themselves can be topologically organized to interact with a particular RNA structure (Figure 2C), for example by utilizing additional domains (yellow oval, Figure 2C) to organize the RNA-binding domains.

Finally, the combination of enzymatic and RNA binding domains provide ways to regulate catalytic activity. In Figure 2D, we outline a situation where the active site of an enzyme is occluded by the presence of an RNA binding domain. In the presence of the substrate RNA, the RNA binding domain binds its target, thereby releasing the enzyme from its inactive state.

RNA recognition by RNA-binding modules

RRM

The RNA recognition motif (RRM, also known as the RNA binding domain RBD or ribonucleoprotein motif RNP), is by far the most common and best characterized of the RNA-binding modules. In this review, we will refer to it as RRM, and use the term RNA binding domain for any domain that binds to RNA. The RRM is composed of 80–90 amino acids that form a four-stranded anti-parallel β-sheet with two helices packed against it, giving the domain the split αβ (βαββαβ) topology18 (Figure 3A). More than 9,000 RRMs have been identified that function in most, if not all, post-transcriptional gene expression processes; in humans, ~0.5–1% of genes contains an RRM, often in multiple copies within the same polypeptide19.

Figure 3. How RNA-binding modules recognize RNA.

Figure 3

a) Structure of the N-terminal RNA-recognition motif (RRM) of human U1A bound to RNA18; in this structure, and in many other RRM-RNA complexes, single stranded bases are specifically recognized through the protein β-sheet and two loops connecting the secondary structure elements. b) The hnRNP K homology 3 (KH3) domain of Nova-2 bound to 5′-AUCAC-3′41; these domains bind to both single-stranded DNA and RNA through a conserved GXXG sequence located in an exposed loop (light blue). c) The Rnt1 double-stranded RNA-binding domain (dsRBD) bound to a RNA helix capped by an AGNN tetraloop50; a conserved protein loop (left-most part of the structure) interacts with 2′-OH groups in the RNA minor groove while highly conserved Lys and Arg residues at the end of the longer helix recognize the position of phosphate atoms characteristic of an A-form helix. d) The two zinc fingers of Tis11d bound to an AU-rich RNA element57; the identity of the single stranded RNA is recognized by the protein backbone through hydrogen bonds with the Watson-Crick face of each base. In all panels, the RNA backbone is represented with an orange ribbon, α-helices are in red and β-sheets in yellow; the Zn atom in the Tis11d structure is in magenta.

In the about 20 structures of RRM–RNA complexes, RNA recognition usually occurs on the surface of the β-sheet1316, 18, 2028. Binding is mediated in most cases by three conserved residues, an Arg/Lys that forms a salt bridge to the phosphodiester backbone, and two aromatic residues that make stacking interactions with the nucleobases. These amino acids reside in the two highly conserved motives, termed ribonucleoprotein motif 1 and 2 (RNP1 and RNP2), that define the motif at the sequence level and are located in the two central β-strands18. This conserved platform allows for recognition of two nucleotides in the center of the β-sheet, and two additional nucleotides on either side6. However, a single RRM can recognize anywhere from 4 to 8 nucleotides by using exposed loops and additional secondary structure elements that are not present in the canonical structure3, 6. This general mechanism of recognition is found in many RRMs, but not all22, 28; some of these domains even interact with proteins and not RNA2935. Thus, some individual RRMs can bind to RNA with great specificity, but in many cases multiple domains are needed to define specificity because the number of nucleotides recognized by an individual RRM is generally too small to define a unique binding sequence3.

KH

The hnRNP K homology domain (KH domain) is a domain that binds to both ssDNA and ssRNA3642 and is ubiquitously found in eukaryotes, eubacteria and archaea43. The domain is composed of ~70 amino acids with a signature sequence of (I/L/V)-I-G-X-X-G-X-X-(I/L/V) near the center of the domain that is functionally very important. Mutations within this region of the Fmr1 protein cause Fragile X mental retardation syndrome44. All KH domains form a three-stranded β-sheet packed against three α–helices, but can be separated in two subfamilies on the basis of their topology45 (type I: βααββα topology; type II: αββααβ topology). Four nucleotides are recognized for both classes in a cleft formed by the GXXG loop, the flanking helices, the β-strand that follows helix 2 (type I) or 3 (type II), and the variable loop between β2 and β3 in (type I) or between α2 and β2 in (type II; Figure 3B). Quite unlike the RRM, this binding platform is free of aromatic amino acids; recognition is achieved instead by hydrogen bonding, electrostatic interactions and shape complementarity.

dsRBD

The double-stranded RNA-binding domain (dsRBD) is another small αβ domain of 70–90 amino acids that is widely found in both bacteria and eukaryotes. However, it interacts with double-stranded (ds)RNA without making specific contacts with the nucleobases. The protein binds across two successive minor grooves and the intervening major groove on one face of the dsRNA helix (Figure 3C)46. Unlike the RRM or KH domains, the majority of intermolecular contacts are sequence independent and involve 2′-OH groups and the phosphate backbone46. The presence of multiple dsRBDs can impart specificity for certain structures because of their ability to recognize certain arrangements of RNA helices49, 51, 52. In addition, the specificity of at least some dsRBDs is mediated in part by an N-terminal helix that binds to irregular helical elements within A-form RNA such as stem-loops, base mismatches and bulges (Figure 3C)4750.

Zinc Fingers

Zinc fingers are classical DNA-binding proteins that can also bind to RNA53, 54, as eloquently demonstrated by several recent structures5557. They are typically classified based on the residues used to coordinate zinc: Cys2His2 (C2H2), CCCH, or CCHC and are generally present in multiple repeats within a protein. Thus, TFIIIA (where the motif was first identified) contains nine C2H2 zinc fingers: fingers 1–3, 5 and 7–9 interact with DNA, while fingers 4–6 interact with 5S RNA58, 59. C2H2 zinc fingers interact with DNA primarily by forming direct hydrogen bonds to Watson–Crick base pairs in the major groove, using residues within their recognition α-helix60, while TFIIIA binds RNA by making specific contacts to two RNA loops through the recognition helices of fingers 4 and 6. Thus, zinc fingers can use some of the same residues to recognize both nucleic acids, but the different DNA and RNA structures dictate a distinct structural arrangement of the zinc fingers on the nucleic acid template.

A second family of RNA-binding zinc fingers contains CCCH motivess61. Remarkably, in the structure of Tis11d bound to an AU-rich RNA element (ARE), sequence-specific RNA recognition occurs primarily through hydrogen bonding to the protein backbone (Figure 3D)57. Thus, the shape of the protein is the primary determinant of specificity by providing a rigid hydrogen-bonding template. This mode of recognition is reminiscent a third type of zinc fingers with a CCHC-zinc binding motif that is found in the nucleocapsid domain of the retroviral Gag proteins and in the HIV-1 nucleocapsid protein6263.

S1 domain

S1 domains were first identified in ribosomal protein S1 (hence the name), but have since been found in other RNA-binding proteins, including several exonucleases64. The domain is composed of approximately 70 amino acids arranged in a 5-stranded antiparallel β-barrel capped by a short 310 helix65. The fold is similar to the oligonucleotide/oligosaccharide binding (OB) fold superfamily, which also contains the related RNA-binding Cold Shock Domain66. The S1 domain uses the common OB-fold binding surface to recognize nucleic acids through two β-strands surrounded by several loops67. Thus, RNA binding by the S1 domain is somewhat reminiscent of RNA recognition by the RRM, where a two-stranded β-sheet core contributes several conserved aromatic residues for stacking interactions with the nucleic acid bases, that are augmented by interactions provided by the surrounding loops and secondary structure elements65, 68.

PAZ and PIWI domains

RNA processing during RNAi and microRNA biogenesis generate species with unique structural and chemical features that must be recognized specifically but in a sequence-independent manner. These functional requirements are fulfilled by a specialized set of domains encountered in proteins involved in processing microRNA (miRNA) and small interfering RNA (siRNA) precursors.

The 110-amino-acid PAZ domain contains a β–barrel domain that resembles an OB or S1 fold juxtaposed to a small αβ domain that forms a clamp-like structure where RNA binds (Table 1)6971. It selectively binds to the 2-nucleotide overhangs and probably serves as an anchor to position the miRNA for proper cleavage by Dicer72, 73. PAZ domains in Argonaute proteins facilitate cleavage of the target strand by the RISC complex responsible for degradation of the RNA targeted for silencing. The additional PIWI domain in Argonaute adopts instead an RNase H fold and anchors the unique 5′ end of the guide strand to position the target strand for degradation (Table 1)7478.

Expanding conventional RNA-binding surfaces

The type of RNA that can be recognized by RNA-binding domains is increased not only by providing multiple domains within a protein (as discussed in the next section), but also by expanding a canonical RNA-binding surface through additional secondary structures or loops6, 50. In the reverse situation, a canonical recognition surface can be occluded by secondary structure elements, leading to the regulation of the RNA-binding activity. Thus, many proteins that are involved in spliceosome assembly have RNA-binding modules that differ from their canonical structure. For example, the SF1 protein that binds to the branch-point sequence has an additional QUA2 domain that defines an enlarged KH domain by making extensive hydrophobic interactions with the KH domain itself. By increasing the recognition surface, SF1 is able to bind to the seven single-stranded nucleotides that define the branch-point sequence42.

The structures of the first two quasi-RRMs from heterogeneous ribonucleoprotein (hnRNP) F demonstrate instead how an RRM can use a different surface for RNA recognition when the β-sheet surface is occluded79. This member of the hnRNP family is involved in the recognition of G-rich sequences (G-tracts) that are often found at recognition elements responsible for 5′ splice site recognition8082. In the structure of the hnRNP F protein bound to the G-tract in Bcl-x pre-mRNA, each domain resembles a canonical RRM despite the absence of the RNP1 and RNP2 motifs normally used to bind RNA. Furthermore, the β-sheet surface is occluded by the presence of a C-terminal α–helix packed against it. Thus, the first two qRRMs of hnRNP F recognize RNA through a novel surface composed of a small β-hairpin between α2 and β4 and the β1–α1 and β2–β3 loops79. Perhaps the requirement for binding through a different surface in this complex stems from the necessity to recognize G-quadruplex RNA while at the same time preventing nonspecific binding to single stranded RNAs normally recognized by RRM proteins.

An additional α–helix C-terminal to the canonical domain is common in RRMs. The La protein C-terminal domain, Cleavage Stimulation Factor 64 (CstF-64) and U1A, all have a helix at the C terminus of the domain (Figure 3A)12, 20, 83. Many other domains form such an helix when bound to RNA, for example Hrp1, HuD and Polyadenylate Binding Protein14, 16, 25. The C-terminal RRM of La does not interact with RNA at all and, in the U1A and CstF-64 structures, the helix moves away from the β–sheet to allow RNA recognition using the canonical site (Figure 3A), suggesting that these helices perform primarily a regulatory role.

Multiple domains specify RNA recognition

Tandem domains

Isolated RNA-binding domains generally have limited ability to interact with RNA in a sequence-specific manner because their recognition sequences are too short6. Thus, multiple domains (typically two) are tethered together on a single polypeptide to create a much larger binding interface that recognizes a longer sequence. Perhaps the most extreme example of this concept comes from the Pumilio (Puf) family of proteins. Each domain recognizes a single nucleotide on its own, but by combining multiple repeats, the protein can bind with high affinity and specificity to as many as eight nucleotides (Table 1, Figure 4A)84. In fact, the three amino acids that recognize a particular nucleotide provide a reasonably predictive recognition code that can be exploited to engineer proteins that recognize different RNA sequences from those specified by the wild-type proteins84,85.

Figure 4. RNA-binding modules function together or alone to recognize a specific RNA.

Figure 4

a) The structure of human Pumilio provides an example of how multiple repeats (eight in this case) that individually recognize a few nucleotides (one in this case) combine to specifically recognize a much longer RNA sequence. Repeats are alternatively colored in magenta and blue; the RNA is colored similarly in all other structures with the backbone shown as an orange tube. b) In the structures of the two RNA recognition motifs (RRMs) from Hrp116 and c) the two K-homology (KH) domains of NusA37, a short linker (in gray) allows the two domains to position themselves with respect to one another upon binding RNA; for Hrp1, the first RRM is in yellow and the second in red; for NusA, the first KH domain is cyan and the second is purple. d) Flexibility within the linker between two double-stranded RNA-binding domains (dsRBDs) allow recognition of separated binding sites. The two dsRBDs of ADAR2 are connected by a flexible linker (dashed gray line) that may allow the protein to interact with a variety of targets of different structure49. e) The RNA-recognition motifs RRM3 (yellow) and RRM4 (red) of PTB form interdomain interactions involving the face of the protein opposite to the β-sheet involved in RNA recognition. This interaction positions the two domains in such away that interacting RNA sequences are looped away from each other, as indicated by the orange dotted line connecting the two RNAs28. f) The structure of the TFIIIA-RNA complex illustrates how zinc finger ZF5 (blue) functions as a spacer that properly positions zinc fingers 4 (teal) and 6 (tan) for recognition of loops E and A, respectively, within 5S rRNA55.

Inter-domain arrangement

Multiple domains associate with each other in a variety of ways to generate extended RNA recognition interfaces. The recent structure of Hrp1 (Figure 4B) exemplifies the structural principles involved in RNA recognition by two RRMs in tandem. In the free protein, both domains function as independent, rigid structures separated by a short flexible linker. Upon binding, both protein and RNA undergo significant changes in structure, with the linker forming a short helix and several inter-domain contacts creating a compact surface for recognition of adjacent stretches in the RNA16 (Figure 4B). The same is observed in Sxl, PABP, nucleolin and HuD proteins1315, 25.

In contrast, when the zinc finger protein Tis11d binds to AU-rich RNA, there are few inter-domain interactions. However, a pre-organized linker between the two zinc fingers orients the two domains for recognition of an eight-nucleotide RNA by the protein main chain with little side chain involvement57 (Figure 3D). In a third example, in the structure of NusA bound to RNA, the two KH domains make extensive inter-domain contacts with each other, burying 1270Å2 86. This association of the KH domains creates an extended RNA-binding surface that allows the two domains to recognize an 11-nucleotide RNA37 (Figure 4C). Thus, each of the KH domains of NusA specifically recognizes four nucleotides, as is canonical for KH domains; their separation by a three-nucleotide linker that also makes interactions with the protein generates the complete recognition sequence37. This binding interface is further extended by an S1 domain N-terminal to the first KH domain that makes extensive inter-domain contacts and, in doing so, may provide an additional surface for RNA recognition.

The zinc-finger domains of TFIIIA provide another example of how linkers between RNA-recognition domains play a crucial role in substrate recognition. Quite remarkably, the linker in this case is a zinc-finger module! In the TFIIIA-5S RNA complex, fingers 4 and 6 interact extensively with the RNA, while finger 5 acts as a spacer that makes sequence-independent contacts involving the side-chains of its α-helix and the RNA backbone. Effectively, it serves as a bridge between loops E and A within 5S RNA, that are directly recognized by fingers 4 and 6, respectively56 (Figure 4F).

While the previous examples illustrate the importance of an ordered linker, the presence of a long flexible linker can be favored (Figure 2A) because it allows RNA-binding proteins to recognize sites that have a variable number of nucleotides between them, that are quite separated from each other on the same RNA or on different RNA molecules altogether. In these cases, ordering of the linker upon binding RNA is not likely to occur. A good illustration of this situation is provided by the two dsRBDs of the RNA-editing enzyme ADAR2, where the two domains do not interact and are separated by a flexible linker in the free or bound protein49 (Figure 4D). Since ADAR2 is required to edit multiple RNAs, interdomain flexibility allows each dsRBD to bind to its preferred site within RNAs of varying length and structure.

Yet another example of the potential advantages of connecting domains with flexible linkers can be found in complexes where conformational flexibility is required for function. In the FBP–FUSE complex, a 30-residue linker separates the KH3 and KH4 domains of FBP, so that they can move independently of each other even when the protein is bound to DNA39. This property is likely to be functionally important because FBP binds to and modulates the helicase activity of the general transcription factor TFIIH. Since this protein might function as a torque-generating machine, it is important for FBP to bind to the dynamic TFIIH molecule while maintaining its interaction with DNA.

This theme is observed even in proteins containing RRM domains, a departure from the common and canonical arrangement described above for Hrp1 and other proteins1316, 25. The structure of polypyrimidine tract binding (PTB) protein shows that RRMs 3 and 4 are connected by a long linker and interact with each other in a way that forces their respective RNA-binding surfaces to face in opposite direction28. This orientation is essentially the opposite of what is observed in many di-domain proteins, yet may be functionally critical in splicing regulation by causing the exon or branch-point sequence to loop out, preventing binding of spliceosomal components and repressing splicing (Figure 4E).

The linker length is important

The considerations of the previous paragraph indicate that one of the major determinants for the affinity and specificity of RNA-binding proteins containing multiple domains resides within the amino acids linking the domains. The length and rigidity of the linker can have dramatic effects on RNA affinity87 and may influence whether a protein binds a single RNA or multiple RNAs (Figure 2A, right). Using the assumption that the free energy of binding individual domains is additive, we would expect the affinity of a protein with multiple RNA binding domains to be the product of the affinity of the individual domains. However, because the linker remains flexible in hnRNP A1, the affinity of the two-domain protein is 1000-fold less than the product of the affinities of the individual domains88. When the first RNA binding domain is bound, the second RNA binding domain sweeps a volume proportional to the length of the linker. Within this sphere, the effective concentration of the second domain is different than in the free solution, leading to altered affinity. A simple model was developed to calculate how the length of the linker affects affinity; using this model, long linkers (more than 50–60 residues) are predicted to have a negligible impact on affinity, because the two domains act independently of each other. As the linker gets shorter, the affinity for RNA increases between 10- and 1000-fold, when compared to the affinity of individual RRMs added together87.

This simple model assumes that the linker does not contact the RNA, but in many cases the linker becomes ordered upon binding RNA. In the example of nucleolin, the model would have predicted a 100-fold increase in affinity compared to that of the two individual nucleolin RRMs, but an increase of between 1000- and 100,000-fold was observed depending on the RNA sequence tested89. Part of the increase in affinity was attributable to the ordering of the linker into an α-helix to effectively shorten its length by half. When the prediction was repeated with this correction, predicted and measured affinities agreed to within 10 fold for some RNAs. However, because of direct interactions between the linker and target RNAs, even this calculation could not account for the 1000-fold difference between predicted and observed affinities for other RNAs 89.

Protein-protein interactions and RNA recognition

Homo- and hetero-dimerization of RNA-binding proteins

In addition to expanding the ways in which RNA can be recognized, multiple modules also allow RNA-binding proteins to interact simultaneously with other proteins and with RNA. The simplest example of this is dimerization. Two proteins involved in the viral response to RNA silencing provide exquisite examples of how dimerization allows specific interactions to be established that would not be possible in the isolated proteins.

The p19 protein is required for tombusvirus virulence in plants, and can also provide this activity when expressed in both Drosophila and human cells90, 91. It functions by specifically binding to siRNAs and preventing its loading into the RISC complex92. Two structures of p19 proteins bound to 21-nucleotides siRNA demonstrate that the protein adopts an αβ topology and binds RNA as a homodimer. The RNA binding surface is formed by a continuous 8-stranded β-sheet formed by the two monomers. Each monomer measures the length of the siRNA by providing a Trp that forms stacking interactions with the bases at the 5′ and 3′ end of the siRNA; the position of the Trp is defined by the structure of the homodimer. Thus, dimerization of p19 allows this protein to measure the length of the siRNA with great precision by positioning the two critical Trp side chains92, 93.

Another potent viral suppressor of RNAi is the Flock House Virus B2 protein. Its structure is composed of three α-helices that dimerize to create a four-helix bundle that recognizes RNA along one face of an A-form helix94, 95. Structural and biochemical evidence demonstrated that this protein suppresses silencing in two ways: by binding to siRNAs and preventing loading into RISC, and by coating longer dsRNA precursors and protecting them from cleavage by Dicer. For both p19 and Flock House Virus B2, the conserved features of the siRNAs (their size and double helical character)9295 are recognized because dimerization generates extended binding sites out of small protein domains and because it establishes the relative position of amino acids involved in RNA recognition.

These two examples illustrate the role of dimerization in RNA recognition, but there are other examples of RNA binding domains that function by dimerization or by forming protein-protein interactions. In the structure of the N-terminal RRM of U1A bound an RNA regulatory element within its own 3′-untranslated region (UTR), two separate RRMs interact through their C-terminal helices to form a homodimer after binding to the RNA. This cooperative binding event can only occur in the presence of RNA because the C-terminal helix is associated with the β-sheet surface of the RRM in the free protein. Interestingly, this dimerization also creates an interface that inhibits polyadenylation by direct interaction with poly(A) polymerase24. In the Nova-1 KH3 domain, changes in the rigidity of the protein are observed upon dimerization, and this stiffening of the entire protein may aid in nucleic acid recognition by reducing the entropic cost of binding to RNA. Furthermore, dimerization presents two recognition sites for RNA binding and thus can provide a cooperative interaction that strengthens the affinity of the protein for the RNA96.

The formation heterodimers through interactions between an RNA binding domain and another protein can increase the specificity of RNA interaction. For example, the binding of the spliceosomal U2B″ RRM to a stem-loop within U2 snRNA requires an interaction with the U2A′ protein23. In a different example, the CBP80 subunit of the cap-binding complex must interact with the RRM of CBP20 if this RRM is to bind with high affinity to the 7-methylguanosine cap of mRNA22, 97. The recent structures of the archaeal and eukaryotic exosomes have revealed extensive protein-protein interactions between proteins containing both KH and S1 domains with the core of the protein complex98, 99. These interactions may position the S1 domains of specific exosome subunits to recognize the RNAs targeted for exosomal degradation.

Protein-protein interactions define RNA specificity

RNA-binding domains from different proteins can cooperate to recognize an RNA through a combination of weak protein-RNA and protein-protein interactions. The recent dissection of a complex derived from the spliceosome demonstrates this principle and illustrates how even relatively small sequence and structural alterations in RNA-binding domains can modulate their RNA recognition properties indirectly by altering protein-protein interactions (Figure 5).

Figure 5. Protein-protein interactions and protein-RNA interactions define the site of spliceosomal assembly.

Figure 5

Proper definition of the splicing site requires a number of cooperative binding events that are mediated by both protein-protein and protein-RNA contacts between various RNA binding modules. a) Schematic of the interactions between various proteins and RNA at the splicing site. Some of the key domains involved in these interactions whose structures are shown below are labeled. Within the RNA the branch-point sequence (BPS), pyrimidine tract (Py-tract), and the 3′ splice site (3′ ss) are labeled with the intron colored gray and the exon colored dark blue. b) SF1 recognizes the BPS through its KH-QUA2 domains. The additional QUA2 domain creates an extended KH domain that can recognize the full BPS sequence RNA42. c) This interaction is strengthened by protein-protein interactions between it’s N-terminus and the non-canonical RRM3 of U2AF6535, d) which is bound to the pyrmidine tract through it’s first two canonical RRMs10. Finally, the U2AF65 interaction is also aided by protein-protein interactions between it’s N-terminus and the non-canonical RRM of U2AF35 bound at the 3′ splice site33. All of the protein and peptide structures are colored as shown in the schematic (a).

During initial steps in spliceosome assembly, the splicing factor 1 (SF1) and U2 auxiliary factor (U2AF) proteins cooperatively bind to sequences at the 3′ splice site and upstream of it (Figure 5A). Recognition of RNA _cis_-acting elements by the two U2AF subunits, U2AF65 and U2AF35, commits the pre-mRNA to the splicing reaction. Specifically, U2AF65 recognizes the polypyrimidine tract within the pre-mRNA primarily through its two central canonical RRMs(Figure 5A, D); this interaction is strengthened by the interaction between a third non-canonical RRM in this protein and SF1 protein (Figure 5A, C), which is instead bound at the branch-point sequence through a KH domain (Figure 5A, B). Additional cooperativity in the assembly of this complex is provided by protein–protein interactions between a non-canonical RRM in U2AF35 (Figure 5A, E), bound at the 3′ splice site, and the N terminus of U2AF65.

Protein-protein interaction surfaces

As described in the previous paragraph, RRM domains can form protein-protein as well as protein-RNA interactions. The protein-protein interactions occur via non-canonical RRM domains within both U2AF65 and U2AF35 that have a much longer α1 helix compared to other RRMs; this helix is the primary mediator of the protein–protein interactions observed in this complex33, 35 (Figure 5C, E). Closer inspection of these U2AF structures reveals a few common themes that may indicate whether an RRM binds to protein or to RNA: poor conservation of the RNP motifs, an Arg-X-Phe motif in the last loop of the RRM, and conserved acidic residues in the α1 helix100. These features define a novel functional class, the U2AF-homology motifs (UHMs), that are capable of forming protein–protein interactions.

The UHM class does not exhaust all possible ways in which two RRMs can interact. The interactions of other RRMs with proteins (for example the Y14-Magoh structure from the exon-junction complex and the Upf2–Upf3 RNA surveillance complexes29, 31, 32, 34, 101, 102) occur on the surface of the β-sheet through residues that are involved in RNA binding in other RRMs. Until more structures of such protein-protein complexes become available, the sequence and structural features in such RRMs that allow them to bind to other proteins rather than RNA will remain unclear.

RNA-binding domains other than the RRM have the ability to participate in protein-protein interactions. As previously described, a number of the KH domains can dimerize, and dsRBD domains form protein-protein interactions that regulate the assembly of complexes involved in RNA localization and the catalytic activity of enzymes acting upon double-stranded RNA. One dsRBD example is illustrated by Staufen, a protein involved in RNA localization in early development and in neurons. Staufen proteins contain up to 5 dsRBDs; some domains are capable of binding dsRNA48, while other domains bind other proteins during embryogenesis103. Remarkably, surface-exposed amino acids involved in RNA recognition are conserved among Staufen dsRBDs that bind to dsRNA, but not in protein-binding dsRBDs. For these domains it is the surface opposite to dsRNA in the canonical dsRBD-dsRNA structure that is conserved instead48. Thus, the ability of these proteins to bind to other proteins can be as important functionally as its RNA-binding activity.

Catalytic domains acting upon RNA

Positioning catalytic domains onto their substrate

Modularity allows RNA-binding domains to target a substrate, and to promote or repress the enzymatic activity of catalytic domains within the same polypeptide (Figure 2D). The way in which RNA-binding and enzymatic modules are positioned within a protein can define how a particular protein recognizes RNA. However, the enzymatic activity can also be enhanced or repressed through mutually exclusive or cooperative interactions between RNA-binding domains, catalytic domains and RNA.

An elegant example of how domain positioning facilitates enzymatic function comes from the RNAi pathway. In the first step of the cascade leading to gene silencing, Drosha and Pasha process primary miRNAs to stem-loops of ~70 nucleotides; Dicer subsequently binds to these miRNA precursors by recognizing two 3′-terminal nucleotides overhangs generated by Drosha104. A minimal Dicer structure from Giardia (lacking the N-terminal helicase and the C-terminal dsRBD, Figure 1) demonstrates that Dicer likely functions as a molecular ruler that positions the catalytic RNase III domains ~25 nucleotides from where the 3′ overhanging nucleotides are recognized by its PAZ domain72, the approximate length of siRNAs.

Another particularly beautiful example of this principle is found in the recent structure of a complete archaeal Box H/ACA small nucleolar RNP (snoRNP)105. These particles are responsible for the catalytic conversion of uracil to pseudouridine in ribosomal and other RNAs106. In this structure, the site of pseudouridylation is juxtaposed to the catalytic center of the protein enzyme Cbf5/dyskerin by two protein clamps at either end of the RNA. The 3′-terminal ‘clamp’ (the ACA sequence motif that defines this class of non-coding RNAs) is recognized by the PUA domain of Cbf5, while the second clamp (the apical loop of the non coding RNA) is recognized by a complex of Cbf5 with two other protein components of the particle.

Activating and repressing enzymes acting on dsRNA

The dsRNA-dependent protein kinase PKR (Figure 6A) and the RNA-editing enzyme ADAR2 (Figure 6B) provide examples of how RNA-binding domains can modulate enzymatic activity by interacting with both the substrate RNA and with the catalytic domain (Figure 2D). PKR is an interferon-induced kinase that plays a key role in controlling viral infections and maintaining cellular homeostasis by becoming activated in response to double-stranded viral RNAs. In the active form, it phosphorylates the α subunit of eukaryotic initiation factor 2 (eIF-2), thereby inhibiting translation and suppressing viral spread107. ADARs act on dsRNA to catalyze the conversion of adenosine to inosine, which is then recognized as guanosine, affecting both the primary sequence and the structure of the edited RNA108.

Figure 6. Modular architecture allows for regulation of the catalytic activity of enzymatic domains.

Figure 6

In both PKR and ADAR proteins, inter-domain interactions between the RNA-binding module and the catalytic domain maintain the proteins in an inactive state. a) The kinase domain of PKR is inhibited by an interaction with the double-stranded RNA-binding domain (dsRBD2). Binding to dsRNA releases the kinase from its inactive state allowing it to inhibit translation by phosphorylating eIF2α. b) The activity of ADAR2 is controlled by a mechanism similar to PKR, but in this case dsRBD1 is involved in the inactivation of the catalytic domain. When double-stranded RNA (dsRNA) binds to both dsRBDs, the protein dimerizes and the catalytic domain becomes exposed to convert adenosine to inosine.

Both proteins have two N-terminal dsRBDs that bind to dsRNA; in each case, the dsRBDs function both as an RNA-recognition unit and as an auto-inhibitor of the catalytic domain109, 110. In PKR, the second dsRBD masks the kinase domain by binding to it directly, thereby maintaining its inactive state (Figure 6A)109, 111, 112. In ADAR2, the proposed inhibitory element is the first dsRBD110 (Figure 6B). In both proteins, however, RNA binding causes enzyme activation by relieving the auto-inhibition caused by the interactions between the RNA-binding and catalytic domains (Figure 6A, B). Since both ADAR and PKR require RNA of sufficient length for activation, the two dsRBDs may be necessary for fully de-repressing the catalytic activity110. In PKR, the presence of a sufficiently long dsRNA (for example, viral RNAs such as HIV TAR) allows both dsRBDs to cooperatively bind to RNA113, 114, relieving the structural block and allowing the kinase domain to be activated through autophosphorylation and dimerization115117. The initial event in this cascade is likely to be binding of the first dsRBD to dsRNA, because this domain has much higher affinity for RNA compared to the second domain114. Only in the presence of a sufficiently long dsRNA can the second dsRBD bind as well, thereby releasing the kinase from its inactive state.

Conclusions

Many RNA-binding proteins are composed of relatively few modules of conserved structure but often limited sequence specificity. By combining these motifs in a variety of structural arrangements, evolution has generated proteins that are capable of recognizing RNA with the affinity and selectivity required to find cognate RNAs in the cellular medium, while at the same time retaining the versatility required to regulate, assemble and disassemble RNA-processing complexes. Structural biology has provided the molecular details concerning how individual domains recognize RNA, but many of these proteins require multiple copies of one of several common domains to function (Figure 1). It is therefore important to understand how multiple modules bind RNA, and how the modular nature of these proteins specifies their biological function. We have described here some of the structural principles of how multiple domains recognize an RNA(Figure 2), but there are still relatively few structures of proteins containing multiple RNA binding domains. Recent studies have also led to the observation that RNA binding modules can regulate the biological activity of enzymes acting upon RNAs in ways that go beyond the identification of the target RNA, but full understanding of these regulatory mechanisms will require detailed structural characterization that is not yet available. We expect that future structural analysis will expand upon the diverse ways in which combinations of RNA binding domains augment protein function.

Acknowledgments

Work in our laboratories is supported by grants from NIH-NIGMS (GV and CM). We apologize to the many colleagues whose work could not be properly referenced due to lack of space.

Glossary Terms

Ribonucleoprotein (RNP)

Complexes that contain both proteins and RNA. The ribonucleoprotein motif refers to the two conserved sequence elements found within the RNA Recognition Motif (within its two central β-strands) that participate in RNA recognition and identify the RRM domain at the sequence level.

Zinc finger

A class of DNA- and RNA-binding proteins characterized by a Cys- and His-rich domain that chelate a Zinc ion. Different classes of zinc-finger proteins contain different combination of metal binding amino acids; thus, C2H2 zinc finger contain two Cys and two His residues, while CCCH and CCHC zinc-binding motifs contain three Cys and a single His in a different topological arrangement.

AU-rich element (ARE)

Sequences rich in A and U nucleotides found in the 3′-untranslated regions of mRNAs that promote stability or degradation of their associated RNAs, thus providing a mechanism for the control of gene expression.

RISC complex

A protein complex responsible for degradation of RNA species targeted by small interfering RNAs. Argonaute protein is the catalytic component of RISC.

Exon junction complex

This is a multi-subunit protein complex that is deposited on the mRNA during the splicing reaction near the splice site. It remains bound to the RNA during subsequent gene expression events, and serves as a platform to recruit nuclear and cytoplasmic factors that influence mRNA localization, transport, stability and translation.

Orthology

Orthologous proteins are direct evolutionary counterparts that retain the same function in different organisms and that have arisen due to speciation events but not through the process of gene duplication (paralogy).

References