Finding specific RNA motifs: function in a zeptomole world? - PubMed (original) (raw)

Finding specific RNA motifs: function in a zeptomole world?

Rob Knight et al. RNA. 2003 Feb.

Abstract

We have developed a new method for estimating the abundance of any modular (piecewise) RNA motif within a longer random region. We have used this method to estimate the size of the active motifs available to modern SELEX experiments (picomoles of unique sequences) and to a plausible RNA World (zeptomoles of unique sequences: 1 zmole = 602 sequences). Unexpectedly, activities such as specific isoleucine binding are almost certainly present in zeptomoles of molecules, and even ribozymes such as self-cleavage motifs may appear (depending on assumptions about the minimal structures). The number of specified nucleotides is not the only important determinant of a motif's rarity: The number of modules into which it is divided, and the details of this division, are also crucial. We propose three maxims for easily isolated motifs: the Maxim of Minimization, the Maxim of Multiplicity, and the Maxim of the Median. These maxims together state that selected motifs should be small and composed of as many separate, equally sized modules as possible. For evenly divided motifs with four modules, the largest accessible activity in picomole scale (1-1000 pmole) pools of length 100 is about 34 nucleotides; while for zeptomole scale (1-1000 zmole) pools it is about 20 specific nucleotides (50% probability of occurrence). This latter figure includes some ribozymes and aptamers. Consequently, an RNA metabolism apparently could have begun with only zeptomoles of RNA molecules.

PubMed Disclaimer

Figures

FIGURE 1.

Examples of modular RNA sequences isolated from SELEX. Capital letters indicate initially random regions, while lower-case letters indicate fixed regions such as primer complements. Critical motifs are highlighted in bold. Diagrams at right show the motifs in their required structural context for the active site. Numbers in brackets indicate the position where the displayed sequence begins; ellipsis indicates that the sequence continues 3′ from the displayed region. (A) The minimal isoleucine motif (Majerfeld and Yarus 1998); although the conserved CUAC was originally found in the 5′ primer, subsequent reselection recovered this module from random sequence (S. Changayil and M. Yarus, pers. comm.), giving a total of 12 bases in two modules. If paired but otherwise unspecified regions are included, the site increases to 18 bases in two modules. (B) The hammerhead ribozyme, reselected from random sequence (Salehi-Ashtiani and Szostak 2001). The single U that comprises module 1 was supplied in a constant region; x marks the cleavage site. Thus, the hammerhead consists of 11 fixed nucleotides in three modules (the GC pair shown in plan text is conserved in natural hammerhead sequences, but was not recovered in SELEX). If paired but otherwise unspecified regions are included, this increases to 37 fixed nucleotides in three modules. Note that this upper limit is unrealistically large, because there are many possible sequences compatible with the required paired regions.

FIGURE 2.

Number of spacer divisions D as a function of spacer length s and modularity m. Modularity ranges from 1 to 6; each label refers to the line above it. Note logarithmic scale on the y-axis. The number of divisions increases very rapidly at high modularity: for example, there are more than 100,000 ways to divide 40 bases of spacer among four modules.

FIGURE 3.

Agreement between calculations and simulations. The number of unique motifs (i.e., sequences that differ in at least one module, y-axis) grows dramatically as the sequence length, x-axis, and modularity grow, although not as fast as does the number of trials (cf. Fig. 2 ▶). Lines are the results of the calculations as derived in Materials and Methods; dots are 25 runs of simulations in which randomly generated sequences were divided into modules in every possible way for a given length and configuration. Dark lines denote evenly divided motifs (e.g., [5,5,5,5] represents a motif of 20 divided into four equal modules); light lines denote unevenly divided motifs (e.g., [17,1,1,1] represents a motif of 20 divided into four modules in which the difference between the largest and smallest modules is as great as possible). Note the dramatic effect (orders of magnitude) of unequal division of the motif. The spread of the dots (each from an individual random sequence) gives an idea of the sampling error: large for short sequences and high modularity; very low once the sequence reaches 60 nucleotides. The model gives excellent agreement with the simulations over a wide range of modularity, sequence length, and size of individual modules. It is impractical to collect simulation data for longer sequences due to the running time (approx. 6 h and 500 MB RAM for modularity 4 and sequence length 100 on a 1.8 GHz Pentium 4; more than 4 d and 4 GB RAM/swap space for sequence length 200).

FIGURE 4.

Importance of evenly divided modules. Individual lines show different divisions of a 15-nucleotide motif into three modules. For a random region of 100 nucleotides, only 51,000 molecules would need to be searched to have a 99% chance of finding a motif divided into three 5 mers (heavy line at bottom of graph), but nearly four million molecules (a factor of almost 100) would need to be searched to have the same chance of finding a motif divided into a 13 mer and two monomers (heavy line at top of graph). The other lines show the other 87 ways of dividing the motif. Out of 91 total ways, there is one way to divide it into three 5 mers, there are three ways to divide it into a 13 mer and two monomers: [13, 1, 1], [1, 13, 1], and [1, 1, 13]. Similarly, there are three ways to divide it into any other configuration in which two of the pieces are equal. There are six ways to divide it into any particular configuration where all three pieces are unequal: for example, [9, 4, 2], [9, 2, 4], [4, 9, 2], [4, 2, 9], [2, 9, 4], and [2, 4, 9]. Only the size of the pieces, not their order, affects the probability, so only 19 distinct lines are visible on the graph (some are very close together). The top two lines on the graph (the [13, 1, 1] family and the [12, 2, 1] family) are clear outliers; most divisions are closer to the best case of [5, 5, 5] than to the worst case. This effect becomes more extreme at higher modularity. The horizontal gray bar shows 1000 zeptomole (602,000 sequences).

FIGURE 5.

Pool sizes required to find the isoleucine aptamer (black) and the hammerhead ribozyme (gray) (Fig. 1 ▶), making different assumptions about sequence requirements. The horizontal gray line represents 1000 zeptomoles, the limit of the Zeptomole World. The isoleucine aptamer is almost certainly a Zeptomole World molecule; the hammerhead may or may not be, depending on how much helix is added to its required sequences. However, its essential sequence components should certainly appear in zeptomole-scale pools. Thin lines show the minimal sites (fixed sequence only); dark lines show the maximal sites (counting paired bases as fixed in one state); and medium lines show the average (counting half the paired bases as fixed; details in how the paired bases are assigned are not visible on this scale). The graph shows pools required for 50% probability of occurrence; for 99% occurrence, multiply all pool sizes by a factor of 6. The maximal sequence for the hammerhead is clearly not a realistic case, or it would not be possible to reproducibly recover this motif from SELEX.

FIGURE 6.

Number of sequences required to find evenly divided motifs in random pools of (A) length 40, (B) length 80, and (C) length 120, for modularities of 1 to 4. Results shown for 50% probability of occurrence; for 99%, multiply all pool sized by a factor of 6. Horizontal dark line indicates 1000 zeptomoles. x-Axis shows total length of motif; y-axis shows number of molecules required.

FIGURE 6.

FIGURE 7.

Largest accessible evenly divided motifs in (A) zeptomole-scale pools, and (B) SELEX-scale pools (602,000 and 1015 sequences, respectively). X-axis shows length of the sequence; y-axis shows largest motif accessible for each modularity at probability 0.5 (increasing the probability to 0.99 decreases the length of the accessible motif by at most two nucleotides in this range).

FIGURE 7.

FIGURE 8.

Calculations for D, the number of ways of dividing a sequence to look for a set of shorter modules. First, partition the number of bases in the sequence n into the number of bases in modules, l, and the number of bases in spacer, s = (n − l). Each possible position of the modules within the longer sequence can be thought of as a particular way of choosing m places to cut the spacer, with the provisions that two cuts cannot occur in the same place and that one cut can occur after the last base of the spacer (i.e., the last module can be at the 3′ end of the sequence). However, the order in which the m cuts are chosen does not matter (even if the cut for the last module was made first, the modules will still be looked for in order). Thus, D is equal to the number of ways of choosing m items from s + 1.

FIGURE 9.

Dependence of the number of positions each module can take on the number of modules m and the size of the problem z. The top line in each block shows all the positions that each module can occupy. Each subsequent line in the block shows a single valid position for each module (dark dashes, numbered according to module), along with the possible alternative positions for the last module (light dashes). This highlights the fact that the successive left-most positions of the last module correspond to successive sizes of the one-dimensional problem. Note that to keep the size constant it is necessary to add another spacer position for each additional module. The Current column shows the number of positions contributed by the current size of the (m − 1) dimensional case, while the Sum column shows the total number of positions for the current size of the _m_-dimensional case. Horizontal arrows show the contribution of each new term (larger size) to the sum: adding a base of spacer is the same as adding the case with the new number of bases of spacer in one lower dimension. Horizontal and vertical arrows show that each successively larger term in a given dimension is the sum of the previous term in that dimension and the larger term in one fewer dimension. Oblique arrows show the relationships between terms in successive dimensions.

FIGURE 10.

Effect of increasing the size or dimension of the problem. Starting with the case where z = 3 and m = 2, incrementing z to get D(4,2) is equivalent to adding D(4,1); In other words, the next larger case in one less dimension (top). Conversely, incrementing m to get D(3,3) is equivalent to adding D(2,2) and D(1,2) (side: left of arrow). This is equivalent to adding D(2,3) (because a given term in d dimensions is the same as the sum of all terms up to that size in d − 1 dimensions as shown in Fig. 9 ▶); in other words, this is the same as adding the next smaller case in the same number of dimensions (side: right of arrow).

FIGURE 11.

Violation of Poisson sampling assumptions. Dividing a motif of constant modularity into pieces affects the number of sequences that need to be searched to minimize the probability that a motif will be missed (log_P_, y-axis: log_P_[not found] of −2 is equivalent to a 99% chance that the motif is found). The vertical line at 115 sequences is the Poisson prediction for a 99% chance of finding an 8 mer divided into two pieces in a random region of length 80. Thin solid lines show the progression for the fastest-changing [4,4] and slowest-changing [7,1] and [1,7] configurations: independent sequences have a constant probability of finding each motif, and so the relationship is log-linear. The thick solid line shows the probability of missing a sequence when all configurations of the motif (all divisions into two modules) are combined: the nonlinearity shows that the results for a single sequence do not scale to multiple sequences, because different configurations saturate at different rates. This line is derived from two runs of the simulation (diamonds and crosses). Dashed lines show extrapolation for the combined configurations either from the results for a single sequence (steeper slope) or from 500 sequences (shallower slope). Note the large discrepancy (two orders of magnitude) between the projection from a single sequence and the actual results for a sample of 500 sequences.

FIGURE 12.

Probability of matching a set of modules. Example cases as for Figure 4 ▶, but note change in numbering (positions are now relative to the first position that the module can occupy under any circumstance, rather than relative to the first position that it can occupy relative to the positions of the other modules in the current case). The top line in each set shows the left-most position each module can take (given a particular state of the first module), and hence, the left-most possible match for each module. The position of the left-most match for the first module determines the size of the problem to solve for the remaining modules (in one fewer dimension). _P_n is the probability of a match in the _n_th module; _Q_n = (1 − _P_n). For m = 1 (top), the probability of a match at the _i_th position is _P_1_Q_1(i − 1). For m = 2 (middle), the probabilities for the first module remain the same; however, depending on the position, a different size subproblem must be solved in one dimension to find the probability that the second module also matched. Similarly, for m = 3 (bottom), the position of the left-most match of the first module determines the size of the two-dimensional subproblem that needs to be solved to find the probability that all three modules matched. In general, to solve for m modules, it is necessary to solve all smaller problems in (m − 1) dimensions, and to weight each of these solutions by the probability that the first module matched in a position compatible with it. The diagrams to the right show the probabilities of each of the allowed combinations of positions (order the same as the ordering of the lines to the left); to find the probability that a particular combination was the left-most set of matches (e.g., first module at its second position, second module at its second position, third module at its fourth position), multiply the individual terms together (here, _P_1_Q_1 × _P_2 × _P_3_Q_32, as can be seen either by examining the individual line corresponding to this case or by examining the relevant cell in the table). Arrows show the correspondence of terms in lower dimensions as parts of higher dimensional problems.

References

1. Ciesiolka, J., Illangasekare, M., Majerfeld, I., Nickles, T., Welch, M., Yarus, M., and Zinnen, S. 1996. Affinity selection-amplification from randomized ribooligonucleotide pools. Methods Enzymol. 267: 315–335. - PubMed
1. Ellington, A.D. and Szostak, J.W. 1990. In vitro selection of RNA molecules that bind specific ligands. Nature 346: 818–822. - PubMed
1. Ellington, A.D., Khrapov, M., and Shaw, C.A. 2000. The scene of a frozen accident. RNA 6: 485–498. - PMC - PubMed
1. Illangasekare, M., Sanchez, G., Nickles, T., and Yarus, M. 1995. Aminoacyl-RNA synthesis catalyzed by an RNA. Science 267: 643–647. - PubMed
1. Illangasekare, M. and Yarus, M. 1999a. Specific, rapid synthesis of Phe-RNA by RNA. Proc. Natl. Acad. Sci. 96: 5470–5475. - PMC - PubMed

MeSH terms

Substances

LinkOut - more resources

Full Text Sources
Other Literature Sources
- H1 Connect - Access expert opinions and insights on biomedical research.

Finding specific RNA motifs: function in a zeptomole world? - PubMed (original) (raw)