mRNAs have greater negative folding free energies than shuffled or codon choice randomized sequences (original) (raw)

Abstract

An examination of 51 mRNA sequences in GenBank has revealed that calculated mRNA folding is more stable than expected by chance. Free energy minimization calculations of native mRNA sequences are more negative than randomized mRNA sequences with the same base composition and length. Randomization of the coding region of genes yields folding free energies of less negative magnitude than the original native mRNA sequence. Randomization of codon choice, while still preserving original base composition, also results in less stable mRNAs. This suggests that a bias in the selection of codons favors the potential formation of mRNA structures which contribute to folding stability.

Introduction

Control of gene expression is known to occur at any of the events from promotion of transcription to stabilization of the mature polypeptide product. Several studies have demonstrated that mRNA stability may be an important factor in gene expression for certain genes (1–3). Structural RNA features are suspected to be involved in the regulation of mRNA degradation in those cases. Several authors have suggested that the choice of codons in eukaryotic genes may be constrained by effects other than the frequency of codons in the whole gene (4–6). This is shown by the occurrence of specific codons at specific positions in a large set of genes with a frequency larger than expected by chance (5). Statistical non-randomness in the occurrence of codons was demonstrated for 96 eukaryotic genes, including the actin and β-globin gene families (7). For a message coding 100 amino acids, there are ∼3100 or 1048 different combinations of bases using synonymous codons coding for the same polypeptide. This work tested the hypothesis that codon choice is biased to generate mRNAs with greater negative folding free energies.

Materials and Methods

mRNA sequences were selected from GenBank using programs in the Wisconsin Group GCG software package (8). mRNA sequence files were randomly selected from GenBank with short Locus descriptors (limited to eight or nine characters) and which possessed sufficient information in the Features annotation to reconstruct the sequence of the mRNA. Fifty-one mRNA sequences were selected from the GenBank sequence database possessing the following properties identified in the Features annotation: (i) mRNA +1 start site identified; (ii) MET start codon identified; (iii) termination codon identified; (iv) poly-A site or signal identified; and (v) the mRNA sequence must be <1200 bases long. A variety of sequence files were examined from diverse species including prokaryotes, plants, invertebrates and higher animals (Table 1). These mRNA sequences were in silico folded using Zuker's MFOLD program from UWGCG using a VAXStation 4000 or SUN Ultra computer (9).

Each in silico folding free energy of a native mRNA was compared with folding free energies calculated from mRNA sequences randomized by one of six different procedures. In the first randomization procedure, each native mRNA sequence was randomized at least 10 times using the SHUFFLE program of UWGCG. SHUFFLE randomizes the order of bases in a sequence keeping the composition constant. These randomized sequences (termed ‘whole-random’ sequences) were folded and the free energies averaged. In the second randomization procedure, the native sequences were randomized only within the coding region, yielding ‘CDS-random’ sequences. These sequences contained unmodified 5′ and 3′ untranslated regions (UTRs). In the third randomization procedure, codons were shuffled within the coding sequence only, yielding ‘codon-shuffled’ sequences. These contain unmodified UTR sequences of the respective native mRNA, and code for a polypeptide with identical amino acid composition yet different amino acid sequence. A program (RNAshuffle) was written in FORTRAN using the GCG software library that randomized only the codon choice to produce ‘codon-random’ sequences for the fourth randomization procedure. Codon-random sequences have the same nucleotide base composition and translated polypeptide product as the respective native mRNA. The fifth randomization procedure was a modification of the previous codon-random algorithm without constraining the base composition. All codons were allowed to be equally likely. The resulting sequences also coded for the same polypeptide as the native mRNA, but the base composition was generally more G+C rich. These sequences were labeled as ‘codon-flat’. The final randomization procedure tested the UTRs by shuffling both of the UTRs together while leaving the CDS unchanged. The lengths of the UTRs remained the same in this procedure called ‘UTR-random’.

Characteristics of selected mRNA sequences

Table 1

Characteristics of selected mRNA sequences

Statistical significance was tested for the biases observed in calculated folding free energy between native and randomized sequences. The statistical significance of the differences in free energy was measured in standard deviation units, termed the segment score from Le and Maizel (10). Large sets of randomized mRNA folding free energies were found to be normally distributed. Standard hypothesis tests were employed using statistical analysis software. All thermodynamic energies are free energies expressed as kcal/mol. A greater negative free energy indicates that a more stable folding configuration is possible.

Results

Fifty-one mRNA sequences were selected from a variety of plant, animal and bacteria sequences in GenBank (Table 1). These sequences and their respective randomized sequences were in silico folded using Zuker's MFOLD program. To ensure consistency of the folding algorithm with the selected set of mRNA sequences, the calculated native folding free energies were plotted as a function of sequence length in Figure 1. As expected, folding free energies become more negative for increasing sequence length since more bonding interactions are possible in longer molecules. A linear regression equation computed from data in Figure 1 gives a slope of −0.21 (kcal/mol per nucleotide) with an R-squared value of 0.58. This indicates that sequence length accounts for over half of the free energy of each mRNA. A further check of the data set examined normalized folding free energies (free energy divided by sequence length) as a function of C+G percent content in Figure 2. As expected the folding free energy per base becomes more negative as the C+G content of the mRNA increases due to the greater interaction energy between C-G pairs compared with A-U pairs. A linear regression equation computed from data in Figure 2 gives a slope of −0.005 (kcal/mol/base per %G+C) with a 0.66 regression coefficient. Thermodynamic data from others indicates that an A-U pair contributes −0.9 kcal/mol while a C-G pair contributes −2.9 kcal/mol (11).

The native folding free energies and the mean of free energies from sets of 10 randomized sequences are shown in Table 2 labeled as ‘whole-random’. Of the 51 mRNAs examined, 40 (78%) are more negative than the mean of the whole-randomized set. Native mRNA sequences then are generally more stable than the corresponding whole-randomized sequences. A segment score calculated from the standard deviation of each randomized set is listed in Table 2. These values are the number of standard deviations the means of the randomized set are away from the native free energy. The average segment score for the whole-randomized set is −1.23 with a 95% confidence interval of 0.45, indicating a significant bias. Four mRNAs have segment scores greater than −4, while 13 (25% of the 51 mRNAs) have scores greater then −2. These cases indicate significantly greater folding stability of the native mRNA compared to randomized sequences.

Of the set of mRNAs with positive segment scores, none has a score greater than +2. Included in this set of mRNAs, less stable than expected are tomrbcsd (Rubisco) and xelhish1 (a histone). Interestingly, these classes of genes are suspected to be at least partially regulated by post-transcriptional events. Rubisco small subunit has been demonstrated to be post-transcriptionally regulated (12). The cell-cycle regulation of histone mRNA stability has also been demonstrated (13), so it is not unreasonable to suspect that the folding stability for these mRNAs may be different from other mRNAs not so regulated. In addition, equal G and C contents in histone genes were hypothesized by Huynen to indicate selection pressures on mRNA secondary structure (14). The free energy of histone mRNA secondary structures was noted in Huynen's study to be only slightly lower than expected on the basis of nucleotide frequencies.

mRNA folding free energy dependence on sequence length. Data points are calculated free energy values of mRNA sequences listed in Table 1. Fitted regression line is shown.

Figure 1

mRNA folding free energy dependence on sequence length. Data points are calculated free energy values of mRNA sequences listed in Table 1. Fitted regression line is shown.

Sequence length normalized folding free energy dependence on G+C content. Units for free energy are kcal/mol/base. Fitted regression line is shown.

Figure 2

Sequence length normalized folding free energy dependence on G+C content. Units for free energy are kcal/mol/base. Fitted regression line is shown.

To determine if the above observed bias toward more stable mRNAs resides in the coding region or the flanking untranslated regions of the mRNAs, native sequences were randomized only within the coding region, yielding ‘CDS-random’ sequences. These sequences contain identical 5′ and 3′ UTRs of the respective native mRNA. Again the native mRNA sequences are usually more negative than the corresponding CDS-random sequences (Table 2). The average segment score is −0.87 with a 95% confidence interval of 0.48. Twelve (24% of the 51 mRNAs) have scores greater than −2. This demonstrates there is also a significant difference in folding free energies between native and partially randomized mRNA sequences. Of the 51 mRNAs examined, 37 or 73% are more negative than the CDS-randomized sequences. The 51 mRNAs possessed a total 5′ UTR length of 3014 nt, a total coding length of 27 306 nt and a total 3′ UTR length of 8533 nt. Therefore the coding regions comprise 70% of the total mRNA nucleotides, yet randomization of the coding region does not substantially alter the number of mRNAs observed with a bias toward more negative folding free energies. The significance level of this bias is also only slightly reduced by CDS-randomization compared to whole-sequence randomization. In only nine cases the sign of the segment score changed. Six mRNA scores changed from minus to positive (indicating the native mRNA changed to being less stable than the randomized set), while three changed from positive to negative. In almost all of these changes, the segment scores were close to zero. Of the group with more negative folding free energies, 23 or 45% are significantly more negative by one standard deviation, and 13 or 25% are more negative by two standard deviations.

The above randomization procedure was modified to shuffle the codons while preserving the native amino acid composition and UTR sequences. These ‘codon-shuffled’ mRNAs again are generally less stable than the respective native mRNA sequence. Thirty-two of the 51 mRNAs (63%) have negative segment scores, with 13 (or 25%) being greater than −1. The mean of the segment scores is lower than the other two randomization procedures, yet the 95% confidence interval (0.47) still does not include zero.

To determine if the observed bias toward more stable mRNAs resides in the choice of codons within the coding sequence, a fourth randomization procedure was performed. Native sequences were randomized only by codon choice within the coding region, yielding ‘codon-random’ sequences with unmodified base composition. These sequences contained identical 5′ and 3′ UTRs of the respective native mRNA and coded for the same polypeptide. Again the native mRNA sequences tend to fold more negative in free energy than the corresponding codon-random sequences (Table 3). Since only a relatively small number of bases will change under this randomization procedure, the resulting sequences in the randomized set will be similar. As a consequence the folding free energies are very close in each set of randomized sequences, resulting in a very low standard deviation. This results in a very large segment score for several mRNAs, so instead the percent difference from the native free energy becomes a more appropriate measure of randomization effects. The mean of the percent difference of native from codon-random free energies is −2.2%, with a 95% confidence interval from −4.0 to −0.4%. The confidence interval does not include zero, demonstrating a significant difference in folding free energies between native and codon-randomized mRNA sequences. Of the 51 mRNAs examined, 35 or 69% are more negative than their codon-randomized mRNAs.

To investigate the effect of the choice of codons within the coding sequence, a fifth randomization procedure was performed. Native sequences were randomized by an equal selection of codons within the coding region, yielding ‘codon-flat’ sequences. These sequences contained identical 5′ and 3′ UTRs of the respective native mRNA and coded for the same polypeptide as with the codon-random sets. Since the choice of codons was unbiased, the resulting base composition was usually different from the native sequences. Again the native mRNA sequences tend to fold more negative in free energy than the corresponding codon-flat sequences (Table 3). The mean of the percent difference of native from codon-flat free energies is −6.6%, with a 95% confidence interval from −10.0 to −3.2%. The confidence interval does not include zero, demonstrating a significant difference in folding free energies between native and codon-flat mRNA sequences. Of the 51 mRNAs examined, 37 or 73% are more negative than their codon-flat mRNAs.

Much data suggest that UTRs of mRNA contain important RNA secondary structures. A great deal of evidence shows that the 5′ UTR is critically involved in regulating translation initiation (15). Alterations in translation regulation not only directly affect the amount of a protein that is finally synthesized, but can also significantly modify the stability characteristics of the message, and therefore modify protein levels by this mechanism as well. A non-viral example of a regulatory element in the 5′ UTR that has been well studied and has been shown to control message translation is the iron response element (IRE) that is found in ferritin mRNA (16). This sequence provides a portion of a stem-loop structure to which an iron regulatory protein (IRP) can bind (17). There are also many examples of elements in the 3′ UTR part of the message that bind to _trans_-acting proteins to control mRNA turnover rates (18). To detect the general presence of mRNA structures in the UTRs, a final randomization procedure was tried. Native sequences were randomized only within the UTRs, yielding ‘UTR-random’ sequences. Again the native mRNA sequences are usually more negative than the corresponding UTR-random sequences (Table 2). The average segment score is −0.50 with a 95% confidence interval of 0.39. Nine of the mRNAs have segment scores greater than −2. Although this randomization procedure produces lower segment scores than the CDS-random case, it is based on a smaller region for randomization. Since the UTRs comprise only 30% of the total mRNA nucleotides, the observed bias toward more negative folding free energies demonstrates there is still a significant difference in folding free energies between native and randomized mRNA sequences.

Discussion

The biases in calculated mRNA folding free energies observed in Table 2 and 3 are small yet significant. For a 400 nt mRNA that is 50% basepaired and 50% G+C, a 5% increase in folding free energy could be caused by changes in only 7–20 bp. Since the CDS comprises ∼70% of the mRNA examined, more of the bias is due to amino acid sequence and codon choice than due to the UTR sequences alone. If the amino acid sequence is constrained by considerations of protein function, then the bias is most likely due to subtle arrangements in codon choice. The effect observed here may be due to a selective advantage for mRNAs to be more basepaired, perhaps to resist degradation or modification. If indeed RNA was the original genetic material as suggested by the research of Joyce (19), then the genetic code may be arranged in such a manner as to encourage intermolecular bonding for single-stranded RNA. This may be of advantage for protoorganisms with genetic information stored in the more labile RNA compared to DNA.

The fact that these biases are observed from an empirical molecular calculation also suggests that local secondary structures are the causative agents. Most algorithms for predicting RNA secondary structure from base sequence are based on a nearest neighbor model of interaction (20,21). Experimental evidence indicates short-ranged stacking and hydrogen bonding are important determinants of RNA stability while hydrophobic bonding is of lesser importance (21). Numerous algorithms have been developed to predict RNA secondary structure by minimizing the configurational free energy. The quality of these predictions depends upon: (i) the accuracy of the thermodynamic data which describe the free energies of various secondary structural features; (ii) the folding rules that an algorithm uses to find the lowest free energy structure; and (iii) the degree to which environmental conditions stabilize alternate structures of equivalent or higher energy. Similar results were obtained using the older FOLD program, indicating the conclusions from this work are not sensitive to the algorithm nor energy data sets used.

Folding free energies and segment scores for mRNAs

Table 2

Folding free energies and segment scores for mRNAs

Folding free energies and percent difference for mRNAs

Table 3

Folding free energies and percent difference for mRNAs

Free energy minimizing algorithms such as Zuker's MFOLD program output a family of structures that have the same or nearly the same free energy. This study has compared the optimal free energy of folding with a reference set of optimal free energies obtained by sequence randomization. Thus what are being compared are the locations of the minima in the configurational energy profiles for folding. Different regions of mRNA sequence were randomized, including codon choice, resulting in destabilization of the folding free energy. The under-representation of 5′-CG-3′ doublets in most native sequences would be changed upon such randomization. The 5′-CG-3′ doublet contributes ∼2 kcal/mol in helix propagation, whereas the 5′-GC-3′ and 5′-GG-3′ doublet each contribute ∼3 kcal/mol (from the GCG energy file for MFOLD). Even though randomization often increases the number of 5′-CG-3′ doublets compared with 5′-GC-3′ doublets, there is no correlation seen between 5′-CG-3′ content and free energy within the randomized sets. Pearson coefficients of free energy versus CG-doublet content are positive and negative, with magnitudes mostly below 0.25 (data not shown). This suggests that local secondary structure interactions are causing the observed bias in folding free energies, not changes in CG content.

This bias in mRNA folding free energy calculations has potential applications for gene finding algorithms. It is not known if open reading frames (ORFs) that are not transcribed show the bias observed in Table 2. It would be difficult to ensure that an ORF was not a region of some mRNA, due to the limited (and sometimes incorrect) information contained in sequence annotation. Although the folding calculations are computationally intensive, the results may be useful to improve the predictive accuracy of gene finding programs. The same level of bias for the same set of mRNAs was observed using the faster (but older) FOLD program (data not shown).

The thermodynamic treatment in this work concerns not the actual mRNA structures but the depth of the free energy potential well that a mRNA molecule could fold due to its sequence. The sequence of mRNAs has been found in this study to generally give rise to more stable secondary structures than expected by chance. Codon choice has been shown to contribute to this observed bias. Although the average bias is statistically significant for all of the randomization procedures used, the average segment scores are not strong in the sense of being greater than −3.0 or more. Yet individually, several sequences were found with very strong segment scores greater than −4.0. The conclusions drawn, however, must be tempered by the fact that the set of genes examined was small considering the 51 sequences were grouped into categories such as invertebrate, human, etc., leaving few sequences per category to draw major conclusions. These results could be useful to develop a classification of mRNAs based on whether mRNAs are more or less stable compared to their randomized sequences. Gene regulation mechanisms or gene product types may be related to the mRNA folding class based on one of the five randomization procedures examined here.

Conclusion

A survey of 51 mRNA sequences reveals a bias in the coding and UTRs that allows for greater negative folding free energies than predicted by sequence length or nucleotide base content. A free energy reference state is taken to be a large enough set of randomized mRNA sequences. Randomization can be implemented over the whole sequence or over sections such as the CDS, UTR or codon choice. Randomization of most regions of the mRNA sequences display lower folding stability as measured by calculated free energy values. Randomization of codon choice while still preserving original base composition also results in less stable mRNAs. This suggests that a bias in the selection of codons favors mRNA structures which contribute to folding stability.

Acknowledgements

I thank Jonathan Arnold (University of Georgia) for helpful discussions concerning this work. This work was supported (or partially supported) by NIH grant GM08247, by a Research Centers in Minority Institutions award, G12RR03062, from the Division of Research Resources, National Institutes of Health, and NSF CREST Center for Theoretical Studies of Physical Systems (CTSPS) Cooperative Agreement #HRD-9632844.

References

1

,

Prog. Nucleic Acid Res. Mol. Biol.

,

1990

, vol.

38

(pg.

1

-

35

)

2

,

J. Mol. Biol.

,

1998

, vol.

274

(pg.

589

-

600

)

3

,

Mol. Cell. Biol.

,

1988

, vol.

8

(pg.

427

-

432

)

4

,

Nucleic Acids Res.

,

1982

, vol.

10

(pg.

3760

-

3780

)

5

,

Nucleic Acids Res.

,

1986

, vol.

14

suppl.

(pg.

r119

-

r150

)

6

,

Nucleic Acids Res.

,

1992

, vol.

20

(pg.

5289

-

5295

)

7

,

J. Mol. Biol.

,

1987

, vol.

197

(pg.

379

-

388

)

8

,

Nucleic Acids Res.

,

1984

, vol.

12

(pg.

387

-

395

)

9

,

Nucleic Acids Res.

,

1981

, vol.

9

(pg.

133

-

148

)

10

,

J. Theor. Biol.

,

1989

, vol.

138

(pg.

495

-

510

)

11

,

Proc. Natl Acad. Sci. USA

,

1986

, vol.

83

(pg.

9373

-

9377

)

12

,

Mol. Cell. Biol.

,

1985

, vol.

5

(pg.

2238

-

2246

)

13

,

Cell

,

1987

, vol.

4

(pg.

615

-

626

)

14

,

J. Mol. Evol.

,

1992

, vol.

34

(pg.

280

-

291

)

15

,

Microbiol. Rev.

,

1995

, vol.

59

(pg.

423

-

450

)

16

,

Science

,

1989

, vol.

246

(pg.

870

-

872

)

17

,

J. Biol. Chem.

,

1995

, vol.

270

(pg.

20509

-

20515

)

18

,

Biochim. Biophys. Acta

,

1997

, vol.

1332

(pg.

M31

-

M38

)

19

,

Nature

,

1989

, vol.

338

(pg.

217

-

224

)

20

,

Proc. Natl Acad. Sci. USA

,

1994

, vol.

91

(pg.

9218

-

9222

)

21

,

Molecular Modeling of Nucleic Acids

,

1998

Washington, DC

American Chemical Society Symposium Series 682

(pg.

246

-

257

)

© 1999 Oxford University Press