Highly efficient concerted evolution in the ribosomal DNA repeats: Total rDNA repeat variation revealed by whole-genome shotgun sequence data (original) (raw)

Genome Res. 2007 Feb; 17(2): 184–191.

Austen R.D. Ganley

1 National Institute for Basic Biology, Okazaki, 444-8585 Japan;

3 Department of Biology, Duke University, Durham, North Carolina 27708, USA

Takehiko Kobayashi

1 National Institute for Basic Biology, Okazaki, 444-8585 Japan;

2 SOKENDAI, Okazaki, 444-8585 Japan;

1 National Institute for Basic Biology, Okazaki, 444-8585 Japan;

2 SOKENDAI, Okazaki, 444-8585 Japan;

3 Department of Biology, Duke University, Durham, North Carolina 27708, USA

4Present address: Division of Cytogenetics, National Institute of Genetics, Mishima, Shizuoka 411-8540, Japan.

Received 2006 May 1; Accepted 2006 Nov 16.

Copyright © 2007, Cold Spring Harbor Laboratory Press

Abstract

Repeat families within genomes are often maintained with similar sequences. Traditionally, this has been explained by concerted evolution, where repeats in an array evolve “in concert” with the same sequence via continual turnover of repeats by recombination. Another form of evolution, birth-and-death evolution, can also explain this pattern, although in this case selection is the critical force maintaining the repeats. The level of intragenomic variation is the key difference between these two forms of evolution. The prohibitive size and repetitive nature of large repeat arrays have made determination of the absolute level of intragenomic repeat variability difficult, thus there is little evidence to support concerted evolution over birth-and-death evolution for many large repeat arrays. Here we use whole-genome shotgun sequence data from the genome projects of five fungal species to reveal absolute levels of sequence variation within the ribosomal RNA gene repeats (rDNA). The level of sequence variation is remarkably low. Furthermore, the polymorphisms that are detected are not functionally constrained and seem to exist beneath the level of selection. These results suggest the rDNA is evolving via concerted evolution. Comparisons with a repeat array undergoing birth-and-death evolution provide a clear contrast in the level of repeat array variation between these two forms of evolution, confirming that the rDNA indeed does evolve via concerted evolution. These low levels of intra-genomic variation are consistent with a model of concerted evolution in which homogenization is very rapid and efficiently maintains highly similar repeat arrays.

Repetitive elements are an abundant feature of genomes (Britten and Kohne 1968) and play critical roles in cell biology, genome structure, and adaptive evolution (Andersson et al. 1998; Shapiro and von Sternberg 2005). In many cases, repeats from a repeat family are highly similar to one another within a genome, a pattern that persists through evolutionary time even though sequence differences are apparent between species (Brown et al. 1972). Thus, the repeats seem to be maintained as a coherent family. This pattern is contrary to the expectations of conventional evolution, where repeats are expected to evolve independently (Fig. 1) and diverge with time. This unusual mode of evolution, where repeats within a genome are more similar to each other than they are to “orthologous” repeats in a related species, is defined as concerted evolution (Zimmer et al. 1980). The molecular process responsible for concerted evolution is known as homogenization (Dover 1982), and although not fully elucidated, is thought to involve continual turnover of repeat copies by unequal recombination (e.g., Smith 1973; Szostak and Wu 1980; Kobayashi et al. 1998). However, recent studies have shown that several repeat families previously thought to evolve by concerted evolution actually evolve via a different evolutionary process known as birth-and-death evolution (e.g., Nei et al. 1997, 2000; Rooney and Ward 2005). In birth-and-death evolution it is selection, not homogenization, that maintains the repeats as a coherent family, and occasional duplication/deletion is also thought to play a role. Therefore, relatively high levels of intragenomic repeat variation are expected in regions of low selective constraint (e.g., synonomous sites and noncoding regions) under birth-and-death evolution. To distinguish between concerted evolution and birth-and-death evolution, knowledge of the level of repeat variation is critical. Despite this, the level of intragenomic repeat variation has not been reported for large repeat arrays; therefore, evidence for concerted evolution of these arrays, as per the classical definition, is lacking.

An external file that holds a picture, illustration, etc. Object name is 184fig1.jpg

Classical evolution vs. concerted evolution. Repeats (individual boxes) in a repeat array are initially formed by a gene amplification event. The repeats accumulate mutations (alternatively colored boxes) through time. Under classical evolution these mutations persist and therefore, after speciation events, the orthologous relationships of the repeats remain (e.g., repeat #1 from species 1 will resemble repeat #1 from species 2 more closely than the other repeats in species 1; indicated schematically by different shades of the same color). However, under concerted evolution, homogenization continuously sweeps one repeat variant (at random) to fixation within the array. Therefore, the repeats within a genome are all expected to be similar, but differ in sequence from the repeats in a closely related species. Birth-and-death evolution is a more complex form of classical evolution mixed with some aspects of concerted evolution, and is not depicted.

The major problems in detecting variation within large tandem-repeat arrays are the prohibitive size and high levels of similarity of these arrays. Therefore, they represent uncharted territories in the genome, and are usually assumed to be comprised of repeats with identical sequences (e.g., Goffeau et al. 1996). However, it is possible that substantial cryptic variation exists within these repeat arrays, as the power to detect such cryptic variation is low (e.g., Williams et al. 1990; Copenhaver and Pikaard 1996; O’Donnell and Cigelnik 1997; Gonzalez and Sylvester 2001) and it is not likely to be noticed during routine applications (such as sequencing for phylogenetic analyses). The advent of whole-genome shotgun sequencing (WGSS) now gives us a way to determine the total level of variation in repeat arrays. In principle, WGSS methodology sequences the entire genome, with all parts receiving equal coverage (in contrast to map-based genome sequencing, where highly repetitive regions are usually not sequenced [Goffeau et al. 1996]). Therefore, information on the absolute level of sequence variation present within repeat arrays is contained within WGSS data.

The ribosomal RNA repeats (rDNA) are an extensively studied repetitive gene family. Each repeat unit contains three ribosomal RNA (rRNA) genes, the large subunit (LSU), small subunit (SSU), and 5.8S rRNA genes, as well as two transcribed spacers (the ITS1 and ITS2) and a large intergenic spacer (IGS) (Long and Dawid 1980). Variably, another rRNA gene, the 5S, may be present within the IGS. rDNA copy number varies widely; in most eukaryotes it is between 30 and 30,000 copies (Prokopowich et al. 2003) and the repeats are organized tandemly at one or more sites per haploid genome.

Fungi are an ideal group of eukaryotes to study at a genomic level due to their relative small genome sizes. We chose five fungi encompassing a diverse range of species for which WGSS data were available. These species were chosen as all have a simple rDNA organization, i.e., a single rDNA array. The first four are members of the phylum Ascomycota. Saccharomyces cerevisiae is a budding yeast from the family Saccharomycetaceae, and is a widely used model organism for biological research (the other model yeast, Schizosaccharomyces pombe was not included as it contains two rDNA arrays). The haploid genome sequence was completed by the Broad Institute (http://www.broad.mit.edu/annotation/genome/saccharomyces_cerevisiae/Home.html). Saccharomyces paradoxus is a close relative of S. cerevisiae. The diploid genome sequence was completed by the Broad Institute as part of a comparative study of Saccharomycetaceae genomes (Kellis et al. 2003) (http://www.broad.mit.edu/annotation/fungi/comp_yeasts/). Ashbya gossypii is a filamentous fungus from the Saccharomycetaceae family. It is a mild pathogen of cotton, and has been used industrially to produce high vitamin B2 levels. The haploid genome sequence was recently completed (Dietrich et al. 2004) (http://agd.unibas.ch/). Aspergillus nidulans is a common soil mold from the order Eurotiales. It has been a popular model organism for genetics for over 50 yr and is filamentous in growth form. The haploid genome sequence was completed by the Broad Institute (http://www.broad.mit.edu/annotation/fungi/aspergillus/). The final species, Cryptococcus neoformans, is a member of the phylum Basidiomycota, which includes the mushrooms. However, C. neoformans grows as a yeast and is a severe pathogen of immunocompromised humans. The haploid genome sequence was recently completed by the Stanford Genome Technology Center (Loftus et al. 2005) (http://www-sequence.stanford.edu/group/C.neoformans/index.html).

In this study we identified all of the rDNA reads from WGSS projects of these five fungal species and assembled these reads and scanned them for polymorphisms to identify the total level of variation in the rDNA. The results give us the first complete picture of the level of variation within a large, tandemly arrayed repeat family, and this variation is very low. We discuss the implications of these results for the evolution of the rDNA repeats.

Results

The rDNA reads from the five genome projects were each assembled into a rDNA unit alignment. The high similarity of the repeats in the rDNA means there is no way to determine which repeat in the array any given sequence read comes from (e.g., whether a given read comes from repeat #1 or repeat #100). Instead, the sequence data collapse down to a single rDNA unit alignment whose sequence coverage is a product of the genome coverage level and rDNA copy number. We performed automated searches, followed by manual corrections on these alignments to identify polymorphisms (see Methods for details). The polymorphisms that were identified were put into two classes, high-confidence and low-confidence polymorphisms, and results for both are presented. The low-confidence polymorphisms occur in areas of low-sequence quality, and thus, we believe that most of them are unlikely to be real, but we have included them for completeness. We found no evidence for any rearranged rDNA units, such as those found recently in humans (Caburet et al. 2005) in any of the five species.

Low-sequence variation within the rDNA repeats

Surprisingly, we found the number of polymorphisms within the rDNA arrays of all five species is very low. In A. gossypii, only three polymorphisms (six, including low-confidence polymorphisms) were found; S. cerevisiae contained four (or seven); S. paradoxus contained 13 (or 16); A. nidulans 11 (or 20); and the number in C. neoformans was slightly higher at 37 (or 43). The numbers of polymorphisms and their locations within the rDNA unit are presented in Figure 2. Most polymorphisms are present on only one or a few reads; however, three polymorphisms (two from S. cerevisiae and one from S. paradoxus) are present on a relatively high number of reads, indicating high copy number in the array (Fig. 2).

An external file that holds a picture, illustration, etc. Object name is 184fig2.jpg

Positions of the polymorphisms in the rDNA unit. Polymorphic sites for each species are plotted onto a map of a single rDNA unit. rDNA features are color coded as indicated beneath the bottom rDNA unit. High-confidence polymorphisms are shown as black lines, low-confidence polymorphisms as gray lines, and the total numbers are boxed (low-confidence polymorphisms in parentheses). Although the polymorphisms are shown in a single repeat, in reality they are likely to be scattered throughout the array. Polymorphisms present in more than one sequence read are indicated by black balls above the line, with the number of reads shown in the ball. In many cases these are likely to result from the coverage level of the whole-genome shotgun sequence data. The rDNA unit length and copy number (from Fig. 3) are also indicated (S. paradoxus is the diploid copy number). Diagram is to scale.

To make meaningful interpretations of these polymorphism levels, it is necessary to relate the polymorphism level to the copy number of the rDNA (as more copies are expected to harbor a greater absolute polymorphism level). Therefore, we determined rDNA copy number for these species. A. gossypii was previously reported to contain ∼50 rDNA copies (Wendland et al. 1999) and S.cerevisiae is known to contain ∼150 copies (Kobayashi et al. 1998). To determine rDNA copy number from the remaining three species, pulsed-field gel electrophoresis was performed on genomic DNA digested with restriction enzymes that do not cut within the rDNA (Fig. 3). The haploid rDNA copy number can then be calculated from the size of the resultant rDNA band. Using S. cerevisiae chromosomes as size markers, the sizes of the rDNA-containing bands are estimated as follows: S. paradoxus = ∼850 kb; A. nidulans = ∼360 kb; and C. neoformans = ∼440 kb. From these values, we calculated rDNA copy number: S. paradoxus contains ∼90 rDNA copies (per haploid genome); A. nidulans ∼45 copies; and C. neoformans ∼55 copies. Two sharp bands are present in S. paradoxus, consistent with slight allelic variation in rDNA copy number in this diploid. The rDNA band is much more diffuse in A. nidulans, consistent with more copy-number variation. This is expected, as this fungus is filamentous; therefore, the time back to the last single nucleus is expected to be large, resulting in greater variation as a result of the continuous copy-number changes that occur during growth (Cowen et al. 2000).

An external file that holds a picture, illustration, etc. Object name is 184fig3.jpg

Determination of rDNA array sizes by pulsed field gel electrophoresis. (A) Ethidium bromide-stained gels showing the sizes of the rDNA arrays from A. nidulans, S. paradoxus, and C. neoformans after digestion of chromosomal plugs with HinDIII, BamHI, and AgeI, respectively. (B) Southern blot of the gels from A probed with a conserved region of the LSU from S. cerevisiae to confirm the rDNA bands. Array sizes (kilobase) are indicated, as are rDNA copy numbers (in parentheses), and these were calculated using S. cerevisiae chromosomes as size markers (M).

To put these levels of variation in perspective, we compared the level of variation found here in the rDNA with the level of variation present in a repeat family undergoing birth-and-death evolution. We used the nucleotide diversity (π) statistic (Nei and Li 1979) to represent the level of repeat array variation, as it can be used to compare repeat families with different copy numbers. The polyubiquitin gene repeats (poly-u repeats) were chosen for comparison, as they were previously shown to be undergoing birth-and-death evolution rather than concerted evolution (Nei et al. 2000). Within a repeat array the amino acid sequences of the poly-u repeats are usually invariant, but synonomous sites in the nucleotide sequence are variable. DNA sequences of the poly-u repeats from the five fungal species used in the rDNA analyses were obtained from the relevant genome projects. These have four to six poly-u repeats in a single tandem array. The level of π within each array was calculated for each species for both the rDNA and poly-u repeats (Table 1). The contrast in variation between the rDNA and poly-u repeats is stark, with the poly-u repeats showing three to four orders of magnitude more diversity than the rDNA repeats. Many of the rDNA polymorphisms we observe are present on only a single sequence read, and these are probably sequence errors (see Discussion). If so, they artificially inflate the level of π, and therefore, we have also calculated π for the rDNA polymorphisms using only those polymorphisms present in more than one read (Table 1). To rule out drastic differences in evolutionary rates being responsible for these results (i.e., to rule out the possibility that the lower rate of intra-genomic variation in the rDNA is the result of sequence change intolerance), we measured the inter-specific divergences for these two loci. These were calculated by comparing the consensus sequences between all five species separately for both loci. The levels of divergence are very similar (average pairwise nucleotide similarity between the five species is 79% for the polyubiquitin gene repeat vs. 76% for the rDNA repeat). Therefore, the overall level of sequence constraint in these two loci is very similar, but the intra-genomic variation is markedly different, demonstrating that the within-array evolutionary forces of these two loci differ greatly.

Table 1.

Intragenomic nucleotide diversity (π) between repeats in the array

An external file that holds a picture, illustration, etc. Object name is 184tbl1.jpg

Polymorphisms exist beneath the radar of selection

We next looked to see how the polymorphisms we detected are distributed with respect to functional constraint. If selection acts directly on rDNA mutations, we would expect to preferentially find polymorphisms in rDNA residues that show low levels of constraint. Visual inspection of the positions of the polymorphisms (Fig. 2) reveals no obvious pattern of localization. Importantly, there is no bias of polymorphisms toward the IGS regions, which are the least selectively constrained regions of the rDNA. However, it is still possible that the polymorphisms found in the rRNA genes mostly fall on residues with low selective constraint. To test this, we used the LSU and SSU variability maps from the European Ribosomal RNA Database (Van de Peer et al. 1997; Ben Ali et al. 1999). These maps categorize each site in the rRNA gene-coding regions into one of six “bins” based on the level of conservation of that site across the eukaryotes. If the polymorphisms we observe are influenced by selection, we would expect a bias of these polymorphisms toward residues with low levels of conservation. We mapped every high-quality polymorphism from the rRNA gene-coding regions onto these variability maps. Pooling the results from all five species demonstrates that the numbers of polymorphisms found in each conservation “bin” are not significantly different from the expected distribution (P > 0.95; Table 2). This is also true (P > 0.6) if only polymorphisms present on more than one read are used, although the numbers are too low for statistical power (results not shown). Thus, there is no bias of these polymorphisms toward more variable sites in the rRNA genes, and we can conclude that at least most of the polymorphisms are not influenced by selective constraint. Finally, the most frequently found polymorphism in the C. neoformans data set (found in 10 reads) is a substitution in a site that is invariant across the fungi (using the data set of Berbee and Taylor 2001) (http://www.treebase.org/treebase/). Therefore, this polymorphism is likely to be highly deleterious, yet it seems to be present in more than one rDNA unit, indicating that polymorphisms present at relatively low frequencies are beneath the “radar” of selection, presumably due to redundancy of the rDNA (Hadjiolov 1984).

Table 2.

Variability level of the sites in which polymorphisms are found in rRNA genes

An external file that holds a picture, illustration, etc. Object name is 184tbl2.jpg

Mutational spectrum of the rDNA polymorphisms

If the polymorphisms we observe are appearing beneath the level of selection, they may give us insight into the spectrum of mutations offered by the mutation process. Therefore, we looked at the types of mutations that have arisen. Polymorphisms were divided into five classes of mutation (transitions, transversions, insertions, deletions, and complex mutations), and the proportions of these for each species (except A. gossypii and S. cerevisiae, as there are too few polymorphisms) are presented (for high-confidence polymorphisms) (Fig. 4). Each species has its own unique profile of mutation, although the polymorphism levels in S. paradoxus and A. nidulans are too low for strong conclusions. However, we find, as expected, that transitions are the most common form of substitution in all species. The most striking pattern is that of C. neoformans, which shows a very strong deletion bias (62% of all polymorphisms are deletions). If deletion/insertion bias predicts the direction of genome size evolution (Mira et al. 2001), then C. neoformans may be in the process of a genome size reduction, although current gene density is not especially high (Loftus et al. 2005).

An external file that holds a picture, illustration, etc. Object name is 184fig4.jpg

Mutational profile of the rDNA polymorphisms. The proportion of each of the five classes of mutation (listed in the boxed legend) that form the high-confidence polymorphisms are graphed for S. paradoxus, A. nidulans, and C. neoformans. Absolute numbers and percentages are given for each class. “Complex” mutations are defined as those involving more than 3 bp. For the full list of polymorphisms, see Supplemental Table 1.

C. neoformans is also characterized by several highly complex mutations. These mutations typically involve replacement of one large sequence tract (10–19 bp) with a similar-sized tract of unrelated sequence (10–28 bp). We speculated that these mutations were the result of sequence substitution mutations (Yoshiyama et al. 2001). However, complementary sequences near the inserted tracts were not found, ruling this explanation out. Attempts were made to detect two of these complex mutations by PCR, but the correct product was not detected (results not shown). Therefore, these complex mutations are possibly artifacts of some kind.

Discussion

This study has given us the first quantitative picture of the level of polymorphism present within rDNA arrays. Combining the levels of polymorphism with rDNA copy number data, it is clear the level of variation across the rDNA arrays in all five fungal species is extremely low. Indeed, in each species many repeats must be identical in sequence across the entire unit (∼7.5–9 kb of sequence), as the total number of polymorphisms is less than the rDNA copy number in every species. We also show that this level of variation is orders of magnitude lower than that of a repeat family undergoing birth-and-death evolution. Furthermore, there is no bias of the few observed polymorphisms to areas of low selective constraint, and polymorphisms present in a few copies seem to exist beneath the level of selection. Together these results demonstrate that the rDNA is evolving via concerted evolution, rather than birth-and-death evolution, and suggest that homogenization is highly efficient at maintaining the rDNA with near-identical repeats.

At first glance it may seem paradoxical that variation within the rDNA is very low when some regions of the rDNA (notably the IGS) evolve very rapidly. However, these features are easily reconciled under a model of rapid homogenization (Fig. 5). In this model there are three phases of homogenization in a hypothetical repeat array. In the first phase (mutation), mutations can occur stochastically anywhere within the repeat unit. The rDNA is highly redundant, so no selective pressure acts on these “unique” mutations and they can persist for some time. Indeed, most polymorphisms we observe seem to fall into this class: low-frequency polymorphisms that are located randomly throughout the repeat unit, irrespective of the level of constraint. In the second phase (transition), continual repeat turnover by homogenization (unequal recombination) results in mutated repeats being either deleted or duplicated (again stochastically). Deletion obviously removes that mutation from the array (homogenizes the array). Duplication starts the mutated repeat on a process where it may increase in copy number through successive duplications. This is where natural selection comes in. A deleterious mutation will only be able to increase in copy number up to a certain threshold, above which the mutation will compromise fitness. Therefore, only mutations tolerated by natural selection can increase to high copy numbers in the array. Interestingly, the three high copy-number polymorphisms found in this study (two from S. cerevisiae and one from S. paradoxus) (Fig. 2) are present in the IGS and ITS, the regions of the rDNA with the lowest selective constraint, fully consistent with them being neutral variants that are being spread by homogenization. All other polymorphisms are present at low copy number, and therefore are likely to have arisen recently or are unable to spread to high copy number because of functional constraint. The final phase is fixation, where a “tolerated” mutant repeat completely replaces the previous repeats. Thus a new, variant sequence becomes homogenized in the array.

An external file that holds a picture, illustration, etc. Object name is 184fig5.jpg

Three phases of repeat homogenization under a rapid homogenization model. First, a mutation occurs at either a selectively constrained (e.g., a coding part of the repeat), or a nonselectively constrained (e.g., a noncoding part of the repeat) site in a single unit from the stylized array. In the transition phase, only the unit with the nonselectively constrained mutation can increase to high copy number by homogenization. This mutation is able to sweep to fixation in the array. Thus, only mutations tolerated by selection can spread throughout the array, explaining why within the same repeat some regions are highly polymorphic while others are highly conserved, even though the entire repeat unit is subject to the identical homogenization process. See text for details.

A rapid homogenization model of concerted evolution is consistent with the results from numerous previous studies (Liao et al. 1997; Ganley and Scott 1998, 2002; Skalicka et al. 2003; Averbeck and Eickbush 2005; Kovarik et al. 2005). Furthermore, rDNA polymorphisms between individuals in a population (e.g., Carbone and Kohn 2001; James et al. 2001; Ganley et al. 2005) are also evidence for rapid homogenization. Our results suggest that individuals are likely to have homogenized arrays, therefore, polymorphisms between individuals in a population represent cases where homogenization has spread a mutation to all the repeats in the array of one individual, creating a fixed polymorphism between individuals. This homogenization must have occurred in the evolutionary time separating the two polymorphic individuals. Other studies also support the existence of a threshold rDNA copy number, above which deleterious mutations are not tolerated. A notable case is the R1/R2 insertion elements in Drosophila, retrotransposons that insert into and inactivate the LSU. There is a dramatic level of turnover of these elements in the rDNA, and this is consistent with inserted units being unable to achieve high copy number (because selection keeps the number of inactive units beneath a certain threshold), yet the elements are able to avoid extinction at the hand of homogenization through continual retrotransposition (e.g., Hollocher and Templeton 1994; Pérez-González and Eickbush 2002; Averbeck and Eickbush 2005). Also, it has long been known that deletions of large numbers of rDNA units (a dramatic form of inactivation) are still viable (e.g., Ritossa et al. 1966; Russell and Rodland 1986; Takeuchi et al. 2003), and more minor copy-number variation in tandemly repeated rDNA arrays is routinely observed. Conversely, a repeat unit with a disrupted coding region was recently found at high frequencies in the 5S rRNA gene repeats in Trypanosoma cruzi (Westenberger et al. 2006); it is not clear why these seemingly deleterious repeats are maintained at high copy numbers.

One unusual feature of our data, as noted in the Results section, is that the majority of polymorphisms are found in only one read. Given that these genomes were sequenced to multiple levels of coverage, we expect “true” polymorphisms to be present in multiple reads at a frequency similar to the coverage level (i.e., for a threefold coverage level, we expect to find single-copy polymorphisms in approximately three reads). We suspect that these are the result of unknown sequencing errors. The possibility that these polymorphisms result from heavily methylated, inactive rDNA copies that are biased against in the genomic libraries, and which contain many mutations, was ruled out by digestion of S. paradoxus, A. nidulans, and C. neoformans genomic DNA with methylation sensitive/insensitive isoschizomers. No obvious rDNA methylation was revealed in any of these species (results not shown). Another possibility is that these polymorphisms are real and are the result of cell-to-cell variation due to ongoing mutation in the rDNA during cell growth, resulting in polymorphisms between cells in a colony. In either case, only counting polymorphisms present on more than one read will be a more realistic measure of total array variation.

No fixed (allelic) differences were found between the homologous arrays in S. paradoxus, even though fixed differences are expected in diploid organisms with low rates of sexual recombination (K. Klein, pers. comm.). We did observe one high-frequency IGS polymorphism in this species, but the polymorphism is present equally on both homologous rDNA arrays (see Supplemental Fig. 1). Although this is consistent with the idea of the cohesive spread of variants through populations by concerted evolution (Ohta and Dover 1984), we think it more likely results from a high frequency of intra-ascus mating and/or auto-diploidization (Johnson et al. 2004). High levels of inbreeding will severely limit heterozygous rDNA arrays.

Some differences between the five species were observed, particularly the high proportion of polymorphisms found on multiple reads in S. paradoxus. However, if single-read polymorphisms are excluded, the results for all five species are similar. All five species have a single rDNA array, and it will be interesting to see what pattern emerges from species with more complicated rDNA organizations. Also, sex in these species is thought to be rare or absent, and they all have a history of vegetative lab cultivation, features that will limit variation (K. Klein, pers. comm.). Finally, variation in the structural arrangement of repeat units has been detected within some species (Caburet et al. 2005), but we do not yet know how widespread this phenomenon is.

In summary, this study shows that the level of intragenomic variation in the rDNA arrays of these five fungal species is extremely low. This low level of variation provides a clear distinction between repeats evolving by concerted evolution and those evolving by birth-and-death evolution, and suggests that concerted evolution is very dynamic and efficient. The WGSS data methodology used here can be applied to other repeat arrays as well as to the rDNA of other species. Furthermore, the contrast between the variation observed in repeats undergoing concerted evolution versus repeats undergoing birth-and-death evolution provides a framework for determining the evolutionary behavior of other repeat families. These results are reassuring for researchers using the rDNA for phylogenetic purposes, as there is no evidence for major cryptic variation within the repeats, although the situation may be different in species with more complicated rDNA arrangements (e.g., multiple loci), and different life-history traits (e.g., higher rates of sexual recombination). Extending these analyses to more and varied species will help clarify this. Our results also provide critical data for the formulation of theoretical models of concerted evolution, as now both the rate of rDNA recombination (e.g., Szostak and Wu 1980; Gangloff et al. 1996) and the level of rDNA array heterogeneity are known. Finally, although our results seem to justify the decision of some genome projects not to sequence long repeat arrays such as the rDNA, such data are useful and can help us decipher the evolutionary dynamics of these intriguing regions of the genome.

Methods

Identifying polymorphisms

rDNA sequence reads were either obtained directly from the genome sequencing center (A. gossypii and C. neoformans) or (for S. paradoxus and A. nidulans) identified as follows. First, complete rDNA unit sequences were constructed by taking a portion of the rDNA (from GenBank), BLASTing (Altschul et al. 1997) this to the genome sequence using the NCBI Trace Archive (>http://www.ncbi.nlm.nih.gov/Traces/trace.cgi?) and using the results to “walk” out in both directions until a complete unit was obtained. For S. cerevisiae, GenBank sequence U53879 was used. The complete unit was then used in a subsequent BLAST back to the genome sequence, with the searches performed conservatively to identify all rDNA reads, and these were downloaded from the NCBI Trace Archive. The consensus rDNA unit sequences for S. paradoxus and C. neoformans are deposited in the DNA Data Bank of Japan (accession nos. BR000309 and BR000310), and those of S. cerevisiae and A. nidulans are given in the Supplemental material. The A. gossypii sequence already exists (accession no. AF113137). Details for each species’ rDNA reads are as follows: A. gossypii, 2658 reads for ∼3.3-fold coverage of the rDNA; S. cerevisiae, 10,764 reads for ∼5.4-fold coverage; S. paradoxus, 5571 reads for ∼1.7-fold coverage (diploid coverage level); A. nidulans, 4996 reads for ∼7.7-fold coverage; and C. neoformans, 4206 reads for ∼4.7-fold coverage. rDNA coverage level was calculated using total rDNA array length (from rDNA copy-number results) and the total rDNA read length (obtained by multiplying rDNA read number by average read length). Average read length was obtained either directly from the genome sequencing center or calculated from total read number, sequenced genome length, and coverage level. This was 500 bp for A. gossypii, S. cerevisiae, and S. paradoxus, and 534 bp for A. nidulans. The value for C. neoformans is not known, but we used 500 bp as an estimate. rDNA coverage level calculated in this way was consistently lower than overall genome coverage level, but this is not surprising as regions of unusual chromatin structure such as centromeres and telomeres are usually underrepresented in genomic libraries (e.g., Mais et al. 2005). The value is particularly low for S. paradoxus because this is diploid rDNA coverage level, whereas haploid coverage level is normally quoted for whole genomes.

rDNA reads from each genome were assembled into a single unit with a high density of sequence reads using the Phred/Phrap/Consed software cluster (http://www.phrap.org/phredphrapconsed.html). Due to the large amount of data, this often required assemblies of subsets of the data. Any minor contigs formed were manually checked, and any reads that were real rDNA reads were subsequently incorporated into the main contig. Polymorphisms were then automatically identified using Consed, and every polymorphic chromatogram was checked by eye. From this, “true” polymorphisms were identified and put subjectively into two classes: high-confidence, and low-confidence polymorphisms. Reads containing polymorphisms are presented in Supplemental Table 1.

Molecular biology techniques

Strains used for molecular analyses in this study were S. cerevisiae RM11–1A, S. paradoxus NRRL Y-17217, A. nidulans FGSC-A4, and C. neoformans B-3501A, the sequenced strains obtained from the respective genome centers. Routine growth of S. paradoxus and C. neoformans was performed in YPD (1% yeast extract, 2% peptone, 2% glucose), while A. nidulans was grown on 5YEG (0.5% yeast extract; 1% glucose) plates. Genomic DNA extractions from C. neoformans were performed based on the method of Hoffman and Winston (1987). Briefly, cells suspended in Winston buffer with glass beads were first frozen at –20°C and then disrupted by shaking after addition of phenol/chloroform using a cell-disrupter (Multi-Bead Shocker, Yasui Kikai) with a 30-sec shake followed by a 60-sec rest for 12 cycles. The remainder of the procedure is identical.

Determination of rDNA copy number

Preparation of chromosomal DNA in agarose plugs for CHEF analyses was performed as follows. For S. paradoxus, CHEF plugs were prepared as previously described using 2 mg/mL of Zymolase 20-T (Seikagaku) and a 6-h digestion (Birren et al. 1997). The method of Yelton et al. (1984) was used to prepare protoplasts from A. nidulans, with the following modifications: ∼5 × 107 conidia were innoculated into 100 mL of minimal medium (Scott and Kafer 1982) and grown at 37°C for 24 h; OM buffer contained 1.0 M MgSO4; mycelia were digested using Yatalase (20 mg/mL; Takara) and Kitalase (5 mg/mL; Wako); and digestion was performed for 6 h at 30°C with shaking at 150 rpm. Preparation of CHEF plugs from these protoplasts then followed that of S. paradoxus (Birren et al. 1997). CHEF plugs from C. neoformans were prepared according to the method of Smith et al. (1988), except that 100 mg/mL of lysing enzyme (Sigma) for 1 h at 37°C were used for protoplast formation. For restriction enzyme in-plug digestion, CHEF plugs were first washed three times in TE (10/1) buffer for 20 min each, followed by three washes in 300 μL of restriction enzyme buffer solution using the appropriate manufacturer’s buffer (for HinDIII and BamHI, Toyobo; for AgeI, Promega) at 37°C for 20 min each. This was followed by incubation with 50–100 U of enzyme/half agarose plug and 0.01% BSA in 500 μL reaction volume for 3–16 h at 37°C, using either BamHI (S. paradoxus), HinDIII (A. nidulans), or AgeI (C. neoformans). rDNA-containing fragments were separated on 1% PFC agarose (BioRad) gels using the CHEF Mapper (BioRad) with the following parameters: 6.0 V/cm; 0.2–204-sec ramped pulsed time; 15.2 h; 0.5× TBE buffer; and 14°C. Southern hybridizations were performed using standard procedures (Sambrook and Russell 2001) and rDNA bands were detected using a probe from a conserved region of the LSU derived from S. cerevisiae.

Nucleotide diversity

Sequences of the polyubiquitin repeat (poly-u repeats) arrays (after removal of introns; monomer ubiquitin sequences were not used; for S. paradoxus homologous diploid arrays were used) are deposited in the DNA Data Bank of Japan (accession nos. BR000305 to BR000308) or for S. cerevisiae can be found in the Supplemental material. Nucleotide diversity values (π) were calculated using the program DnaSP version 4 (Rozas et al. 2003). rDNA arrays were constructed from rDNA copy number and high-confidence polymorphism number data after correcting for genome coverage level.

Acknowledgments

Particular thanks are owed to Fred Dietrich (Duke Univ. Medical Center) for the inspiration for this project, and for providing access to the A. gossypii genome sequence data. We also thank Eula Fung (Stanford Genome Tech. Center) for providing the C. neoformans rDNA reads and for helpful discussions; Alexey Egorov (NCBI Trace Archive) for retrieving the rDNA reads for A. nidulans and S. paradoxus; and Tim James and Rytas Vilgalys (Duke Univ.), and Takashi Horiuchi (Natl. Inst. Basic Biol.) for helpful discussions. Thanks are also due to K. Klein (Minnesota State Univ.) for helpful comments on the manuscript, and the communication of unpublished results. This work was supported by grants from the Clark Fellowship for Molecular Evolution and Comparative Genomics and the Japan Society for the Promotion of Science to ARDG, grants 17080010, 17370065, and 18207013 from the Ministry of Education, Science and Culture, Japan, and by a Human Frontier Science Program grant to T.K.

Footnotes

References


Articles from Genome Research are provided here courtesy of Cold Spring Harbor Laboratory Press