Extremely Intron-Rich Genes in the Alveolate Ancestors Inferred with a Flexible Maximum-Likelihood Approach (original) (raw)

Journal Article

,

*Department of Computer Science and Operations Research, Université de Montréal, Montréal, Québec, Canada

†Collegium Budapest Institute for Advanced Study, Budapest, Hungary

Search for other works by this author on:

,

‡National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, Maryland

Search for other works by this author on:

‡National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, Maryland

Search for other works by this author on:

Accepted:

30 January 2008

Published:

21 February 2008

Cite

Miklós Csűrös, Igor B. Rogozin, Eugene V. Koonin, Extremely Intron-Rich Genes in the Alveolate Ancestors Inferred with a Flexible Maximum-Likelihood Approach, Molecular Biology and Evolution, Volume 25, Issue 5, May 2008, Pages 903–911, https://doi.org/10.1093/molbev/msn039
Close

Navbar Search Filter Mobile Enter search term Search

Abstract

Chromalveolates are a large, diverse supergroup of unicellular eukaryotes that includes Apicomplexa, dinoflagellates, ciliates (three lineages that form the alveolate branch), heterokonts, haptophytes, and cryptomonads (three lineages comprising the chromist branch). All sequenced genomes of chromalveolates have relatively low intron density in protein-coding genes, and few intron positions are shared between chromalveolate lineages. In contrast, genes of different chromalveolates share many intron positions with orthologous genes from other eukaryotic supergroups, in particular, the intron-rich orthologs from animals and plants. Reconstruction of the history of intron gain and loss during the evolution of chromalveolates using a general and flexible maximum-likelihood approach indicates that genes of the ancestors of chromalveolates and, particularly, alveolates had unexpectedly high intron densities. It is estimated that the chromalveolate ancestor had, approximately, two-third of the human intron density, whereas the intron density in the genes of the alveolate ancestor is estimated to be slightly greater than the human intron density. Accordingly, it is inferred that the evolution of chromalveolates was dominated by intron loss. The conclusion that ancestral chromalveolate forms had high intron densities is unexpected because all extant unicellular eukaryotes have relatively few introns and are thought to be unable to maintain numerous introns due to intense purifying selection in their, typically, large populations. It is suggested that, at early stages of evolution, chromalveolates went through major population bottlenecks that were accompanied by intron invasion.

Introduction

Spliceosomal introns that interrupt most of the protein-coding genes and the concurrent splicing machinery that mediates intron excision and exon splicing are among the defining features of eukaryotes (Doolittle 1978; Gilbert 1978; Mattick 1994; Deutsch and Long 1999). To date, all eukaryotes with sequenced genomes, including parasitic protists with compact genomes, previously suspected to be intronless, have been shown to possess at least a few introns (Nixon et al. 2002; Simpson et al. 2002; Vanacova et al. 2005) and a (nearly) full complement of spliceosomal proteins (Collins and Penny 2005). Different species dramatically vary in their intron density, ranging from a few introns per genome to over 8 per gene (Logsdon 1998; Mourier and Jeffares 2003; Jeffares et al. 2006). Despite the ubiquity of introns in eukaryotic genomes, their natural history is poorly understood. To what extent introns are to be regarded as “junk DNA” as opposed to functional parts of the genome remains an open question. There are many reports on the contribution of introns to the regulation of gene expression (Bourdon et al. 2001; Le Hir et al. 2003; Rose 2004; Ying and Lin 2005), but it is unclear how general such functional roles of introns might be.

Much uncertainty also remains with regard to the origin and subsequent evolution of introns. For the last 30 years, the study of intron evolution had been coached, primarily, as a debate between the so-called introns-early and introns-late concepts. The introns-early view (more recently revived in the form of “introns-first”) holds that introns were part of the very first protein-coding genes and contributed to the emergence of proteins via recombination between RNA molecules that encoded short peptide (Doolittle 1978; Gilbert 1978; Gilbert and Glynias 1993; Gilbert et al. 1997; Jeffares et al. 2006). The introns-late concept counters that the primordial genes were intronless, and prokaryotic genes have remained so throughout their history, whereas eukaryotic genes have been invaded by introns only after (or during) the onset of the eukaryotic lineage (Stoltzfus et al. 1994; Logsdon et al. 1995; Logsdon 1998). Considering the absence of the spliceosome and spliceosomal introns in prokaryotes, the failure of key predictions, such as those about differences in intron phase distributions among ancient and more recent introns (Rogozin et al. 2003) and conservation of intron positions between ancient paralogs (Cho and Doolittle 1997; Sverdlov et al. 2007), and the uncertainty surrounding other types of evidence such as intron-domain correspondence (Roy and Gilbert 2006), the original introns-early concept hardly seems tenable anymore (Koonin 2006). However, comparative-genomic studies show that numerous intron positions in orthologous genes are conserved at great evolutionary depths, for example, between plants and animals (Fedorov et al. 2002; Rogozin et al. 2003). Furthermore, increasingly sophisticated reconstructions of intron gain and loss during eukaryotic evolution suggest that the protein-coding genes of ancient eukaryotic ancestors, including the Last Eukaryotic Common Ancestor (LECA), already possessed intron density comparable to that found in modern, moderately intron-rich genomes (Csuros 2005; Nguyen et al. 2005; Roy and Gilbert 2005a; Roy and Gilbert 2005b; Carmel, Wolf, et al. 2007). Accordingly, the history of eukaryotic genes, with respect to the dynamics of introns, appears to be, to a large extent, dominated by losses, perhaps, punctuated with a few episodes of major gain (Roy 2006; Carmel, Wolf, et al. 2007).

Currently, the major phylogenetic divisions of eukaryotes are conservatively envisaged as 5 supergroups, the relationships between which remain uncertain (Keeling et al. 2005; Keeling 2007). The intron-rich organisms (animals and plants, respectively) belong to two supergroups, unikonts and plantae, which also include many (relatively) intron-poor species such as, respectively, fungi and green and red algae. The finding that orthologous genes of plants and animals share ∼25% of the intron positions led to the inference of a relatively high intron content for the common ancestor of these two supergroups, which, depending on the adopted phylogeny, may or may not be the same as LECA (Rogozin et al. 2003; Roy and Gilbert 2005b; Carmel, Wolf, et al. 2007). The remaining eukaryotic supergroups so far are known to include only (relatively) intron-poor, unicellular species. Given the widespread intron loss during eukaryotic evolution, a major question is, did the evolution of these eukaryotic lineages start from an intron-poor state such that their subsequent history involved limited and, more or less, balanced loss and gain of introns, or was the ancestral state intron-rich state such that subsequent evolution comprised, mostly, differential intron loss.

Here, we address this question in the case of the chromalveolates, a vast supergroup that is an assemblage of diverse unicellular eukaryotes and encompasses up to half of all protist and algal species (Cavalier-Smith 1999; Cavalier-Smith 2004). The monophyly of chromalveolates has been originally suggested on the basis of a parsimonious scenario for plastid evolution under which the common ancestor of chromalveolates engulfed a red alga and thus acquired the plastid through secondary endosymbiosis (Cavalier-Smith 1999; Archibald 2005). Subsequently, this hypothesis received strong support from phylogenetic analysis of both plastid and nuclear genes (Fast et al. 2001; Fast et al. 2002; Harper and Keeling 2003; Harper et al. 2005). These phylogenetic studies have also established the tree topology within the chromalveolate supergroup. The chromalveolates are subdivided into 2 major groups, each consisting of 3 subgroups, all of which are diverse collections of organisms in their own right. The alveolate group encompasses Apicomplexa (including a variety of important pathogens, such as malarial plasmodium, toxoplasma, and cryptosporidium), dinoflagellates, and ciliates, whereas the chromist group consists of cryptomonads, haptophytes, and heterokonts (also known as stramenopiles). Here, we apply a general and flexible maximum-likelihood (ML) technique to the comparative-genomic analysis of 11 genomes of chromalveolates and 12 other eukaryotes and show that the common ancestors of chromalveolates and, particularly, alveolates had unexpectedly intron-rich genes.

Materials and Methods

Data

We collected gene structure data from all publicly available, complete, annotated chromalveolate genomes in which spliceosomal introns are not uncommon. As outgroups, we used all available land plant genomes and two green algal genomes, as well as a comparable number of genomes from animals and fungi. Among the available fungal and animal genomes, we selected a diverse set of intron-rich species (which are expected to convey more information about ancestral introns than intron-poor genomes). Throughout the evolutionary analyses, a fixed organismal phylogeny that includes, mostly, uncontested evolutionary relationships was used; the tree includes 3 relevant eukaryotic supergroups, Chromalveolata, Plantae, and unikonts, with the relationship between them remaining unresolved and represented as a trifurcation (Adl et al. 2005; Keeling et al. 2005). The genome sequences were extracted from GenBank, the NCBI RefSeq database, or the Joint Genome Institute database; the details on the sources of protein sequences and exon–intron structure are given in supplementary table S1 (Supplementary Material online). The following eukaryotic species were included in the analysis: Plasmodium berghei (Pber), Plasmodium chabaudi chabaudi (Pcha), Plasmodium falciparum (Pfal), Plasmodium yoelii yoelii (Pyoe), Theileria annulata (Tann), Theileria parva (Tpar), Paramecium tetraurelia (Ptet), Tetrahymena thermophyla (Tthe), Phaeodactylum tricornutum (Ftri), Phytophthora ramorum (Pram), Phytophthora sojae (Psoj), Arabidopsis thaliana (Atha), Oryza sativa ssp. japonica (Osat), Populus trichocarpa (Ptri), Chlamydomonas reinhardtii (Crei), Ostreococcus tauri (Otau), Apis mellifera (Amel), Homo sapiens (Hsap), Tribolium castaneum (Tcas), Coprinus cinereus (Ccin), Phycomyces blakesleeanus (Pbla), Phanerochaete chrysosporium (Pchr), and Rhizopus oryzae (Rory).

Paralogous Gene Sets

Sets of paralogous genes were constructed by updating and extending the database of eukaryotic clusters of orthologous genes (KOGs) as follows. First, the KOG database covering 7 eukaryotic genomes (Tatusov et al. 2003) was downloaded from ftp://ftp.ncbi.nlm.nih.gov/pub/COG/KOG/. Subsequently, each KOG was used as query to search clade-specific databases of protein sequences using the PSI-Blast program (Altschul et al. 1997; Schaffer et al. 2001). The searches were performed using command-line tools of the NCBI software development kit (Version 6.1, obtained from ftp://ftp.ncbi.nlm.nih.gov/toolbox/ncbi\_tools/ncbi.tar.gz). Using a Blast database for each of fungi, chromalveolates, and insects, the searches were performed using three iterations (switch –j 3) of PSI-Blast (blastpgp executable); for human and plant sequences, no iterations were used. For each KOG query, sequences with an E-value <10−9 were retained if they had a Blast hit score within 50% of the best hit for the species. In a further filtering step, reversed position-specific Blast search (Marchler-Bauer et al. 2007) was used to query each retained protein sequence against the Conserved Domain Database (CDD) of KOGs (rpsblast executable with default parameters). Only those sequences passed this filter for which the highest scoring KOG hit was the same as the KOG used in the initial PSI-Blast search and the second highest scoring KOG had less than 90% of the highest score. Sequences from the same genome that were thus assigned to the same KOG comprised paralogous sets.

Orthologous Genes

Within each set of paralogs, a set of putative orthologs was selected by reconciling gene and species phylogenies using the following procedure. First, we employed a novel “weaving” method to select a plausible orthologous set, which was then validated using a likelihood-based phylogeny comparison. The weaving method (Supplementary Material online) constructs a phylogeny of molecular sequences within a fixed species tree. The key technique consists of building a rooted evolutionary tree from sequences associated with two organismal lineages resulting from a speciation event, in the following manner. First, pairwise distances are computed from a multiple alignment of the sequences (the alignment is computed on the fly for each application of this technique.) Second, the tree is built by applying the Neighbor-Joining algorithm (Saitou and Nei 1987; Studier and Keppler 1988) to the distances. Inner nodes of the tree are subsequently classified as speciation or duplication nodes. Duplication nodes that have descendant speciation nodes are split so that paralogous gene lineages are identified for which duplication predates the speciation event. Only one sequence is kept as a representative from each gene lineage. This technique is applied to each bifurcation of the species tree by proceeding from the terminal taxa towards the root. The result of this weaving procedure is a set of putative orthologous lineages, which are presumably the result of gene duplication predating the root of the species tree.

The largest set of putative orthologs was elected from each set of paralogs, and a phylogeny was constructed using Neighbor-Joining. The resulting distance-based phylogeny was compared with the species tree. For this comparison, the PAML package (Yang 2007) was used to compute likelihood scores for protein sequence evolution along both phylogenies (with the software option of the WAG+Γ amino acid replacement model). A set of putative orthologs was considered valid if it contained representatives of at least 18 species and the log-likelihood score with the distance-based phylogeny was greater than the log-likelihood score with the species tree by at most 0.4 (this threshold was established by surveying the distribution of these scores across all KOGs).

Throughout the ortholog identification phase, sequences were aligned using MUSCLE (Edgar 2004) and distances were computed using the heuristic of Sonnhammer and Hollich (Sonnhammer and Hollich 2005) in conjunction with the VTML240 amino acid scoring matrix (Muller et al. 2002).

Orthologous Intron Sites

For each set of orthologous proteins, a multiple alignment was constructed using MUSCLE (Edgar 2004), the corresponding coding sequences were aligned using the protein alignment as the guide, and the intron sites were projected onto the alignment as described previously (Rogozin et al. 2003). Aligned intron-containing sites with identical phases were considered orthologous. Sites were propagated to further analysis by computationally inspecting sequence conservation around them. For each intron site within each sequence, the number of nongap amino acid positions had to be at least 4 on both the left- and right-hand sites to be categorized solid. If the number of solid positions at a site was at least 18, then it was included in the intron data set. In solid positions, 0 and 1 were used to encode absence and presence of the intron, respectively, whereas in nonsolid positions, and for missing sequences, an ambiguity character was used. The intron data set was compiled by concatenating the intron site information from all orthologous sets.

Likelihood-Based Analysis of Intron Evolution

The intron data set was analyzed in a likelihood framework described previously (Csuros 2005; Csuros et al. 2007). Briefly, the procedure is as follows. It is assumed that intron sites evolve independently under a Markov model (Steel 1994). The intron state (encoded by 0 and 1 for absence and presence) changes on each branch e of the phylogeny according to the probabilities

graphic

(1)

where λ denotes branch-specific intron gain rate, μ denotes branch-specific loss rate, and t stands for branch length. These latter parameters were set by numerical optimization of the likelihood function, while taking into account a correction for missing intron sites (Felsenstein 1992). The intron density at an ancestral node was computed as an expected value conditioned on the observed data, by summing posterior probabilities (Csuros et al. 2007). The extent of intron gains and losses along individual branches are estimated analogously, using conditional expectations.

We experimented with rate variation models in which intron sites belong to discrete loss and gain rate categories. In rate variation models, each site category is defined by a pair of gain and loss rate–modifying factors (α, β) that apply to all branches of the tree such that loss and gain rates αμ and βλ are plugged into the state transition probabilities of equation (1). We used the Bayesian Information Criterion (Schwarz 1978) to select the best rate variation model, which had two loss rates’ classes.

Confidence Intervals

Confidence interval (CIs) for estimates of ancestral intron density were obtained by using 1 000 bootstrap replicates. In each iteration, a new data set was generated by randomly selecting the same number of intron sites (independently and uniformly, with replacement). The likelihood of the new data set was maximized numerically to set gain and loss rates, as well as the rate variation parameters. Ancestral intron densities were estimated as conditional expectations. The CIs were obtained by discarding the 25 largest and the 25 smallest values from the bootstrap estimates.

Results

Shared and Unique Intron Positions in Orthologous Genes of Chromalveolates and Other Eukaryotes

The data set analyzed here consisted of 394 orthologous gene sets from 23 eukaryotes, including 11 chromalveolates, where each set was represented in at least 18 species. The species were selected to combine the chromalveolates with complete annotated genome sequences available with a maximum representation of intron-rich outgroups. The data set contained 7 030 intron-bearing sites in conserved, unambiguously aligned regions of the orthologous protein sequences (Materials and Methods).

A crucial observation is that introns are rarely found in the same position between distant chromalveolate species, with the exception of introns in Plasmodium, which often share positions with introns in Theileria, as reported previously (Roy and Penny 2006). Previous analyses have shown similar patterns of intron sharing at slightly lower levels, due to sparser taxonomic sampling (Rogozin et al. 2003; Nguyen et al. 2007; Roy and Penny 2007a). Surprisingly, in many cases, chromalveolate introns are more likely to share position with introns in orthologous genes of animals, fungi, or plants than with other chromalveolates (table 1). Thus, almost half of Phytophthora intron positions coincide with those in orthologous genes of animals, fungi, or plants. This pattern of intron sharing suggests that differential lineage-specific intron loss was a substantial, if not the primary, contributor to the observed differences in the exon–intron structure of orthologous genes among the chromalveolates.

Table 1

Introns Shared between Taxaa

Theileria Plasmodium Ciliates Phytophthora Viridiplantae Animals Opisthokonts AFV
Theileria (692) 100 13 9 4 17 17 23 29
Plasmodium (195) 46 100 16 8 21 17 23 29
Ciliates (824) 7 4 100 4 17 18 23 28
Phytophthora (288) 10 6 12 100 32 31 45 48
Theileria Plasmodium Ciliates Phytophthora Viridiplantae Animals Opisthokonts AFV
Theileria (692) 100 13 9 4 17 17 23 29
Plasmodium (195) 46 100 16 8 21 17 23 29
Ciliates (824) 7 4 100 4 17 18 23 28
Phytophthora (288) 10 6 12 100 32 31 45 48

a

Numbers denote percentages, computed as the fraction of all introns from taxa in the row's clade that coincide with at least one member of the column's set. “AFV” column refers to all animals, fungi, plants, and green algae in the data set. The numbers of intron-bearing sites are shown in parentheses in the row headers.

Table 1

Introns Shared between Taxaa

Theileria Plasmodium Ciliates Phytophthora Viridiplantae Animals Opisthokonts AFV
Theileria (692) 100 13 9 4 17 17 23 29
Plasmodium (195) 46 100 16 8 21 17 23 29
Ciliates (824) 7 4 100 4 17 18 23 28
Phytophthora (288) 10 6 12 100 32 31 45 48
Theileria Plasmodium Ciliates Phytophthora Viridiplantae Animals Opisthokonts AFV
Theileria (692) 100 13 9 4 17 17 23 29
Plasmodium (195) 46 100 16 8 21 17 23 29
Ciliates (824) 7 4 100 4 17 18 23 28
Phytophthora (288) 10 6 12 100 32 31 45 48

a

Numbers denote percentages, computed as the fraction of all introns from taxa in the row's clade that coincide with at least one member of the column's set. “AFV” column refers to all animals, fungi, plants, and green algae in the data set. The numbers of intron-bearing sites are shown in parentheses in the row headers.

Intron Gain and Loss Dynamics in Chromalveolate Lineages

The gain and loss of introns in chromalveolates were reconstructed using a likelihood framework that incorporated branch-specific intron loss and gain rates, as well as rate variation across sites embodied by two loss-rate categories where about one-fifth of modern intron sites lose introns at a 60% lower rate than the rest of the sites. The parameters of the rate categories were set by numerical optimization along with other model parameters. The number of rate categories was picked using a correction for model complexity to the likelihood score (Materials and Methods). The model imposes no constraints on the sequence of events occurring in the same site, that is, introns in a given position of an orthologous gene set can be lost and regained or gained independently in different lineages.

The reconstruction revealed a remarkable variation in intron loss and gain dynamics among chromalveolate lineages (fig. 1; the reconstructions for each of the individual set of orthologs are available at http://www.iro.umontreal.ca/~csuros/introns/Chroma23/). The exon–intron structure of orthologous genes has not changed much within the Apicomplexan genera, that is, individual species of Theileria and Plasmodium maintained the same intron density, with balanced gains and losses affecting 3–4% of their introns. This is in agreement with the recent findings of Roy and Hartl who demonstrated the stasis of gene structures within the Plasmodium genus (Roy and Hartl 2006). The branch leading to the Theileria ancestor, where ∼20% of modern Theileria introns were gained, is characterized by a slight net loss, with losses outnumbering gains, approximately, 2-fold. Intron abundance was reduced more drastically in other alveolate branches, where losses outnumber gains 3- to 6-fold (the ciliate branch and the Alveolata–Apicomplexa branch), or even more than 20-fold (the Apicomplexa-Plasmodium branch), in agreement with the previous conclusions on the high prevalence of intron loss in Apicomplexa (Roy and Penny 2007a). The present reconstruction indicates that evolution of gene structure in heterokont lineages was also dominated by massive loss of introns (fig. 1).

Inferred intron gains and losses in the evolution of eukaryotes. On each branch, the positive number is the intron gain estimate and the negative number is the intron loss estimate. Double lines highlight branches dominated by intron gain. The discs indicate the inferred intron density at deep ancestors. The boxed numbers next to the chromalveolate lineages give the intron count in diatom or the inferred intron content in the group's common ancestor.

FIG. 1.—

Inferred intron gains and losses in the evolution of eukaryotes. On each branch, the positive number is the intron gain estimate and the negative number is the intron loss estimate. Double lines highlight branches dominated by intron gain. The discs indicate the inferred intron density at deep ancestors. The boxed numbers next to the chromalveolate lineages give the intron count in diatom or the inferred intron content in the group's common ancestor.

The extensive intron loss is often accompanied by modest but nonnegligible intron gain. Among the chromalveolates, in the extreme case of the diatom P. tricornutum, these recent gains account for ∼90% of the few introns present in the genes of this organism (fig. 2). A similar pattern has been detected and thoroughly discussed by Roy and Penny for the diatom Thalassiosira pseudonana (Roy and Penny 2007b). Phytophthora is estimated to have gained a comparable number of introns in the same time interval but underwent a less extreme reduction such that about 50% of the introns predate the chromalveolate ancestor (fig. 1). In alveolates, recent lineage-specific gains (on branches below the apicomplexan and ciliate ancestors) account for 20–30% of the extant introns in Theileria, Plasmodium, and Tetrahymena and ∼46% of the extant introns in Paramecium (fig. 2).

Origin of alveolate introns. Bars show the predicted branch on which modern introns were gained in the respective species. Numbers next to the organism names show the number of introns in the data set for the species; numbers within the bars show the inferred numbers of introns of different provenance. For each species, the surveyed branches are the terminal species-specific branch, the branch descending from the alveolate ancestor (to the apicomplexan or ciliate ancestor), as well as the branches Chromalveolata–Alveolata, Bikonta–Chromalveolata, root-Bikonta, and the protoeukaryote branch leading to the root (branches are identified by the descendant taxon). For Plasmodium falciparum and Theileria annulata, the branch leading from the apicomplexan ancestor to the genus ancestor (Plasmodium or Theileria) is also considered.

FIG. 2.—

Origin of alveolate introns. Bars show the predicted branch on which modern introns were gained in the respective species. Numbers next to the organism names show the number of introns in the data set for the species; numbers within the bars show the inferred numbers of introns of different provenance. For each species, the surveyed branches are the terminal species-specific branch, the branch descending from the alveolate ancestor (to the apicomplexan or ciliate ancestor), as well as the branches Chromalveolata–Alveolata, Bikonta–Chromalveolata, root-Bikonta, and the protoeukaryote branch leading to the root (branches are identified by the descendant taxon). For Plasmodium falciparum and Theileria annulata, the branch leading from the apicomplexan ancestor to the genus ancestor (Plasmodium or Theileria) is also considered.

High Intron Density in Deep Ancestors of Chromalveolates

We considered four alveolate lineages: Plasmodium, Theileria, Paramecium, and Tetrahymena. The key aspects of chromalveolate intron evolution are apparent in the pattern of intron sharing between these lineages (table 2). First, introns that are shared between alveolate and nonalveolate organisms most often appear in only one alveolate lineage (specifically, in 72% of the cases). Considering the relatively low level of parallel intron gain in the same position (estimated at <20% even for the most distant eukaryotes; Sverdlov et al. 2005; Carmel, Rogozin, et al. 2007), these shared introns were, most likely, present in the alveolate ancestor, and so their presence in only a subset of the chromalveolate lineages attests to extensive, lineage-specific intron loss. Second, introns that are unique to chromalveolates exhibit an even more skewed distribution among lineages than introns that are conserved outside the supergroup. Indeed, introns that are shared with other eukaryotes are significantly more likely to appear in multiple chromalveolate lineages than supergroup-specific introns. Introns that appear in, at least, two chromalveolate lineages and are shared with nonchromalveolates are significantly more likely to appear in three or more lineages than chromalveolate-specific introns (P < 6.2 × 10−3, one-tailed Fisher's exact test). This difference is likely to stem from a combination of the substantial between-sites variation of the intron loss rate (Carmel, Wolf, et al. 2007) and the relatively recent origin of some chromalveolate-specific introns (Roy and Penny 2007b).

Table 2

Distribution of Shared Introns in Alveolate Lineages (Plasmodium, Theileria, Paramecium, and Tetrahymena)a

Lineages Shared Unique
1 299 999
2 96 125
3 or 4 22 10
Lineages Shared Unique
1 299 999
2 96 125
3 or 4 22 10

a

The “Shared” column shows the distribution of the number of alveolate lineages in which a site is occupied by introns that are also shared with at least one nonalveolate lineage. The “Unique” column shows the same distribution for intron-bearing sites that are unique to alveolates.

Table 2

Distribution of Shared Introns in Alveolate Lineages (Plasmodium, Theileria, Paramecium, and Tetrahymena)a

Lineages Shared Unique
1 299 999
2 96 125
3 or 4 22 10
Lineages Shared Unique
1 299 999
2 96 125
3 or 4 22 10

a

The “Shared” column shows the distribution of the number of alveolate lineages in which a site is occupied by introns that are also shared with at least one nonalveolate lineage. The “Unique” column shows the same distribution for intron-bearing sites that are unique to alveolates.

The inferred intron densities in the ancestors of alveolates and chromalveolates are remarkably high (fig. 3). Specifically, the alveolate ancestor is estimated to have had a slightly greater intron density than humans, whereas the ancestor of the chromalveolate supergroup would have ∼65% of that density. Strikingly, the estimated intron density of the alveolate ancestor is somewhat greater than the intron density in the plant (∼91% of the human density) and opisthokont (∼78% of the human density) ancestors estimated with the same method (fig. 3). The latter estimates were only slightly higher than those obtained previously with more constrained ML models (Nguyen et al. 2005; Roy and Gilbert 2005b; Carmel, Wolf, et al. 2007; Csuros et al. 2007).

Intron densities in the analyzed eukaryotic species and inferred intron densities for ancestral forms. For the ancestral forms (internal nodes of the tree), 95%CIs are shown by the error bars, which were established by bootstrapping (Materials and Methods). Scaling on the right-hand side was obtained by projecting the intron counts onto density by using humans as reference (4 603 introns in coding sequences of total length 671 877 bp).

FIG. 3.—

Intron densities in the analyzed eukaryotic species and inferred intron densities for ancestral forms. For the ancestral forms (internal nodes of the tree), 95%CIs are shown by the error bars, which were established by bootstrapping (Materials and Methods). Scaling on the right-hand side was obtained by projecting the intron counts onto density by using humans as reference (4 603 introns in coding sequences of total length 671 877 bp).

Although modern alveolates have an intron density that is at least 60% lower than the current estimate for the ancestral form, ∼72% of the inferred ancestral alveolate intron positions are shared by at least one extant, nonalveolate eukaryote. The uncertainty of the ancestral alveolate density estimate is relatively high (∼19% coefficient of variation in bootstrap experiments), but even conservative estimates exceed two-thirds of the modern human intron density (P < 0.05 in bootstrap experiments). Similar high estimates were obtained with the three possible branching orders for the supergroups and when different conservation criteria were applied for identification of homologous intron sites (Supplementary Material online). Furthermore, the possibility of a numerical optimization artifact was ruled out by examining the surface of the likelihood function (Supplementary Material online).

Discussion

The results of this study appear counterintuitive in that a very high intron density is confidently predicted for the ancestors of eukaryotic groups that (at least, so far) do not contain a single intron-rich species. This prediction became possible because a general and flexible ML method was applied to a diverse set of species. Adequate taxon sampling is particularly crucial for the reconstruction of evolution characterized by dramatic reduction of intron frequency in multiple lineages within a eukaryotic group such as the chromalveolates. In a case like this, evidence of a high ancestral intron density in the examined group can be obtained only through analysis of a diverse set of species because the genomes within the group share very few intron positions with each other but, collectively, retain many ancestral intron positions shared with some intron-rich genomes outside the group. The estimates of the rates of parallel intron gain obtained here are, generally, compatible with the previous estimates (Sverdlov et al. 2005; Carmel, Rogozin, et al. 2007) and indicate that the shared introns are predominantly ancestral rather than those acquired in different lineages independently. The high level of intron conservation between chromalveolates and representatives of other eukaryotic supergroups, such as plants and animals, suggests that the intriguing possibility that, at least, some of the conserved introns retain ancestral functions throughout eukaryotic evolution. Indeed, introns often affect the expression of genes at several levels including mRNA export, stability, and translation efficiency (Le Hir et al. 2003). However, the loss of most of the ancestral introns in some of the chromalveolate lineages indicates that if such ancestral functions of introns exist, they are not unconditionally essential.

The trend toward an upward revision of inferred ancestral intron densities is seen in recent reconstructions (Carmel, Wolf, et al. 2007; Csuros et al. 2007) compared with previous analyses, even those performed with methods that might be prone to statistical bias (Roy and Gilbert 2005b). Conceivably, given that the current collection of eukaryotic genomes (improved as it is) hardly can be considered representative of each supergroup, even the estimates in this work are conservative.

Another important factor is the number of orthologous genes included in the data set as this determines the number of intron sites. Given the large loss and gain rate variation between lineages, several thousand sites are necessary to produce accurate rate estimates. In addition, with too few intron sites, models with rate variation cannot be used because there is insufficient information to partition the sites into rate categories. For instance, when only half of the intron sites contained in the present data set are analyzed, a constant-rate model has almost as much statistical support as a two-loss-rates’ model, and a model complexity penalty (e.g., the Bayesian Information Criterion) will outvote the rate variation model (data not shown). An inevitable complication is that the requirement for a large number of intron sites forces one to include ambiguous entries in the data table. This ambiguity between intron absence and presence in homologous intron sites can be caused by ortholog misidentification, a genuine lack of an ortholog in one or more species due to lineage-specific gene loss, or uncertainties in multiple alignments. Accordingly, a correction for missing data was employed in the present analysis (Materials and Methods). The missing data problem notwithstanding, analysis of a data set that includes sufficiently large numbers of species, genes, and sites is crucial for reaching robust inferences on ancestral gene structures. A case in point is the recent work of Nguyen et al. (2007) in which 162 orthologous genes from 9 alveolate species were analyzed, yielding an estimate of the ancestral alveolate intron density close to that in Tetrahymena. This appears to be a substantial underestimate, likely, caused by an inadequate choice of the outgroup (only one nonalveolate species, human, was used for comparison), combined with scant taxon sampling (no chromists and only one ciliate), and a small data set to which only a constant-rate model could be applied, as a result of the imposed requirements of completely resolved orthologous gene sets and intron sites.

The inference of very high intron densities for the chromalveolate and, particularly, alveolate ancestral forms is generally compatible with other recent inferences of intron-rich eukaryotic ancestors (Csuros 2005; Nguyen et al. 2005; Roy and Gilbert 2005b; Carmel, Wolf, et al. 2007; Csuros et al. 2007). However, in more specific terms, these findings appear unexpected inasmuch as, so far, no particularly intron-rich unicellular eukaryotes have been identified. The (relative) paucity of introns in the genomes of unicellular organisms has been interpreted from the standpoint of a general population-genetic theory according to which intensive purifying selection in the large populations of these organisms prevents retention of a large number of introns (Lynch and Conery 2003; Lynch 2006). The present results strongly suggest that unicellular eukaryotes with very high intron densities did exist in the remote past. This conclusion seems to indicate that early stages of eukaryotic evolution, beginning with the eukaryogenesis itself (Martin and Koonin 2006), involved major bottlenecks during which extensive intron gain had occurred. The present reconstruction shows that the high intron density in the common ancestor of the chromalveolate supergroup is, essentially, the heritage of eukaryogenesis that is thought to have involved massive invasion of group II introns into the emerging nuclear genome, possibly from the mitochondrial endosymbiont (Lynch and Richardson 2002; Lambowitz and Zimmerly, 2004; Martin and Koonin 2006). The emergence of chromalveolates appears to be connected with a secondary endosymbiosis (Cavalier-Smith 1999; Cavalier-Smith 2004; Archibald 2005); however, this event did not seem to bring about another wave of intron invasion. By contrast, major intron gain is inferred to have occurred at the onset of the alveolate group (figs. 1 and 3), presumably, as a result of yet another population bottleneck. The rest of the chromalveolate evolution, including the origin of heterokonts, was apparently dominated by intron loss, presumably, following independent increases in the effective population size in each lineage. To a large extent, this extensive elimination of introns might have been mediated by retrotransposon activity as suggested by Roy and Penny (Roy and Penny 2007a). In future genome analyses, it would be of interest to investigate other correlates of intensive purifying selection in these lineages of chromalveolates, such as the extent of gene loss.

Conclusions

The results of this work indicate that ancestral forms in a eukaryotic supergroup that consists exclusively of unicellular and relatively intron-poor organisms were, in all likelihood, extremely intron rich — possibly, more so than modern multicellular eukaryotes with the most complex genomes. Given the extensive lineage-specific intron loss that apparently dominated the evolution of chromalveolates, this conclusion could be reached only by analyzing a large set of orthologous genes from a representative set of species. As shown here and elsewhere, the ancestors of plantae and unikonts are also estimated to have been intron rich although, paradoxically, somewhat less so than the chromalveolate and alveolate ancestors. For the remaining two eukaryotic supergroups that include only protists, Rhizaria and Excavates, there are currently no sufficiently intron-rich genomes to perform similar reconstructions. When such genomes become available, it will become possible to obtain a reasonably complete scenario of early evolution of eukaryotic gene structure.

This research was supported by the Intramural Research Program of the NIH (National Library of Medicine, National Center for Biotechnology Information) and by a research grant from the National Sciences and Engineering Research Council of Canada.

References

et al. ,

(28 co-authors)

.

The new higher level classification of eukaryotes with emphasis on the taxonomy of protists

,

J Eukaryot Microbiol

,

2005

, vol.

52

(pg.

399

-

451

)

Gapped BLAST and PSI-BLAST: a new generation of protein database search programs

,

Nucleic Acids Res

,

1997

, vol.

25

(pg.

3389

-

3402

)

Jumping genes and shrinking genomes—probing the evolution of eukaryotic photosynthesis with genomics

,

IUBMB Life

,

2005

, vol.

57

(pg.

539

-

547

)

Introns and their positions affect the translational activity of mRNA in plant cells

,

EMBO Rep

,

2001

, vol.

2

(pg.

394

-

398

)

Patterns of intron gain and conservation in eukaryotic genes

,

BMC Evol Biol

,

2007

, vol.

7

pg.

192

Three distinct modes of intron dynamics in the evolution of eukaryotes

,

Genome Res

,

2007

, vol.

17

(pg.

1034

-

1044

)

Principles of protein and lipid targeting in secondary symbiogenesis: euglenoid, dinoflagellate, and sporozoan plastid origins and the eukaryote family tree

,

J Eukaryot Microbiol

,

1999

, vol.

46

(pg.

347

-

366

)

Chromalveolate diversity and cell megaevolution: interplay of membranes, genomes and cytoskeleton

,

Organelles, Genomes and Eukaryotic Evolution

,

2004

An Evolutionary Synthesis in the Age of Genomics (Systematics Association Special Volume No. 68). Boca Raton, FL: CRC Press. pp. 75–108

Intron distribution in ancient paralogs supports random insertion and not random loss

,

J Mol Evol

,

1997

, vol.

44

(pg.

573

-

584

)

Complex spliceosomal organization ancestral to extant eukaryotes

,

Mol Biol Evol

,

2005

, vol.

22

(pg.

1053

-

1066

)

Likely scenarios of intron evolution. Comparative genomics

,

Lect Notes Comput Sci

,

2005

, vol.

3678

(pg.

47

-

60

)

In search of lost introns

,

Bioinformatics

,

2007

, vol.

23

(pg.

i87

-

96

)

Intron-exon structures of eukaryotic model organisms

,

Nucleic Acids Res

,

1999

, vol.

27

(pg.

3219

-

3228

)

Genes in pieces: were they ever together?

,

Nature

,

1978

, vol.

272

(pg.

581

-

582

)

MUSCLE: multiple sequence alignment with high accuracy and high throughput

,

Nucleic Acids Res

,

2004

, vol.

32

(pg.

1792

-

1797

)

Nuclear-encoded, plastid-targeted genes suggest a single common origin for apicomplexan and dinoflagellate plastids

,

Mol Biol Evol

,

2001

, vol.

18

(pg.

418

-

426

)

Re-examining alveolate evolution using multiple protein molecular phylogenies

,

J Eukaryot Microbiol

,

2002

, vol.

49

(pg.

30

-

37

)

Large-scale comparison of intron positions among animal, plant, and fungal genes

,

Proc Natl Acad Sci USA

,

2002

, vol.

99

(pg.

16128

-

16133

)

Phylogenies from restriction sites: a maximum likelihood approach

,

Evolution

,

1992

, vol.

46

(pg.

159

-

173

)

Why genes in pieces?

,

Nature

,

1978

, vol.

271

pg.

501

Origin of genes

,

Proc Natl Acad Sci USA

,

1997

, vol.

94

(pg.

7698

-

7703

)

On the ancient nature of introns

,

Gene

,

1993

, vol.

135

(pg.

137

-

144

)

Nucleus-encoded, plastid-targeted glyceraldehyde-3-phosphate dehydrogenase (GAPDH) indicates a single origin for chromalveolate plastids

,

Mol Biol Evol

,

2003

, vol.

20

(pg.

1730

-

1735

)

On the monophyly of chromalveolates using a six-protein phylogeny of eukaryotes

,

Int J Syst Evol Microbiol

,

2005

, vol.

55

(pg.

487

-

496

)

The biology of intron gain and loss

,

Trends Genet

,

2006

, vol.

22

(pg.

16

-

22

)

Genomics. Deep questions in the tree of life

,

Science

,

2007

, vol.

317

(pg.

1875

-

1876

)

The tree of eukaryotes

,

Trends Ecol Evol

,

2005

, vol.

20

(pg.

670

-

676

)

The origin of introns and their role in eukaryogenesis: a compromise solution to the introns-early versus introns-late debate?

,

Biol Direct

,

2006

, vol.

1

pg.

22

Mobile group II introns

,

Annu Rev Genet

,

2004

, vol.

38

(pg.

1

-

35

)

How introns influence and enhance eukaryotic gene expression

,

Trends Biochem Sci

,

2003

, vol.

28

(pg.

215

-

220

)

The recent origins of spliceosomal introns revisited

,

Curr Opin Genet Dev

,

1998

, vol.

8

(pg.

637

-

648

)

Seven newly discovered intron positions in the triose-phosphate isomerase gene: evidence for the introns-late theory

,

Proc Natl Acad Sci USA

,

1995

, vol.

92

(pg.

8507

-

8511

)

The origins of eukaryotic gene structure

,

Mol Biol Evol

,

2006

, vol.

23

(pg.

450

-

468

)

The origins of genome complexity

,

Science

,

2003

, vol.

302

(pg.

1401

-

1404

)

The evolution of spliceosomal introns

,

Curr Opin Genet Dev

,

2002

, vol.

12

(pg.

701

-

710

)

et al.

(25 co-authors)

CDD: a conserved domain database for interactive domain family analysis

,

Nucleic Acids Res

,

2007

, vol.

35

(pg.

D237

-

D240

)

Introns and the origin of nucleus-cytosol compartmentalization

,

Nature

,

2006

, vol.

440

(pg.

41

-

45

)

Introns: evolution and function

,

Curr Opin Genet Dev

,

1994

, vol.

4

(pg.

823

-

831

)

Eukaryotic intron loss

,

Science

,

2003

, vol.

300

pg.

1393

Estimating amino acid substitution models: a comparison of Dayhoff's estimator, the resolvent approach and a maximum likelihood method

,

Mol Biol Evol

,

2002

, vol.

19

(pg.

8

-

13

)

New maximum likelihood estimators for eukaryotic intron evolution

,

PLoS Comput Biol

,

2005

, vol.

1

pg.

e79

The evolution of spliceosomal introns in alveolates

,

Mol Biol Evol

,

2007

, vol.

24

(pg.

1093

-

1096

)

A spliceosomal intron in Giardia lamblia

,

Proc Natl Acad Sci USA

,

2002

, vol.

99

(pg.

3701

-

3705

)

Remarkable interkingdom conservation of intron positions and massive, lineage-specific intron loss and gain in eukaryotic evolution

,

Curr Biol

,

2003

, vol.

13

(pg.

1512

-

1517

)

The effect of intron location on intron-mediated enhancement of gene expression in Arabidopsis

,

Plant J

,

2004

, vol.

40

(pg.

744

-

751

)

Intron-rich ancestors

,

Trends Genet

,

2006

, vol.

22

(pg.

468

-

471

)

Complex early genes

,

Proc Natl Acad Sci USA

,

2005b

, vol.

102

(pg.

1986

-

1991

)

Rates of intron loss and gain: implications for early eukaryotic evolution

,

Proc Natl Acad Sci USA

,

2005a

, vol.

102

(pg.

5773

-

5778

)

The evolution of spliceosomal introns: patterns, puzzles and progress

,

Nat Rev Genet

,

2006

, vol.

7

(pg.

211

-

221

)

Very little intron loss/gain in Plasmodium: intron loss/gain mutation rates and intron number

,

Genome Res

,

2006

, vol.

16

(pg.

750

-

756

)

Large-scale intron conservation and order-of-magnitude variation in intron loss/gain rates in apicomplexan evolution

,

Genome Res

,

2006

, vol.

16

(pg.

1270

-

1275

)

A very high fraction of unique intron positions in the intron-rich diatom Thalassiosira pseudonana indicates widespread intron gain

,

Mol Biol Evol

,

2007b

, vol.

24

(pg.

1447

-

1457

)

Widespread intron loss suggests retrotransposon activity in ancient apicomplexans

,

Mol Biol Evol

,

2007a

, vol.

24

(pg.

1926

-

1933

)

The neighbor-joining method: a new method for reconstructing phylogenetic trees

,

Mol Biol Evol

,

1987

, vol.

4

(pg.

406

-

425

)

Improving the accuracy of PSI-BLAST protein database searches with composition-based statistics and other refinements

,

Nucleic Acids Res

,

2001

, vol.

29

(pg.

2994

-

3005

)

Estimating the dimensions of a model

,

Ann Stat

,

1978

, vol.

6

(pg.

461

-

464

)

Eukaryotic evolution: early origin of canonical introns

,

Nature

,

2002

, vol.

419

pg.

270

Scoredist: a simple and robust protein sequence distance estimator

,

BMC Bioinformatics

,

2005

, vol.

6

pg.

108

Recovering a tree from the leaf colourations it generates under a Markov model

,

Appl Math Lett

,

1994

, vol.

7

(pg.

19

-

24

)

Testing the exon theory of genes: the evidence from protein structure

,

Science

,

1994

, vol.

265

(pg.

202

-

207

)

A note on the neighbor-joining algorithm of Saitou and Nei

,

Mol Biol Evol

,

1988

, vol.

5

(pg.

729

-

731

)

A glimpse of a putative pre-intron phase of eukaryotic evolution

,

Trends Genet

,

2007

, vol.

23

(pg.

105

-

108

)

Conservation versus parallel gains in intron evolution

,

Nucleic Acids Res

,

2005

, vol.

33

(pg.

1741

-

1748

)

(17 co-authors)

The COG database: an updated version includes eukaryotes

,

BMC Bioinformatics

,

2003

, vol.

4

pg.

41

Spliceosomal introns in the deep-branching eukaryote Trichomonas vaginalis

,

Proc Natl Acad Sci USA

,

2005

, vol.

102

(pg.

4430

-

4435

)

PAML 4: phylogenetic analysis by maximum likelihood

,

Mol Biol Evol

,

2007

, vol.

24

(pg.

1586

-

1591

)

Intronic microRNAs

,

Biochem Biophys Res Commun

,

2005

, vol.

326

(pg.

515

-

520

)

Author notes

Aoife McLysaght, Associate Editor

© 2008 The Authors.

This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/2.0/uk/) which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.

Supplementary data

Citations

Views

Altmetric

Metrics

Total Views 1,350

1,061 Pageviews

289 PDF Downloads

Since 11/1/2016

Month: Total Views:
November 2016 1
December 2016 1
January 2017 2
February 2017 7
March 2017 11
April 2017 9
May 2017 8
June 2017 6
July 2017 11
August 2017 6
September 2017 10
October 2017 10
November 2017 8
December 2017 26
January 2018 17
February 2018 19
March 2018 25
April 2018 20
May 2018 13
June 2018 16
July 2018 26
August 2018 12
September 2018 25
October 2018 21
November 2018 12
December 2018 15
January 2019 13
February 2019 14
March 2019 19
April 2019 34
May 2019 14
June 2019 17
July 2019 14
August 2019 18
September 2019 22
October 2019 24
November 2019 20
December 2019 12
January 2020 18
February 2020 15
March 2020 10
April 2020 8
May 2020 8
June 2020 16
July 2020 71
August 2020 14
September 2020 8
October 2020 30
November 2020 13
December 2020 16
January 2021 9
February 2021 6
March 2021 11
April 2021 14
May 2021 14
June 2021 8
July 2021 7
August 2021 9
September 2021 9
October 2021 7
November 2021 21
December 2021 7
January 2022 8
February 2022 10
March 2022 4
April 2022 10
May 2022 12
June 2022 7
July 2022 15
August 2022 13
September 2022 14
October 2022 9
November 2022 5
December 2022 30
January 2023 11
February 2023 16
March 2023 17
April 2023 16
May 2023 16
June 2023 12
July 2023 5
August 2023 13
September 2023 4
October 2023 8
November 2023 16
December 2023 15
January 2024 14
February 2024 14
March 2024 15
April 2024 13
May 2024 13
June 2024 27
July 2024 22
August 2024 13
September 2024 16
October 2024 10

Citations

45 Web of Science

×

Email alerts

Email alerts

Citing articles via

More from Oxford Academic