RNA-based gene duplication: mechanistic and evolutionary insights (original) (raw)

. Author manuscript; available in PMC: 2013 Jun 24.

Published in final edited form as: Nat Rev Genet. 2009 Jan;10(1):19–31. doi: 10.1038/nrg2487

Abstract

Gene copies that stem from the mRNAs of parental source genes have long been viewed as evolutionary dead-ends with little biological relevance. Here we review a range of recent studies that have unveiled a significant number of functional retroposed gene copies in both mammalian but also some non-mammalian genomes (in particular that of the fruitfly). These studies not only revealed previously unknown mechanisms for the emergence of new genes and their functions but also provided fascinating general insights into molecular and evolutionary processes that have shaped genomes. For example, analyses of chromosomal gene movement patterns via RNA-based gene duplication have shed fresh new light on the evolutionary origin and biology of our sex chromosomes.


The process of the “birth” of a new gene has fascinated biologists for a long time1,2, not least because new genes are thought to contribute to the origin of adaptive evolutionary novelties and thus lineage- or species-specific phenotypic traits1,3. A major mechanism underlying the formation of new genes is gene duplication2. Traditionally, only DNA-mediated duplication mechanisms (i.e. duplication of chromosomal segments containing genes) have been considered and widely studied in this context (reviewed e.g. in refs 4,5), although gene copies originating through an alternative mechanism - the reverse-transcription of mRNA intermediates - have been described since the early 1980s6-8. These intronless retroposed gene copies were long dismissed a priori as “dead-on-arrival” (ref. 9-12) and routinely classified as processed pseudogenes13 due to the expected lack of regulatory elements and presence of mutations in many copies such as premature stop codons. Indeed, they were mainly considered a nuisance and confounding factor in transcription surveys because of their often high sequence similarity with parental source genes.

However, after some anecdotal findings of functional retroposed genes since the late 1980s (e.g. ref. 14), an unexpectedly large number of functional retrogenes have recently been discovered - mainly in mammals and fruitflies (e.g. refs 15-19). These studies revealed that retrogenes often evolved functional roles in the male germline (e.g. ref. 16,17), while other intriguing retrogene functions - e.g. in anti-viral defense20, in hormone-pheromone metabolism21,22, in the brain23, or in courtship behaviors24 – have also been postulated. More fundamentally, retrogene analyses have uncovered novel mechanisms with respect to how new genes may arise (e.g. the recruitment of regulatory elements) and obtain new functions (e.g. through gene fusion and adaptive evolution). Finally, retroposed gene copies have served as unique genomic markers, increasing our understanding of various genomic processes, ranging from the detection of extinct transcripts25 to the origin of our sex chromosomes17. All of these findings were only possible thanks to the growing number of complete genome sequences and achieved by targeted cross-disciplinary approaches, which involved evolutionary analysis, mining of available large-scale expression data, and molecular/genomics experiments.

This review aims to cover the most exciting insights obtained from the study of RNA-based gene duplication, focusing on functionally relevant aspects of protein-coding retrogenes. Given that the process of retroduplication is most abundant and/or best studied in mammals and fruitflies, we will focus our discussion on these organisms. Specifically, after briefly introducing the process of retroduplication, we first discuss the abundance of retrocopies and functional retrogenes in mammals and Drosophila. We then proceed with a discussion on how retrocopies may become transcribed and functional, which is followed by an overview of novel mechanisms underlying the emergence of new gene functions that were uncovered in detailed surveys of young retrogenes. We then discuss a major functional role of retrogenes in the male germline, which is related to the biology and evolution of X chromosomes. Finally, we round off the review with a discussion of other general insights pertaining to mammalian genome evolution obtained from global retrocopy surveys and some concluding remarks on potential future research directions.

Mechanisms of retroposition

To be heritable and hence of evolutionary relevance, retroposition/retroduplication (these terms are used interchangeably) needs to occur in the germline (or during early embryonic stages). Thus, retroposition requires an enzymatic machinery that not only can reverse-transcribe and integrate fully processed cDNA copies of mRNAs from parental source genes into the genome but that is also active in the germline. The fact that retroposition relies on the duplication through an mRNA intermediate also implies that only genes expressed in the germline can be duplicated via this mechanism.

The key retroduplication enzyme, reverse transcriptase, appears to generally stem from different types of retrotransposable elements, depending on the organism. In mammals, long interspersed nuclear elements (LINEs) seem to provide the enzymes necessary for retroposition. These retrotransposable elements encode a reverse transcriptase with endonucleolytic activity that can recognize any polyadenylated mRNA26,27. Esnault et al. and Wei et al. demonstrated that the L1 subfamily of LINEs can generate processed genes28,29, indicating that L1 retrotransposon activity has generated retroposed gene copies in mammals. The process of retroposition (including the hallmarks of retroposed gene copies) is detailed in Figure 1.

Figure 1. Mechanism of gene retroposition.

Figure 1

(A) Gene retroposition is initiated with the transcription of a parental gene by RNA polymerase II and (B) further processing of its RNA (splicing and polyadenylation), which produces a mature mRNA. (C) Gene retroposition is mediated by the L1 endonuclease domain (pink hourglass) that creates a first nick (yellow star) at the genomic site of insertion at the TTAAAA target sequence. (D) This nick enables the priming of the reverse transcription (by the L1 reverse transcription domain; pink oval shape), which uses the parental mRNA as template. (E) Second strand nick generation (precise mechanism not known). (F) Second DNA strand synthesis (precise mechanism not known). (G) Complementary DNA synthesis in overhang regions created by the two nicks, which creates a duplication of the sequence flanking the target sequence, which is one of the molecular signatures of gene retroposition, in addition to the lack of introns and the presence of a poly-A tail (the direct repeats and the poly-A tail degenerate upon time and are therefore usually only detectable in recent retrocopies). The illustration is based on findings described in references 26-28.

Retrotransposable element-encoded enzymes are likely also responsible for retroposition in Drosophila10,30 and some plants31,32, which carry various retrotransposons with reverse transcriptase activity, although the retroposition machinery has not been studied in detail in these organisms to date. The paucity of retrocopies in non-mammalian vertebrates is likely explained by the lack of retrotransposons with reverse transcriptases that can process standard mRNAs. For example, bird genomes contain a relatively large number of CR1 LINE elements33, but CR1 reverse transcriptases cannot recognize polyadenylated mRNAs (due to their specificity towards a different target sequence) and are thus incapable of promoting retroposition of mRNAs from genes in the genome34. The small number of RNA-based gene copies in birds34 seems to have been mediated by retroviral mechanisms35.

Rates of retrocopy and retrogene formation

Given that retrocopies are particularly abundant in mammals11,17-19,36 (due to the high activity of L1 LINE elements), we first discuss the rates of retrocopy and functional retrogene formation in mammals and then in Drosophila. Thousands of retrocopies have been identified in several placental mammal (eutherian) genomes11,17,18,36. This suggests a high rate of retrocopy formation during the evolution of this mammalian lineage. However, the rate of retroposition has not been constant, with periods of very high and low activity11,37,38, likely due to the fluctuating activity of L1 elements (see also BOX 1). Recently, approximately 2000 retrocopies were identified in the opossum genome17, suggesting a similarly high retroposition rate in metatherians (marsupials). Only few retrocopies (in the order of 50) seem to be present in the platypus genome (Soumillon et al., unpublished), consistent with the paucity of L1 elements in monotremes39, the most basal mammalian lineage.

Box 1. Retrocopies as genomic archives.

Generally, retrocopies may serve as useful genomic markers of transcript activity during evolution. For example (as indicated in the ‘Rate of retroposition’ section), as retroposition is mediated by LINE elements, the rate of retrocopy generation (which may be calculated on the basis of the divergence of retrocopies and parental genes at synonymous site) can be used to explore the activity of LINE retrotransposons during evolution.

Moreover, given that the probability of retroposition of a gene is expected to mainly depend on the abundance of its transcripts in the germline and/or the early embryo, the number of retrocopies should reflect parental gene activity during these stages11,12. Consistently, well-known housekeeping genes and/or genes with high germline/early embryo expression levels have produced many retrocopies11,12,105. Thus, retrocopies could serve as unique markers to shed light on the tissue origin of retroposition by correlating parental gene expression during different male/female germline or early embryonic stages with the abundance of their retrocopy offspring in the genome. The better the correlation observed in such an analysis, the more retrocopies would have emerged in a given germline/embryonic cell type.

Finally, the fact that retrocopies reflect their parental transcript structures have been exploited to detect previously unannotated or extinct, “fossil” transcripts25,106. For example, in a recent study, the authors reconstructed ancestral transcripts present in the common human-chimpanzee ancestor based on retrocopy sequences and inferred potential exon gains and losses in humans/chimpanzees based on their analysis106.

It was long assumed that retroposed gene copies represent, by and large, non-functional retropseudogenes due to their presumed lack of expression potential10,13, although individual studies have revealed instances of functional retrogenes since the late eighties14. But how many retrocopies have evolved into bona fide genes? Different types of evidence can be used to support functionality of retrocopies. Given the wealth of genomic data, rather straightforward approaches to support retrogene functionality are based on evolutionary analyses that screen for signatures of selection. For example, the (selective) preservation of intact open reading frames (ORFs) between distant17,18 or several closely related species37 provides statistically significant and convincing evidence for non-neutral evolution of retrocopies and therefore their functionality. Furthermore, comparisons of the rates of functionally relevant (amino acid changing) substitutions and neutral changes (silent substitutions) in retrogene coding regions can be used to detect non-neutral evolution, indicative of functional constraint (e.g. refs 23,37). In addition to such evolutionary approaches, molecular signs of functionality may be sought, such as evidence for transcription, which can often be readily obtained. But on its own, this does not suffice to support functionality of individual genes, as non-functional DNA (including retropseudogenes18) might be transcribed as well. Evidence for translation, that is, the presence of a protein (e.g. detected with specific antibodies) coupled with analysis of cellular phenotypes provides strong evidence of retrogene functionality. Ideally, the in vivo function of a retrogene is demonstrated, either by showing the association of retrogene mutations with disease40-42, or by the targeted disruption of retrogenes in animal models24,43,44. However, given that solid experimental evidence for the functionality of retrocopies is currently hard to obtain on a larger scale, the estimates of overall rates of functional retrogene formation discussed in the following have usually been obtained based on evolutionary/statistical analyses.

Vinckenbosch et al. estimated the number of functional retrogenes present in the human genome by comparing transcription levels of intact retrocopies to those of retropseudogenes, which reflect the transcriptional background noise in the genome18. The authors showed that more than a thousand retrocopies show evidence of being transcribed18, with intact retrocopies being transcribed to a much greater extent than retropseudogenes. On the basis of this observation the authors then conservatively estimated that at least 120 retrocopies are likely to represent functional genes. Based on an assessment of selective constraint on primate retrocopies, Marques et al. estimated the rate of functional retrogene formation in primates37. They estimated that, on average, at least one functional retrogene per million years emerged on the primate lineage leading to humans37.

In Drosophila, where the first retroposed gene copies were described in the early 1990s, a similar rate of functional retrogene formation was estimated15,45. Evidence of selective constraint suggests that about 90-100 functional retrogenes in this invertebrate lineage are functional15,16,46. However, the total number of retrocopies in this genus is much lower than that in mammals, which seems mainly to be due to the paucity of retropseudogenes in the Drosophila genome,9,47 (because of the extremely short half-life of unconstrained DNA in this genus9), rather than a low rate of retroposition.

Sources of regulatory elements

The observation that a significant number of retrocopies have evolved into bona fide genes raises the question how retrocopies can be expressed in their new genomic location. To become expressed at a significant level and in a meaningful way (e.g. in tissues where it can exert a selectively beneficial function), a new gene needs to obtain a core promoter and probably other elements (e.g. enhancers) that regulate its expression. In the following, we discuss various mechanisms through which the acquisition of promoters and other regulatory elements may occur.

Generally, retrocopies may profit from pre-exisiting regulatory elements in their vicinity for their expression. For example, a straightforward way for a retrocopy to obtain transcription potential would be to directly hitchhike on the regulatory machinery of other genes. Indeed, a number of cases have been described where retrocopies are located in an intron of a host gene, being transcribed in the form of a fusion transcript together with host gene exons18,41,48,49 (Fig. 2A). In mammals, retrocopies are often transcribed together only with 5′ untranslated (UTR) exons of the host gene, as “splice variants”, thus potentially avoiding interference with host gene functions18. In general, transcribed retrocopies tend to be close to other genes, suggesting that their transcription may be facilitated by the open chromatin and/or regulatory elements of nearby genes18 (Fig. 2B). The latter possibility is supported by observations that retrogenes may be transcribed from bi-directional CpG-rich promoters of genes in their proximity (Fablet et al., manuscript in preparation). The sometimes substantial distances between the retrocopy insertion site and these promoters are usually spanned by new 5′ untranslated exon/intron structures that arose during the process of promoter acquisition18.

Figure 2. Source of retrogene promoters.

Figure 2

The figure illustrates various scenarios that lead to the transcription of retroposed gene copies. (A) Retrocopies may insert into intronic sequences of host genes. The evolution and/or presence of splicing signals enable these copies to be integrated into new splice variants of their host gene. Depending on the localization of these new splice sites, these variants result in either non-coding fusion transcripts (where the entire open reading frame derives from the retrocopy) or coding sequence fusions (the coding region of the retrocopy is fused to that of the host gene). (B) The insertion of retrocopies into actively transcribed regions with an open chromatin structure facilitates their transcription, due to the increased accessibility for the transcriptional machinery. The presence of enhancer elements from neighboring genes and weak transcription promoting sequences (not previously associated with genes) can further strengthen their transcriptional activity. (C) Recruitment of distant promoters in the genomic neighborhood via the acquisition of a new untranslated exon/intron structure. (D) Recruitment of promoters from retrotransposons or CpG proto-promoters. (E) Inheritance of parental promoters through alternative transcriptional start site usage of the parental gene. (F) De novo promoter evolution in the 5′ flanking region of the insertion site by single nucleotide substitutions.

In a similar way (i.e. via the acquisition of new 5′ UTR structures), retrocopies may also become transcribed from distant CpG-enriched sequences (which often have inherent capacity to promote transcription50) not previously associated with other genes (Fig. 2C; Fablet et al., manuscript in preparation). These distant CpG “proto-promoter” elements may have been optimized by natural selection once associated with a functional retrogene. Similarly, distant promoters from retrotransposable elements have been “captured” by retrocopies for their transcription via the acquisition of new 5′ untranslated exon/intron structures (Fablet et al., manuscript in preparation). In addition, retrotransposons51 (or potentially CpG island proto-promoters) immediately upstream of retrogene insertion sites may also be used directly (Fig. 2D).

Until recently, it was thought that retrocopies are quite unlikely to directly inherit parental promoters (hence the common expectation that they are unlikely to evolve into functional genes), although instances of parental promoter inheritance had been found52-54. However, a recent study suggests that retrocopies might nevertheless rather frequently inherit basic promoters directly from their parental source genes55. Often, these parental genes are transcribed from CpG promoters, which usually have multiple transcriptional start sites56 (TSS). If a retrocopy stems from a parental transcript with a TSS located relatively far upstream, the mRNA that gave rise to the retrocopy may carry downstream promoter sequences and TSSs with sufficient capacity to promote transcription (Fig. 2E). The frequent inheritance of CpG promoters might also help to explain why a significant number of retrogenes evolved paternally or maternally imprinted expression57,58 (Table 1).

Table 1.

Representative retrogenes in mammals and fruitflies.

Genes Phylogeneticdistribution Features (Chromosomal origin / structure / type of selection /function) References
Primates
GLUD2 Hominoids Into X, positive selection, subcellular adaptation, adaptation to(neurotransmitter) glutamate metabolism 23,67
CDC14Bretro Hominoids Positive selection, subcellular adaptation, derived from cell cyclegene, brain/testis-specific expression 37,65
c1orf37-dup Humans Positive selection, transmembrane protein 66
PGAM3 Old World primates Positive selection, phosphoglycerate mutase 64
TRIM5-CypA gene Macaque lineage Chimeric gene, retrovirus restriction, CypA portion derives fromretroposition 72-74
TRIM5-CypA gene New World monkeys Chimeric gene, retrovirus restriction, CypA portion derives fromretroposition 20
_PIP5K1A_-PSMD4 retrogene Hominoids Chimeric gene, positive selection, subcellular change, fusionretrogene; stems from chimeric transcript of two adjacent parentalgenes 75
TAF1L, KIF4B Old World primates X-derived 37,101
RBMXL1 Old World primates X-derived, chimeric gene, fusion to host gene UTR
Utp14c Primates X-derived, chimeric gene, evidence for it to be required for malefertility, fusion to host gene UTR 40
Rodents
Utp14b Rodents X-derived, chimeric gene, required for male fertility, fusion to hostgene UTR exon 41,42
U2af1-rs1 Rodents X-derived, paternally imprinted 57
PMSE2b Mouse* Inserted into a LINE1 which drives its transcription 51
Mammals
Cstf2t All Mammals X-derived, chimeric gene, required for male fertility, crucial forproper polyadenylation in meiosis/post-meiosis 43
HNRNPGT Therians X-derived, required for male fertility 44
Pgk2 Eutherians X-derived, promoter inherited from parent, acquisition of a testisspecificenhancer, first described X-derived retrogene 14,60
Inpp5f, Nap1/5,Mcts2 Eutherians X-derived, paternally imprinted, located in introns of host genes 57
KLF14 Eutherians Maternally imprinted, accelerated evolution on the human lineage 58
USP26 Eutherians Into X, among the 5 most positively selected gene in human-chimpcomparison 102
Drosophila
jingwei (jgw) _D. yakuba, santomea_and teisseri Chimeric gene, positive selection, retrocopy encoded ADH domainevolved new substrate (alcohol) specificity 21,48
Sphinx( spx) D. melanogaster Chimeric gene, positive selection, retrocopy evolved into non-coding RNA gene that promotes male-female courtship 24,49
Adh-Twain D. subobscura,guanche and_madeirensis_ Chimeric gene, positive selection, putative functional adaptation tonew substrate specificity 103
mojoless (mjl) Drosophila genus X-derived, required for male fertility 104
Dntf-2r D. melanogaster subgroup Substitutions in an upstream proto-promoter element appear to haveprovided this gene with a new, testis-specific promoter

In Drosophila, the source of transcription potential of retrogenes is somewhat more elusive. While – similarly to mammals - host gene fusions have occurred in this genus (e.g. refs 48,49) and retrogene transcription may be facilitated through the transcriptional activity of genes in their vicinity15, some other mechanisms described for mammals, such as parental promoter inheritance or retrotransposon-driven transcription, have not yet been detected in fruitflies. Instead, small substitutional changes in pre-existing upstream sequences of retrogene insertion sites that occurred under the influence of natural selection have been postulated to play a role in the formation of basic Drosophila retrogene promoters15,59 (Fig. 2F).

We note that the various mechanisms that may endow retrogenes with regulatory elements described here probably often only provide the basic means for the initial transcription of retrocopies, while more sophisticated regulatory elements may evolve with time (see e.g. the mammalian Pgk2 retrogene; Table 1; ref. 52,60).

The evolution of new functions from retrogenes

DNA versus RNA-based duplication

The fundamental differences between the two major duplication mechanisms – segmental duplication (reviewed e.g. in refs 4,5) and retroposition – have significant consequences for the respective evolutionary fates of generated gene copies and their analysis. Segmental duplication regularly produces daughter copies that inherit the genetic features – exons/introns and regulatory elements – of the ancestral gene, whereas retroduplicate copies usually lack introns and are less likely to have strong regulatory elements upon their emergence. Therefore, segmental duplication is more likely to yield expressed daughter copies than the retroduplication process. At the same time, segmental duplicates are likely to exhibit very similar expression patterns in their early evolution, which may often imply that one copy is initially functionally redundant, and the increased gene dose might even deleterious (although increased gene dosage may sometimes be beneficial and thus selectively preserved). By contrast, retroduplicate copies often need to recruit regulatory elements to become transcribed (see section above). This also means, however, that retrocopies that do become transcribed are probably more prone to evolve new expression patterns and - as a consequence - novel functional roles than gene copies arising from segmental duplication.

A further fundamental difference between the two duplication mechanisms is related to the relationship between the two duplicate members of the pair. The clear directionality in the retroduplication process (often not discernible for segmental duplications) facilitates studies pertaining to the origin of new gene functions, since parental genes usually maintain the ancestral gene function (although there are interesting exceptions to this rule, ref. 61), while new functions usually are acquired by the intronless daughter retrogene copies. It also renders the detection and analysis of young duplication events, which are particularly informative for the study of new gene functions (see below), straightforward. Recent segmental duplicates, on the other hand, are not easily distinguishable and more difficult to study, as they are, for example, frequently collapsed into a single locus in standard genome assemblies due to their high sequence and structural similarities.

Finally, retroduplication usually produces gene copies on chromosomes different from that of the parental gene copy, while segmental duplications are less likely to involve different chromosomes (although the rate of inter- vs. intrachromsomal segmental duplication differs between lineages, refs 45,62,63). Thus, retroduplication represents the ideal “vehicle” for interchromosomal gene “movements”, the directions of which are also easily determined based on the inherent directionality of the process (see below for a detailed discussion of retrogene movement studies).

Nevertheless, due to the abundance of functional segmental duplicates in nearly all studied genomes, numerous studies of segmental duplication have yielded many fundamental insights and established general concepts regarding the emergence of new gene functions (reviewed in detail in e.g. refs 4,5),

However, due to the particular features of retroposed gene copies outlined above, the analysis of retroduplication has provided additional insights with respect to the functional evolution of new genes not previously described for segmental duplicates. In particular the analysis of young retrogenes has provided novel insights into mechanisms underlying the evolution of new genes, as the changes in sequence that occurred during their early evolution are usually still traceable using evolutionary approaches1. In mammals, the study of young retrogenes has mainly focused on primate cases. Systematic surveys and individual studies led to the discovery of several young retrogenes that emerged recently on the primate lineage leading to humans23,37,64-66. For some of these, positively selected substitutions could be tied to functional change and adaptation23,65,67 (Table 1).

Emergence of new cell compartment-specific functions

Further analysis of these recently emerged retrogenes uncovered a novel mechanism underlying the emergence of new gene function. They showed that new gene functions can arise through changes in the localization of encoded proteins in the cell, a process collectively termed subcellular adaptation65,67,68. The following two examples led to the finding of subcellular adaptation and demonstrate two ways by which this process might occur (Fig. 3).

Figure 3. Subcellular adaptation of proteins encoded by new duplicate genes.

Figure 3

(A) Illustration of 2 scenarios for the evolution of duplicated genes (red and green) and their products. Each gene and its encoded protein are represented with one color. Distinct protein shapes indicate distinct functions. Three different protein localizations (cytosolic, endoplasmic reticulum, or secreted proteins) are indicated in a schematic cell. Positively selected substitutions responsible for subcellular changes or changes in protein function are indicated (arrows). See main text for references and further details. (B) Adaptive evolution of two primate specific retrogenes (GLUD2 left, CDC14Bretro right). Phylogenetic trees indicate retroduplication events. Periods of adaptive evolution and reconstructed subcellular localizations are indicated. Microscopy images display representative subcellular phenotypes for the indicated branches. Markers on the left: protein localization (green), nuclear DNA (blue), and microtubules (red). Yellow signals indicate an overlap of the protein with microtubules. Markers on the right: protein localization (green) and mitochondria (red).

The study of the GLUD2 retrogene exemplifies one form of subcellular adaptation (“sublocalization”, ref. 68) in which the protein encoded by the new gene becomes more specifically targeted to one or several of the ancestral cellular compartments. GLUD2 (Table 1) emerged in the common ancestor of humans and apes 18-25 MYA by retroposition from its parental gene GLUD1, which encodes an enzyme that degrades glutamate69. The _GLUD2_-encoded enzyme evolved unique biochemical properties soon after the duplication event by virtue of two key amino acid substitutions that were fixed as a result of positive selection23. These changes were suggested to reflect the functional adaptation of GLUD2 to the metabolism of neurotransmitter glutamate in the brain70. A further study of GLUD2 uncovered another level of functional adaptation. Rosso et al. showed that whereas the ancestral glutamate dehydrogenase enzyme localizes to mitochondria and the cytoplasm, GLUD2 became specifically targeted to one of these compartments, the mitochondrion, due to a single, positively selected substitution in its N-terminal targeting sequence67. This event likely contributed to the adaptation of GLUD2 to a function in the glutamate metabolism of the brain and other tissues. Thus, GLUD2 represents an example of rapid change in subcellular localization and function of a new protein that has been driven by natural selection65,67,68 (Fig. 3).

The analysis of another ape-specific retrogene, CDC14Bretro, revealed that proteins encoded by new genes can completely relocalize to new, previously unoccupied cellular niches during evolution under the influence of natural selection, reflecting a variant form of subcellular adaptation that was termed subcellular relocalization or neolocalization68,71. CDC14Bretro stems from a splice variant of the CDC14B cell cycle gene65 (Table 1) and encodes a protein that became specifically expressed in the adult/fetal brain and testes soon after its emergence in the common human and ape ancestor. It then completely relocalized in the cell due to intense positive selection in the common African ape ancestor ~7-12 Mya, shifting from the ancestral association with microtubules (which it stabilized) to a localization and function on the endoplasmic reticulum (Fig. 3).

Notably, a recent global survey of yeast duplicate proteins, prompted by these retrogene studies, showed that subcellular adaptation appears to be widespread, being involved in the evolutionary fate of at least 30% of duplicates68. Thus, in conclusion, the analysis of young retrogenes led to the finding that in addition to changes in gene expression and/or the biochemical function of the protein5 (through neo- or subfunctionalization), rapid and selectively driven subcellular adaptation by either “neolocalization” (CDC14Bretro) or “sublocalization” (GLUD2) represents a common, previously little considered mechanism underlying the emergence of new gene function (Fig. 3).

Gene fusion and domain shuffling

Another way by which new gene functions can arise is through gene fusion, which is defined as the fusion of two previously separate source genes into a single transcription unit1. Gene fusion may occur through various mechanisms (including DNA-based recombination events) and can lead to the juxtaposition of exons encoding functional protein domains from different genes, in which case it represents a form of exon or domain shuffling1.

Fusions of retroposed gene copies to genes into which they insert have yielded new genes with important functions. Detailed studies of such fusion genes uncovered surprising aspects of new gene formation such as the recurrent juxtaposition of genes with complementary functions, as in the case of the TRIM5-CypA fusion gene (Fig. 4). A retroposed copy of the CypA gene, whose encoded protein potently binds retroviral capsids, was shown to have integrated independently into the antiviral defense gene TRIM5 in a New World monkey20 (Fig. 4A) and an Old World monkey72-74 (Fig. 4B). In both cases, the retrocopy-encoded CypA protein replaced and functionally substituted the original capsid-binding domain (B30.2) from TRIM5. The newly emerged TRIM5-CypA fusion protein more efficiently restricts HIV-1 and other retroviruses in these species20,72-74. The TRIM5-CypA gene fusion represents a striking case of domain shuffling and convergent evolution. The at first glance seemingly unlikely multiple independent insertions of CypA retrocopies into the same gene were probably facilitated by a rather high retroposition rate of the CypA gene (due to its high expression in the germline). Rare TRIM5-CypA fusions were then likely driven to fixation during the evolution of the monkey lineages by strong selective pressures, because potent TRIM5 variants can provide a high degree of resistance to lethal and common diseases caused by various retroviruses73.

Figure 4. Origin of TRIM5-CypA gene fusions in macques and owl monkeys.

Figure 4

(A) Retroposition of CypA into an intron of the TRIM5 gene from macaques and the resulting fusion gene is shown (similar to the process displayed in Fig. 2A). (B) An independent retroposition of CypA into the UTR of TRIM5 in owl monkeys is shown, also resulting in a new TRIM5-CypA fusion gene. Please refer to Fig. 2 for the colour code and to the main text for details.

Recent studies revealed that fusion genes can also arise through the co-retroposition of adjacent parental source genes. Akiva et al. identified a recent retroposed gene (PIPSL) on human chromosome 10 that stems from a fusion transcript of two parental genes (PIP5K1A and PSMD4) that reside adjacently on chromosome 1 (ref. 75). Babushok et al. then showed that the gene was exclusively expressed in testes in humans and chimpanzees76. But, curiously, although PIPSL was apparently shaped by strong positive selection - suggesting functionality and adaptive evolution of the encoded protein - this fusion gene appeared to be post-transcriptionally repressed. However, in a recent follow-up analysis, we (manuscript submitted) obtained evolutionary and experimental support for the functionality of this gene in hominoids. Given the abundance of intergenic splicing in mammals75,77, we speculate that co-retroposition of adjacent genes might potentially be responsible for the origination of other chimeric retrogenes.

Analysis of chimeric genes in Drosophila demonstrated how gene fusion via retroposition can generate raw material for the evolution of new gene functions under the influence of positive Darwinian selection. The gene jingwei (jgw), which represents the first chimeric gene involving retroposition described in any species48, originated by the insertion of a retrocopy of the Alcohol dehydrogenase gene (Adh) into the yande gene48 (Table 1). The functional evolution of jgw was recently unveiled using a biochemical approach2122, which revealed that the JGW protein was shaped by positive selection (in particular the ADH domain) and apparently evolved a role in hormone/pheromone biosynthesis or degradation processes.

The Drosophila sphinx (spx) gene49 (Table 1) illustrates a mechanism for how RNA genes with important new functions can arise, a process that is as yet poorly understood. Sphinx emerged within the last 2-3 million years and derives from a retroposed ATP synthase gene that fused to exons located in the vicinity of the insertion site. Notably, the retroposed gene copy lost its protein coding capacity (accumulating nonsense mutations) and spx subsequently evolved into a non-coding RNA-gene under the influence of positive selection. Dai et al. knocked out the spx gene in D. melanogaster24. The phenotype of these spx knockout flies – increased male-male courtship behaviour relative to wild type Drosophila – suggests that spx represents the first recently emerged gene for which a behavioral phenotype could be identified.

Retrogene functions in testes and sex chromosome evolution

In the following, we will discuss global surveys of retroposition in mammals and fruitflies, which have shown that retrogenes often evolved functions in the testes and that the formation and preservation of many of these genes is closely linked to the biology and selective forces (imposed by the male germline) that have shaped X chromosomes ever since their emergence. Dating of the origin of these retrogenes also allowed a reassessment of the age of mammalian sex chromosomes.

Expression in testes

Numerous retrogene studies in both mammals and fruitflies revealed an overall propensity of retrogenes to be expressed in testes (refs 16,18,37,46,48 and references therein). A combination of a testis expression bias and natural selection was postulated to explain this observation17,37. In meiotic and post-meiotic spermatogenic cells the autosomal chromosomes appear to be in a state of hypertranscription due to various modifications of the chromatin (reviewed in ref. 78). This hypertranscription state was suggested to allow transcription of DNA that is usually not transcribed and therefore might have facilitated transcription of retrocopies37 but also of other types of duplicates79 in testis during their early evolution. A subset of these retrocopies subsequently obtained beneficial functions in testis and evolved into bona fide genes (see further discussion below). Natural selection then further enhanced their promoters (and other regulatory elements), which led to a stronger and more refined testis expression pattern among the functional retrogenes.

An alternative and not mutually exclusive hypothesis is based on the notion that retrocopies might preferentially insert into open, actively transcribed chromatin80. Given that retroduplication occurs in the germline, they might therefore predominantly insert into or close to germline-expressed genes, which would facilitate retrocopy transcription in the germline. However, in Drosophila, this hypothesis appears to explain testis expression of only some retrogenes (several retrogenes are located in regions with many testis-expressed genes, ref. 81). In mammals, this insertion bias scenario remains to be explored.

Retrogenes out of the X

As pointed out above, the retroduplication process readily produces gene copies on chromosomes different from that of the parental gene copy. Global genomic surveys of such gene “movements” revealed an intriguing pattern that was observed both in mammals17-19 and Drosophila16: a disproportionately large number of parental genes on the X chromosome have given rise to functional retrogene copies on autosomes16,19 (Fig. 5A). For mammals, it was shown that these autosomal retrogene are specifically expressed in testis – during and after the meiotic stages of spermatogenesis – whereas their X-linked parents (usually broadly expressed housekeeping genes) are transcriptionally silenced during these stages (Fig. 5A), due to male meiotic sex chromosome inactivation (MSCI) (ref. 17 and studies reviewed in ref. 82).

Figure 5. Retrogenes, MSCI, and the emergence of mammalian sex chromosomes.

Figure 5

(A, upper part) Illustration of the retroposition of an X-linked parental gene to an autosome. (A, lower part) Illustration of the expression of X-linked parental genes and their autosomal retrogene copies before (in spermatogonial cells), during (spermatocytes), and after (spermatids) the process of meiotic sex chromosome inactivation (MSCI). (B) The evolutionary onset for the selectively driven out of X retroduplication process and MSCI, as well as the inferred origin of therian (eutherians/placental mammals and metatherians/marsupials) sex chromosomes. See main text for further explanations.

Importantly, these mammalian X-derived retrogenes are significantly more frequently and more specifically expressed during and post meiosis than other retrogenes17 (which also tend to be expressed in testes – see subsection above). This substantiates the hypothesis that retrogenes that stem from the X have been fixed during evolution and shaped by natural selection to compensate for parental (housekeeping) gene silencing during and after MSCI17,19,83. This compensation hypothesis has also been functionally supported by studies that showed that loss of function of retrogenes with X-linked progenitors lead to severe defects of male meiotic functions in mice41-44 and probably humans40. It is worth pointing out that, curiously, the potential mechanistic biases favoring expression in meiotic/post-meiotic cells (see subsection above) allow X-derived retrogenes to be expressed precisely where needed to compensate their parents. Thus, together with the fact that the retroduplication process readily moves genes between chromosomes, this means that retrogenes – rather than DNA-based duplicates – may easily evolve into functional autosomal substitutes of their X-linked parental genes during the late stages of spermatogenesis.

Although it was recently suggested that the major cause for the out-of-X movement in Drosophila might be different from that in mammals84, a recent study suggests that MSCI may occur in Drosophila (ref. 85). Therefore, MSCI may be the main force responsible for the preferential fixation of X-derived retrogenes with meiotic/post-meiotic expression in fruitflies as well. In addition, similarly to mammals, retrogene-parental gene expression patterns also seem to be complementary during meiosis in Drosophila46.

The origin of mammalian sex chromosomes

A recent survey of young primate retrogenes showed that the out-of-X movement of retrogenes is ongoing37, which suggests that gene export from the X continues to be selectively beneficial. But when did this process begin during evolution? A systematic dating analysis using representative genomes from the three major mammalian lineages recently revealed that although retrogenes were generated ever since the common ancestor of all mammals, selectively driven retrogene export from the X only started later, on the eutherian and marsupial lineages, respectively17 (Fig. 5B). Given that MSCI is the likely selective force driving genes off the X, this observation suggested that MSCI emerged – rather late - in the common ancestor of eutherians and marsupials, that is, well after their separation from the monotreme lineage17 (Fig. 5B).

Moreover, these observations lead to a reassessment of the age of our sex chromosomes, which evolved from an ancestral pair of autosomes86,87. Given that MSCI probably reflects the spread of the recombination barrier between the X and Y chromosomes during their evolution17,88, Potrzebowski et al. concluded that these chromosomes originated (probably late) in the common ancestor of eutherians and marsupials and not in the common ancestor of all mammals, and are therefore much younger than previously thought17 (Fig. 5B). This view is supported by the recent analysis of the platypus genome, which revealed that monotreme sex chromosomes share homology only with bird and not with therian (eutherian/marsupial) sex chromosomes39,89,90.

Retroposition into the X

Curiously, retrogenes are not only exported from the X but are also prefentially imported into this chromosome in mammals (ref. 19). There seems to exist a slight mechanistic bias that favors the insertion and/or retention of retrocopies on the X (ref. 19). Although the cause of this bias remains unclear, the excess of retropseudogenes on the X in is consistent with the accumulation of other non-functional retro-elements (LINEs) on the X chromosome in this lineage91. In addition, however, a strong selective force - the precise nature of which remains to be identified - apparently led to the preferential fixation of bona fide retrogenes on the X (ref. 19). We finally note that no increased fixation rate of retrogenes on the X is observed in Drosophila16,92. This may reflect differences in the biology of sex chromosomes between mammals and fruitflies, but the precise reasons for this discrepancy needs to be clarified.

Retrocopies and gene structure evolution

Studies of the process of retroposition have not only shed light on the origin of new genes as discussed above, but have also provided other general insights pertaining to the evolution of mammalian genomes. We discuss these findings in the following subsections and in BOX 1, which highlights how retrocopies reflect aspects of transcriptome evolution.

Retrocopies and intron loss

One way by which retrocopies have shaped mammalian genes is by mediating the loss of introns. Intron gains are rare events during evolution, while intron loss appears to be more frequent93. In mammals, for example, not a single case of intron gain has been documented, whereas more than 100 intron losses have been reported94. Interestingly, these intron losses appear to have been mediated by recombination of the gene displaying intron loss with the reverse-transcribed, processed mRNA molecule (cDNA) of the same gene94,95. There are several lines of evidence supporting this hypothesis, including the always precise loss of the intronic sequence (the alternative mechanism – DNA deletion – would often result in imprecise intron loss), the fact that intron loss usually affects genes that are highly expressed in the germline (thus producing many processed cDNAs that may recombine with the source gene), and the preferential loss of introns towards the 3′ end of the genes94,96 (reflecting that reverse-transcription begins at the 3′ end of transcripts; thus incomplete 3′ cDNAs can recombine with the source gene, leading to 3′ intron loss).

Retrogenes and splicing constraints

Retrogenes have also helped to support the novel hypothesis that the preservation of splicing signals constrains protein evolution. Specifically, a recent study suggested that the selective pressures on splice signals (enhancer/silencers) near exon boundaries significantly reduces the rate of protein evolution97. The rate of protein evolution of retrogenes is highest near the sequences where intron-exon junctions previously resided in the parental genes that gave rise to the retrogenes. Therefore, splicing sequence constraints may have hampered the evolution of multi-exon gene encoded proteins, thus potentially preventing functional optimization of proteins. It will be interesting to test whether retrogenes have evolved more efficient and/or adapted proteins compared to their intron-containing parents due the relaxation of splicing constraints.

Conclusions

Messenger RNA-derived duplicates were long thought to be doomed to pseudogenization and decay. As outlined in this review, however, a significant number of retroposed gene copies have escaped this evolutionary fate and have evolved into bona fide genes. Retroduplicate genes are probably still much less likely to become functional compared to “normal” DNA duplicates due to their peculiar properties, which include the frequent lack of strong regulatory elements upon their emergence. On the other hand, due to these properties, retrogenes often evolved in unique ways, being much more prone to evolve new expression patterns, new genomic locations, and new functions than DNA duplicates. Thus, individual and global surveys of retrogenes (using a variety of evolutionary, genomics, and molecular tools) have unearthed previously unknown molecular mechanisms pertaining to the origin of new genes (e.g. promoter recruitment, subcellular adaptation of encoded proteins), and have provided unexpected and unique insights into genome evolution (e.g. the origin and evolution of our sex chromosomes).

In spite of these recent advances in the RNA-based duplication field, much remains to be done. To date, only relatively few young retrogenes have been pinpointed and even fewer studies (most of them discussed in this review) have attempted to characterize the functional evolution of young retrogenes, thus going beyond mere descriptions of evolutionary signatures. Future work should therefore first aim to identify more young functional retrogenes. Such studies are challenging (at least in mammals), due to the difficulty in assessing their selective preservation, but will benefit from the steadily increasing number of available complete genomes in primates. Notably, very recent functional hominoid retrocopies might soon be identified based on an astounding number of human genomes that will soon be completed using the new, recently developed ultra-high throughput sequencing technologies98. New cases of young retrogenes should then be subjected to in-depth analyses of their functional evolution, using combinations of evolutionary analysis with molecular, cellular, and in vivo experiments (e.g. transgenic mice carrying primate-specific genes, or knockout studies in Drosophila). Ultimately, such studies are likely to uncover additional modes underlying the evolution of new gene function and provide a more global view of the contribution of retrogenes to cellular or organismal phenotypes.

It will also be interesting to screen for retrogenes in genomes from other organisms for which complete genomes are becoming available and to study their chromosomal localization patterns, evolution, and functions. For example, a recent study discovered a surprisingly large number of functional retrogenes with interesting properties in the rice genome32 (a large proportion of them fused to other genes), an unexpected finding, given that the retroposition activity in plants was traditionally thought to be low.

Finally, we believe that retrocopies generally still represent a relatively untapped resource and are likely to reveal further unpredicted and fascinating aspects, which may even open up new fields of research. For example, very recently it was found that mammalian retropseudogenes appear to frequently encode small interfering RNAs, important for the regulation of their parental source genes99,100. Thus, even retropseudogenes do not necessarily represent evolutionary dead-ends but may provide the raw material for functionally important evolutionary innovations.

Acknowledgements

We apologize to colleagues whose work could not be discussed or cited due to space constraints and/or the focus of this review. We thank the members of the H.K. and M.L. laboratories for helpful discussions. This work was supported by funds from the Swiss National Science Foundation, the European Research Council (STREP: 140404), and EMBO Young Investigator Grant (to H.K.), as well as the NIH (R0IGM078070-01A1) (to M. L.).

Glossary

RETROPOSITION

A mechanism that creates duplicate gene copies in new genomic positions through the reverse-transcription of mRNAs from source genes (also known as RNA-based duplication, retroduplication).

PARENTAL GENE

Source of the mRNA that gives rise to a retroposed gene copy.

RETROCOPY

Gene copy that results from the process of retroposition (also termed retroposed gene copy, retroduplicate copy).

RETROGENE

Expressed and functional retrocopy (usually with an intact open reading frame consistent with that of the parental gene).

RETROPSEUDOGENE

Non-functional retrocopy, which usually carries frameshift-causing insertions/deletions and/or premature stop codons that preclude gene function.

L1 ELEMENTS

A member of the long interspersed retrotransposable (LINE) family of repeats, which provides the enzymatic machinery necessary for the process of retroposition.

NEW GENE

A gene that originated recently during evolution.

SUBCELLULAR ADAPTATION

A process by which a (duplicate) gene product evolves a new localization in the cell or localizes more specifically to one of the ancestral compartments under the influence of positive Darwinian selection.

GENE FUSION

The fusion of adjacent genes into a single transcription unit (termed chimeric gene or fusion gene).

DOMAIN SHUFFLING

Juxtaposition of one or more exons from two different genes that encode functional protein domains.

MSCI

Meiotic sex chromosome inactivation – the transcriptional silencing of the X and Y chromosomes during the meiotic phase of spermatogenesis.

References