LINE-1 Retrotransposition Activity in Human Genomes (original) (raw)

Cell. Author manuscript; available in PMC 2011 Jan 2.

Published in final edited form as:

PMCID: PMC3013285

NIHMSID: NIHMS256394

Christine R. Beck,1,* Pamela Collier,4 Catriona Macfarlane,4 Maika Malig,5 Jeffrey M. Kidd,5 Evan E. Eichler,5 Richard M. Badge,4 and John V. Moran1,2,3,*

Christine R. Beck

1 Department of Human Genetics, University of Michigan Medical School, Ann Arbor, MI

Pamela Collier

4 Department of Genetics, University of Leicester, Leicester, UK

Catriona Macfarlane

4 Department of Genetics, University of Leicester, Leicester, UK

Maika Malig

5 Department of Genome Sciences and Howard Hughes Medical Institute, University of Washington, Seattle, WA

Jeffrey M. Kidd

5 Department of Genome Sciences and Howard Hughes Medical Institute, University of Washington, Seattle, WA

Evan E. Eichler

5 Department of Genome Sciences and Howard Hughes Medical Institute, University of Washington, Seattle, WA

Richard M. Badge

4 Department of Genetics, University of Leicester, Leicester, UK

John V. Moran

1 Department of Human Genetics, University of Michigan Medical School, Ann Arbor, MI

2 Department of Internal Medicine, University of Michigan Medical School, Ann Arbor, MI

3 Howard Hughes Medical Institute, University of Michigan Medical School, Ann Arbor, MI

1 Department of Human Genetics, University of Michigan Medical School, Ann Arbor, MI

2 Department of Internal Medicine, University of Michigan Medical School, Ann Arbor, MI

3 Howard Hughes Medical Institute, University of Michigan Medical School, Ann Arbor, MI

4 Department of Genetics, University of Leicester, Leicester, UK

5 Department of Genome Sciences and Howard Hughes Medical Institute, University of Washington, Seattle, WA

Supplementary Materials

01.

GUID: 2085CF34-175E-4D73-AD2F-E73825947111

02.

GUID: 4B7C8597-D567-4D50-ACD2-5DE2AD3B06E0

03.

GUID: A03698C3-52FA-4AD0-8C37-CA512B547579

04.

GUID: 07CBB775-65F2-4A3D-9E26-89561260ECAB

05.

GUID: 0CF71E6A-CCF7-4752-8DF7-CA77A5111F0F

06.

GUID: DC6ECD69-DBA5-490D-BF4F-8E23FDD9FAFB

Summary

Long Interspersed Element-1 (LINE-1 or L1) sequences comprise the bulk of retrotransposition activity in the human genome; however, the abundance of highly active or ‘hot’ L1s in the human population remains largely unexplored. Here, we used a fosmid-based, paired-end DNA sequencing strategy to identify 68 full-length L1s which are differentially present among individuals but are absent from the human genome reference sequence. The majority of these L1s were highly active in a cultured cell retrotransposition assay. Genotyping 26 elements revealed that two L1s are only found in Africa and that two more are absent from the H952 subset of the Human Genome Diversity Panel. Therefore, these results suggest that ‘hot’ L1s are more abundant in the human population than previously appreciated, and that ongoing L1 retrotransposition continues to be a major source of inter-individual genetic variation.

Introduction

L1s comprise ~17% of human DNA and have been an instrumental force in shaping genome architecture (Lander et al., 2001). Most L1s are molecular fossils that cannot move (retrotranspose) to new genomic locations (Grimaldi and Singer, 1983; Lander et al., 2001). However, a small number of human-specific L1 (L1Hs) elements remain retrotransposition-competent (Badge et al., 2003; Brouha et al., 2003; Sassaman et al., 1997). On occasion, their retrotransposition has resulted in sporadic cases of human disease (reviewed in Babushok and Kazazian, 2007; Kazazian et al., 1988).

During the past fifteen years, computational, molecular biological, and genomic approaches have been used to identify and characterize L1Hs elements (Badge et al., 2003; Boissinot et al., 2000; Boissinot et al., 2004; Brouha et al., 2003; Lander et al., 2001; Moran et al., 1996; Myers et al., 2002; Ovchinnikov et al., 2001; Sheen et al., 2000; Xing et al., 2009). Several themes have emerged from these studies. First, L1Hs elements can be stratified into several subfamilies (pre-Ta, Ta-0, Ta-1, Ta1-d, Ta1-nd) based upon the presence of diagnostic sequence variants contained within their 5′ and 3′ untranslated regions (UTRs) (Boissinot et al., 2000; Skowronski et al., 1988; Smit et al., 1995). Second, many L1Hs elements are dimorphic in that they are differentially present in individual genomes and/or are present in an individual, but absent from the haploid Human Genome Reference sequence (HGR) (Badge et al., 2003; Boissinot et al., 2004; Brouha et al., 2003; Lander et al., 2001; Myers et al., 2002; Xing et al., 2009). Third, it has been estimated that the average human genome contains ~80–100 active (retrotransposition-competent) L1Hs elements, and that only a small number of highly active L1Hs elements (‘hot’ L1s) account for the bulk of retrotransposition activity in the HGR (Brouha et al., 2003). Those studies, as well as recent efforts to identify insertion, deletion, and inversion polymorphisms (structural variants) in humans (Kidd et al., 2008; Korbel et al., 2007; Tuzun et al., 2005; Xing et al., 2009) indicate that ongoing L1 retrotransposition contributes to inter-individual genetic variation.

Here, we employed a fosmid-based, paired-end DNA resource to identify full-length L1Hs elements in the genomes of six individuals of diverse geographic origin. Over half (37/68) of the newly identified L1s were ‘hot’ for retrotransposition when examined in a cultured cell assay (Moran et al., 1996). Genotyping a subset of these L1s further revealed that some are likely restricted to Africans, whereas others are absent from the Human Genome Diversity Panel (HGDP) (Cann et al., 2002) suggesting that they are present at very low allele frequencies.

Results

An experimental strategy to identify full-length human specific L1s

To identify novel, full-length L1s in the genomes of geographically diverse individuals, we exploited a fosmid-based, paired-end DNA sequencing strategy that previously was used to identify structural variants in human DNA (Kidd et al., 2008; Tuzun et al., 2005). Fragments of genomic DNA approximately 40kb in size were individually cloned using fosmid vectors (see Extended Experimental Procedures). Sequence reads were obtained from both ends of each insert (paired-end sequences) and compared to the HGR. End-sequences from genomic fragments that do not differ significantly in size from the HGR will map ~40kb away from each other. In contrast, paired-end sequences derived from genomic fragments containing a full-length, dimorphic ~6kb L1Hs element will be separated by ~34kb when mapped to the HGR (Figure 1) (Tuzun et al., 2005). In general, the predicted variants were required to be supported by two fosmid clones containing putative insertions from the same individual. The size cutoffs used in our screening protocols are biased to allow the identification of full-length or near full-length L1 insertion polymorphisms, but not severely 5′ truncated L1 sequences, which are replication-deficient (Table 1). Through this scheme, we should be able to identify the bulk of full-length L1s in an individual genome that are dimorphic when compared to the HGR.

An external file that holds a picture, illustration, etc. Object name is nihms256394f1.jpg

A strategy for identifying dimorphic L1Hs elements in individual human genomes

In silico comparison of the fosmid end sequences (red squares) from individual genomic libraries (blue horizontal line) and the HGR (pink horizontal line) enables the detection of fosmids that may contain insertions or deletions with respect to the HGR (see dashed lines). Insertion fosmids were screened by allele specific oligonucleotide hybridization to detect characters that are present in the 5′ UTR of newer L1 elements (one discriminating character utilized, a deletion of the G residue at bp 74 in recent L1s, is indicated in maroon). Putative L1Hs-containing fosmids were analyzed by Southern blotting with a 5′ UTR probe (blue arrow). A representative digest and Southern blot is shown. The ~6kb band is diagnostic for the full-length L1. The additional hybridizing band (~1.3kb band liberated from the L1 5′ flank in this Southern blot example) serves to distinguish individual fosmids. ATLAS and/or DNA sequencing confirmed the presence of a dimorphic, full-length L1Hs insertion.

Table 1

Summary of data for the 6 libraries

Column 1: library identifiers. Column 2: Coriell identifier of individuals analyzed. Column 3: population of origin for individuals in the HapMap study. Column 4: the average insert size of each individual library (in kb). Column 5: the standard deviation in insert size of each individual library. Column 6: the detection limit for the size of insertions in each library. For ABC9 a more reduced threshold was applied than that used previously (Kidd et al., 2008). Column 7: the number of elements found in each library that are absent from the HGR. Column 8: the number of elements from column 7 that are not completely annotated in dbRIP (Wang et al., 2006). Column 9: the number of elements from column 7 that were active in retrotransposition assays. Column 10: elements from column 9 that retrotransposed at levels >10% of L1.3, a known active element. Column 11: The number of the HGR ‘hot’ elements that were present in each individual (Brouha et al., 2003)

Individual/Library Data LINE-1 Data
Library ID Coriell ID Population Library Mean in silico Insert Size S.D (kb) Detection Limit (kb) Dimorphic Elements Novel (not in dbRIP) Active Hot HGR ‘hot’ Elements
G248 NA15510 N/A 39.89 2.75 8.25 5 5 4 4 2
ABC9 NA18956 Japan 39.51 2.0¶ 4.52¶ 16 16 9 8 2
ABC10 NA19240 Yoruba* 41 1.84 5.52 20 18 11 9 2
ABC11 NA18555 China 40.03 1.77 5.31 13 12 9 8 2
ABC12 NA12878 CEPH* 39.75 1.4 4.2 8 7 4 3 2
ABC13 NA19129 Yoruba* 39.29 1.77 5.31 7 7 6 5 2
Total 69/68* 65 43 37

Fosmids fulfilling the above mapping criterion were subjected to a series of screens (Figure 1). First, allele-specific oligonucleotide hybridization using probes directed against diagnostic sequences in the L1Hs 5′ UTR identified insertion fosmids that contain putative dimorphic L1Hs elements (Boissinot et al., 2000; Tuzun et al., 2005). Second, Southern blotting with a probe directed against the 5′ UTR of L1.3 (Accession# L19088) enabled the identification of fosmids that contained putative full-length L1Hs elements (Dombroski et al., 1993; Sassaman et al., 1997). Third, a suppression PCR-based method (ATLAS) (Badge et al., 2003) and/or direct sequencing was used to verify the presence of a full-length (or near full-length) L1Hs element in the fosmid. Finally, genomic sequences flanking the 5′ and 3′ ends of the newly identified L1Hs elements were used as probes in BLAT searches (http://genome.ucsc.edu/cgi-bin/hgBlat?command=start) (Kent, 2002) to confirm that the L1 was absent from the HGR (NCBI build 36.1/hg18). Flanking sequences also were used to determine whether any of the L1Hs elements were present in a database of known polymorphic retrotransposon insertions (dbRIP; http://dbrip.brocku.ca/) (Wang et al., 2006). Two additional L1Hs elements were identified through direct sequencing of the fosmids (#1-2-1 and 10-2-1).

Identification of full-length L1Hs elements from geographically diverse individuals

We first conducted a pilot study to examine a fosmid library from a female individual (G248; NA15510) for full-length L1Hs insertions (Table 1) (Tuzun et al., 2005). Despite the fact that this library was optimized for identifying ~8kb insertion polymorphisms as part of the Human Genome Structural Variation project (HGSV) (Kidd et al., 2008; Tuzun et al., 2005), we were able to identify five novel L1Hs elements using our screening protocol (Table 1).

The above data provided ‘proof of principle’ that our strategy was effective for identifying full-length, dimorphic L1Hs elements. Thus, we next screened fosmid libraries from five females representing four distinct geographic populations that were studied as part of the HapMap project (one Japanese (NA18956), one Chinese (NA18555), one Western European CEPH (NA12878), and two Yoruban individuals (NA19240, NA19129)) (Consortium, 2005; Kidd et al., 2008). Size cutoffs allowed detection of insertion polymorphisms as small as ~4.2–5.5kb and enabled the identification of an additional 64 L1Hs elements (Table 1) (Kidd et al., 2008). As our strategy is biased toward finding novel, full-length L1s, we generally observed a decrease in the number of L1Hs elements identified in each successive library screen (e.g., ABC13 was the last library analyzed and contained relatively few novel L1Hs elements). In total, we identified 69 L1Hs elements that were absent from the HGR, one of which was identified in two different individuals (#4-1 and 5–77, respectively). This element also was completely annotated in dbRIP, unlike 65 of the distinct 68 L1s identified in this study (Table 1). The number of elements discovered at each stage of the analysis is detailed in the Extended Experimental Procedures.

Many of the newly identified L1Hs elements are ‘hot’ for retrotransposition

We next tested if the L1Hs elements identified in our screens were active for retrotransposition in cultured cells. Sixty-seven elements were cloned into either a pBluescript and/or pCEP4 L1 expression vector that contained an mneoI retrotransposition indicator cassette in its 3′ UTR (#2-42 was refractory to cloning; details in Experimental Procedures) (Freeman et al., 1994; Moran et al., 1996). The pBluescript-based L1 constructs lack an exogenous promoter; thus, L1 expression is driven from its native 5′ UTR. Elements isolated from libraries ABC11–13 were assayed in this context. L1s isolated from the G248, ABC9, and ABC10 libraries were assayed in pCEP4 (CMV+/5′UTR+) and/or pBluescript (5′UTR+) based contexts. The resultant plasmids were transfected into HeLa cells and successful retrotransposition events were detected as G418-resistant foci (Figure 2a) (Moran et al., 1996). Retrotransposition activities are reported relative to L1.3, and ‘hot’ refers to an L1 that jumps at >10% of L1.3 (see Table S1). Notably, 22 elements yielded similar retrotransposition efficiencies relative to L1.3 when tested in either a CMV+/5′UTR+ or a 5′UTR+ context (data not shown). Since the subcloning procedure does not involve PCR, we truly are testing the retrotransposition capability of each of the identified L1Hs elements in our screen.

An external file that holds a picture, illustration, etc. Object name is nihms256394f2.jpg

L1Hs activity in 6 human genomes

(a) Cloning strategy: All but one L1Hs element were cloned directly from fosmids using _Acc_I sites in their 5′ UTR and 3′ UTRs, respectively (red vertical lines; see Extended Experimental Procedures). The L1s then were ligated into vectors that either contain or lack a CMV promoter (black rectangle). Both vectors contain the mneoI retrotransposition indicator cassette (light blue) in the L1 3′ UTR. This cassette allows for detection of retrotransposition events in a cell culture retrotransposition assay. SD=splice donor. SA=splice acceptor. Active elements confer G418 resistance to HeLa cells, whereas defective elements, as illustrated by the RT mutant control (RT- L1), do not. (b) Representative G418-resistant foci for the 20 elements from the Yoruban library, ABC10: Nine of these elements were highly active (large suns to the left of assay image), and two more retained a low level of activity (small suns). One element (#3-5, red box) is a ‘hot’ pre-Ta L1 (#3-5 was tested in a pBluescript backbone (5′UTR+); all others were tested in a pCEP4 (CMV+/5′UTR+)) backbone (Extended Experimental Procedures). Table S1 displays retrotransposition efficiencies for each L1 identified in this study. Figure S1 provides details on the EN-deficient element #3-24. (c) The 68 distinct L1Hs elements identified in this study and their positions in the genome: Red vertical lines and text represent ‘hot’ or highly active elements. Orange vertical lines with black text represent low-level activity elements. Blue vertical lines with black text represent dead or inactive elements. The black line indicates the one untested element (#2-42). Ideograms were adapted from UCSC genome browser: http://genome.ucsc.edu (Kent et al., 2002).

Each individual contained between three and nine ‘hot’ L1s in their genome and 55% (37/67) of the L1Hs elements tested were hot for retrotransposition (Figures 2a & 2b, Table 1). These 37 ‘hot’ L1Hs elements represent an approximately 4-fold increase in the number of ‘hot’ L1s identified in previous studies (Badge et al., 2003; Brouha et al., 2002; Brouha et al., 2003; Kimberland et al., 1999; Lander et al., 2001; Sassaman et al., 1997). Examination of the 3′ UTR sequences of the 68 L1s uncovered six elements that contain an ACG in place of the Ta subfamily diagnostic ACA characters. These elements are termed ‘pre-Ta’, and represent an older L1s subfamily (Boissinot et al., 2000; Brouha et al., 2003; Kazazian et al., 1988; Lander et al., 2001; Myers et al., 2002; Skowronski et al., 1988). Two pre-Ta L1s (#3-5 and 5–55) were ‘hot’ for retrotransposition (Figure 2B; Table S1). These data agree with previous studies, which showed that a de novo insertion of a pre-Ta L1 into the Factor VIII gene resulted in a sporadic case of hemophilia A (Kazazian et al., 1988).

Hallmarks and insertion locations of L1s identified in this study

We next sequenced each L1Hs element in its entirety and compared these data to fosmid sequences previously deposited in GenBank (Kidd et al., 2008). We annotated each L1 for hallmarks of retrotransposition as well as their chromosomal environment (Table S2). In general, the L1Hs elements were flanked by target-site duplications that ranged from 6 to 20bp, inserted into an L1 endonuclease consensus cleavage sequence (Cost and Boeke, 1998; Feng et al., 1996; Morrish et al., 2002), and their 3′ ends had either homopolymeric poly (A) tails that ranged from ~8–41bp in size or interrupted poly (A) tails/3′ transductions ranging from ~18bp to 1,105bp in length (Table S2) (Goodier et al., 2000; Holmes et al., 1994; Moran et al., 1999; Pickeral et al., 2000).

A subset of the elements (~32/68) contained an additional 1–14bp of untemplated nucleotides at their 5′ ends, termed 5′ end heterogeneity (Athanikar et al., 2004; Lavie et al., 2004). Five of these L1s have an extra G at their 5′ ends, and one has three extra Gs when compared to a ‘hot’ L1Hs consensus sequence (Brouha et al., 2003). These extra nucleotides potentially could result either from a terminal transferase activity associated with the L1 reverse transcriptase, or reverse transcription of the 7-methylguanosine cap at the 5′ end of L1 RNA (Boeke, 2003; Gilbert et al., 2005; Symer et al., 2002). The majority of elements identified were full-length; however, we also found 7 elements (e.g. #1-5 and 2–30) that were truncated within their 5′ UTR. These data, along with the fact that the fosmid libraries provided ~4–5 fold coverage of each haplotype from the 6 individuals (Kidd et al., 2008), indicate that our screening procedure identified the majority of the full-length L1s in these genomes.

The 68 L1Hs elements were dispersed throughout the genome. We did not identify L1Hs elements on chromosomes 16 or 19 (Figure 2c); however, this result probably reflects our small sample size rather than a systematic bias against their ability insert on these chromosomes (Lander et al., 2001). Consistently, we previously were able to detect the insertion of engineered L1s into chromosomes 16 and 19 of HeLa cells (Gilbert et al., 2005).

Approximately 32% (22/68) of L1Hs elements were present in the introns of known RefSeq genes (http://www.ncbi.nlm.nih.gov/RefSeq/), and mutations in several of these genes are implicated in human genetic disorders (Table S3). Thirteen L1 insertions were in the anti-sense orientation (i.e., were transcribed in the opposite orientation to the gene), whereas 9 L1 insertions were in the same transcriptional orientation as the gene. Since ~26–38% of the genome is spanned by genes (Venter et al., 2001), the data suggest that the L1s have inserted randomly with respect to gene content, which is in agreement with previous studies (Gilbert et al., 2005; Gilbert et al., 2002; Ovchinnikov et al., 2001; Symer et al., 2002).

Our sequencing studies uncovered several expected trends and some unexpected results. All 37 ‘hot’ L1 elements and the 6 low-level activity elements had two intact open reading frames (ORFs). A consensus sequence derived from these 37 ‘hot’ L1s was identical at the amino acid level to a previously derived consensus (Brouha et al., 2003).

Inactive elements generally had frame shift (5/24) or chain-terminating nonsense mutations (9/24) in at least one of the L1 ORFs. However, 10 of these low-level activity or inactive elements contained two intact open reading frames. One L1 (#3-24) contained an S228P missense mutation within the endonuclease (EN) domain of ORF2p (Feng et al., 1996; Weichenrieder et al., 2004). Though L1s containing EN mutations are unable to retrotranspose in HeLa cells, they can retrotranspose in Chinese Hamster Ovary (CHO) cells deficient in the non-homologous end-joining (NHEJ) pathway of DNA repair, presumably by parasitizing a free 3′ OH group to initiate target-primed reverse-transcription (TPRT) (Morrish et al., 2007; Morrish et al., 2002). Interestingly, although #3-24 is inactive in NHEJ proficient cell lines, the L1 retrotransposed at roughly 60% the efficiency of the wild-type control, L1.3, in NHEJ deficient CHO cells (Morrish et al., 2002). Introducing the S228P change into L1.3 (Sassaman et al., 1997) also allowed efficient EN-independent retrotransposition, indicating that this mutation is largely responsible for the inactivity of #3-24 in HeLa cells (Figure S1).

Analysis of genomic sequences flanking the 68 L1Hs elements revealed a number of interesting findings. The poly (A) tails of 25 L1s were interrupted or contained 3′ transductions (Goodier et al., 2000; Holmes et al., 1994; Moran et al., 1999; Pickeral et al., 2000), seventeen of which clustered into ‘subfamilies’ of L1Hs elements. In one case, we identified an L1 (#2-1) as the likely source element for one of these ‘subfamilies’. For #1-3, 3–31, and 1–5, these transductions/interrupted poly (A) tails were identical to those in L1Hs elements that have caused disease-producing mutations (e.g., L1RP, LRE3) (Brouha et al., 2002; Kimberland et al., 1999). In other cases, the transductions denote examples of recently amplified subfamilies (Goodier et al., 2000; Lander et al., 2001; Pickeral et al., 2000).

Examining the 5′ genomic flanks showed that the retrotransposition of a full-length L1 from the ABC9 genomic library (#2-24) that integrated on chromosome 10 was accompanied by ~250bp of an Alu element which maps to chromosome 16. The Alu sequence is in the opposite transcriptional orientation to the L1, 13bp of unmapped sequence separates the elements, and the whole insertion was flanked by target site duplications (TSDs) (Figure S2). Thus, though most of the full-length L1Hs elements identified here have amplified by canonical retrotransposition, recombination and/or replication-mediated repair processes may facilitate the integration of some elements (Gilbert et al., 2005; Gilbert et al., 2002; Symer et al., 2002). Additionally, our screen allowed us to resolve possible sequence anomalies in the HGR. For example, one fosmid that lacks a dimorphic L1Hs element (#6-105) actually contains two L1s (a PA2 and pre-Ta element) that likely were collapsed into a harlequin element during the HGR assembly (Figure S2).

Finally, the data also enabled us to examine allelic heterogeneity associated with L1Hs elements. For example, one L1 (#5-70) was present in the HGR, but contained a stop codon in ORF2 and was not tested for activity (Brouha et al., 2003). Interestingly, #5-70 retrotransposed at ~8% of the level of L1.3, further illustrating how allelic heterogeneity can impact retrotransposon activity (Lutz et al., 2003; Seleme et al., 2006).

Allele frequencies of genotyped elements

The 68 L1Hs elements identified here are dimorphic with respect to presence; thus, we tested if a subset of these L1s represented population-restricted or potentially private alleles. To address this question, we first compiled existing genotyping data (Badge et al., 2003; Myers et al., 2002; Xing et al., 2009). Additional genotyping then was conducted on a subset of the L1s discovered here (26 in total; see Supplemental Information for selection criteria). The 26 L1s first were genotyped in a CEPH panel of 129 unrelated individuals. Nine L1s absent from the CEPH panel then were genotyped in a Zimbabwean panel of 72 unrelated individuals. Finally, if the element was absent from both panels, it was genotyped on the H952 subset of the HGDP consisting of ~1050 individuals from ~51 worldwide populations (Figure 3a and Table S4) (Cann et al., 2002; Rosenberg, 2006).

An external file that holds a picture, illustration, etc. Object name is nihms256394f3.jpg

Allele frequencies of L1Hs alleles in the population

(a) Genotyping assays: L1s were queried in panels of individuals for their absence (solid grey lines), or presence (red line). Genotyping of 26 elements in the three panels allowed the discovery of population restricted or potentially ‘private’ L1Hs elements. The expected amplicon sizes are diagrammed for element #3-24. (b) Pedigrees showing the inheritance of two elements typed in the ABC10 trio: Genotyping gels show the heritability of #3-31 (African specific) and #3-24 (absent from the HGDP). E and F at the top of the gel image indicate PCR results for empty and filled sites. M, F, and C at the bottom of the image indicate lanes for the mother, father, and child of the trio. (c) Example data sheet for the G248 element #1-5: Empty site: insertion site in the HGR. EN cleavage site: the endonucleolytic cleavage site used by L1 EN to initiate retrotransposition. pA length: the approximate L1 poly (A) tail length; 3′ transductions and interrupted poly (A) tails also are annotated. TSD length: the length of the target site duplication flanking the L1Hs element (underlined lettering). Table S2 contains data sheets for each L1 in this study. Table S3 contains L1Hs insertion locations with respect to genes. Figure S2 displays a non-canonical L1Hs insertion and documents a possible sequence anomaly in the HGR.

Two elements (#3-5 and 3–31) genotyped on the HGDP exist at very low allele frequencies and were only found in Africans. Two other L1Hs elements (#1-5 and 3–24) were absent from the HGDP (Table S4). Element #3-24 (the S228P mutant described above) was found in the ABC10 Yoruban library. Further genotyping revealed that the L1Hs element containing the mutation was present in her mother (but not her father), excluding a de novo origin (Figure 3b). The other putatively ‘private’ L1Hs element was from G248 (#1-5), so we could not examine its segregation in a trio. Interestingly, this L1 insertion occurred into an intron of the ABCA1 gene (Figure 3c); mutations in ABCA1 have been associated with Tangier disease and low serum HDL levels (Frikke-Schmidt, 2009).

The total number of active L1Hs elements present in ABC13

To estimate the total number of active L1s in one individual, we carried out in silico genotyping of the 68 L1Hs elements in ABC13, the last library examined in our subtractive scheme. We identified 20 regions containing distinct L1 insertions identified in the first 5 individuals that corresponded to insertion fosmids in the ABC13 HGSV track (http://hgsv.washington.edu/) of the UCSC genome browser (Figure 4a, Table S4) (Kent et al., 2002; Kidd et al., 2008). PCR genotyping confirmed that ABC13 contained 18 of these 20 elements (Figure 4b), and was homozygous with respect to presence for three of the elements. This result suggests that in silico genotyping could be used as a screening tool to identify L1Hs elements present at low allele frequencies in the population (Table S4).

An external file that holds a picture, illustration, etc. Object name is nihms256394f4.jpg

An estimate of the number of active L1Hs elements in an individual (ABC13) genome

(a) In silico genotyping: The last library in our study, ABC13, was examined in silico (see text) for the presence of insertion fosmids mapping to the location of L1Hs elements found in other individuals. Element 3–17 is used as an example. All blue lines represent insertion fosmids in the genomes of the 8 individuals on the HGSV track (http://hgsv.washington.edu/) of the UCSC genome browser (http://genome.ucsc.edu) (Kent et al., 2002). The ABC7, 8, and 14 libraries were not investigated in this study. (b) PCR validation: The elements identified in silico were genotyped using the scheme shown in Figure 3 to validate the predictions from the HGSV track of the UCSC browser. Element 3–17 is used to illustrate the genotyping. ABC10 and ABC13 are heterozygous with respect to the L1Hs insertion. ABC11 lacks the L1Hs insertion. Table S4 displays genotyping results for all elements in this study.

Adding the 18 L1Hs elements identified by in silico genotyping to the seven novel L1Hs elements identified in the ABC13 genome through our fosmid screens revealed that this individual contains 25/68 L1Hs elements identified in this study. Additional genotyping revealed that this individual contains 2 of the ‘hot’ L1s characterized in a previous study (Table 1) (Brouha et al., 2003). Combining these numbers with our retrotransposition data indicates that the ABC13 genome contains 14 potentially ‘hot’ L1Hs elements, and that at least 3 of these elements are present in a homozygous state.

Estimates of L1 age

Our data suggest that, on average, the 68 L1Hs elements identified here are present at lower allele frequencies, are more active, and may be evolutionarily younger than those in previous studies (Brouha et al., 2003). To test this hypothesis, we derived maximum likelihood estimates for the ages of Ta-1 L1Hs elements in our dataset and that of Brouha et al. (Brouha et al., 2003; Marchani et al., 2009). This analysis revealed that the Ta-1 L1Hs elements identified here are significantly younger (1.0 MY 95% C.I. 0.98–1.01 MY) than those reported previously (2.01 MY 95% C.I. 2.00–2.02 MY) (Marchani et al., 2009) (1.73 MY 95% C.I. 1.69–1.77 MY) (Brouha et al., 2003).

The maximum likelihood estimated age (Marchani et al., 2009) (1.0 MY) of the L1s reported here differs significantly from that calculated using the ad hoc method, which uses sequence divergence within subfamilies of elements to determine age (Carroll et al., 2001) (1.18 MY old). These two methods are known to be respectively robust (the maximum likelihood method) and sensitive (the ad hoc method) to the presence of multiple active lineages in the dataset (i.e. departures from the master gene model of L1 evolution) (Cordaux et al., 2004). The difference in these two estimates may indicate that members of multiple active L1Hs subfamilies are present in our dataset, and suggests that the true age of the L1s may be younger than either calculation suggests. Indeed, the above data are consistent with the hypothesis that the HGR is strongly biased in favor of older, fixed L1Hs elements.

We next used a neighbor joining approach, rooted with an intact chimpanzee L1 element, to generate a phylogenetic tree of the 68 full-length L1Hs elements (Figure 5, see Extended Experimental Procedures). As predicted, pre-Ta elements were located near the root of the tree. Interestingly, two known (L1RP & LRE3) and five other currently amplifying ‘subfamilies’ clustered together on the tree (Figure 5; see groups of colored elements), even though the interrupted poly (A) tail/transduction sequences themselves were excluded from the sequence alignments.

An external file that holds a picture, illustration, etc. Object name is nihms256394f5.jpg

Phylogenetic tree of the L1Hs elements identified in this study

The tree is a single neighbor-joining tree (with branch lengths corrected using the Kimura 2 parameter model of nucleotide substitution) with 68 full-length elements from our study. The numbers at particular nodes indicate the number of times that node was observed in 1000 bootstrap replicates of the dataset. Only bootstrap values exceeding 70% are shown. The brackets at the right side indicate previously described ‘transduction subfamilies’ (L1RP (labeled RP in the Figure) & LRE3) and distinct L1Hs ‘subfamilies’ currently capable of amplifying in human genomes (I–V) (Goodier et al., 2000; Pickeral et al., 2000). Those subfamilies are highlighted in the same color to show their clustering on the tree. Retrotransposition activity (% relative to L1.3) as well as allele frequency (e.g., AF= 0.012), if determined, is appended to the sequence identifiers. Element #11-17 contains ACG characters in its 3′ UTR, which are diagnostic for pre-Ta L1s; however, the element clusters with the Ta0 subfamily. The tree and age estimates use sequences indicated in the Supplemental Information.

Discussion

We have developed a systematic process to identify novel, dimorphic, active L1Hs elements in genomes of individuals from diverse geographic populations. Many of the newly identified L1Hs elements exist at low allele frequencies in the population and four L1Hs elements represent ‘rare’ alleles, three of which appear to be restricted to Africans. Sequence-based age estimates further reveal that these L1Hs elements appear to be, on average, evolutionarily younger than those identified in previous studies (Brouha et al., 2003; Marchani et al., 2009). These data are consistent with the notion that full-length active L1s are systematically underrepresented in available genome reference sequences (Badge et al., 2003; Boissinot et al., 2004; Brouha et al., 2003; Sassaman et al., 1997; Sheen et al., 2000; Xing et al., 2009).

Our study has underscored the effectiveness of fosmid paired-end libraries in the discovery of novel, active L1Hs elements. Though a number of technologies have been developed to identify polymorphic L1s (Badge et al., 2003; Boissinot et al., 2004; Brouha et al., 2003; Moran et al., 1996; Myers et al., 2002; Sheen et al., 2000; Xing et al., 2009), the approach described here is not reliant upon PCR fidelity, readily allowing the identification of active L1Hs elements and making sequencing of genomic flanking sequences, poly (A) tails, and L1-mediated transductions relatively straightforward. Thus, we predict that the fosmid-based approach likely will be superior to second-generation, low-coverage genome sequencing methodologies (e.g., many individual genomes characterized in the 1000 genomes project; http://www.1000genomes.org/page.php) for comprehensively identifying and characterizing ‘rare’ L1 alleles in individual genomes. Indeed, recently published genome sequences highlight the difficulties in detecting and unambiguously mapping highly repetitive insertions (relative to a reference genome), including L1Hs elements (Bentley et al., 2008; McKernan et al., 2009; Wang et al., 2008; Wheeler et al., 2008).

Our analysis revealed that many active L1s cluster in small ‘subfamilies’. In the strictest sense, these data argue against a master gene model (Deininger et al., 1992) and instead support a model in which multiple active source L1Hs elements (including members of both the pre-Ta and Ta-subfamilies) are currently retrotransposing in modern human genomes (Cordaux et al., 2004). We cannot formally exclude a ‘stealth’ model, where L1s in unfavorable expression contexts sometimes give rise to new retrotransposition-competent source elements that can be expressed from a more favorable genomic context (Han et al., 2005). However, the most parsimonious explanation of our data is that multiple source L1Hs elements and subfamilies with limited ‘life-spans’ exist in the genome. We posit that ‘hot’ L1Hs elements must give rise to new, active progeny at a faster rate than they are inactivated by cellular mutational processes (see Figure 6 for model); this can lead to a scenario where small numbers of currently active L1Hs lineages may out-compete older L1s for limiting reagents, such as host factors (Boissinot and Furano, 2001). This competition scenario both supports and extends current lineage succession models and could potentially explain the monophyletic history of L1s and the appearance of a replication-dominant L1Hs subfamily (Boissinot et al., 2000; Cordaux et al., 2004; Seleme et al., 2006).

An external file that holds a picture, illustration, etc. Object name is nihms256394f6.jpg

Multiple source loci model for continued L1Hs activity

An element (source locus) that is both active and in a conducive genomic environment can retrotranspose. Shown here is an example of a progenitor element that can be associated with subsequent members of a ‘family’ through the use of interrupted poly (A) tails and/or 3′ transduced sequence (3′ red arrow and line). Distinct elements are marked by distinguishing TSDs specific for their new integration site (different colored horizontal arrows). There are many of these ‘families’ active in human genomes, such as L1RP, LRE3, and the 5 ‘families’ noted in Figure 5. Although host processes (lightning bolt) may inactivate some older elements, some of their descendents may retain the ability to retrotranspose and could harbor the 3′ transduction/interrupted poly (A) tail.

Our data set is still relatively small, and it remains difficult to estimate the actual number of ‘hot’ L1s in the extant population. However, our ability to readily identify rare ‘hot’ L1s in the genomes of geographically diverse individuals strongly suggests that these highly active L1Hs elements are more abundant in the population than previously appreciated. The active L1Hs elements identified here also have the potential to impact modern human genomes by retrotransposing flanking genomic sequences to new chromosomal locations and by serving as substrates for non-allelic homologous recombination (reviewed in Cordaux and Batzer, 2009; Moran et al., 1999). The proteins encoded by these L1s also may promote the retrotransposition of Alu elements and non-coding RNAs (Bennett et al., 2008; Dewannieux et al., 2003; Garcia-Perez et al., 2007). Indeed, our data support the hypothesis that ‘hot’ L1s are actively retrotransposing in modern-day human genomes and suggest that some of the L1 alleles identified here could serve as source elements for disease-producing L1 insertions.

Experimental Procedures

Creation of Fosmid Libraries and Identification of Insertion-Containing Fosmids

Genomic DNA from the 6 individuals was obtained from transformed lymphoblastoid cell lines (available from the Coriell Cell Repository). The DNA was hydrodynamically sheared, end-repaired, size selected for 40kb fragments by pulsed field gel electrophoresis, and ligated into fosmid vectors (Donahue and Ebling, 2007). Agencourt Biosciences Corporation constructed all libraries, with the exception of the G248 library, which was constructed as part of the human genome project finishing effort. From each library, approximately 1 million individual cloned fragments were arrayed into 384-well plates. End-sequence pairs were obtained from both ends of each DNA fragment using standard capillary sequencing and were mapped back to the HGR. Insertion-containing fosmids were identified as the subset of fosmids containing an apparent insert that was ~3 standard deviations smaller than the library mean (Kidd et al., 2008; Tuzun et al., 2005).

Screening of Fosmid Clones for LINE-1 Insertions

Insertion-containing fosmids identified in silico were screened for L1Hs elements in the following manner. First, all insertion fosmids were subjected to allele-specific oligonucleotide hybridization to identify characters in the 5′ UTRs of newer L1 subfamilies (Badge et al., 2003; Boissinot et al., 2000). This protocol was adapted from ‘hybridization of bacterial DNA on filters’ (Sambrook, 1989). Fosmid DNAs were prepared according to the Very Low-Copy Plasmid/Cosmid Purification protocol for the Qiagen-tip 100 Midi prep kit (Qiagen). Those DNAs were subjected to Southern blotting followed by ATLAS (Badge et al., 2003) and/or direct sequencing to identify L1Hs elements that were absent from the HGR. Sequences flanking the L1Hs elements then were used as probes in BLAT searches at the UCSC genome browser (http://genome.ucsc.edu/) to determine the insertion site in the HGR (Kent, 2002; Kent et al., 2002). Detailed protocols for each step of the screening process, as well as the number of fosmids positive at each stage of the analysis, can be found in the Extended Experimental Procedures.

Cloning of L1s

In general, L1Hs elements were cloned directly from insertion-containing fosmids by digestion with _Acc_I (Sassaman et al., 1997). The restricted DNA was separated on a 0.8% agarose gel, and the ~6kb L1-containing restriction fragment was cloned into an L1 expression vector. This method captures the vast majority of the L1Hs sequence, leaving only the first ~35bp and last ~50bp of the original L1 5′ and 3′ UTRs present in the cloning vector, respectively. One element, #2-42, was refractory to this cloning procedure, as it contains a polymorphism near the 3′ end of ORF2 that creates an additional _Acc_I site. The PDH L1.3 mutant was generated by site-directed mutagenesis. Each L1Hs element was sequenced in its entirety. Detailed protocols for the creation of each construct are included in the Extended Experimental Procedures.

L1 Retrotransposition Assays

We used a modification of a transient transfection protocol to conduct retrotransposition assays in HeLa and CHO cells (Moran et al., 1996; Morrish et al., 2002; Wei et al., 2000). Briefly, cells in 6-well dishes were transfected using the Fugene 6 agent (Roche) with 1μg of plasmid (containing the indicator cassette) per each well. Cells were fed with media ~24 hours post plating, and daily from 72 hours with media plus either 400μg/mL G418 or 10μg/mL blasticidin. Fourteen days post transfection, cells were fixed and stained with 0.1% crystal violet. Colonies were counted in the appropriate wells, and these counts were normalized to GFP transfection efficiency. Detailed protocols for culture and assay conditions are found in the Extended Experimental Procedures.

Genotyping and Panels

The genomic locations of L1Hs insertions were compared to a database of human retrotransposon insertion polymorphisms (dbRIP; http://dbrip.brocku.ca/) (Wang et al., 2006). PCR genotyping assays were designed for a subset of L1Hs elements that were not completely annotated in dbRIP. Genotyping initially was conducted on a CEPH panel of 129 unrelated individuals of Northern European ancestry. If a L1Hs element was absent from the CEPH panel, it was genotyped on a panel containing genomic DNAs from 72 unrelated Zimbabwean individuals. Finally, if an L1Hs element was absent from both genotyping panels, it was genotyped on the H952 subset (Rosenberg, 2006) of the HGDP (Cann et al., 2002) (see Figure 3b). In silico genotyping was conducted using the HGSV track of the UCSC genome browser (Kent et al., 2002; Kidd et al., 2008). Details about these analyses are in the Extended Experimental Procedures.

Estimation of L1 Element Age

Sequences of the 69 full-length L1 elements were classified into subfamilies using the L1Xplorer analysis website (Penzkofer et al., 2005). Ta-1, Ta-0 and Non-Canonical (NC) (Brouha et al., 2003) elements were separately aligned using Muscle 3.52 (Edgar, 2004) on the Phylomen web server (http://phylemon.bioinfo.cipf.es/cgi-bin/home.cgi) (Tarraga et al., 2007). Raw alignments were manually refined to remove all indels, all variable CpG sites and the L1 polypurine tract using Jalview (Waterhouse et al., 2009). Maximum likelihood estimates of the age (T) of each group, the sampling variance of T, and its 95% confidence intervals were calculated using the mleT script (Marchani et al., 2009) running under Matlab 7.2 -2007a (The Mathworks Inc., Natick, MA). The subroutine CountMutations (Marchani et al., 2009) was also utilized to calculate the number of substitutions in the datasets to enable the “ad hoc” subfamily age estimation method (Marchani et al., 2009).]

Phylogenetic Tree

The sequences of the 69 elements were aligned as described above. Raw alignments were manually refined using Jalview (Waterhouse et al., 2009) to remove large indels and truncated elements; this led to the exclusion of #6-113 due to a large 5′ UTR deletion.

A single Neighbor Joining tree of the 68 remaining full-length elements was constructed using the PHYLIP package (Felsenstein, 1989). Branch lengths were corrected using the Kimura 2 parameter model (Kimura, 1980). To assess the reliability of the phylogeny, 1000 bootstrapped re-samples of the multiple alignment were made using the seqboot program of the PHYLIP package (Felsenstein, 1989). The neighbor joining tree derived from the full dataset was manually annotated with bootstrap values using Dendroscope (Huson et al., 2007) (Figure 5). Only bifurcations that occurred in more than 70% of bootstrap re-samples are labeled.

Supplementary Material

01

02

03

04

05

06

Acknowledgments

We thank Prof. Sir Alec Jeffreys FRS for access to CEPH and Zimbabwean DNA samples, and Prof. Mark Jobling for access to HGDP DNA samples. [We thank Dr Elizabeth Marchani for advice on maximum likelihood age estimates and Dr. José Luis Garcia-Perez for plasmid JJ105/L1.3. We thank Dr. Garcia-Perez and members of the Moran lab for helpful comments. C.R.B. was supported in part by NIH training grants T32GM7544 & T32000040. J.M.K. was supported by a National Science Foundation Graduate Research Fellowship. Work in the laboratory of E.E.E. was supported by grant HG004120. P.C. and C.M. were supported by a Wellcome Trust Project Grant (075163/Z/04/Z) to R.M.B and Prof. Sir Alec Jeffreys, FRS. J.V.M. is supported by NIH grants GM066695 and GM060518. The University of Michigan Cancer Center Support Grant (5P30CA46592) helped defray sequencing costs incurred in this study. J.V.M. and E.E.E. are Investigators of the Howard Hughes Medical Institute.

Footnotes

Accession Numbers

Accession numbers for all elements are tabulated in the Supplemental Information. Two L1Hs elements (Accession Numbers (#1-5) GU477636 and (#6-102) GU477637) were recently posted in GenBank.

Publisher's Disclaimer: This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final citable form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.

References