LTR_retriever: A Highly Accurate and Sensitive Program for Identification of Long Terminal Repeat Retrotransposons - PubMed (original) (raw)

LTR_retriever: A Highly Accurate and Sensitive Program for Identification of Long Terminal Repeat Retrotransposons

Shujun Ou et al. Plant Physiol. 2018 Feb.

Abstract

Long terminal repeat retrotransposons (LTR-RTs) are prevalent in plant genomes. The identification of LTR-RTs is critical for achieving high-quality gene annotation. Based on the well-conserved structure, multiple programs were developed for the de novo identification of LTR-RTs; however, these programs are associated with low specificity and high false discovery rates. Here, we report LTR_retriever, a multithreading-empowered Perl program that identifies LTR-RTs and generates high-quality LTR libraries from genomic sequences. LTR_retriever demonstrated significant improvements by achieving high levels of sensitivity (91%), specificity (97%), accuracy (96%), and precision (90%) in rice (Oryza sativa). LTR_retriever is also compatible with long sequencing reads. With 40k self-corrected PacBio reads equivalent to 4.5× genome coverage in Arabidopsis (Arabidopsis thaliana), the constructed LTR library showed excellent sensitivity and specificity. In addition to canonical LTR-RTs with 5'-TG…CA-3' termini, LTR_retriever also identifies noncanonical LTR-RTs (non-TGCA), which have been largely ignored in genome-wide studies. We identified seven types of noncanonical LTRs from 42 out of 50 plant genomes. The majority of noncanonical LTRs are Copia elements, with which the LTR is four times shorter than that of other Copia elements, which may be a result of their target specificity. Strikingly, non-TGCA Copia elements are often located in genic regions and preferentially insert nearby or within genes, indicating their impact on the evolution of genes and their potential as mutagenesis tools.

© 2018 American Society of Plant Biologists. All Rights Reserved.

PubMed Disclaimer

Figures

Figure 1.

Figure 1.

The structure of LTR-RTs, their derivatives, and false positives. A, The structure of an intact LTR-RT with LTR (navy pentagons), a pair of dinucleotide palindromic motifs flanking each LTR (magenta triangles), the internal region including protein-coding sequences for gag, pol, and env (green boxes), and a 5-bp target site duplication (TSD) flanking the element (gray boxes). B, A truncated LTR-RT with missing structural components. C, A solo LTR. D, A nested LTR-RT with another LTR-RT inserted into its coding region. E, A false LTR-RT detected due to two adjacent non-LTRs (gray boxes). The counterfeit also features a direct repeat (blue pentagons) but usually has extended sequence similarity on one or both sides of the LTR (orange and brown boxes). Regions a to d are extracted and analyzed by LTR_retriever.

Figure 2.

Figure 2.

Workflow of LTR_retriever. Modules 1 to 8 are indicated in parentheses. *, Optional; supply the -notrunc parameter to deactivate this step. **, Optional; require -nonTGCA [extra_input_file] to activate this module. ***, Optional; supply the -noanno parameter to deactivate this step.

Figure 3.

Figure 3.

Comparison of the performance of LTR-RT recovery programs on the rice genome. LTR libraries of the rice genome were constructed using LTR_STRUC, MGEScan-LTR, LTR_finder, LTRharvest, and LTR_retriever and then were used to identify LTR sequences in the genome using RepeatMasker. Identified candidate sequences were compared with whole-genome LTR sequences recognized by the manually curated standard library. The genomic size (bp) of true positive, false positive, true negative, and false negative were used to calculate sensitivity, specificity, accuracy, and precision. *, The analysis used optimized parameters (see “Materials and Methods”), while the remainder were in default parameters. The output of optimized LTRharvest was used as input for LTR_retriever. Parameters of LTR identity (-similar), alignment seed length (-seed), and TSD search range (-vic) in LTRharvest were optimized based on the sensitivity and FDR of LTR-RT recovery in rice and further applied to other search programs.

Figure 4.

Figure 4.

Direct library construction using self-corrected PacBio reads. A, Identification of intact LTR elements and construction of libraries using the Arabidopsis L_er_-0 genome and 20k to 180k self-corrected PacBio reads. B, The performance of custom LTR libraries compared with that from the Arabidopsis reference (Columbia-0) genome.

Figure 5.

Figure 5.

Characterization of noncanonical Copia elements in plants. A, Non-TGCA Copia is older than canonical Copia. B, Non-TGCA Copia has a lower ratio of solo LTR to complete LTR, indicating ineffective exclusion for this type of LTR element. C, Non-TGCA Copia elements are associated predominantly with nonrepetitive flanking sequences. D, Non-TGCA Copia elements are located closer to genes than canonical Copia elements. Blue lines represent non-TGCA (noncanonical) Copia elements, and orange lines represent TGCA (canonical) Copia elements. All analyses were based on 50 plant genomes.

Similar articles

Cited by

References

    1. Ammiraju JS, Zuccolo A, Yu Y, Song X, Piegu B, Chevalier F, Walling JG, Ma J, Talag J, Brar DS, et al. (2007) Evolutionary dynamics of an ancient retrotransposon family provides insights into evolution of genome size in the genus Oryza. Plant J 52: 342–351 - PubMed
    1. Ammiraju JSS, Fan C, Yu Y, Song X, Cranston KA, Pontaroli AC, Lu F, Sanyal A, Jiang N, Rambo T, et al. (2010) Spatio-temporal patterns of genome evolution in allotetraploid species of the genus Oryza. Plant J 63: 430–442 - PubMed
    1. Arabidopsis Genome Initiative (2000) Analysis of the genome sequence of the flowering plant Arabidopsis thaliana. Nature 408: 796–815 - PubMed
    1. Baucom RS, Estill JC, Chaparro C, Upshaw N, Jogi A, Deragon JM, Westerman RP, SanMiguel PJ, Bennetzen JL (2009) Exceptional diversity, non-random distribution, and rapid evolution of retroelements in the B73 maize genome. PLoS Genet 5: e1000732. - PMC - PubMed
    1. Benson G. (1999) Tandem repeats finder: a program to analyze DNA sequences. Nucleic Acids Res 27: 573–580 - PMC - PubMed

Publication types

MeSH terms

Substances

LinkOut - more resources