Exploiting single-molecule transcript sequencing for eukaryotic gene prediction - PubMed (original) (raw)
Exploiting single-molecule transcript sequencing for eukaryotic gene prediction
André E Minoche et al. Genome Biol. 2015.
Abstract
We develop a method to predict and validate gene models using PacBio single-molecule, real-time (SMRT) cDNA reads. Ninety-eight percent of full-insert SMRT reads span complete open reading frames. Gene model validation using SMRT reads is developed as automated process. Optimized training and prediction settings and mRNA-seq noise reduction of assisting Illumina reads results in increased gene prediction sensitivity and precision. Additionally, we present an improved gene set for sugar beet (Beta vulgaris) and the first genome-wide gene set for spinach (Spinacia oleracea). The workflow and guidelines are a valuable resource to obtain comprehensive gene sets for newly sequenced genomes of non-model eukaryotes.
Figures
Fig. 1
Identification of full-insert cDNA sequences in SMRT sequencing data. Colors refer to the different types of sequences that can be encountered within the read data, that is 5’ and 3’ cDNA synthesis primers, PacBio SMRT library preparation adapter, and cDNA sequences consisting of 5’ UTR, open reading frame (ORF), 3’ UTR, and poly(A) tail. Initially, reads were subclassified into two groups: SMRT reads consisting of several subreads (left) or individual subreads (right). Reads from both groups were error-corrected and used to identify full-length cDNA sequences
Fig. 2
Transcript length distribution. a Length distribution of 29,831 transcript models supported by evidence previously annotated in the RefBeet-1.1 assembly [13]. b Length distribution of SMRT CCS representing full-length transcripts. c Length distribution of transcripts annotated in RefBeet-1.1 that were matched by CCS representing full-length transcripts
Fig. 3
Alignment of full-insert SMRT sequences to identify reliable gene structures. Multiple independent SMRT reads derived from the same gene were used to (a) confirm genes previously predicted using AUGUSTUS default parameters and to (b) identify new gene models without prior annotation. Gene predictions were considered as validated if all aligning SMRT sequences indicated the same intron boundaries. For new gene models the most abundant isoform per locus supported by at least two reads was reported. c Prediction artefact through intronic transposable elements and corrected prediction in BeetSet-2. Numbers next to gene names indicate the percentage of predicted gene features supported by expression evidence
Fig. 4
Gene model validation. An initial gene set was calculated based on the sugar beet reference genome (RefBeet-1.2) and publicly available gene expression data [5, 13] using AUGUSTUS default parameters. Genes from the initial gene set were validated using PacBio SMRT sequences and by manual curation. Additional gene models were determined solely from SMRT full-insert sequences. The latter were included to train the parameter set for the final BeetSet-2 gene prediction
Fig. 5
mRNA-seq coverage of sugar beet genes. Each dot represents one sugar beet gene. x-axis: mRNA-seq coverage as in the annotation based on the RefBeet-1.1 assembly; y-axis: mRNA-seq coverage for BeetSet-2 genes. The mRNA-seq data used in the RefBeet-1.1 annotation consisted chiefly of Illumina reads from genotype KWS2320, plus reads from other accessions (total amount: 616.3 million reads). The mRNA-seq data used to generate BeetSet-2 included KWS2320 reads plus isogenic reads from plants grown under stress conditions and their controls (total amount: 923.8 million reads). The overall mRNA-seq coverage increased in BeetSet-2, which improved the prediction of lowly expressed genes
Fig. 6
Workflow of our analyses to improve eukaryotic gene predictions, including the scripts that are part of this publication (highlighted orange). Input and output data are highlighted in bold lettering
References
- Stevens P. Angiosperm Phylogeny Website. 2012. Available at: http://www.mobot.org/MOBOT/research/APweb/.
Publication types
MeSH terms
Substances
LinkOut - more resources
Full Text Sources
Other Literature Sources