Exploiting single-molecule transcript sequencing for eukaryotic gene prediction - PubMed (original) (raw)

Exploiting single-molecule transcript sequencing for eukaryotic gene prediction

André E Minoche et al. Genome Biol. 2015.

Abstract

We develop a method to predict and validate gene models using PacBio single-molecule, real-time (SMRT) cDNA reads. Ninety-eight percent of full-insert SMRT reads span complete open reading frames. Gene model validation using SMRT reads is developed as automated process. Optimized training and prediction settings and mRNA-seq noise reduction of assisting Illumina reads results in increased gene prediction sensitivity and precision. Additionally, we present an improved gene set for sugar beet (Beta vulgaris) and the first genome-wide gene set for spinach (Spinacia oleracea). The workflow and guidelines are a valuable resource to obtain comprehensive gene sets for newly sequenced genomes of non-model eukaryotes.

PubMed Disclaimer

Figures

Fig. 1

Fig. 1

Identification of full-insert cDNA sequences in SMRT sequencing data. Colors refer to the different types of sequences that can be encountered within the read data, that is 5’ and 3’ cDNA synthesis primers, PacBio SMRT library preparation adapter, and cDNA sequences consisting of 5’ UTR, open reading frame (ORF), 3’ UTR, and poly(A) tail. Initially, reads were subclassified into two groups: SMRT reads consisting of several subreads (left) or individual subreads (right). Reads from both groups were error-corrected and used to identify full-length cDNA sequences

Fig. 2

Fig. 2

Transcript length distribution. a Length distribution of 29,831 transcript models supported by evidence previously annotated in the RefBeet-1.1 assembly [13]. b Length distribution of SMRT CCS representing full-length transcripts. c Length distribution of transcripts annotated in RefBeet-1.1 that were matched by CCS representing full-length transcripts

Fig. 3

Fig. 3

Alignment of full-insert SMRT sequences to identify reliable gene structures. Multiple independent SMRT reads derived from the same gene were used to (a) confirm genes previously predicted using AUGUSTUS default parameters and to (b) identify new gene models without prior annotation. Gene predictions were considered as validated if all aligning SMRT sequences indicated the same intron boundaries. For new gene models the most abundant isoform per locus supported by at least two reads was reported. c Prediction artefact through intronic transposable elements and corrected prediction in BeetSet-2. Numbers next to gene names indicate the percentage of predicted gene features supported by expression evidence

Fig. 4

Fig. 4

Gene model validation. An initial gene set was calculated based on the sugar beet reference genome (RefBeet-1.2) and publicly available gene expression data [5, 13] using AUGUSTUS default parameters. Genes from the initial gene set were validated using PacBio SMRT sequences and by manual curation. Additional gene models were determined solely from SMRT full-insert sequences. The latter were included to train the parameter set for the final BeetSet-2 gene prediction

Fig. 5

Fig. 5

mRNA-seq coverage of sugar beet genes. Each dot represents one sugar beet gene. x-axis: mRNA-seq coverage as in the annotation based on the RefBeet-1.1 assembly; y-axis: mRNA-seq coverage for BeetSet-2 genes. The mRNA-seq data used in the RefBeet-1.1 annotation consisted chiefly of Illumina reads from genotype KWS2320, plus reads from other accessions (total amount: 616.3 million reads). The mRNA-seq data used to generate BeetSet-2 included KWS2320 reads plus isogenic reads from plants grown under stress conditions and their controls (total amount: 923.8 million reads). The overall mRNA-seq coverage increased in BeetSet-2, which improved the prediction of lowly expressed genes

Fig. 6

Fig. 6

Workflow of our analyses to improve eukaryotic gene predictions, including the scripts that are part of this publication (highlighted orange). Input and output data are highlighted in bold lettering

References

    1. Coghlan A, Fiedler TJ, McKay SJ, Flicek P, Harris TW, Blasiar D, et al. nGASP—the nematode genome annotation assessment project. BMC Bioinformatics. 2008;9:549. doi: 10.1186/1471-2105-9-549. - DOI - PMC - PubMed
    1. Eid J, Fehr A, Gray J, Luong K, Lyle J, Otto G, et al. Real-time DNA sequencing from single polymerase molecules. Science. 2009;323:133–8. doi: 10.1126/science.1162986. - DOI - PubMed
    1. Sharon D, Tilgner H, Grubert F, Snyder M. A single-molecule long-read survey of the human transcriptome. Nat Biotechnol. 2013;31:1009–14. doi: 10.1038/nbt.2705. - DOI - PMC - PubMed
    1. Stevens P. Angiosperm Phylogeny Website. 2012. Available at: http://www.mobot.org/MOBOT/research/APweb/.
    1. Herwig R, Schulz B, Weisshaar B, Hennig S, Steinfath M, Drungowski M, et al. Construction of a “unigene” cDNA clone set by oligonucleotide fingerprinting allows access to 25 000 potential sugar beet genes. Plant J Cell Mol Biol. 2002;32:845–57. doi: 10.1046/j.1365-313X.2002.01457.x. - DOI - PubMed

Publication types

MeSH terms

Substances

LinkOut - more resources