Gene Model Annotations for Drosophila melanogaster: Impact of High-Throughput Data - PubMed (original) (raw)
. 2015 Jun 24;5(8):1721-36.
doi: 10.1534/g3.115.018929.
Gilberto Dos Santos 2, Madeline A Crosby 2, David B Emmert 2, Susan E St Pierre 2, L Sian Gramates 2, Pinglei Zhou 2, Andrew J Schroeder 2, Kathleen Falls 2, Victor Strelets 3, Susan M Russo 2, William M Gelbart 2; FlyBase Consortium
Affiliations
- PMID: 26109357
- PMCID: PMC4528329
- DOI: 10.1534/g3.115.018929
Gene Model Annotations for Drosophila melanogaster: Impact of High-Throughput Data
Beverley B Matthews et al. G3 (Bethesda). 2015.
Abstract
We report the current status of the FlyBase annotated gene set for Drosophila melanogaster and highlight improvements based on high-throughput data. The FlyBase annotated gene set consists entirely of manually annotated gene models, with the exception of some classes of small non-coding RNAs. All gene models have been reviewed using evidence from high-throughput datasets, primarily from the modENCODE project. These datasets include RNA-Seq coverage data, RNA-Seq junction data, transcription start site profiles, and translation stop-codon read-through predictions. New annotation guidelines were developed to take into account the use of the high-throughput data. We describe how this flood of new data was incorporated into thousands of new and revised annotations. FlyBase has adopted a philosophy of excluding low-confidence and low-frequency data from gene model annotations; we also do not attempt to represent all possible permutations for complex and modularly organized genes. This has allowed us to produce a high-confidence, manageable gene annotation dataset that is available at FlyBase (http://flybase.org). Interesting aspects of new annotations include new genes (coding, non-coding, and antisense), many genes with alternative transcripts with very long 3' UTRs (up to 15-18 kb), and a stunning mismatch in the number of male-specific genes (approximately 13% of all annotated gene models) vs. female-specific genes (less than 1%). The number of identified pseudogenes and mutations in the sequenced strain also increased significantly. We discuss remaining challenges, for instance, identification of functional small polypeptides and detection of alternative translation starts.
Keywords: alternative splice; exon junction; lncRNA; transcription start site; transcriptome.
Copyright © 2015 Matthews et al.
Figures
Figure 1
Changes to the FlyBase gene model annotation set in the era of high-throughput data. The gene model annotation set of FlyBase version FB2010_01 (R5.24), the last version to predate FlyBase incorporation of high-throughput data, was compared to that of FlyBase version FB2014_06 (R6.03) to determine the degree of change over the course of 26 annotation updates. Gene model annotations common to both sets (white) were identified. Gene model annotations specific to R6.03 were then examined to identify those derived from R5.24-specific gene model annotations through gene merge/split or reclassification (yellow). The remaining R5.24-specific annotations were classified as “withdrawn” (red), and the R6.03-specific annotations were classified as “new” (blue). The number of gene models within each category of status change is shown for protein-coding genes (A). For 13,003 of the 13,112 protein-coding genes common to both R5.24 and R6.03 (excluding nine complex cases), we also examined the degree to which the number of associated transcripts changed: a measure of gene model complexity. For these genes, the numbers of associated transcripts that persisted (white), were deleted (red), or were added (blue) at some point between R5.24 and R6.03 are shown (B). Changes to the set of annotated pseudogenes (C) and non-coding RNA genes (D) are also shown.
Figure 2
The
Klp54D gene model was split into two genes. A GBrowse1 view of the Klp54D gene model as it existed in R5.30 (A). The gene (blue) and transcript (orange) annotations were based primarily on gene prediction (yellow). On the basis of high-throughput data, this gene model was split in R5.36 to give Klp54D and CG43324, as shown in an updated GBrowse2 view of this same region, as it exists in R6.03 (B). Below the transcript annotations, modENCODE RNA-Seq exon junctions (blue), aligned cDNA evidence (green), and modENCODE RNA-Seq coverage data for 30 developmental stages spanning early embryogenesis to adulthood are shown from top to bottom. The RNA-Seq expression data show that CG43324 is expressed at a much higher level and in more stages than Klp54D. There is also no RNA-Seq exon junction connecting the two genes. In addition, the annotated 5′ end of CG43324 is supported by RAMPAGE TSS data (not shown). More information on data presented in GBrowse may be found at http://flybase.org/wiki/FlyBase:GBrowse\_Tracks.
Figure 3
Alternative transcription start site and 3′ end for
CG31717. A GBrowse2 view of CG31717, as it exists in R6.03, depicting (from top to bottom) modENCODE embryonic transcription start site evidence, FlyBase gene and transcript annotations, aligned cDNA evidence, modENCODE RNA-Seq junctions, and modENCODE stranded RNA-Seq expression profiles for CNS tissues (larval, pupal, and adult head samples) and gonadal tissues (testis, accessory gland, virgin female ovary, and mated female ovary); plus strand signal is shown above the minus strand signal for each RNA-Seq track. More information on data presented in GBrowse may be found at http://flybase.org/wiki/FlyBase:GBrowse\_Tracks.
Figure 4
New long non-coding RNA genes are supported by RNA-Seq data. A GBrowse2 view for a region containing four recently annotated lncRNA genes is shown (R6.03).
CR43132 is supported by RNA-Seq junction and expression data. CR45523, CR45524, and CR45526 are supported by RNA-Seq expression data only; they were identified in a genome-wide scan for intergenic regions with RPKM values of 3 or more. The transcript polarity is determined from the stranded “Gonads and male accessory glands” RNA-Seq expression tracks. CR45523, CR45524, and CG45526 show expression primarily in male testis (red RNA-Seq signal), a pattern common to many of the newly annotated ncRNA genes. See Figure 2 and Figure 3 for GBrowse track descriptions. More information on data presented in GBrowse may be found at http://flybase.org/wiki/FlyBase:GBrowse\_Tracks.
Figure 5
New ncRNA gene
CR45161 is antisense to fln. CR45161 is a newly annotated antisense gene supported by RNA-Seq expression and junction data. Although it might be mistaken for background transcription in the unstranded “Developmental stage” RNA-Seq expression tracks, its strong transcription on the positive strand is obvious in the stranded “CNS and adult head” RNA-Seq track. See Figure 2 and Figure 3 for GBrowse track descriptions. More information on data presented in GBrowse may be found at http://flybase.org/wiki/FlyBase:GBrowse\_Tracks.
Figure 6
A subset of possible
AnxB9 transcript isoforms has been annotated. RNA-Seq junction and expression data predict eight alternative splice donors from three different leading 5′ exons, of which four have been used in annotations. Low-frequency junctions have not been annotated. Alternative splicing in the last intron leads to three different protein isoforms. A low-frequency junction at the 3′ end of the gene has also been excluded. Twelve different transcript isoforms are possible using the annotated junctions (32 are possible with all junctions), but only a subset of the possible combinations has been annotated. See Figure 2 and Figure 3 for GBrowse track descriptions. More information on data presented in GBrowse may be found at http://flybase.org/wiki/FlyBase:GBrowse\_Tracks.
Figure 7
The two nonoverlapping protein isoforms of
klar. A GBrowse2 view of klar is shown, as it exists in R6.03, with nonoverlapping isoforms highlighted in yellow (klar-RC and -RI do not overlap klar-RD and -RH). The C-terminus of the longer, "upstream" isoforms (klar-RD and -RH) is sufficient for targeting proteins to lipid droplets, whereas the "KASH" domain present in the "downstream" isoforms (klar-RC and -RI) is sufficient for targeting to the nuclear envelope (Guo et al. 2005). The "upstream" nonoverlapping isoform is necessary for proper lipid droplet targeting in the embryo. While the KASH domain is necessary for nuclear migration in the embryo and retina, this function is associated with the "full-length" KASH-containing isoforms. The short KASH-containing isoform, which lacks motor interaction domains, is expressed (Western blot, immunofluorescence) and is apparently enriched in nurse cells but is not sufficient to rescue nuclear migration in the retina. See Figure 2 and Figure 3 for GBrowse track descriptions. More information on data presented in GBrowse may be found at http://flybase.org/wiki/FlyBase:GBrowse\_Tracks.
Similar articles
- Gene Model Annotations for Drosophila melanogaster: The Rule-Benders.
Crosby MA, Gramates LS, Dos Santos G, Matthews BB, St Pierre SE, Zhou P, Schroeder AJ, Falls K, Emmert DB, Russo SM, Gelbart WM; FlyBase Consortium. Crosby MA, et al. G3 (Bethesda). 2015 Jun 24;5(8):1737-49. doi: 10.1534/g3.115.018937. G3 (Bethesda). 2015. PMID: 26109356 Free PMC article. - FlyBase: introduction of the Drosophila melanogaster Release 6 reference genome assembly and large-scale migration of genome annotations.
dos Santos G, Schroeder AJ, Goodman JL, Strelets VB, Crosby MA, Thurmond J, Emmert DB, Gelbart WM; FlyBase Consortium. dos Santos G, et al. Nucleic Acids Res. 2015 Jan;43(Database issue):D690-7. doi: 10.1093/nar/gku1099. Epub 2014 Nov 14. Nucleic Acids Res. 2015. PMID: 25398896 Free PMC article. - The Drosophila melanogaster transcriptome by paired-end RNA sequencing.
Daines B, Wang H, Wang L, Li Y, Han Y, Emmert D, Gelbart W, Wang X, Li W, Gibbs R, Chen R. Daines B, et al. Genome Res. 2011 Feb;21(2):315-24. doi: 10.1101/gr.107854.110. Epub 2010 Dec 22. Genome Res. 2011. PMID: 21177959 Free PMC article. - Annotation of the Drosophila melanogaster euchromatic genome: a systematic review.
Misra S, Crosby MA, Mungall CJ, Matthews BB, Campbell KS, Hradecky P, Huang Y, Kaminker JS, Millburn GH, Prochnik SE, Smith CD, Tupy JL, Whitfied EJ, Bayraktaroglu L, Berman BP, Bettencourt BR, Celniker SE, de Grey AD, Drysdale RA, Harris NL, Richter J, Russo S, Schroeder AJ, Shu SQ, Stapleton M, Yamada C, Ashburner M, Gelbart WM, Rubin GM, Lewis SE. Misra S, et al. Genome Biol. 2002;3(12):RESEARCH0083. doi: 10.1186/gb-2002-3-12-research0083. Epub 2002 Dec 31. Genome Biol. 2002. PMID: 12537572 Free PMC article. Review. - Using FlyBase, a Database of Drosophila Genes and Genomes.
Marygold SJ, Crosby MA, Goodman JL; FlyBase Consortium. Marygold SJ, et al. Methods Mol Biol. 2016;1478:1-31. doi: 10.1007/978-1-4939-6371-3_1. Methods Mol Biol. 2016. PMID: 27730573 Free PMC article. Review.
Cited by
- Diversely evolved xibalbin variants from remipede venom inhibit potassium channels and activate PKA-II and Erk1/2 signaling.
Pinheiro-Junior EL, Alirahimi E, Peigneur S, Isensee J, Schiffmann S, Erkoc P, Fürst R, Vilcinskas A, Sennoner T, Koludarov I, Hempel BF, Tytgat J, Hucho T, von Reumont BM. Pinheiro-Junior EL, et al. BMC Biol. 2024 Jul 29;22(1):164. doi: 10.1186/s12915-024-01955-5. BMC Biol. 2024. PMID: 39075558 Free PMC article. - Genome-wide maps of ribosomal occupancy provide insights into adaptive evolution and regulatory roles of uORFs during Drosophila development.
Zhang H, Dou S, He F, Luo J, Wei L, Lu J. Zhang H, et al. PLoS Biol. 2018 Jul 20;16(7):e2003903. doi: 10.1371/journal.pbio.2003903. eCollection 2018 Jul. PLoS Biol. 2018. PMID: 30028832 Free PMC article. - Distinct developmental mechanisms influence sexual dimorphisms in the milkweed bug Oncopeltus fasciatus.
Just J, Laslo M, Lee YJ, Yarnell M, Zhang Z, Angelini DR. Just J, et al. Proc Biol Sci. 2023 Feb 8;290(1992):20222083. doi: 10.1098/rspb.2022.2083. Epub 2023 Feb 1. Proc Biol Sci. 2023. PMID: 36722087 Free PMC article. - The developmental proteome of Drosophila melanogaster.
Casas-Vila N, Bluhm A, Sayols S, Dinges N, Dejung M, Altenhein T, Kappei D, Altenhein B, Roignant JY, Butter F. Casas-Vila N, et al. Genome Res. 2017 Jul;27(7):1273-1285. doi: 10.1101/gr.213694.116. Epub 2017 Apr 5. Genome Res. 2017. PMID: 28381612 Free PMC article. - Integrating RNA-seq and ChIP-seq data to characterize long non-coding RNAs in Drosophila melanogaster.
Chen MJ, Chen LK, Lai YS, Lin YY, Wu DC, Tung YA, Liu KY, Shih HT, Chen YJ, Lin YL, Ma LT, Huang JL, Wu PC, Hong MY, Chu FH, Wu JT, Li WH, Chen CY. Chen MJ, et al. BMC Genomics. 2016 Mar 11;17:220. doi: 10.1186/s12864-016-2457-0. BMC Genomics. 2016. PMID: 26969372 Free PMC article.
References
- Aminetzach Y. T., Macpherson J. M., Petrov D. A., 2005. Pesticide resistance via transposition-mediated adaptive gene truncation in Drosophila. Science 309: 764–767. - PubMed
- Balakirev E. S., Ayala F. J., 2004. The β-esterase gene cluster of Drosophila melanogaster: is ψEst-6 a pseudogene, a functional gene, or both? Genetica 121: 165–179. - PubMed
- Behm-Ansmant I., Kashima I., Rehwinkel J., Sauliere J., Wittkopp N., et al. , 2007. mRNA quality control: an ancient machinery recognizes and degrades mRNAs with nonsense codons. FEBS Lett. 581: 2845–2853. - PubMed
Publication types
MeSH terms
Substances
Grants and funding
- U41 HG00739/HG/NHGRI NIH HHS/United States
- U41 HG000739/HG/NHGRI NIH HHS/United States
- G1000968/MRC_/Medical Research Council/United Kingdom
- P41 HG000739/HG/NHGRI NIH HHS/United States
- (G1000968/MRC_/Medical Research Council/United Kingdom
LinkOut - more resources
Full Text Sources
Other Literature Sources
Molecular Biology Databases