GENCODE: the reference human genome annotation for The ENCODE Project - PubMed (original) (raw)

. 2012 Sep;22(9):1760-74.

doi: 10.1101/gr.135350.111.

Adam Frankish, Jose M Gonzalez, Electra Tapanari, Mark Diekhans, Felix Kokocinski, Bronwen L Aken, Daniel Barrell, Amonida Zadissa, Stephen Searle, If Barnes, Alexandra Bignell, Veronika Boychenko, Toby Hunt, Mike Kay, Gaurab Mukherjee, Jeena Rajan, Gloria Despacio-Reyes, Gary Saunders, Charles Steward, Rachel Harte, Michael Lin, Cédric Howald, Andrea Tanzer, Thomas Derrien, Jacqueline Chrast, Nathalie Walters, Suganthi Balasubramanian, Baikang Pei, Michael Tress, Jose Manuel Rodriguez, Iakes Ezkurdia, Jeltje van Baren, Michael Brent, David Haussler, Manolis Kellis, Alfonso Valencia, Alexandre Reymond, Mark Gerstein, Roderic Guigó, Tim J Hubbard

Affiliations

GENCODE: the reference human genome annotation for The ENCODE Project

Jennifer Harrow et al. Genome Res. 2012 Sep.

Abstract

The GENCODE Consortium aims to identify all gene features in the human genome using a combination of computational analysis, manual annotation, and experimental validation. Since the first public release of this annotation data set, few new protein-coding loci have been added, yet the number of alternative splicing transcripts annotated has steadily increased. The GENCODE 7 release contains 20,687 protein-coding and 9640 long noncoding RNA loci and has 33,977 coding transcripts not represented in UCSC genes and RefSeq. It also has the most comprehensive annotation of long noncoding RNA (lncRNA) loci publicly available with the predominant transcript form consisting of two exons. We have examined the completeness of the transcript annotation and found that 35% of transcriptional start sites are supported by CAGE clusters and 62% of protein-coding genes have annotated polyA sites. Over one-third of GENCODE protein-coding genes are supported by peptide hits derived from mass spectrometry spectra submitted to Peptide Atlas. New models derived from the Illumina Body Map 2.0 RNA-seq data identify 3689 new loci not currently in GENCODE, of which 3127 consist of two exon models indicating that they are possibly unannotated long noncoding loci. GENCODE 7 is publicly available from gencodegenes.org and via the Ensembl and UCSC Genome Browsers.

PubMed Disclaimer

Figures

Figure 1.

Figure 1.

The GENCODE pipeline. This schematic diagram shows the flow of data between the groups of the GENCODE Consortium. Manual annotation is central to the process but relies on specialized prediction pipelines to provide hints to first-pass annotation and quality control (QC) for completed annotation. Automated annotation supplements manual annotation, the two being merged to produce the GENCODE data set and also to apply QC to the completed annotation. A subset of annotated gene models is subject to experimental validation. The Annotrack tracking system contains data from all groups and is used to highlight differences, coordinate QC, and track outcomes.

Figure 2.

Figure 2.

Analysis of exon number of protein-coding and noncoding RNA transcripts. The numbers of exons for each individual transcript annotated at protein-coding and lncRNA loci are plotted for GENCODE 3c (red lines) and GENCODE 7 (blue lines). For each release, darker lines indicate protein-coding transcripts, and lighter lines indicate lncRNA transcripts. The 5′ and 3′ UTR exons of protein-coding transcripts are included.

Figure 3.

Figure 3.

A schematic showing the structural annotation of different pseudogene biotypes. The schematic diagram illustrates the categorization of GENCODE pseudogenes on the basis of their origin. Processed pseudogenes are derived by a retrotransposition event and unprocessed pseudogenes by a gene duplication event in both cases, followed by the gain of a disabling mutation. Both processed and unprocessed pseudogenes can retain or gain transcriptional activity, which is reflected in the transcribed_processed and transcribed_unprocessed_pseudogene classification. Polymorphic pseudogenes contain a disabling mutation in the reference genome but are known to be coding in other individuals, while unitary pseudogenes have functional protein-coding orthologs in other species (we have used mouse as a reference) but contain a fixed disabling mutation in human.

Figure 4.

Figure 4.

Analysis of GENCODE annotation in 3c through 7. (A) The content of the GENCODE 3c to 7 at the locus level for four broad biotypes: protein-coding, pseudogene, long noncoding RNA (lncRNA), and small RNA (sRNA). The yellow section of each column indicates the proportion of loci classified as Level 1 (validated), the blue part as Level 2 (manually annotated), and the red part as Level 3 (automatically annotated). (B) The analysis of the content of the GENCODE 3c to 7 at the level of the individual transcript. Again, the yellow section of each column indicates the proportion of transcripts classified as Level 1, the blue part as Level 2, and the red part as Level 3.

Figure 5.

Figure 5.

Comparison of polyA features annotated across all chromosomes. The mean number of polyA features (sites plus signals) for all protein-coding loci are plotted for every chromosome for GENCODE 3c (red columns) and 7 (blue columns).

Figure 6.

Figure 6.

Examining the length of 5′ and 3′ UTRs between GENCODE 3c and 7. (A) The length of 5′ UTR sequence (in 50-bp bins) for each protein-coding transcript. 5′ UTR annotation from GENCODE 3c (red) and 7 (blue). (*) A cutoff was made at 949 bases; longer 5′ UTRs do exist. (B) The length of 3′ UTR sequence (in 250-bp bins) for each protein-coding transcript. 3′ UTR annotation from GENCODE 3c (red) and 7 (blue).

Figure 7.

Figure 7.

(A) Comparing different publicly available gene sets. The protein-coding content of five major publicly available gene sets— GENCODE, AceView, consensus coding sequence (CCDS), RefSeq, and UCSC—were compared at the level of total gene number, total transcript number, and mean transcripts per locus. (Blue) GENCODE data; (orange) AceView; (yellow) CCDS; (green) RefSeq; (red) UCSC. The lncRNA content of three of these gene sets—GENCODE, RefSeq, and UCSC—were also compared at the level of total gene number, total transcript number, and mean transcripts per locus. Again, GENCODE data are shown in blue, RefSeq in green, and UCSC in red. (B) Overlap between GENCODE, RefSeq, and UCSC at the transcript and CDS levels. Both protein-coding and lncRNA transcripts of all data sets were compared at the transcript level. Two transcripts were considered to match if all their exon junction coordinates were identical in the case of multi-exonic transcripts, or if their transcript coordinates were the same for mono-exonic transcripts. Similarly, the CDSs of two protein-coding transcripts matched when the CDS boundaries and the encompassed exon junctions were identical. Numbers in the intersections involving GENCODE are specific to this data set, otherwise they correspond to any of the other data sets.

Figure 8.

Figure 8.

Quality of evidence used to support automatic, manually, and merged annotated transcripts. The level of supporting evidence for automatic only (A), manual only (B), and merged (C) annotated transcripts is shown for each chromosome. (Yellow) The proportion of models with good support; (dark blue) those supported by suspect mRNAs from libraries with known problems with quality; (light green) those with multiple EST support; (orange) those with support from a single EST; (red) those supported by ESTs from suspect libraries; (pale blue) those lacking good support. The number of transcripts across all chromosomes represented in A is 23,855; B, 89,669; and C, 22,535.

Figure 9.

Figure 9.

Accessing the GENCODE gene set through UCSC and Ensembl. (A) The composite of screenshots from the UCSC browser shows GENCODE gene annotation displayed in the basic and comprehensive display mode, along with the GENCODE pseudogenes, CCDS models, and a subset of histone modification tracks, DNaseI hypersensitivity clusters, and transcription factor binding site tracks. (B) The configuration display where the user can filter on biotype, annotation method, and transcript type (C). (D) The transcript page in UCSC where the different identifications and version of the transcript can be seen, as well as the evidence used to build the transcript. From the page, the user can click on the Ensembl identification and immediately jump to the Ensembl gene view page (E) and see an overview of the different transcripts in the locus as well as which is a CCDS.

Similar articles

Cited by

References

    1. The 1000 Genomes Project Consortium 2010. A map of human genome variation from population-scale sequencing. Nature 467: 1061–1073 - PMC - PubMed
    1. Altschul SF, Madden TL, Schaffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ 1997. Gapped BLAST and PSI-BLAST: A new generation of protein database search programs. Nucleic Acids Res 25: 3389–3402 - PMC - PubMed
    1. Amaral PP, Clark MB, Gascoigne DK, Dinger ME, Mattick JS 2011. lncRNAdb: A reference database for long noncoding RNAs. Nucleic Acids Res 39: D146–D151 - PMC - PubMed
    1. Apweiler R, Jesus Martin M, O'Donovan C, Magrane M 2012. Reorganizing the protein space at the Universal Protein Resource (UniProt). Nucleic Acids Res 40: D71–D75 - PMC - PubMed
    1. Ara T, Lopez F, Ritchie W, Benech P, Gautheret D 2006. Conservation of alternative polyadenylation patterns in mammalian genes. BMC Genomics 7: 189 doi: 10.1186/1471-2164-7-189 - PMC - PubMed

Publication types

MeSH terms

Substances

Grants and funding

LinkOut - more resources