Prominent use of distal 5′ transcription start sites and discovery of a large number of additional exons in ENCODE regions (original) (raw)
- France Denoeud1,8,
- Philipp Kapranov2,8,
- Catherine Ucla3,
- Adam Frankish4,
- Robert Castelo1,
- Jorg Drenkow2,
- Julien Lagarde1,
- Tyler Alioto5,
- Caroline Manzano3,
- Jacqueline Chrast6,
- Sujit Dike2,
- Carine Wyss3,
- Charlotte N. Henrichsen6,
- Nancy Holroyd4,
- Mark C. Dickson7,
- Ruth Taylor4,
- Zahra Hance4,
- Sylvain Foissac5,
- Richard M. Myers7,
- Jane Rogers4,
- Tim Hubbard4,
- Jennifer Harrow4,
- Roderic Guigó1,5,
- Thomas R. Gingeras2,
- Stylianos E. Antonarakis3, and
- Alexandre Reymond3,6,9
- 1 Grup de Recerca en Informática Biomèdica, Institut Municipal d’Investigació Mèdica/Universitat Pompeu Fabra, 08003 Barcelona, Catalonia, Spain;
- 2 Affymetrix, Inc., Santa Clara, California 95051, USA;
- 3 Department of Genetic Medicine and Development, University of Geneva Medical School, 1211 Geneva, Switzerland;
- 4 Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton, Cambridgeshire CB10 1HH, United Kingdom;
- 5 Center for Genomic Regulation, 08003 Barcelona, Catalonia, Spain;
- 6 Center for Integrative Genomics, University of Lausanne, 1015 Lausanne, Switzerland;
- 7 Department of Genetics, Stanford Human Genome Center, Stanford University School of Medicine, Stanford, California 94305-5120, USA
- ↵8 These authors contributed equally to this work.
Abstract
This report presents systematic empirical annotation of transcript products from 399 annotated protein-coding loci across the 1% of the human genome targeted by the Encyclopedia of DNA elements (ENCODE) pilot project using a combination of 5′ rapid amplification of cDNA ends (RACE) and high-density resolution tiling arrays. We identified previously unannotated and often tissue- or cell-line-specific transcribed fragments (RACEfrags), both 5′ distal to the annotated 5′ terminus and internal to the annotated gene bounds for the vast majority (81.5%) of the tested genes. Half of the distal RACEfrags span large segments of genomic sequences away from the main portion of the coding transcript and often overlap with the upstream-annotated gene(s). Notably, at least 20% of the resultant novel transcripts have changes in their open reading frames (ORFs), most of them fusing ORFs of adjacent transcripts. A significant fraction of distal RACEfrags show expression levels comparable to those of known exons of the same locus, suggesting that they are not part of very minority splice forms. These results have significant implications concerning (1) our current understanding of the architecture of protein-coding genes; (2) our views on locations of regulatory regions in the genome; and (3) the interpretation of sequence polymorphisms mapping to regions hitherto considered to be “noncoding,” ultimately relating to the identification of disease-related sequence alterations.
Footnotes
↵9 Corresponding author.
↵9 E-mail alexandre.reymond{at}unil.ch; fax 00 41 21 692 3965.[Supplemental material is available online at www.genome.org. The sequence data from this study have been submitted to DDBJ/GenBank/EMBL under accession numbers DQ655905-DQ656069 and EF070113-EF070122.]
Article is online at http://www.genome.org/cgi/doi/10.1101/gr.5660607
- Received June 19, 2006.
- Accepted January 22, 2007.
Freely available online through the Genome Research Open Access option.
Copyright © 2007, Cold Spring Harbor Laboratory Press