Identifying protein-coding genes in genomic sequences - PubMed (original) (raw)
Review
Identifying protein-coding genes in genomic sequences
Jennifer Harrow et al. Genome Biol. 2009.
Abstract
The vast majority of the biology of a newly sequenced genome is inferred from the set of encoded proteins. Predicting this set is therefore invariably the first step after the completion of the genome DNA sequence. Here we review the main computational pipelines used to generate the human reference protein-coding gene sets.
Figures
Figure 1
Gene-finding strategies. Given a genome DNA sequence, information on the location of genes and transcripts can be obtained from different sources: conservation with one or more informant genomes (1); intrinsic signals involved in gene specification, such as start and stop codons and splice sites (2); the statistical properties of coding sequences (3); and, most importantly, known transcript sequences (either full-length cDNAs or partial ESTs) and protein sequences (4). Over the past two decades, a plethora of programs and strategies has been developed to combine these sources of information to obtain reliable gene predictions. The 'intrinsic' evidence from sequence signals and statistical bias can be combined (using a variety of frameworks often related to hidden Markov models [59]), to produce gene predictions (6). These programs are often referred to as ab initio or de novo gene finders. They are the programs of choice in the absence of known transcript or protein sequences or phylogenetically related genomes. If related genome sequences are available, the intrinsic information can be combined with patterns of genomic sequence conservation using programs often referred to as comparative (or dual- or multi-genome) gene finders (5). With these programs, maximum resolution is achieved when the compared genomes are at a phylogenetic distance such that there is maximum separation between the conservation in coding and noncoding regions. To increase resolution, programs have been developed that use multiple informant genomes. The most sophisticated use an underlying phylogenetic tree to appropriately weight sequence conservation depending on evolutionary distance. If cDNA and EST sequences are available, these often take priority over other sources of information. The initial map of the transcript or protein sequences onto the genome, which can be obtained using a variety of tools, including sequence-similarity searches, is refined using more sophisticated 'splice alignment' algorithms, whose explicit splice-site models allow more precise alignment across gaps corresponding to introns (8). Alternatively, cDNA and protein information can be fed into an ab initio gene-finder algorithm to give information on the exons included in the prediction (7). Often, cDNA and protein evidence is only partial; in such cases, the initial reliable gene and transcript set may be extended with more hypothetical models derived from ab initio or comparative gene finders, or from the genome mapping of cDNA and protein sequences from other species. Pipelines have been derived that automate this multi-step process (9). More recently, programs have been developed that combine the output of many individual gene finders (10). The underlying assumption in these 'combiners' is that consensus across programs increases the likelihood of the predictions. Thus, predictions are weighted according to the particular features of the program producing them. The most general frameworks allow the integration of a great variety of types of predictions - not only gene predictions, but also predictions of individual sites and exons. Despite all the developments in computational gene finding, the most reliable and complete gene annotations are still obtained after the initial alignments of cDNA and proteins onto the genome sequence are inspected manually to establish the exon boundaries of genes and transcripts (11). This is the task carried out by the HAVANA team at the Sanger Institute. The initial manual annotation can be refined even further by subsequent experimental verification of those transcript models lacking sufficiently strong evidence, as in the GENCODE project (12). Examples of gene-prediction programs (with references and URLs) corresponding to each strategy outlined here are provided in Additional data file 1.
Figure 2
ENSEMBL browser. The ContigView page of the Ensembl browser representing the SPAG4 gene locus on chromosome 20 within the Encode region ENr333. (a) The green transcript represents the CCDS coding region agreed on by the CCDS consortium. (b) The blue transcripts are the Vega transcripts, which are manually annotated by the HAVANA group and are a mixture of coding (solid blues) and noncoding (blue outline) transcripts. (c) Finally, the gold transcript represents the coding transcript on which the HAVANA and Ensembl annotations agree.
Similar articles
- Easy Access to and Applications of the Sequences of All Protein-Coding Genes of All Sequenced Mouse Strains.
Timmermans S, Libert C. Timmermans S, et al. Trends Genet. 2018 Dec;34(12):899-902. doi: 10.1016/j.tig.2018.08.007. Epub 2018 Sep 19. Trends Genet. 2018. PMID: 30243593 - Comparison of methods for genomic localization of gene trap sequences.
Harper CA, Huang CC, Stryke D, Kawamoto M, Ferrin TE, Babbitt PC. Harper CA, et al. BMC Genomics. 2006 Sep 18;7:236. doi: 10.1186/1471-2164-7-236. BMC Genomics. 2006. PMID: 16982004 Free PMC article. - In silico characterization of proteins: UniProt, InterPro and Integr8.
Mulder NJ, Kersey P, Pruess M, Apweiler R. Mulder NJ, et al. Mol Biotechnol. 2008 Feb;38(2):165-77. doi: 10.1007/s12033-007-9003-x. Epub 2007 Oct 4. Mol Biotechnol. 2008. PMID: 18219596 Review. - Current challenges in genome annotation through structural biology and bioinformatics.
Furnham N, de Beer TA, Thornton JM. Furnham N, et al. Curr Opin Struct Biol. 2012 Oct;22(5):594-601. doi: 10.1016/j.sbi.2012.07.005. Epub 2012 Aug 9. Curr Opin Struct Biol. 2012. PMID: 22884875 Review.
Cited by
- Generation and analysis of the expressed sequence tags from the mycelium of Ganoderma lucidum.
Huang YH, Wu HY, Wu KM, Liu TT, Liou RF, Tsai SF, Shiao MS, Ho LT, Tzean SS, Yang UC. Huang YH, et al. PLoS One. 2013 May 2;8(5):e61127. doi: 10.1371/journal.pone.0061127. Print 2013. PLoS One. 2013. PMID: 23658685 Free PMC article. - Conserved Genome Organization and Core Transcriptome of the Lactobacillus acidophilus Complex.
Crawley AB, Barrangou R. Crawley AB, et al. Front Microbiol. 2018 Aug 13;9:1834. doi: 10.3389/fmicb.2018.01834. eCollection 2018. Front Microbiol. 2018. PMID: 30150974 Free PMC article. - Extension of human lncRNA transcripts by RACE coupled with long-read high-throughput sequencing (RACE-Seq).
Lagarde J, Uszczynska-Ratajczak B, Santoyo-Lopez J, Gonzalez JM, Tapanari E, Mudge JM, Steward CA, Wilming L, Tanzer A, Howald C, Chrast J, Vela-Boza A, Rueda A, Lopez-Domingo FJ, Dopazo J, Reymond A, Guigó R, Harrow J. Lagarde J, et al. Nat Commun. 2016 Aug 17;7:12339. doi: 10.1038/ncomms12339. Nat Commun. 2016. PMID: 27531712 Free PMC article. - Repertoire-wide gene structure analyses: a case study comparing automatically predicted and manually annotated gene models.
Wilbrandt J, Misof B, Panfilio KA, Niehuis O. Wilbrandt J, et al. BMC Genomics. 2019 Oct 17;20(1):753. doi: 10.1186/s12864-019-6064-8. BMC Genomics. 2019. PMID: 31623555 Free PMC article. - Contrasting Patterns in the Evolution of Vertebrate MLX Interacting Protein (MLXIP) and MLX Interacting Protein-Like (MLXIPL) Genes.
Singh P, Irwin DM. Singh P, et al. PLoS One. 2016 Feb 24;11(2):e0149682. doi: 10.1371/journal.pone.0149682. eCollection 2016. PLoS One. 2016. PMID: 26910886 Free PMC article.
References
- Kapranov P, Cheng J, Dike S, Nix DA, Duttagupta R, Willingham AT, Stadler PF, Hertel J, Hackermüller J, Hofacker IL, Bell I, Cheung E, Drenkow J, Dumais E, Patel S, Helt G, Ganesh M, Ghosh S, Piccolboni A, Sementchenko V, Tammana H, Gingeras TR. RNA maps reveal new RNA classes and a possible function for pervasive transcription. Science. 2007;316:1484–1488. doi: 10.1126/science.1138341. - DOI - PubMed
- Carninci P, Kasukawa T, Katayama S, Gough J, Frith MC, Maeda N, Oyama R, Ravasi T, Lenhard B, Wells C, Kodzius R, Shimokawa K, Bajic VB, Brenner SE, Batalov S, Forrest AR, Zavolan M, Davis MJ, Wilming LG, Aidinis V, Allen JE, Ambesi-Impiombato A, Apweiler R, Aturaliya RN, Bailey TL, Bansal M, Baxter L, Beisel KW, Bersano T, Bono H, et al. The transcriptional landscape of the mammalian genome. Science. 2005;309:1559–1563. doi: 10.1126/science.1112014. - DOI - PubMed
Publication types
MeSH terms
Substances
Grants and funding
- 077198/WT_/Wellcome Trust/United Kingdom
- U01 HG003147/HG/NHGRI NIH HHS/United States
- U01 HG003150/HG/NHGRI NIH HHS/United States
- U54 HG004555/HG/NHGRI NIH HHS/United States
LinkOut - more resources
Full Text Sources
Other Literature Sources