Identifying protein-coding genes in genomic sequences - PubMed (original) (raw)

Review

Identifying protein-coding genes in genomic sequences

Jennifer Harrow et al. Genome Biol. 2009.

Abstract

The vast majority of the biology of a newly sequenced genome is inferred from the set of encoded proteins. Predicting this set is therefore invariably the first step after the completion of the genome DNA sequence. Here we review the main computational pipelines used to generate the human reference protein-coding gene sets.

PubMed Disclaimer

Figures

Figure 1

Gene-finding strategies. Given a genome DNA sequence, information on the location of genes and transcripts can be obtained from different sources: conservation with one or more informant genomes (1); intrinsic signals involved in gene specification, such as start and stop codons and splice sites (2); the statistical properties of coding sequences (3); and, most importantly, known transcript sequences (either full-length cDNAs or partial ESTs) and protein sequences (4). Over the past two decades, a plethora of programs and strategies has been developed to combine these sources of information to obtain reliable gene predictions. The 'intrinsic' evidence from sequence signals and statistical bias can be combined (using a variety of frameworks often related to hidden Markov models [59]), to produce gene predictions (6). These programs are often referred to as ab initio or de novo gene finders. They are the programs of choice in the absence of known transcript or protein sequences or phylogenetically related genomes. If related genome sequences are available, the intrinsic information can be combined with patterns of genomic sequence conservation using programs often referred to as comparative (or dual- or multi-genome) gene finders (5). With these programs, maximum resolution is achieved when the compared genomes are at a phylogenetic distance such that there is maximum separation between the conservation in coding and noncoding regions. To increase resolution, programs have been developed that use multiple informant genomes. The most sophisticated use an underlying phylogenetic tree to appropriately weight sequence conservation depending on evolutionary distance. If cDNA and EST sequences are available, these often take priority over other sources of information. The initial map of the transcript or protein sequences onto the genome, which can be obtained using a variety of tools, including sequence-similarity searches, is refined using more sophisticated 'splice alignment' algorithms, whose explicit splice-site models allow more precise alignment across gaps corresponding to introns (8). Alternatively, cDNA and protein information can be fed into an ab initio gene-finder algorithm to give information on the exons included in the prediction (7). Often, cDNA and protein evidence is only partial; in such cases, the initial reliable gene and transcript set may be extended with more hypothetical models derived from ab initio or comparative gene finders, or from the genome mapping of cDNA and protein sequences from other species. Pipelines have been derived that automate this multi-step process (9). More recently, programs have been developed that combine the output of many individual gene finders (10). The underlying assumption in these 'combiners' is that consensus across programs increases the likelihood of the predictions. Thus, predictions are weighted according to the particular features of the program producing them. The most general frameworks allow the integration of a great variety of types of predictions - not only gene predictions, but also predictions of individual sites and exons. Despite all the developments in computational gene finding, the most reliable and complete gene annotations are still obtained after the initial alignments of cDNA and proteins onto the genome sequence are inspected manually to establish the exon boundaries of genes and transcripts (11). This is the task carried out by the HAVANA team at the Sanger Institute. The initial manual annotation can be refined even further by subsequent experimental verification of those transcript models lacking sufficiently strong evidence, as in the GENCODE project (12). Examples of gene-prediction programs (with references and URLs) corresponding to each strategy outlined here are provided in Additional data file 1.

Figure 2

ENSEMBL browser. The ContigView page of the Ensembl browser representing the SPAG4 gene locus on chromosome 20 within the Encode region ENr333. (a) The green transcript represents the CCDS coding region agreed on by the CCDS consortium. (b) The blue transcripts are the Vega transcripts, which are manually annotated by the HAVANA group and are a mixture of coding (solid blues) and noncoding (blue outline) transcripts. (c) Finally, the gold transcript represents the coding transcript on which the HAVANA and Ensembl annotations agree.

Cited by

A hidden human proteome encoded by 'non-coding' genes.
Lu S, Zhang J, Lian X, Sun L, Meng K, Chen Y, Sun Z, Yin X, Li Y, Zhao J, Wang T, Zhang G, He QY. Lu S, et al. Nucleic Acids Res. 2019 Sep 5;47(15):8111-8125. doi: 10.1093/nar/gkz646. Nucleic Acids Res. 2019. PMID: 31340039 Free PMC article.
Nuclear translocation of spike mRNA and protein is a novel feature of SARS-CoV-2.
Sattar S, Kabat J, Jerome K, Feldmann F, Bailey K, Mehedi M. Sattar S, et al. Front Microbiol. 2023 Jan 26;14:1073789. doi: 10.3389/fmicb.2023.1073789. eCollection 2023. Front Microbiol. 2023. PMID: 36778849 Free PMC article.
Annotations for all by all - the BioSapiens network.
Thornton J; BioSapiens Network. Thornton J, et al. Genome Biol. 2009 Feb 10;10(2):401. doi: 10.1186/gb-2009-10-2-401. Genome Biol. 2009. PMID: 19232072 Free PMC article.
Reassessing domain architecture evolution of metazoan proteins: major impact of gene prediction errors.
Nagy A, Szláma G, Szarka E, Trexler M, Bányai L, Patthy L. Nagy A, et al. Genes (Basel). 2011 Jul 13;2(3):449-501. doi: 10.3390/genes2030449. Genes (Basel). 2011. PMID: 24710207 Free PMC article.
Putative extremely high rate of proteome innovation in lancelets might be explained by high rate of gene prediction errors.
Bányai L, Patthy L. Bányai L, et al. Sci Rep. 2016 Aug 1;6:30700. doi: 10.1038/srep30700. Sci Rep. 2016. PMID: 27476717 Free PMC article.

References

1. Roma G, Cobellis G, Claudiani P, Maione F, Cruz P, Tripoli G, Sardiello M, Peluso I, Stupka E. A novel view of the transcriptome revealed from gene trapping in mouse embryonic stem cells. Genome Res. 2007;17:1051–1060. doi: 10.1101/gr.5720807. - DOI - PMC - PubMed
1. Kapranov P, Cheng J, Dike S, Nix DA, Duttagupta R, Willingham AT, Stadler PF, Hertel J, Hackermüller J, Hofacker IL, Bell I, Cheung E, Drenkow J, Dumais E, Patel S, Helt G, Ganesh M, Ghosh S, Piccolboni A, Sementchenko V, Tammana H, Gingeras TR. RNA maps reveal new RNA classes and a possible function for pervasive transcription. Science. 2007;316:1484–1488. doi: 10.1126/science.1138341. - DOI - PubMed
1. Carninci P, Kasukawa T, Katayama S, Gough J, Frith MC, Maeda N, Oyama R, Ravasi T, Lenhard B, Wells C, Kodzius R, Shimokawa K, Bajic VB, Brenner SE, Batalov S, Forrest AR, Zavolan M, Davis MJ, Wilming LG, Aidinis V, Allen JE, Ambesi-Impiombato A, Apweiler R, Aturaliya RN, Bailey TL, Bansal M, Baxter L, Beisel KW, Bersano T, Bono H, et al. The transcriptional landscape of the mammalian genome. Science. 2005;309:1559–1563. doi: 10.1126/science.1112014. - DOI - PubMed
1. Lagos-Quintana M, Rauhut R, Lendeckel W, Tuschl T. Identification of novel genes coding for small expressed RNAs. Science. 2001;294:853–858. doi: 10.1126/science.1064921. - DOI - PubMed
1. Pheasant M, Mattick JS. Raising the estimate of functional human sequences. Genome Res. 2007;17:1245–1253. doi: 10.1101/gr.6406307. - DOI - PubMed

Identifying protein-coding genes in genomic sequences - PubMed (original) (raw)