Identifying protein-coding genes in genomic sequences (original) (raw)

The genome sequence is an organism's blueprint: the set of instructions dictating its biological traits. The unfolding of these instructions is initiated by the transcription of the DNA into RNA sequences. According to the standard model, the majority of RNA sequences originate from protein-coding genes; that is, they are processed into messenger RNAs (mRNAs) which, after their export to the cytosol, are translated into proteins. While the importance of noncoding RNAs has come to the fore over the past ten years [15], proteins are still assumed to be the main functional and structural players in the cell. The delineation of the complete set of protein-coding genes and their alternative splice forms is, therefore, essential to the task of translating the information in the sequence of the genome into biologically relevant knowledge. This is not a trivial task, as illustrated by the fact that many years after the first drafts of the human genome sequence became available [68], uncertainty remains regarding the exact number of protein-coding genes [9], a number that might actually vary between individuals - and even between cells within the same individual - as extensive structural variation has been reported in the human genome [1012].

Even the concept of a 'gene' is under revision. Genes have long been regarded as discrete entities located linearly along chromosomes, but recent investigations have demonstrated extensive transcriptional overlap between different genes. Specifically, genomic regions from otherwise distinct and apparently well characterized protein-coding loci (which may be very far apart in linear genomic space) often appear to combine to produce transcripts with the potential for encoding novel protein species [13, 14].

References

  1. Roma G, Cobellis G, Claudiani P, Maione F, Cruz P, Tripoli G, Sardiello M, Peluso I, Stupka E: A novel view of the transcriptome revealed from gene trapping in mouse embryonic stem cells. Genome Res. 2007, 17: 1051-1060. 10.1101/gr.5720807.
    Article PubMed CAS PubMed Central Google Scholar
  2. Kapranov P, Cheng J, Dike S, Nix DA, Duttagupta R, Willingham AT, Stadler PF, Hertel J, Hackermüller J, Hofacker IL, Bell I, Cheung E, Drenkow J, Dumais E, Patel S, Helt G, Ganesh M, Ghosh S, Piccolboni A, Sementchenko V, Tammana H, Gingeras TR: RNA maps reveal new RNA classes and a possible function for pervasive transcription. Science. 2007, 316: 1484-1488. 10.1126/science.1138341.
    Article PubMed CAS Google Scholar
  3. Carninci P, Kasukawa T, Katayama S, Gough J, Frith MC, Maeda N, Oyama R, Ravasi T, Lenhard B, Wells C, Kodzius R, Shimokawa K, Bajic VB, Brenner SE, Batalov S, Forrest AR, Zavolan M, Davis MJ, Wilming LG, Aidinis V, Allen JE, Ambesi-Impiombato A, Apweiler R, Aturaliya RN, Bailey TL, Bansal M, Baxter L, Beisel KW, Bersano T, Bono H, et al: The transcriptional landscape of the mammalian genome. Science. 2005, 309: 1559-1563. 10.1126/science.1112014.
    Article PubMed CAS Google Scholar
  4. Lagos-Quintana M, Rauhut R, Lendeckel W, Tuschl T: Identification of novel genes coding for small expressed RNAs. Science. 2001, 294: 853-858. 10.1126/science.1064921.
    Article PubMed CAS Google Scholar
  5. Pheasant M, Mattick JS: Raising the estimate of functional human sequences. Genome Res. 2007, 17: 1245-1253. 10.1101/gr.6406307.
    Article PubMed CAS Google Scholar
  6. Lander ES, Linton LM, Birren B, Nusbaum C, Zody MC, Baldwin J, Devon K, Dewar K, Doyle M, FitzHugh W, Funke R, Gage D, Harris K, Heaford A, Howland J, Kann L, Lehoczky J, LeVine R, McEwan P, McKernan K, Meldrim J, Mesirov JP, Miranda C, Morris W, Naylor J, Raymond C, Rosetti M, Santos R, Sheridan A, Sougnez C, et al: Initial sequencing and analysis of the human genome. Nature. 2001, 409: 860-921. 10.1038/35057062.
    Article PubMed CAS Google Scholar
  7. Venter JC, Adams MD, Myers EW, Li PW, Mural RJ, Sutton GG, Smith HO, Yandell M, Evans CA, Holt RA, Gocayne JD, Amanatides P, Ballew RM, Huson DH, Wortman JR, Zhang Q, Kodira CD, Zheng XH, Chen L, Skupski M, Subramanian G, Thomas PD, Zhang J, Gabor Miklos GL, Nelson C, Broder S, Clark AG, Nadeau J, McKusick VA, Zinder N, et al: The sequence of the human genome. Science. 2001, 291: 1304-1351. 10.1126/science.1058040.
    Article PubMed CAS Google Scholar
  8. International Human Genome Sequencing Consortium: Finishing the euchromatic sequence of the human genome. Nature. 2004, 431: 931-945. 10.1038/nature03001.
    Article Google Scholar
  9. Clamp M, Fry B, Kamal M, Xie X, Cuff J, Lin M, Kellis M, Lindblad-Toh K, Lander E: Distinguishing protein-coding and noncoding genes in the human genome. Proc Natl Acad Sci USA. 2007, 104: 19428-19433. 10.1073/pnas.0709013104.
    Article PubMed CAS PubMed Central Google Scholar
  10. Sebat J, Lakshmi B, Troge J, Alexander J, Young J, Lundin P, Månér S, Massa H, Walker M, Chi M, Navin N, Lucito R, Healy J, Hicks J, Ye K, Reiner A, Gilliam TC, Trask B, Patterson N, Zetterberg A, Wigler M: Large-scale copy number polymorphism in the human genome. Science. 2004, 305: 525-528. 10.1126/science.1098918.
    Article PubMed CAS Google Scholar
  11. Iafrate A, Feuk L, Rivera M, Listewnik M, Donahoe P, Qi Y, Scherer S, Lee C: Detection of large-scale variation in the human genome. Nat Genet. 2004, 36: 949-951. 10.1038/ng1416.
    Article PubMed CAS Google Scholar
  12. Kidd JM, Cooper GM, Donahue WF, Hayden HS, Sampas N, Graves T, Hansen N, Teague B, Alkan C, Antonacci F, Haugen E, Zerr T, Yamada NA, Tsang P, Newman TL, Tüzün E, Cheng Z, Ebling HM, Tusneem N, David R, Gillett W, Phelps KA, Weaver M, Saranga D, Brand A, Tao W, Gustafson E, McKernan K, Chen L, Malig M, et al: Mapping and sequencing of structural variation from eight human genomes. Nature. 2008, 453: 56-64. 10.1038/nature06862.
    Article PubMed CAS PubMed Central Google Scholar
  13. Denoeud F, Kapranov P, Ucla C, Frankish A, Castelo R, Drenkow J, Lagarde J, Alioto T, Manzano C, Chrast J, Dike S, Wyss C, Henrichsen CN, Holroyd N, Dickson MC, Taylor R, Hance Z, Foissac S, Myers RM, Rogers J, Hubbard T, Harrow J, Guigó R, Gingeras TR, Antonarakis SE, Reymond A: Prominent use of distal 5' transcription start sites and discovery of a large number of additional exons in ENCODE regions. Genome Res. 2007, 17: 746-759. 10.1101/gr.5660607.
    Article PubMed CAS PubMed Central Google Scholar
  14. Rozowsky JS, Newburger D, Sayward F, Wu J, Jordan G, Korbel JO, Nagalakshmi U, Yang J, Zheng D, Guigó R, Gingeras TR, Weissman S, Miller P, Snyder M, Gerstein MB: The DART classification of unannotated transcription within the ENCODE regions: associating transcription with known and novel loci. Genome Res. 2007, 17: 732-745. 10.1101/gr.5696007.
    Article PubMed CAS PubMed Central Google Scholar
  15. Fickett JW: Recognition of protein coding regions in DNA sequences. Nucleic Acids Res. 1982, 10: 5303-5318. 10.1093/nar/10.17.5303.
    Article PubMed CAS PubMed Central Google Scholar
  16. Brent MR, Guigó R: Recent advances in gene structure prediction. Curr Opin Struct Biol. 2004, 14: 264-272. 10.1016/j.sbi.2004.05.007.
    Article PubMed CAS Google Scholar
  17. Mathé C, Sagot M-F, Schiex T, Rouzé P: Current methods of gene prediction, their strengths and weaknesses. Nucleic Acids Res. 2002, 30: 4103-4117. 10.1093/nar/gkf543.
    Article PubMed PubMed Central Google Scholar
  18. Jones S: Prediction of genomic functional elements. Annu Rev Genomics Hum Genet. 2006, 7: 315-338. 10.1146/annurev.genom.7.080505.115745.
    Article PubMed CAS Google Scholar
  19. Flicek P, Aken BL, Beal K, Ballester B, Caccamo M, Chen Y, Clarke L, Coates G, Cunningham F, Cutts T, Down T, Dyer SC, Eyre T, Fitzgerald S, Fernandez-Banet J, Gräf S, Haider S, Hammond M, Holland R, Howe KL, Howe K, Johnson N, Jenkinson A, Kähäri A, Keefe D, Kokocinski F, Kulesha E, Lawson D, Longden I, Megy K, et al: Ensembl 2008. Nucleic Acids Res. 2008, D707-D714. 36 Database
  20. Karolchik D, Hinrichs AS, Kent WJ: The UCSC Genome Browser. Curr Protocols Bioinf. Chapter 1 (Unit 1.4):
  21. Pruitt KD, Tatusova T, Maglott DR: NCBI reference sequences (RefSeq): a curated non-redundant sequence database of genomes, transcripts and proteins. Nucleic Acids Res. 2008, D61-D65. 36 Database
  22. Maglott DR, Katz KS, Sicotte H, Pruitt KD: NCBI's LocusLink and RefSeq. Nucleic Acids Res. 2000, 28: 126-128. 10.1093/nar/28.1.126.
    Article PubMed CAS PubMed Central Google Scholar
  23. Gnomon. [http://www.ncbi.nlm.nih.gov/projects/genome/guide/gnomon.shtml]
  24. Wilming LG, Gilbert JGR, Howe K, Trevanion S, Hubbard T, Harrow JL: The vertebrate genome annotation (Vega) database. Nucleic Acids Res. 2008, D753-D760. 36 Database
  25. Searle S, Gilbert J, Iyer V, Clamp M: The Otter annotation system. Genome Res. 2004, 14: 963-970. 10.1101/gr.1864804.
    Article PubMed CAS PubMed Central Google Scholar
  26. CCDS. [http://www.ncbi.nlm.nih.gov/CCDS/CcdsBrowse.cgi]
  27. Harrow J, Denoeud F, Frankish A, Reymond A, Chen CK, Chrast J, Lagarde J, Gilbert JG, Storey R, Swarbreck D, Rossier C, Ucla C, Hubbard T, Antonarakis SE, Guigó R: GENCODE: producing a reference annotation for ENCODE. Genome Biol. 2006, 7 (Suppl 1): 1-9. 10.1186/gb-2006-7-s1-s4.
    Article Google Scholar
  28. GENCODE. [http://genome.imim.es/gencode]
  29. ENCODE Project Consortium: Identification and analysis of functional elements in 1% of the human genome by the ENCODE pilot project. Nature. 2007, 447: 799-816. 10.1038/nature05874.
    Article Google Scholar
  30. ENCODE Project Consortium: The ENCODE (ENCyclopedia Of DNA Elements) project. Science. 2004, 306: 636-640. 10.1126/science.1105136.
    Article Google Scholar
  31. Guigó R, Flicek P, Abril JF, Reymond A, Lagarde J, Denoeud F, Antonarakis S, Ashburner M, Bajic VB, Birney E, Castelo R, Eyras E, Ucla C, Gingeras TR, Harrow J, Hubbard T, Lewis SE, Reese MG: EGASP: the human ENCODE Genome Annotation Assessment Project. Genome Biol. 2006, 7 (Suppl 1): S2.1-31. 10.1186/gb-2006-7-s1-s2.
    Article Google Scholar
  32. Nagy A, Hegyi H, Farkas K, Tordai H, Kozma E, Bányai L, Patthy L: Quality control of gene predictions. Modern Genome Annotation. Edited by: Frishman D, Valencia A. 2008, The Biosapiens Network, Springer
    Google Scholar
  33. Tress ML, Martelli PL, Frankish A, Reeves GA, Wesselink JJ, Yeats C, Olason PI, Albrecht M, Hegyi H, Giorgetti A, Raimondo D, Lagarde J, Laskowski RA, López G, Sadowski MI, Watson JD, Fariselli P, Rossi I, Nagy A, Kai W, Størling Z, Orsini M, Assenov Y, Blankenburg H, Huthmacher C, Ramírez F, Schlicker A, Denoeud F, Jones P, Kerrien S, et al: The implications of alternative splicing in the ENCODE protein complement. Proc Natl Acad Sci USA. 2007, 104: 5495-5500. 10.1073/pnas.0700800104.
    Article PubMed CAS PubMed Central Google Scholar
  34. Nagy A, Hegyi H, Farkas K, Tordai H, Kozma E, Banyai L, Patthy L: Identification and correction of abnormal, incomplete and mispre-dicted proteins in public databases. BMC Bioinf. 2008, 9: 353-10.1186/1471-2105-9-353.
    Article Google Scholar
  35. MisPred. [http://mispred.enzim.hu/]
  36. EPipe 1.0. [http://www.cbs.dtu.dk/services/EPipe-1.0]
  37. Ng P, Wei CL, Sung WK, Chiu KP, Lipovich L, Ang CC, Gupta S, Shahab A, Ridwan A, Wong CH, Liu ET, Ruan Y: Gene identification signature (GIS) analysis for transcriptome characterization and genome annotation. Nat Methods. 2005, 2: 105-111. 10.1038/nmeth733.
    Article PubMed CAS Google Scholar
  38. Kapranov P, Cawley SE, Drenkow J, Bekiranov S, Strausberg RL, Fodor SPA, Gingeras TR: Large-scale transcriptional activity in chromosomes 21 and 22. Science. 2002, 296: 916-919. 10.1126/science.1068597.
    Article PubMed CAS Google Scholar
  39. Kapranov P, Drenkow J, Cheng J, Long J, Helt G, Dike S, Gingeras TR: Examples of the complex architecture of the human transcriptome revealed by RACE and high-density tiling arrays. Genome Res. 2005, 15: 987-997. 10.1101/gr.3455305.
    Article PubMed CAS PubMed Central Google Scholar
  40. Nagalakshmi U, Wang Z, Waern K, Shou C, Raha D, Gerstein M, Snyder M: The transcriptional landscape of the yeast genome defined by RNA sequencing. Science. 2008, 320: 1344-1349. 10.1126/science.1158441.
    Article PubMed CAS PubMed Central Google Scholar
  41. Mortazavi A, Williams B, McCue K, Schaeffer L, Wold B: Mapping and quantifying mammalian transcriptomes by RNA-Seq. Nat Methods. 2008, 5: 621-628. 10.1038/nmeth.1226.
    Article PubMed CAS Google Scholar
  42. Wilhelm B, Marguerat S, Watt S, Schubert F, Wood V, Goodhead I, Penkett C, Rogers J, Bahler J: Dynamic repertoire of a eukaryotic transcriptome surveyed at single-nucleotide resolution. Nature. 2008, 453: 1239-1243. 10.1038/nature07002.
    Article PubMed CAS Google Scholar
  43. Sultan M, Schulz MH, Richard H, Magen A, Klingenhoff A, Scherf M, Seifert M, Borodina T, Soldatov A, Parkhomchuk D, Schmidt D, O'Keeffe S, Haas S, Vingron M, Lehrach H, Yaspo ML: A global view of gene activity and alternative splicing by deep sequencing of the human transcriptome. Science. 2008, 321: 956-960. 10.1126/science.1160342.
    Article PubMed CAS Google Scholar
  44. Lister R, O'Malley RC, Tonti-Filippini J, Gregory BD, Berry CC, Millar AH, Ecker JR: Highly integrated single-base resolution maps of the epigenome in Arabidopsis. Cell. 2008, 133: 523-536. 10.1016/j.cell.2008.03.029.
    Article PubMed CAS PubMed Central Google Scholar
  45. Cloonan N, Forrest AR, Kolle G, Gardiner BB, Faulkner GJ, Brown MK, Taylor DF, Steptoe AL, Wani S, Bethel G, Robertson AJ, Perkins AC, Bruce SJ, Lee CC, Ranade SS, Peckham HE, Manning JM, McKernan KJ, Grimmond SM: Stem cell transcriptome profiling via massive-scale mRNA sequencing. Nat Methods. 2008, 5: 613-619. 10.1038/nmeth.1223.
    Article PubMed CAS Google Scholar
  46. Akiva P, Toporik A, Edelheit S, Peretz Y, Diber A, Shemesh R, Novik A, Sorek R: Transcription-mediated gene fusion in the human genome. Genome Res. 2006, 16: 30-36. 10.1101/gr.4137606.
    Article PubMed CAS PubMed Central Google Scholar
  47. Parra G, Reymond A, Dabbouseh N, Dermitzakis ET, Castelo R, Thomson TM, Antonarakis SE, Guigó R: Tandem chimerism as a means to increase protein complexity in the human genome. Genome Res. 2006, 16: 37-44. 10.1101/gr.4145906.
    Article PubMed CAS PubMed Central Google Scholar
  48. Vinckenbosch N, Dupanloup I, Kaessmann H: Evolutionary fate of retroposed gene copies in the human genome. Proc Natl Acad Sci USA. 2006, 103: 3220-3225. 10.1073/pnas.0511307103.
    Article PubMed CAS PubMed Central Google Scholar
  49. Zheng D, Frankish A, Baertsch R, Kapranov P, Reymond A, Choo SW, Lu Y, Denoeud F, Antonarakis SE, Snyder M, Ruan Y, Wei CL, Gingeras TR, Guigó R, Harrow J, Gerstein MB: Pseudogenes in the ENCODE regions: Consensus annotation, analysis of transcription, and evolution. Genome Res. 2007, 17: 839-851. 10.1101/gr.5586307.
    Article PubMed CAS PubMed Central Google Scholar
  50. Marques AC, Dupanloup I, Vinckenbosch N, Reymond A, Kaessmann H: Emergence of young human genes after a burst of retroposition in primates. PLoS Biol. 2005, 3: e357-10.1371/journal.pbio.0030357.
    Article PubMed PubMed Central Google Scholar
  51. Werner A, Schmutzler G, Carlile M, Miles C, Peters H: Expression profiling of antisense transcripts on DNA arrays. Physiol Genomics. 2007, 28: 294-300.
    Article PubMed CAS Google Scholar
  52. Kampa D, Cheng J, Kapranov P, Yamanaka M, Brubaker S, Cawley S, Drenkow J, Piccolboni A, Bekiranov S, Helt G, Tammana H, Gingeras TR: Novel RNAs identified from an in-depth analysis of the transcriptome of human chromosomes 21 and 22. Genome Res. 2004, 14: 331-342. 10.1101/gr.2094104.
    Article PubMed CAS PubMed Central Google Scholar
  53. Katayama S, Tomaru Y, Kasukawa T, Waki K, Nakanishi M, Nakamura M, Nishida H, Yap CC, Suzuki M, Kawai J, Suzuki H, Carninci P, Hayashizaki Y, Wells C, Frith M, Ravasi T, Pang KC, Hallinan J, Mattick J, Hume DA, Lipovich L, Batalov S, Engström PG, Mizuno Y, Faghihi MA, Sandelin A, Chalk AM, Mottagui-Tabar S, Liang Z, Lenhard B, et al: Antisense transcription in the mammalian transcriptome. Science. 2005, 309: 1564-1566. 10.1126/science.1112009.
    Article PubMed Google Scholar
  54. Washietl S, Pedersen JS, Korbel JO, Stocsits C, Gruber AR, Hackermüller J, Hertel J, Lindemeyer M, Reiche K, Tanzer A, Ucla C, Wyss C, Antonarakis SE, Denoeud F, Lagarde J, Drenkow J, Kapranov P, Gingeras TR, Guigó R, Snyder M, Gerstein MB, Reymond A, Hofacker IL, Stadler PF: Structured RNAs in the ENCODE selected regions of the human genome. Genome Res. 2007, 17: 852-864. 10.1101/gr.5650707.
    Article PubMed CAS PubMed Central Google Scholar
  55. Washietl S, Hofacker IL, Lukasser M, Huttenhofer A, Stadler PF: Mapping of conserved RNA secondary structures predicts thousands of functional noncoding RNAs in the human genome. Nat Biotechnol. 2005, 23: 1383-1390. 10.1038/nbt1144.
    Article PubMed CAS Google Scholar
  56. Watanabe T, Totoki Y, Toyoda A, Kaneda M, Kuramochi-Miyagawa S, Obata Y, Chiba H, Kohara Y, Kono T, Nakano T, Surani MA, Sakaki Y, Sasaki H: Endogenous siRNAs from naturally formed dsRNAs regulate transcripts in mouse oocytes. Nature. 2008, 453: 539-543. 10.1038/nature06908.
    Article PubMed CAS Google Scholar
  57. Borel C, Gagnebin M, Gehrig C, Kriventseva EV, Zdobnov EM, Antonarakis SE: Mapping of small RNAs in the human ENCODE regions. Am J Hum Genet. 2008, 82: 971-981. 10.1016/j.ajhg.2008.02.016.
    Article PubMed CAS PubMed Central Google Scholar
  58. Unneberg P, Claverie J-M: Tentative mapping of transcription-induced interchromosomal interaction using chimeric EST and mRNA data. PLoS ONE. 2007, 2: e254-10.1371/journal.pone.0000254.
    Article PubMed PubMed Central Google Scholar
  59. Eddy S: What is a hidden Markov model?. Nat Biotechnol. 2004, 22: 1315-1316. 10.1038/nbt1004-1315.
    Article PubMed CAS Google Scholar
  60. Djebali S, Kapranov P, Foissac S, Lagarde J, Reymond A, Ucla C, Wyss C, Drenkow J, Dumais E, Murray RR, Lin C, Szeto D, Denoeud F, Calvo M, Frankish A, Harrow J, Makrythanasis P, Vidal M, Salehi-Ashtiani K, Antonarakis SE, Gingeras TR, Guigó R: Efficient targeted transcript discovery via array-based normalization of RACE libraries. Nat Methods. 2008, 5: 629-635. 10.1038/nmeth.1216.
    Article PubMed CAS PubMed Central Google Scholar

Download references