Homology-based annotation yields 1,042 new candidate genes in the Drosophila melanogaster genome (original) (raw)

Nature Genetics volume 27, pages 337–340 (2001)Cite this article

Abstract

The approach to annotating a genome critically affects the number and accuracy of genes identified in the genome sequence. Genome annotation based on stringent gene identification is prone to underestimate the complement of genes encoded in a genome. In contrast, over-prediction of putative genes followed by exhaustive computational sequence, motif and structural homology search will find rarely expressed, possibly unique, new genes at the risk of including non-functional genes. We developed a two-stage approach that combines the merits of stringent genome annotation with the benefits of over-prediction. First we identify plausible genes regardless of matches with EST, cDNA or protein sequences from the organism (stage 1). In the second stage, proteins predicted from the plausible genes are compared at the protein level with EST, cDNA and protein sequences, and protein structures from other organisms (stage 2). Remote but biologically meaningful protein sequence or structure homologies provide supporting evidence for genuine genes. The method, applied to the Drosophila melanogaster genome, validated 1,042 novel candidate genes after filtering 19,410 plausible genes, of which 12,124 matched the original 13,601 annotated genes1. This annotation strategy is applicable to genomes of all organisms, including human.

This is a preview of subscription content, access via your institution

Access options

Subscribe to this journal

Receive 12 print issues and online access

$209.00 per year

only $17.42 per issue

Buy this article

Prices may be subject to local taxes which are calculated during checkout

Additional access options:

Similar content being viewed by others

References

  1. Adams, M.D. et al. The genome sequence of Drosophila melanogaster. Science 287, 2185–2195 (2000).
    Article Google Scholar
  2. Rubin, G.M. et al. A Drosophila complementary DNA resource. Science 287, 2222–2224 (2000).
    Article CAS Google Scholar
  3. Burge, C. & Karlin, S. Prediction of complete gene structures in human genomic DNA. J. Mol. Biol. 268, 78–94 (1997).
    Article CAS Google Scholar
  4. Burge, C.B. & Karlin, S. Finding the genes in genomic DNA. Curr. Opin. Struct. Biol. 8, 346–354 (1998).
    Article CAS Google Scholar
  5. Reese, M.G. et al. Genome annotation assessment in Drosophila melanogaster. Genome Res. 10, 483–501 (2000).
    Article CAS Google Scholar
  6. Boguski, M.S., Tolstoshev, C.M. & Bassett, D.E. Gene discovery in dbEST. Science 265, 1993–1994 (1994).
    Article CAS Google Scholar
  7. Gaasterland, T. & Ragan, M.A. Constructing multigenome views of whole microbial genomes. Microb. Comp. Genomics 3, 177–192 (1998).
    Article CAS Google Scholar
  8. Benson, D.A. et al. GenBank. Nucleic Acids Res. 27, 12–17 (1999).
    Article CAS Google Scholar
  9. Bhat, T.N. et al. The PDB data uniformity project. Nucleic Acids Res. 29, 214–218 (2001).
    Article CAS Google Scholar
  10. Deckert, G. et al. The complete genome of the hyperthermophilic bacterium Aquifex aeolicus. Nature 392, 353–358 (1998).
    Article CAS Google Scholar
  11. Gaasterland, T. et al. MAGPIE/EGRET annotation of the 2.9-Mb Drosophila melanogaster Adh region. Genome Res. 10, 502–510 (2000).
    Article CAS Google Scholar
  12. Sánchez, R. & Sali, A. Large-scale protein structure modeling of the Saccharomyces cerevisiae genome. Proc. Natl. Acad. Sci. USA 95, 13597–13602 (1998).
    Article Google Scholar
  13. Sánchez, R. & Sali, A. ModBase: a database of comparative protein structure models. Bioinformatics 15, 1060–1061 (1999).
    Article Google Scholar
  14. Sánchez, R. & Sali, A. Evaluation of comparative protein structure modeling by MODELLER -3. Proteins Suppl. 1, 50–58 (1997).
  15. Martí-Renom, M.A. et al. Comparative protein structure modeling of genes and genomes. Annu. Rev. Biophys. Biomol. Struct. 29, 291–325 (2000).
    Article Google Scholar
  16. Reese, M.G., Kulp, D., Tammana, H. & Haussler, D. Genie—gene finding in Drosophila melanogaster. Genome Res. 10, 529–538 (2000).
    Article CAS Google Scholar
  17. Strausberg, R.L., Feingold, E.A., Klausner, R.D. & Collins, F.S. The mammalian gene collection. Science 286, 455–457 (1999).
    Article CAS Google Scholar
  18. Reboul, J. et al. Open-reading-frame sequence tags (OSTs) support the existence of at least 17,300 genes in C. elegans. Nature Genet. 27, 332–336 (2001).
    Article CAS Google Scholar
  19. Burley, S.K. et al. Structural genomics: beyond the human genome project. Nature Genet. 23, 151–157 (1999).
    Article CAS Google Scholar
  20. Altschul, S.F. et al. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 25, 3389–3402 (1997).
    Article CAS Google Scholar
  21. Salamov, A.A. & Solovyev, V.V. Ab initio gene finding in Drosophila genomic DNA. Genome Res. 10, 516–522 (2000).
    Article CAS Google Scholar
  22. Henikoff, J., Henikoff, S. & Pietrokovski, S. New features of the Blocks Database servers. Nucleic Acids Res. 27, 226–228 (1999).
    Article CAS Google Scholar
  23. Hofmann, K., Bucher, P., Falquet, L. & Bairoch, A. The PROSITE database, its status in 1999. Nucleic Acids Res. 27, 215–219 (1999).
    Article CAS Google Scholar
  24. Altschul, S.F. & Koonin, E.V. Iterated profile searches with PSI-BLAST—a tool for discovery in protein databases. Trends Biochem. Sci. 23, 444–447 (1998).
    Article CAS Google Scholar
  25. Sali, A. & Blundell, T.L. Comparative protein modeling by satisfaction of spatial restraints. J. Mol. Biol. 234, 779–815 (1993).
    Article CAS Google Scholar
  26. Bateman, A. et al. Pfam 3.1: 1313 multiple alignments and profile HMMs match the majority of proteins. Nucleic Acids Res. 27, 260–262 (1999).
    Article CAS Google Scholar

Download references

Acknowledgements

We thank S. Burley, M. Vidal, J. Sorge, J. Goncalves, M. Ashburner, S. Lewis, M. Young and U. Gaul for insights and comments. This work was partially supported by the Mathers, Sinsheimer and Mallinkrodt Foundations, National Cancer Institute Health grant R33CA84699, National Institutes of Health grant P50GM62529, and the National Science Foundation grant DBI-9984882.

Author information

Author notes

  1. Shuba Gopal and Mark Schroeder: These authors contributed equally to this work.

Authors and Affiliations

  1. Laboratories of Computational Genomics, The Rockefeller University, New York, New York, USA
    Shuba Gopal, Mark Schroeder, Ursula Pieper, Alexander Sczyrba, Gulriz Aytekin-Kurban, Stefan Bekiranov, J. Eduardo Fajardo & Terry Gaasterland
  2. Biophysics, The Rockefeller University, New York, New York, USA
    Ursula Pieper, Narayanan Eswar, Roberto Sanchez & Andrej Sali

Authors

  1. Shuba Gopal
    You can also search for this author inPubMed Google Scholar
  2. Mark Schroeder
    You can also search for this author inPubMed Google Scholar
  3. Ursula Pieper
    You can also search for this author inPubMed Google Scholar
  4. Alexander Sczyrba
    You can also search for this author inPubMed Google Scholar
  5. Gulriz Aytekin-Kurban
    You can also search for this author inPubMed Google Scholar
  6. Stefan Bekiranov
    You can also search for this author inPubMed Google Scholar
  7. J. Eduardo Fajardo
    You can also search for this author inPubMed Google Scholar
  8. Narayanan Eswar
    You can also search for this author inPubMed Google Scholar
  9. Roberto Sanchez
    You can also search for this author inPubMed Google Scholar
  10. Andrej Sali
    You can also search for this author inPubMed Google Scholar
  11. Terry Gaasterland
    You can also search for this author inPubMed Google Scholar

Corresponding author

Correspondence toTerry Gaasterland.

Rights and permissions

About this article

Cite this article

Gopal, S., Schroeder, M., Pieper, U. et al. Homology-based annotation yields 1,042 new candidate genes in the Drosophila melanogaster genome.Nat Genet 27, 337–340 (2001). https://doi.org/10.1038/85922

Download citation

This article is cited by