Computational identification of promoters and first exons in the human genome (original) (raw)

Nature Genetics volume 29, pages 412–417 (2001)Cite this article

A Corrigendum to this article was published on 01 November 2002

Abstract

The identification of promoters and first exons has been one of the most difficult problems in gene-finding. We present a set of discriminant functions that can recognize structural and compositional features such as CpG islands, promoter regions and first splice-donor sites. We explain the implementation of the discriminant functions into a decision tree that constitutes a new program called FirstEF. By using different models to predict CpG-related and non-CpG-related first exons, we showed by cross-validation that the program could predict 86% of the first exons with 17% false positives. We also demonstrated the prediction accuracy of FirstEF at the genome level by applying it to the finished sequences of human chromosomes 21 and 22 as well as by comparing the predictions with the locations of the experimentally verified first exons. Finally, we present the analysis of the predicted first exons for all of the 24 chromosomes of the human genome.

This is a preview of subscription content, access via your institution

Access options

Subscribe to this journal

Receive 12 print issues and online access

$209.00 per year

only $17.42 per issue

Buy this article

Prices may be subject to local taxes which are calculated during checkout

Additional access options:

Similar content being viewed by others

Accession codes

Accessions

GenBank/EMBL/DDBJ

References

  1. Lander, E.S. et al. Initial sequencing and analysis of the human genome. Nature 409, 860–921 (2001).
    Article CAS Google Scholar
  2. Venter, J.C. et al. The sequence of the human genome. Science 291,1304–1351 (2001).
    Article CAS Google Scholar
  3. Lander, E.S. The new genomics: global views of biology. Science 274, 536–539 (1996).
    Article CAS Google Scholar
  4. Burge, C. & Karlin, S. Prediction of complete gene structures in human genomic DNA. J. Mol. Biol. 268, 78–94 (1997).
    Article CAS Google Scholar
  5. Solovyev, V.V., Salamov, A.A. & Lawrence, C.B. Predicting internal exons by oligonucleotide composition and discriminant analysis of spliceable open reading frames. Nucleic Acids Res. 22, 5156–5163 (1994).
    Article CAS Google Scholar
  6. Zhang, M.Q. Identification of protein coding regions in the human genome by quadratic discriminant analysis. Proc. Natl. Acad. Sci. USA 94, 565–568 (1997).
    Article CAS Google Scholar
  7. Cleverie, J.M. Computational methods for the identification of genes in vertebrate genomic sequences. Hum. Mol. Genet. 6, 1735–1744 (1997).
    Article Google Scholar
  8. Galas, D.J. Sequence interpretation: making sense of sequence. Science 291, 1257–1260 (2001).
    Article CAS Google Scholar
  9. Stormo, G.D. Gene-finding approaches for eukaryotes. Genome Res. 10, 394–397 (2000).
    Article CAS Google Scholar
  10. Maroni, G. The organization of eukaryotic genes. Evol. Biol. 29, 1–19 (1996).
    CAS Google Scholar
  11. Scherf, M., Klingenhoff, A. & Werner, T. Highly specific localization of promoter regions in large genomic sequences by PromoterInspector: a novel context analysis approach. J. Mol. Biol. 297, 599–606 (2000).
    Article CAS Google Scholar
  12. Davuluri, R.V., Suzuki, Y., Sugano, S. & Zhang, M.Q. CART classification of human 5′ UTR sequences. Genome Res. 10, 1807–1816 (2000).
    Article CAS Google Scholar
  13. Gardiner-Garden, M. & Frommer, M. CpG islands in vertebrate genomes. J. Mol. Biol. 196, 261–282 (1987).
    Article CAS Google Scholar
  14. Ioshikhes, I.P. & Zhang, M.Q. Large-scale human promoter mapping using CpG islands. Nature Genet. 26, 61–63 (2000).
    Article CAS Google Scholar
  15. Hattori, M. et al. The DNA sequence of human chromosome 21. Nature 405, 311–319 (2000).
    Article CAS Google Scholar
  16. Dunham, I. et al. The DNA sequence of human chromosome 22. Nature 402, 489–495 (1999).
    Article CAS Google Scholar
  17. Lemon, B. & Tjian, R. Orchestrated response: a symphony of transcription factors for gene control. Genes Dev. 14, 2551–2569 (2000).
    Article CAS Google Scholar
  18. Claverie, J.M. From bioinformatics to computational biology. Genome Res. 10, 1277–1279 (2000).
    Article CAS Google Scholar
  19. Perier, R.C., Praz, V., Junier, T., Bonnard, C. & Bucher P. The eukaryotic promoter database (EPD). Nucleic Acids Res. 28, 302–303 (2000).
    Article CAS Google Scholar
  20. Hong, S.J. & Weiss, S.M. Advances in predictive models for data mining. Pattern Recognition Let. 22, 55–61 (2001).
    Article Google Scholar
  21. Cross, S.H. & Bird, A.P. CpG islands and genes. Curr. Opin. Genet. Dev. 5, 309–314 (1995).
    Article CAS Google Scholar
  22. Cross, S., Kovarik, P., Schmidtke, J. & Bird, A. Non-methylated islands in fish genomes are GC-poor. Nucleic Acids Res. 19, 1469–1474 (1991).
    Article CAS Google Scholar
  23. Zhang, M.Q. Identification of human gene core promoters in silico. Genome Res. 8, 319–326 (1998).
    Article CAS Google Scholar
  24. Venables, W.N. & Ripley, B.D. Modern Applied Statistics with S-Plus (Springer, New York, 1994).

Download references

Acknowledgements

This work was supported by grants to M.Q.Z. from the National Institutes of Health, and I.G. is also supported by a CSHL Association fellowship. We thank G. Chen for setting up the web interface to FirstEF, as well as N. Banerjee, K. Hermann, H. Herzel, M. Hoffman, D. Holste, W. Li, F. Lillo, M. Ronemus, R. Sachidanandam, K. Rateitschak, A. Schmitt and Z. Xuan for valuable discussions and comments on the manuscript.

Author information

Author notes

  1. Ramana V. Davuluri
    Present address: Human Cancer Genetics Program, The Ohio State University, 420 W. 12th Avenue, TMRF 524, Columbus, Ohio, 43210, USA

Authors and Affiliations

  1. Cold Spring Harbor Laboratory, Cold Spring Harbor, 11724, New York, USA
    Ramana V. Davuluri, Ivo Grosse & Michael Q. Zhang

Authors

  1. Ramana V. Davuluri
    You can also search for this author inPubMed Google Scholar
  2. Ivo Grosse
    You can also search for this author inPubMed Google Scholar
  3. Michael Q. Zhang
    You can also search for this author inPubMed Google Scholar

Corresponding author

Correspondence toMichael Q. Zhang.

Supplementary information

Rights and permissions

About this article

Cite this article

Davuluri, R., Grosse, I. & Zhang, M. Computational identification of promoters and first exons in the human genome.Nat Genet 29, 412–417 (2001). https://doi.org/10.1038/ng780

Download citation

This article is cited by