Computational identification of promoters and first exons in the human genome (original) (raw)
- Article
- Published: 26 November 2001
Nature Genetics volume 29, pages 412–417 (2001)Cite this article
- 1411 Accesses
- 307 Citations
- 10 Altmetric
- Metrics details
A Corrigendum to this article was published on 01 November 2002
Abstract
The identification of promoters and first exons has been one of the most difficult problems in gene-finding. We present a set of discriminant functions that can recognize structural and compositional features such as CpG islands, promoter regions and first splice-donor sites. We explain the implementation of the discriminant functions into a decision tree that constitutes a new program called FirstEF. By using different models to predict CpG-related and non-CpG-related first exons, we showed by cross-validation that the program could predict 86% of the first exons with 17% false positives. We also demonstrated the prediction accuracy of FirstEF at the genome level by applying it to the finished sequences of human chromosomes 21 and 22 as well as by comparing the predictions with the locations of the experimentally verified first exons. Finally, we present the analysis of the predicted first exons for all of the 24 chromosomes of the human genome.
This is a preview of subscription content, access via your institution
Access options
Subscribe to this journal
Receive 12 print issues and online access
$209.00 per year
only $17.42 per issue
Buy this article
- Purchase on SpringerLink
- Instant access to full article PDF
Prices may be subject to local taxes which are calculated during checkout
Additional access options:
Similar content being viewed by others
Accession codes
Accessions
GenBank/EMBL/DDBJ
References
- Lander, E.S. et al. Initial sequencing and analysis of the human genome. Nature 409, 860–921 (2001).
Article CAS Google Scholar - Venter, J.C. et al. The sequence of the human genome. Science 291,1304–1351 (2001).
Article CAS Google Scholar - Lander, E.S. The new genomics: global views of biology. Science 274, 536–539 (1996).
Article CAS Google Scholar - Burge, C. & Karlin, S. Prediction of complete gene structures in human genomic DNA. J. Mol. Biol. 268, 78–94 (1997).
Article CAS Google Scholar - Solovyev, V.V., Salamov, A.A. & Lawrence, C.B. Predicting internal exons by oligonucleotide composition and discriminant analysis of spliceable open reading frames. Nucleic Acids Res. 22, 5156–5163 (1994).
Article CAS Google Scholar - Zhang, M.Q. Identification of protein coding regions in the human genome by quadratic discriminant analysis. Proc. Natl. Acad. Sci. USA 94, 565–568 (1997).
Article CAS Google Scholar - Cleverie, J.M. Computational methods for the identification of genes in vertebrate genomic sequences. Hum. Mol. Genet. 6, 1735–1744 (1997).
Article Google Scholar - Galas, D.J. Sequence interpretation: making sense of sequence. Science 291, 1257–1260 (2001).
Article CAS Google Scholar - Stormo, G.D. Gene-finding approaches for eukaryotes. Genome Res. 10, 394–397 (2000).
Article CAS Google Scholar - Maroni, G. The organization of eukaryotic genes. Evol. Biol. 29, 1–19 (1996).
CAS Google Scholar - Scherf, M., Klingenhoff, A. & Werner, T. Highly specific localization of promoter regions in large genomic sequences by PromoterInspector: a novel context analysis approach. J. Mol. Biol. 297, 599–606 (2000).
Article CAS Google Scholar - Davuluri, R.V., Suzuki, Y., Sugano, S. & Zhang, M.Q. CART classification of human 5′ UTR sequences. Genome Res. 10, 1807–1816 (2000).
Article CAS Google Scholar - Gardiner-Garden, M. & Frommer, M. CpG islands in vertebrate genomes. J. Mol. Biol. 196, 261–282 (1987).
Article CAS Google Scholar - Ioshikhes, I.P. & Zhang, M.Q. Large-scale human promoter mapping using CpG islands. Nature Genet. 26, 61–63 (2000).
Article CAS Google Scholar - Hattori, M. et al. The DNA sequence of human chromosome 21. Nature 405, 311–319 (2000).
Article CAS Google Scholar - Dunham, I. et al. The DNA sequence of human chromosome 22. Nature 402, 489–495 (1999).
Article CAS Google Scholar - Lemon, B. & Tjian, R. Orchestrated response: a symphony of transcription factors for gene control. Genes Dev. 14, 2551–2569 (2000).
Article CAS Google Scholar - Claverie, J.M. From bioinformatics to computational biology. Genome Res. 10, 1277–1279 (2000).
Article CAS Google Scholar - Perier, R.C., Praz, V., Junier, T., Bonnard, C. & Bucher P. The eukaryotic promoter database (EPD). Nucleic Acids Res. 28, 302–303 (2000).
Article CAS Google Scholar - Hong, S.J. & Weiss, S.M. Advances in predictive models for data mining. Pattern Recognition Let. 22, 55–61 (2001).
Article Google Scholar - Cross, S.H. & Bird, A.P. CpG islands and genes. Curr. Opin. Genet. Dev. 5, 309–314 (1995).
Article CAS Google Scholar - Cross, S., Kovarik, P., Schmidtke, J. & Bird, A. Non-methylated islands in fish genomes are GC-poor. Nucleic Acids Res. 19, 1469–1474 (1991).
Article CAS Google Scholar - Zhang, M.Q. Identification of human gene core promoters in silico. Genome Res. 8, 319–326 (1998).
Article CAS Google Scholar - Venables, W.N. & Ripley, B.D. Modern Applied Statistics with S-Plus (Springer, New York, 1994).
Acknowledgements
This work was supported by grants to M.Q.Z. from the National Institutes of Health, and I.G. is also supported by a CSHL Association fellowship. We thank G. Chen for setting up the web interface to FirstEF, as well as N. Banerjee, K. Hermann, H. Herzel, M. Hoffman, D. Holste, W. Li, F. Lillo, M. Ronemus, R. Sachidanandam, K. Rateitschak, A. Schmitt and Z. Xuan for valuable discussions and comments on the manuscript.
Author information
Author notes
- Ramana V. Davuluri
Present address: Human Cancer Genetics Program, The Ohio State University, 420 W. 12th Avenue, TMRF 524, Columbus, Ohio, 43210, USA
Authors and Affiliations
- Cold Spring Harbor Laboratory, Cold Spring Harbor, 11724, New York, USA
Ramana V. Davuluri, Ivo Grosse & Michael Q. Zhang
Authors
- Ramana V. Davuluri
You can also search for this author inPubMed Google Scholar - Ivo Grosse
You can also search for this author inPubMed Google Scholar - Michael Q. Zhang
You can also search for this author inPubMed Google Scholar
Corresponding author
Correspondence toMichael Q. Zhang.
Supplementary information
Rights and permissions
About this article
Cite this article
Davuluri, R., Grosse, I. & Zhang, M. Computational identification of promoters and first exons in the human genome.Nat Genet 29, 412–417 (2001). https://doi.org/10.1038/ng780
- Received: 03 July 2001
- Accepted: 19 October 2001
- Published: 26 November 2001
- Issue Date: December 2001
- DOI: https://doi.org/10.1038/ng780