GeneID in Drosophila - PubMed (original) (raw)

GeneID in Drosophila

G Parra et al. Genome Res. 2000 Apr.

Abstract

GeneID is a program to predict genes in anonymous genomic sequences designed with a hierarchical structure. In the first step, splice sites, and start and stop codons are predicted and scored along the sequence using position weight matrices (PWMs). In the second step, exons are built from the sites. Exons are scored as the sum of the scores of the defining sites, plus the log-likelihood ratio of a Markov model for coding DNA. In the last step, from the set of predicted exons, the gene structure is assembled, maximizing the sum of the scores of the assembled exons. In this paper we describe the obtention of PWMs for sites, and the Markov model of coding DNA in Drosophila melanogaster. We also compare other models of coding DNA with the Markov model. Finally, we present and discuss the results obtained when GeneID is used to predict genes in the Adh region. These results show that the accuracy of GeneID predictions compares currently with that of other existing tools but that GeneID is likely to be more efficient in terms of speed and memory usage.

PubMed Disclaimer

Figures

Figure 1

Predictions obtained by

GeneID

in the region 462500–477500 from the Adh sequence, compared with the annotation in the standard std3 set. In a first step,

GeneID

identifies and scores all possible donor (blue) and acceptor (yellow) sites, start codons (green), and stop codons (red) using PWMs—the height of the corresponding spike is proportional to the site score. A total of 4704 sites were generated along this 15,000-bp region by

GeneID

, only the highest scoring ones are displayed here. In a second step,

GeneID

builds all exons compatible with these sites. A total of 11,967 exons were built in this particular region (not displayed). Exons are scored as the sum of the scores of the defining sites, plus the score of their coding potential measured according with a Markov model of order 5. The coding potential is displayed along the DNA sequence (MM_score). Regions strong in red are more likely to be coding than regions strong in blue. From the set of predicted exons, the gene structure is generated, maximizing the sum of the scores of the assembled exons. Exons assembled in the predicted genes are drawn with heights proportional to their scores. A two-color code is used to indicate frame compatibility: Two adjacent exons are frame compatible if the right half of the upstream exon (the remainder) matches the color of the left half of the downstream exon (the frame). Data are from the

gff2ps

program (available at

http://www1.imim.es/∼jabril/GFFTOOLS/GFF2PS.html

). The input

GFF

and the configuration files required for

gff2ps

to generate this diagram can be found at

http://www1.imim.es/∼gparra/GASP1

Comment in

A biologist's view of the Drosophila genome annotation assessment project.
Ashburner M. Ashburner M. Genome Res. 2000 Apr;10(4):391-3. doi: 10.1101/gr.10.4.391. Genome Res. 2000. PMID: 10779478 Review. No abstract available.

Cited by

Scrutinizing the immune defence inventory of Camponotus floridanus applying total transcriptome sequencing.
Gupta SK, Kupper M, Ratzka C, Feldhaar H, Vilcinskas A, Gross R, Dandekar T, Förster F. Gupta SK, et al. BMC Genomics. 2015 Jul 22;16(1):540. doi: 10.1186/s12864-015-1748-1. BMC Genomics. 2015. PMID: 26198742 Free PMC article.
EfGD: the Erianthus fulvus genome database.
Qian Z, Li X, He L, Gu S, Shen Q, Rao X, Zhang R, Di Y, Xie L, Wang X, Chen S, Dong Y, Li F. Qian Z, et al. Database (Oxford). 2022 Aug 31;2022:baac076. doi: 10.1093/database/baac076. Database (Oxford). 2022. PMID: 36043401 Free PMC article.
Genomics-driven discovery of the pneumocandin biosynthetic gene cluster in the fungus Glarea lozoyensis.
Chen L, Yue Q, Zhang X, Xiang M, Wang C, Li S, Che Y, Ortiz-López FJ, Bills GF, Liu X, An Z. Chen L, et al. BMC Genomics. 2013 May 20;14:339. doi: 10.1186/1471-2164-14-339. BMC Genomics. 2013. PMID: 23688303 Free PMC article.
Origin and adaptation to high altitude of Tibetan semi-wild wheat.
Guo W, Xin M, Wang Z, Yao Y, Hu Z, Song W, Yu K, Chen Y, Wang X, Guan P, Appels R, Peng H, Ni Z, Sun Q. Guo W, et al. Nat Commun. 2020 Oct 8;11(1):5085. doi: 10.1038/s41467-020-18738-5. Nat Commun. 2020. PMID: 33033250 Free PMC article.
Challenges, Solutions, and Quality Metrics of Personal Genome Assembly in Advancing Precision Medicine.
Xiao W, Wu L, Yavas G, Simonyan V, Ning B, Hong H. Xiao W, et al. Pharmaceutics. 2016 Apr 22;8(2):15. doi: 10.3390/pharmaceutics8020015. Pharmaceutics. 2016. PMID: 27110816 Free PMC article. Review.

References

1. Borodovsky M, McIninch J. Genmark: Parallel gene recognition for both DNA strands. Comput Chem. 1993;17:123–113.
1. Burge CB, Karlin S. Finding the genes in genomic DNA. Curr Opin Struct Biol. 1998;8:346–354. - PubMed
1. Claverie JM. Computational methods for the identification of genes in vertebrate genomic sequences. Hum Mol Genet. 1997;6:1735–1744. - PubMed
1. Guigó R. Assembling genes from predicted exons in linear time with dynamic programming. J Comput Biol. 1998;5:681–702. - PubMed
1. ————— . DNA composition, codon usage and exon prediction. In: Bishop M, editor. Nucleic protein databases. San Diego, CA: Academic Press; 1999. pp. 53–80.

GeneID in Drosophila - PubMed (original) (raw)