Gene finding in novel genomes - PubMed (original) (raw)

Gene finding in novel genomes

Ian Korf. BMC Bioinformatics. 2004.

Abstract

Background: Computational gene prediction continues to be an important problem, especially for genomes with little experimental data.

Results: I introduce the SNAP gene finder which has been designed to be easily adaptable to a variety of genomes. In novel genomes without an appropriate gene finder, I demonstrate that employing a foreign gene finder can produce highly inaccurate results, and that the most compatible parameters may not come from the nearest phylogenetic neighbor. I find that foreign gene finders are more usefully employed to bootstrap parameter estimation and that the resulting parameters can be highly accurate.

Conclusion: Since gene prediction is sensitive to species-specific parameters, every genome needs a dedicated gene finder.

PubMed Disclaimer

Figures

Figure 1

Figure 1

SNAP HMM state diagram Each state of the HMM is represented by a shape and transitions between the states are represented by arrows. States include N: intergenic, Es: single-exon gene, Ei: initial exon, Et terminal exon, E0–E2: exons in phase 0–2, I0–I2: introns in phases 0–2 (subscript of T, TA, or TG denotes the last bp or two bp of the intron – this is used to prevent in-frame stop codons across splice junctions).

Figure 2

Figure 2

Codon frequency The frequency of each degenerate codon is indicated in a species-specific color (At Arabidopsis thaliana, Ce Caenorhabditis elegans, Dm Drosophila melanogaster, Os Oryza sativa). Codons are grouped by their parent amino acid.

Figure 3

Figure 3

Pictograms of splice sites and translation start The height of each letter is proportional to its frequency. At Arabidopsis thaliana, Ce Caenorhabditis elegans, Dm Drosophila melanogaster, Os Oryza sativa. (a) splice acceptor site – canonical AG is at positions -2 and -1, (b) splice donor site – canonical GT is at +1 and +2, (c) translation start site – canonical ATG is at +1 to +3, (d) splice acceptor site consensus derived from gene predictions in A. thaliana with C. elegans parameters.

References

    1. Burge C, Karlin S. Prediction of complete gene structures in human genomic DNA. J Mol Biol. 1997;268:78–94. doi: 10.1006/jmbi.1997.0951. - DOI - PubMed
    1. Webb CT, Shabalina SA, Ogurtsov AY, Kondrashov AS. Analysis of similarity within 142 pairs of orthologous intergenic regions of Caenorhabditis elegans and Caenorhabditis briggsae. Nucleic Acids Res. 2002;30:1233–1239. doi: 10.1093/nar/30.5.1233. - DOI - PMC - PubMed
    1. Reese MG, Hartzell G, Harris NL, Ohler U, Abril JF, Lewis SE. Genome annotation assessment in Drosophila melanogaster. Genome Res. 2000;10:483–501. doi: 10.1101/gr.10.4.483. - DOI - PMC - PubMed
    1. Riboldi Tunnicliffe G, Gloeckner G, Elgar GS, Brenner S, Rosenthal A. Comparative analysis of the PCOLCE region in Fugu rubripes using a new automated annotation tool. Mamm Genome. 2000;11:213–219. doi: 10.1007/s003350010039. - DOI - PubMed
    1. Kraemer E, Wang J, Guo J, Hopkins S, Arnold J. An analysis of gene-finding programs for Neurospora crassa. Bioinformatics. 2001;17:901–912. doi: 10.1093/bioinformatics/17.10.901. - DOI - PubMed

Publication types

MeSH terms

LinkOut - more resources