Genomix: a method for combining gene-finders' predictions, which uses evolutionary conservation of sequence and intron-exon structure - PubMed (original) (raw)
Genomix: a method for combining gene-finders' predictions, which uses evolutionary conservation of sequence and intron-exon structure
Avril Coghlan et al. Bioinformatics. 2007.
Abstract
Motivation: Correct gene predictions are crucial for most analyses of genomes. However, in the absence of transcript data, gene prediction is still challenging. One way to improve gene-finding accuracy in such genomes is to combine the exons predicted by several gene-finders, so that gene-finders that make uncorrelated errors can correct each other.
Results: We present a method for combining gene-finders called Genomix. Genomix selects the predicted exons that are best conserved within and/or between species in terms of sequence and intron-exon structure, and combines them into a gene structure. Genomix was used to combine predictions from four gene-finders for Caenorhabditis elegans, by selecting the predicted exons that are best conserved with C.briggsae and C.remanei. On a set of approximately 1500 confirmed C.elegans genes, Genomix increased the exon-level specificity by 10.1% and sensitivity by 2.7% compared to the best input gene-finder.
Availability: Scripts and Supplementary Material can be found at http://www.sanger.ac.uk/Software/analysis/genomix
Figures
Fig. 1
The predictions for the C.elegans ver-1 gene from different gene-finders form one ‘exon cluster’. The gene model shown at the top has been experimentally confirmed. Genomix aims to select the subset of predicted exons that are most likely to be correct, and join them into a frame-consistent gene structure. For ver-1, although none of the input gene-finders predict the correct gene structure, Genomix predicts the correct structure by selecting all the correct predicted exons (grey) and no incorrect predicted exons (black).
Fig. 2
Measuring exon conservation. (A) Four different gene-finders predict different coordinates for the fifth exon of C.elegans gene X. (B) Genomix calculates a score for each of the four predicted C.elegans exons, which reflects how conserved is its sequence, intron–exon boundaries, phases and length relative to the C.briggsae exons in a matching exon cluster. Caenorhabditis elegans predicted exon 5a has the highest conservation score, followed by 5b, then 5c and 5d.
Fig. 3
Using dynamic programming to select predicted exons. (A) To select the best subset of exons predicted in the C.elegans query exon cluster (here exon cluster 18196, which corresponds to the C.elegans ver-1 locus), dynamic programming is used to find the optimal alignment between the C.elegans exons and the exons in the top matching exon cluster (here C.remanei exon cluster 48 051, which corresponds to the C.remanei ver-1 locus). (B) The solution of the dynamic programming algorithm is the optimal alignment between the predicted exons from the query C.elegans exon cluster and the predicted exons from the matching C.remanei exon cluster.
Fig. 4
An alternative isoform of a WormBase curated gene that was suggested by Genomix. (A) WormBase release WS147 contained one confirmed transcript for gene T20D3.5, now known as T20D3.5a. Genomix suggested an alternative isoform T20D3.5b. The grey box indicates the extra upstream coding region of T20D3.5b that is missing from T20D3.5a. (B) A multiple alignment of T20D3.5a, T20D3.5b and their C.briggsae, C.remanei, Drosophila melanogaster and human orthologs. The alignment is truncated after the start of T20D3.5a, since T20D3.5a and T20D3.5b are identical after this point. The human and D.melanogaster orthologs were identified from TreeFam (Li et al., 2006). Here ‘human4’ is Ensembl gene ENSG00000189332. Gene predictions for the C.briggsae and C.remanei orthologs were made using Genomix.
Similar articles
- nGASP--the nematode genome annotation assessment project.
Coghlan A, Fiedler TJ, McKay SJ, Flicek P, Harris TW, Blasiar D; nGASP Consortium; Stein LD. Coghlan A, et al. BMC Bioinformatics. 2008 Dec 19;9:549. doi: 10.1186/1471-2105-9-549. BMC Bioinformatics. 2008. PMID: 19099578 Free PMC article. - NemaFootPrinter: a web based software for the identification of conserved non-coding genome sequence regions between C. elegans and C. briggsae.
Rambaldi D, Guffanti A, Morandi P, Cassata G. Rambaldi D, et al. BMC Bioinformatics. 2005 Dec 1;6 Suppl 4(Suppl 4):S22. doi: 10.1186/1471-2105-6-S4-S22. BMC Bioinformatics. 2005. PMID: 16351749 Free PMC article. - mGene: accurate SVM-based gene finding with an application to nematode genomes.
Schweikert G, Zien A, Zeller G, Behr J, Dieterich C, Ong CS, Philips P, De Bona F, Hartmann L, Bohlen A, Krüger N, Sonnenburg S, Rätsch G. Schweikert G, et al. Genome Res. 2009 Nov;19(11):2133-43. doi: 10.1101/gr.090597.108. Epub 2009 Jun 29. Genome Res. 2009. PMID: 19564452 Free PMC article. - Computational methods for ab initio and comparative gene finding.
Picardi E, Pesole G. Picardi E, et al. Methods Mol Biol. 2010;609:269-84. doi: 10.1007/978-1-60327-241-4_16. Methods Mol Biol. 2010. PMID: 20221925 Review. - Advances in the Exon-Intron Database (EID).
Shepelev V, Fedorov A. Shepelev V, et al. Brief Bioinform. 2006 Jun;7(2):178-85. doi: 10.1093/bib/bbl003. Epub 2006 Mar 9. Brief Bioinform. 2006. PMID: 16772261 Review.
Cited by
- ASPic-GeneID: a lightweight pipeline for gene prediction and alternative isoforms detection.
Alioto T, Picardi E, Guigó R, Pesole G. Alioto T, et al. Biomed Res Int. 2013;2013:502827. doi: 10.1155/2013/502827. Epub 2013 Nov 7. Biomed Res Int. 2013. PMID: 24308000 Free PMC article. - Improving the assessment of the outcome of nonsynonymous SNVs with a consensus deleteriousness score, Condel.
González-Pérez A, López-Bigas N. González-Pérez A, et al. Am J Hum Genet. 2011 Apr 8;88(4):440-9. doi: 10.1016/j.ajhg.2011.03.004. Epub 2011 Mar 31. Am J Hum Genet. 2011. PMID: 21457909 Free PMC article. - nGASP--the nematode genome annotation assessment project.
Coghlan A, Fiedler TJ, McKay SJ, Flicek P, Harris TW, Blasiar D; nGASP Consortium; Stein LD. Coghlan A, et al. BMC Bioinformatics. 2008 Dec 19;9:549. doi: 10.1186/1471-2105-9-549. BMC Bioinformatics. 2008. PMID: 19099578 Free PMC article. - Spliceosomal introns as tools for genomic and evolutionary analysis.
Irimia M, Roy SW. Irimia M, et al. Nucleic Acids Res. 2008 Mar;36(5):1703-12. doi: 10.1093/nar/gkn012. Epub 2008 Feb 7. Nucleic Acids Res. 2008. PMID: 18263615 Free PMC article. Review.
References
- Ali KM, Pazzani MJ. Error reduction through learning multiple descriptions. Machine Learning. 1996;24:173–206.
- Brent MR. Genome annotation past, present and future: how to define an ORF at each locus. Genome Res. 2005;15:1777–1786. - PubMed