Genomix: a method for combining gene-finders' predictions, which uses evolutionary conservation of sequence and intron-exon structure - PubMed (original) (raw)

Genomix: a method for combining gene-finders' predictions, which uses evolutionary conservation of sequence and intron-exon structure

Avril Coghlan et al. Bioinformatics. 2007.

Abstract

Motivation: Correct gene predictions are crucial for most analyses of genomes. However, in the absence of transcript data, gene prediction is still challenging. One way to improve gene-finding accuracy in such genomes is to combine the exons predicted by several gene-finders, so that gene-finders that make uncorrelated errors can correct each other.

Results: We present a method for combining gene-finders called Genomix. Genomix selects the predicted exons that are best conserved within and/or between species in terms of sequence and intron-exon structure, and combines them into a gene structure. Genomix was used to combine predictions from four gene-finders for Caenorhabditis elegans, by selecting the predicted exons that are best conserved with C.briggsae and C.remanei. On a set of approximately 1500 confirmed C.elegans genes, Genomix increased the exon-level specificity by 10.1% and sensitivity by 2.7% compared to the best input gene-finder.

Availability: Scripts and Supplementary Material can be found at http://www.sanger.ac.uk/Software/analysis/genomix

PubMed Disclaimer

Figures

Fig. 1

Fig. 1

The predictions for the C.elegans ver-1 gene from different gene-finders form one ‘exon cluster’. The gene model shown at the top has been experimentally confirmed. Genomix aims to select the subset of predicted exons that are most likely to be correct, and join them into a frame-consistent gene structure. For ver-1, although none of the input gene-finders predict the correct gene structure, Genomix predicts the correct structure by selecting all the correct predicted exons (grey) and no incorrect predicted exons (black).

Fig. 2

Fig. 2

Measuring exon conservation. (A) Four different gene-finders predict different coordinates for the fifth exon of C.elegans gene X. (B) Genomix calculates a score for each of the four predicted C.elegans exons, which reflects how conserved is its sequence, intron–exon boundaries, phases and length relative to the C.briggsae exons in a matching exon cluster. Caenorhabditis elegans predicted exon 5a has the highest conservation score, followed by 5b, then 5c and 5d.

Fig. 3

Fig. 3

Using dynamic programming to select predicted exons. (A) To select the best subset of exons predicted in the C.elegans query exon cluster (here exon cluster 18196, which corresponds to the C.elegans ver-1 locus), dynamic programming is used to find the optimal alignment between the C.elegans exons and the exons in the top matching exon cluster (here C.remanei exon cluster 48 051, which corresponds to the C.remanei ver-1 locus). (B) The solution of the dynamic programming algorithm is the optimal alignment between the predicted exons from the query C.elegans exon cluster and the predicted exons from the matching C.remanei exon cluster.

Fig. 4

Fig. 4

An alternative isoform of a WormBase curated gene that was suggested by Genomix. (A) WormBase release WS147 contained one confirmed transcript for gene T20D3.5, now known as T20D3.5a. Genomix suggested an alternative isoform T20D3.5b. The grey box indicates the extra upstream coding region of T20D3.5b that is missing from T20D3.5a. (B) A multiple alignment of T20D3.5a, T20D3.5b and their C.briggsae, C.remanei, Drosophila melanogaster and human orthologs. The alignment is truncated after the start of T20D3.5a, since T20D3.5a and T20D3.5b are identical after this point. The human and D.melanogaster orthologs were identified from TreeFam (Li et al., 2006). Here ‘human4’ is Ensembl gene ENSG00000189332. Gene predictions for the C.briggsae and C.remanei orthologs were made using Genomix.

Similar articles

Cited by

References

    1. Ali KM, Pazzani MJ. Error reduction through learning multiple descriptions. Machine Learning. 1996;24:173–206.
    1. Allen JE, et al. JIGSAW, GeneZilla, and GlimmerHMM: puzzling out the features of human genes in the ENCODE regions. Genome Biol. 2006;7(Suppl. 1):S9. - PMC - PubMed
    1. Altschul SF, et al. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 1997;25:3389–3402. - PMC - PubMed
    1. Brenner SE, et al. Assessing sequence comparison methods with reliable structurally identified distant evolutionary relationships. Proc. Natl Acad. Sci. USA. 1998;95:6073–6078. - PMC - PubMed
    1. Brent MR. Genome annotation past, present and future: how to define an ORF at each locus. Genome Res. 2005;15:1777–1786. - PubMed

Publication types

MeSH terms

Substances

LinkOut - more resources