JIGSAW, GeneZilla, and GlimmerHMM: puzzling out the features of human genes in the ENCODE regions - PubMed (original) (raw)

JIGSAW, GeneZilla, and GlimmerHMM: puzzling out the features of human genes in the ENCODE regions

Jonathan E Allen et al. Genome Biol. 2006.

Abstract

Background: Predicting complete protein-coding genes in human DNA remains a significant challenge. Though a number of promising approaches have been investigated, an ideal suite of tools has yet to emerge that can provide near perfect levels of sensitivity and specificity at the level of whole genes. As an incremental step in this direction, it is hoped that controlled gene finding experiments in the ENCODE regions will provide a more accurate view of the relative benefits of different strategies for modeling and predicting gene structures.

Results: Here we describe our general-purpose eukaryotic gene finding pipeline and its major components, as well as the methodological adaptations that we found necessary in accommodating human DNA in our pipeline, noting that a similar level of effort may be necessary by ourselves and others with similar pipelines whenever a new class of genomes is presented to the community for analysis. We also describe a number of controlled experiments involving the differential inclusion of various types of evidence and feature states into our models and the resulting impact these variations have had on predictive accuracy.

Conclusion: While in the case of the non-comparative gene finders we found that adding model states to represent specific biological features did little to enhance predictive accuracy, for our evidence-based 'combiner' program the incorporation of additional evidence tracks tended to produce significant gains in accuracy for most evidence types, suggesting that improved modeling efforts at the hidden Markov model level are of relatively little value. We relate these findings to our current plans for future research.

PubMed Disclaimer

Figures

Figure 1

Figure 1

Accuracy as a function of training set size. Percentage of correct exons (F score) is shown on the _y_-axis and training set size in thousands is shown on the _x_-axis. Data points (N = 121) are shown in blue; the best fit function of the form y = a/(1+_be_-cx+d) is shown in red; a = 69.01, b = 0.0152, c = 0.0012, d = 2.09. The curve is effectively flat for values of x above 6,000 (not shown). The curve for nucleotide and gene level accuracies and for the second test set are of very similar shape. F = 2 × Sn × Sp/(Sn + Sp).

Figure 2

Figure 2

The computational gene finding pipeline UMIAGS (University of Maryland Integrative Analysis of Gene Structure). The raw genomic sequence is shown as an input at left; gene structure predictions are emitted at right. Additional evidence tracks for the combiner program JIGSAW are shown entering from the bottom. See text for details. GHMM, generalized hidden Markov model.

Figure 3

Figure 3

HMM for predicting isochore boundaries. States are shown as large circles, with transitions indicated by directed arrows. Transition probabilities are omitted for clarity. Within each outer state is a GHMM profile. States represent isochores, or discrete ranges of G+C density: I = (0,43%), II = (43-51%), III = (51-57%), and IV = (57-100%).

Figure 4

Figure 4

State-transition diagram of the GHMM for GlimmerHMM. The dashed line in the middle separates the positive strand and negative strand portions of the model. Each state in the GHMM is implemented as a separate submodel, such as a weight array matrix or an IMM (interpolated Markov models).

Figure 5

Figure 5

State-transition diagram of the GHMM-based gene finder GeneZilla. Green states were differentially included for the feature-state experiments. Reverse-strand states have been omitted for brevity. A, acceptor site; AATAAA, polyadenylation signal (including ATTAAA); ATG, start codon; b, branch point; CAP, cap site; CpG, CpG island; D, donor site; E, exon; I, intron; N, intergenic; sigP, signal peptide; TATA, TATA box; TAG, stop codon (including TAA and TGA); UTR, untranslated region.

Figure 6

Figure 6

Parsing sequence S into three non-overlapping intervals _t_0, _t_1 and _t_2 with the state assignments _q_1, _q_2 and _q_3, respectively. Position k marks an index in S. The dashed box highlights the evidence overlapping the first interval from position _b_0 to _e_0.

Figure 7

Figure 7

Training procedures for building JIGSAW prediction models. Feature vectors are collected from m examples and separated according to each of the six gene feature types. Decision trees are induced for each of the separated training sets, and their output is combined during the prediction procedure.

Similar articles

Cited by

References

    1. Siepel A, Haussler D. Combining phylogenetic and hidden Markov models in biosequence analysis. Proceedings of the Seventh Annual International Conference on Computational Molecular Biology (RECOMB 2003) April 10-13 Berlin Germany. 2003. pp. 277–286. - PubMed
    1. Pedersen JS, Hein J. Gene finding with a hidden Markov model of gene structure and evolution. Bioinformatics. 2003;19:219–227. - PubMed
    1. Majoros WH, Salzberg SL. An empirical analysis of training protocols for probabilistic gene finders. BMC Bioinformatics. 2004;5:206. - PMC - PubMed
    1. Guigo R, Flicek P, Abril JF, Reymond A, Lagarde J, Denoeud F, Antonarakis S, Ashburner M, Bajic VB, Birney E, et al. EGASP: The human ENCODE genome annotation assessment project. Genome Biology. 2006;7(Suppl 1):S2. - PMC - PubMed
    1. Burge C, Karlin S. Prediction of complete gene structures in human genomic DNA. J Mol Biol. 1997;268:78–94. - PubMed

Publication types

MeSH terms

Grants and funding

LinkOut - more resources