JIGSAW, GeneZilla, and GlimmerHMM: puzzling out the features of human genes in the ENCODE regions - PubMed (original) (raw)

JIGSAW, GeneZilla, and GlimmerHMM: puzzling out the features of human genes in the ENCODE regions

Jonathan E Allen et al. Genome Biol. 2006.

Abstract

Background: Predicting complete protein-coding genes in human DNA remains a significant challenge. Though a number of promising approaches have been investigated, an ideal suite of tools has yet to emerge that can provide near perfect levels of sensitivity and specificity at the level of whole genes. As an incremental step in this direction, it is hoped that controlled gene finding experiments in the ENCODE regions will provide a more accurate view of the relative benefits of different strategies for modeling and predicting gene structures.

Results: Here we describe our general-purpose eukaryotic gene finding pipeline and its major components, as well as the methodological adaptations that we found necessary in accommodating human DNA in our pipeline, noting that a similar level of effort may be necessary by ourselves and others with similar pipelines whenever a new class of genomes is presented to the community for analysis. We also describe a number of controlled experiments involving the differential inclusion of various types of evidence and feature states into our models and the resulting impact these variations have had on predictive accuracy.

Conclusion: While in the case of the non-comparative gene finders we found that adding model states to represent specific biological features did little to enhance predictive accuracy, for our evidence-based 'combiner' program the incorporation of additional evidence tracks tended to produce significant gains in accuracy for most evidence types, suggesting that improved modeling efforts at the hidden Markov model level are of relatively little value. We relate these findings to our current plans for future research.

PubMed Disclaimer

Figures

Figure 1

Accuracy as a function of training set size. Percentage of correct exons (F score) is shown on the _y_-axis and training set size in thousands is shown on the _x_-axis. Data points (N = 121) are shown in blue; the best fit function of the form y = a/(1+_be_-cx+d) is shown in red; a = 69.01, b = 0.0152, c = 0.0012, d = 2.09. The curve is effectively flat for values of x above 6,000 (not shown). The curve for nucleotide and gene level accuracies and for the second test set are of very similar shape. F = 2 × Sn × Sp/(Sn + Sp).

Figure 2

The computational gene finding pipeline UMIAGS (University of Maryland Integrative Analysis of Gene Structure). The raw genomic sequence is shown as an input at left; gene structure predictions are emitted at right. Additional evidence tracks for the combiner program JIGSAW are shown entering from the bottom. See text for details. GHMM, generalized hidden Markov model.

Figure 3

HMM for predicting isochore boundaries. States are shown as large circles, with transitions indicated by directed arrows. Transition probabilities are omitted for clarity. Within each outer state is a GHMM profile. States represent isochores, or discrete ranges of G+C density: I = (0,43%), II = (43-51%), III = (51-57%), and IV = (57-100%).

Figure 4

State-transition diagram of the GHMM for GlimmerHMM. The dashed line in the middle separates the positive strand and negative strand portions of the model. Each state in the GHMM is implemented as a separate submodel, such as a weight array matrix or an IMM (interpolated Markov models).

Figure 5

State-transition diagram of the GHMM-based gene finder GeneZilla. Green states were differentially included for the feature-state experiments. Reverse-strand states have been omitted for brevity. A, acceptor site; AATAAA, polyadenylation signal (including ATTAAA); ATG, start codon; b, branch point; CAP, cap site; CpG, CpG island; D, donor site; E, exon; I, intron; N, intergenic; sigP, signal peptide; TATA, TATA box; TAG, stop codon (including TAA and TGA); UTR, untranslated region.

Figure 6

Parsing sequence S into three non-overlapping intervals _t_0, _t_1 and _t_2 with the state assignments _q_1, _q_2 and _q_3, respectively. Position k marks an index in S. The dashed box highlights the evidence overlapping the first interval from position _b_0 to _e_0.

Figure 7

Training procedures for building JIGSAW prediction models. Feature vectors are collected from m examples and separated according to each of the six gene feature types. Decision trees are induced for each of the separated training sets, and their output is combined during the prediction procedure.

Cited by

A systematically improved high quality genome and transcriptome of the human blood fluke Schistosoma mansoni.
Protasio AV, Tsai IJ, Babbage A, Nichol S, Hunt M, Aslett MA, De Silva N, Velarde GS, Anderson TJ, Clark RC, Davidson C, Dillon GP, Holroyd NE, LoVerde PT, Lloyd C, McQuillan J, Oliveira G, Otto TD, Parker-Manuel SJ, Quail MA, Wilson RA, Zerlotini A, Dunne DW, Berriman M. Protasio AV, et al. PLoS Negl Trop Dis. 2012 Jan;6(1):e1455. doi: 10.1371/journal.pntd.0001455. Epub 2012 Jan 10. PLoS Negl Trop Dis. 2012. PMID: 22253936 Free PMC article.
gff2sequence, a new user friendly tool for the generation of genomic sequences.
Camiolo S, Porceddu A. Camiolo S, et al. BioData Min. 2013 Sep 11;6(1):15. doi: 10.1186/1756-0381-6-15. BioData Min. 2013. PMID: 24020993 Free PMC article.
Reference genome of wild goat (capra aegagrus) and sequencing of goat breeds provide insight into genic basis of goat domestication.
Dong Y, Zhang X, Xie M, Arefnezhad B, Wang Z, Wang W, Feng S, Huang G, Guan R, Shen W, Bunch R, McCulloch R, Li Q, Li B, Zhang G, Xu X, Kijas JW, Salekdeh GH, Wang W, Jiang Y. Dong Y, et al. BMC Genomics. 2015 Jun 5;16(1):431. doi: 10.1186/s12864-015-1606-1. BMC Genomics. 2015. PMID: 26044654 Free PMC article.
Categorization of 77 dystrophin exons into 5 groups by a decision tree using indexes of splicing regulatory factors as decision markers.
Malueka RG, Takaoka Y, Yagi M, Awano H, Lee T, Dwianingsih EK, Nishida A, Takeshima Y, Matsuo M. Malueka RG, et al. BMC Genet. 2012 Mar 31;13:23. doi: 10.1186/1471-2156-13-23. BMC Genet. 2012. PMID: 22462762 Free PMC article.
Genome sequence of Babesia bovis and comparative analysis of apicomplexan hemoprotozoa.
Brayton KA, Lau AO, Herndon DR, Hannick L, Kappmeyer LS, Berens SJ, Bidwell SL, Brown WC, Crabtree J, Fadrosh D, Feldblum T, Forberger HA, Haas BJ, Howell JM, Khouri H, Koo H, Mann DJ, Norimine J, Paulsen IT, Radune D, Ren Q, Smith RK Jr, Suarez CE, White O, Wortman JR, Knowles DP Jr, McElwain TF, Nene VM. Brayton KA, et al. PLoS Pathog. 2007 Oct 19;3(10):1401-13. doi: 10.1371/journal.ppat.0030148. PLoS Pathog. 2007. PMID: 17953480 Free PMC article.

References

1. Siepel A, Haussler D. Combining phylogenetic and hidden Markov models in biosequence analysis. Proceedings of the Seventh Annual International Conference on Computational Molecular Biology (RECOMB 2003) April 10-13 Berlin Germany. 2003. pp. 277–286. - PubMed
1. Pedersen JS, Hein J. Gene finding with a hidden Markov model of gene structure and evolution. Bioinformatics. 2003;19:219–227. - PubMed
1. Majoros WH, Salzberg SL. An empirical analysis of training protocols for probabilistic gene finders. BMC Bioinformatics. 2004;5:206. - PMC - PubMed
1. Guigo R, Flicek P, Abril JF, Reymond A, Lagarde J, Denoeud F, Antonarakis S, Ashburner M, Bajic VB, Birney E, et al. EGASP: The human ENCODE genome annotation assessment project. Genome Biology. 2006;7(Suppl 1):S2. - PMC - PubMed
1. Burge C, Karlin S. Prediction of complete gene structures in human genomic DNA. J Mol Biol. 1997;268:78–94. - PubMed

Publication types

MeSH terms

LinkOut - more resources

Full Text Sources
Medical
- MedlinePlus Health Information
Molecular Biology Databases
- NIAID Data Ecosystem - Find datasets on Infectious and Immune-mediated Diseases
Research Materials
- NCI CPTC Antibody Characterization Program

JIGSAW, GeneZilla, and GlimmerHMM: puzzling out the features of human genes in the ENCODE regions - PubMed (original) (raw)