JIGSAW, GeneZilla, and GlimmerHMM: puzzling out the features of human genes in the ENCODE regions - PubMed (original) (raw)
JIGSAW, GeneZilla, and GlimmerHMM: puzzling out the features of human genes in the ENCODE regions
Jonathan E Allen et al. Genome Biol. 2006.
Abstract
Background: Predicting complete protein-coding genes in human DNA remains a significant challenge. Though a number of promising approaches have been investigated, an ideal suite of tools has yet to emerge that can provide near perfect levels of sensitivity and specificity at the level of whole genes. As an incremental step in this direction, it is hoped that controlled gene finding experiments in the ENCODE regions will provide a more accurate view of the relative benefits of different strategies for modeling and predicting gene structures.
Results: Here we describe our general-purpose eukaryotic gene finding pipeline and its major components, as well as the methodological adaptations that we found necessary in accommodating human DNA in our pipeline, noting that a similar level of effort may be necessary by ourselves and others with similar pipelines whenever a new class of genomes is presented to the community for analysis. We also describe a number of controlled experiments involving the differential inclusion of various types of evidence and feature states into our models and the resulting impact these variations have had on predictive accuracy.
Conclusion: While in the case of the non-comparative gene finders we found that adding model states to represent specific biological features did little to enhance predictive accuracy, for our evidence-based 'combiner' program the incorporation of additional evidence tracks tended to produce significant gains in accuracy for most evidence types, suggesting that improved modeling efforts at the hidden Markov model level are of relatively little value. We relate these findings to our current plans for future research.
Figures
Figure 1
Accuracy as a function of training set size. Percentage of correct exons (F score) is shown on the _y_-axis and training set size in thousands is shown on the _x_-axis. Data points (N = 121) are shown in blue; the best fit function of the form y = a/(1+_be_-cx+d) is shown in red; a = 69.01, b = 0.0152, c = 0.0012, d = 2.09. The curve is effectively flat for values of x above 6,000 (not shown). The curve for nucleotide and gene level accuracies and for the second test set are of very similar shape. F = 2 × Sn × Sp/(Sn + Sp).
Figure 2
The computational gene finding pipeline UMIAGS (University of Maryland Integrative Analysis of Gene Structure). The raw genomic sequence is shown as an input at left; gene structure predictions are emitted at right. Additional evidence tracks for the combiner program JIGSAW are shown entering from the bottom. See text for details. GHMM, generalized hidden Markov model.
Figure 3
HMM for predicting isochore boundaries. States are shown as large circles, with transitions indicated by directed arrows. Transition probabilities are omitted for clarity. Within each outer state is a GHMM profile. States represent isochores, or discrete ranges of G+C density: I = (0,43%), II = (43-51%), III = (51-57%), and IV = (57-100%).
Figure 4
State-transition diagram of the GHMM for GlimmerHMM. The dashed line in the middle separates the positive strand and negative strand portions of the model. Each state in the GHMM is implemented as a separate submodel, such as a weight array matrix or an IMM (interpolated Markov models).
Figure 5
State-transition diagram of the GHMM-based gene finder GeneZilla. Green states were differentially included for the feature-state experiments. Reverse-strand states have been omitted for brevity. A, acceptor site; AATAAA, polyadenylation signal (including ATTAAA); ATG, start codon; b, branch point; CAP, cap site; CpG, CpG island; D, donor site; E, exon; I, intron; N, intergenic; sigP, signal peptide; TATA, TATA box; TAG, stop codon (including TAA and TGA); UTR, untranslated region.
Figure 6
Parsing sequence S into three non-overlapping intervals _t_0, _t_1 and _t_2 with the state assignments _q_1, _q_2 and _q_3, respectively. Position k marks an index in S. The dashed box highlights the evidence overlapping the first interval from position _b_0 to _e_0.
Figure 7
Training procedures for building JIGSAW prediction models. Feature vectors are collected from m examples and separated according to each of the six gene feature types. Decision trees are induced for each of the separated training sets, and their output is combined during the prediction procedure.
Similar articles
- Pairagon+N-SCAN_EST: a model-based gene annotation pipeline.
Arumugam M, Wei C, Brown RH, Brent MR. Arumugam M, et al. Genome Biol. 2006;7 Suppl 1(Suppl 1):S5.1-10. doi: 10.1186/gb-2006-7-s1-s5. Epub 2006 Aug 7. Genome Biol. 2006. PMID: 16925839 Free PMC article. - JIGSAW: integration of multiple sources of evidence for gene prediction.
Allen JE, Salzberg SL. Allen JE, et al. Bioinformatics. 2005 Sep 15;21(18):3596-603. doi: 10.1093/bioinformatics/bti609. Epub 2005 Aug 2. Bioinformatics. 2005. PMID: 16076884 - Exogean: a framework for annotating protein-coding genes in eukaryotic genomic DNA.
Djebali S, Delaplace F, Roest Crollius H. Djebali S, et al. Genome Biol. 2006;7 Suppl 1(Suppl 1):S7.1-10. doi: 10.1186/gb-2006-7-s1-s7. Epub 2006 Aug 7. Genome Biol. 2006. PMID: 16925841 Free PMC article. - EGASP: the human ENCODE Genome Annotation Assessment Project.
Guigó R, Flicek P, Abril JF, Reymond A, Lagarde J, Denoeud F, Antonarakis S, Ashburner M, Bajic VB, Birney E, Castelo R, Eyras E, Ucla C, Gingeras TR, Harrow J, Hubbard T, Lewis SE, Reese MG. Guigó R, et al. Genome Biol. 2006;7 Suppl 1(Suppl 1):S2.1-31. doi: 10.1186/gb-2006-7-s1-s2. Epub 2006 Aug 7. Genome Biol. 2006. PMID: 16925836 Free PMC article. Review. - Performance assessment of promoter predictions on ENCODE regions in the EGASP experiment.
Bajic VB, Brent MR, Brown RH, Frankish A, Harrow J, Ohler U, Solovyev VV, Tan SL. Bajic VB, et al. Genome Biol. 2006;7 Suppl 1(Suppl 1):S3.1-13. doi: 10.1186/gb-2006-7-s1-s3. Epub 2006 Aug 7. Genome Biol. 2006. PMID: 16925837 Free PMC article. Review.
Cited by
- Genomes of Two Flying Squid Species Provide Novel Insights into Adaptations of Cephalopods to Pelagic Life.
Li M, Wu B, Zhang P, Li Y, Xu W, Wang K, Qiu Q, Zhang J, Li J, Zhang C, Fan J, Feng C, Chen Z. Li M, et al. Genomics Proteomics Bioinformatics. 2022 Dec;20(6):1053-1065. doi: 10.1016/j.gpb.2022.09.009. Epub 2022 Oct 7. Genomics Proteomics Bioinformatics. 2022. PMID: 36216027 Free PMC article. - A Chromosome-Level Genome Assembly of Yellowtail Kingfish (Seriola lalandi).
Li S, Liu K, Cui A, Hao X, Wang B, Wang HY, Jiang Y, Wang Q, Feng B, Xu Y, Shao C, Liu X. Li S, et al. Front Genet. 2022 Jan 19;12:825742. doi: 10.3389/fgene.2021.825742. eCollection 2021. Front Genet. 2022. PMID: 35126476 Free PMC article. - De novo screening of disease-resistant genes from the chromosome-level genome of rare minnow using CRISPR-cas9 random mutation.
Huang R, Shi M, Luo L, Yang C, Ou M, Zhang W, Liao L, Li Y, Xia XQ, Zhu Z, Wang Y. Huang R, et al. Gigascience. 2021 Nov 19;10(11):giab075. doi: 10.1093/gigascience/giab075. Gigascience. 2021. PMID: 34849868 Free PMC article. - Infection Process and Genome Assembly Provide Insights into the Pathogenic Mechanism of Destructive Mycoparasite Calcarisporium cordycipiticola with Host Specificity.
Liu Q, Xu Y, Zhang X, Li K, Li X, Wang F, Xu F, Dong C. Liu Q, et al. J Fungi (Basel). 2021 Oct 28;7(11):918. doi: 10.3390/jof7110918. J Fungi (Basel). 2021. PMID: 34829206 Free PMC article. - Review on the Computational Genome Annotation of Sequences Obtained by Next-Generation Sequencing.
Ejigu GF, Jung J. Ejigu GF, et al. Biology (Basel). 2020 Sep 18;9(9):295. doi: 10.3390/biology9090295. Biology (Basel). 2020. PMID: 32962098 Free PMC article. Review.
References
- Siepel A, Haussler D. Combining phylogenetic and hidden Markov models in biosequence analysis. Proceedings of the Seventh Annual International Conference on Computational Molecular Biology (RECOMB 2003) April 10-13 Berlin Germany. 2003. pp. 277–286. - PubMed
- Pedersen JS, Hein J. Gene finding with a hidden Markov model of gene structure and evolution. Bioinformatics. 2003;19:219–227. - PubMed
- Burge C, Karlin S. Prediction of complete gene structures in human genomic DNA. J Mol Biol. 1997;268:78–94. - PubMed
Publication types
MeSH terms
Grants and funding
- R01 LM006845-07/LM/NLM NIH HHS/United States
- R01 LM007938/LM/NLM NIH HHS/United States
- R01-LM007938/LM/NLM NIH HHS/United States
- R01 LM006845/LM/NLM NIH HHS/United States
- R01 LM006845-07S1/LM/NLM NIH HHS/United States
- R01-LM06845/LM/NLM NIH HHS/United States
LinkOut - more resources
Full Text Sources
Medical
Research Materials