FragGeneScan: predicting genes in short and error-prone reads - PubMed (original) (raw)

FragGeneScan: predicting genes in short and error-prone reads

Mina Rho et al. Nucleic Acids Res. 2010 Nov.

Abstract

The advances of next-generation sequencing technology have facilitated metagenomics research that attempts to determine directly the whole collection of genetic material within an environmental sample (i.e. the metagenome). Identification of genes directly from short reads has become an important yet challenging problem in annotating metagenomes, since the assembly of metagenomes is often not available. Gene predictors developed for whole genomes (e.g. Glimmer) and recently developed for metagenomic sequences (e.g. MetaGene) show a significant decrease in performance as the sequencing error rates increase, or as reads get shorter. We have developed a novel gene prediction method FragGeneScan, which combines sequencing error models and codon usages in a hidden Markov model to improve the prediction of protein-coding region in short reads. The performance of FragGeneScan was comparable to Glimmer and MetaGene for complete genomes. But for short reads, FragGeneScan consistently outperformed MetaGene (accuracy improved ∼62% for reads of 400 bases with 1% sequencing errors, and ∼18% for short reads of 100 bases that are error free). When applied to metagenomes, FragGeneScan recovered substantially more genes than MetaGene predicted (>90% of the genes identified by homology search), and many novel genes with no homologs in current protein sequence database.

PubMed Disclaimer

Figures

Figure 1.

Figure 1.

The HMM of FragGeneScan with seven super-states. The super-states are denoted as seven shaded boxes representing gene regions: (i) start codons (ii) and stop codons (iii) for both the forward (i–iii) and backward (v–vii) strands, and non-coding regions (iv). The states for gene regions (i and vii) consist of six consecutive match states represented by diamonds, insertion states by triangles and deletion states by squares, which collectively correspond to a six-periodic inhomogeneous HMM.

Figure 2.

Figure 2.

Gene prediction performance in simulated reads of 100 bases without sequencing error (a) and (b), and with 1% sequencing error (c) and (d). The x_-axis denotes the source genomes from which the short reads were simulated: 1. E. coli; 2. H. pylori; 3. B. subtilis; 4. B. aphidicola_; 5. C. tepidum; 6. B. pseudomallei; 7. W. endosymbiont; 8. C. jeikeium; 9. P. marinus. The _y_-axis denotes sensitivity in (a) and (c), and specificity in (b) and (d).

Figure 3.

Figure 3.

Gene prediction performance in simulated reads of 400 bases without sequencing error (a) and (b) and with 1% sequencing error rate (c) and (d). The x_-axis denotes the source genomes from which the short reads were simulated: 1. E. coli; 2. H. pylori; 3. B. subtilis; 4. B. aphidicola_; 5. C. tepidum; 6. B. pseudomallei; 7. W. endosymbiont; 8. C. jeikeium; 9. P. marinus. The _y_-axis denotes sensitivity in (a) and (c), and specificity in (b) and (d).

Figure 4.

Figure 4.

Examples of fragmented genes that contain frameshift sequencing errors: a gene predicted from a read simulated from the E. coli genome starting at position 4 578 113 (a), and a gene predicted from a metagenomic read from the TS28 dataset (b). The alignments of the nucleotide sequences are partially shown for clarity. The dotted lines connect the regions of nucleotides (with sequencing errors fixed) and the amino acid(s) that they encode. The alignment between the predicted protein from the metagenomic read and its homolog identified in IMG protein database is also shown.

References

    1. Rappe MS, Giovannoni SJ. The uncultured microbial majority. Annu. Rev. Microbiol. 2003;57:369–394. - PubMed
    1. Gill SR, Pop M, DeBoy RT, Eckburg PB, Turnbaugh PJ, Samuel BS, Gordon JI, Relman DA, Fraser-Liggett CM, Nelson KE. Metagenomic analysis of the human distal gut microbiome. Science. 2006;312:1355–1359. - PMC - PubMed
    1. Hugenholtz P. Exploring prokaryotic diversity in the genomic era. Genome Biol. 2002;3:reviews0003. 0001–0003.0008. - PMC - PubMed
    1. Riesenfeld CS, Schloss PD, Handelsman J. Metagenomics: genomic analysis of microbial communities. Annu. Rev. Genet. 2004;38:525–552. - PubMed
    1. Hattori M, Taylor TD. The human intestinal microbiome: a new frontier of human biology. DNA Res. 2009;16:1–12. - PMC - PubMed

Publication types

MeSH terms

Substances

LinkOut - more resources