Self-identification of protein-coding regions in microbial genomes - PubMed (original) (raw)

Self-identification of protein-coding regions in microbial genomes

S Audic et al. Proc Natl Acad Sci U S A. 1998.

Abstract

A new method for predicting protein-coding regions in microbial genomic DNA sequences is presented. It uses an ab initio iterative Markov modeling procedure to automatically perform the partition of genomic sequences into three subsets shown to correspond to coding, coding on the opposite strand, and noncoding segments. In contrast to current methods, such as GENEMARK [Borodovsky, M. & McIninch, J. D. (1993) Comput. Chem. 17, 123-133], no training set or prior knowledge of the statistical properties of the studied genome are required. This new method tolerates error rates of 1-2% and can process unassembled sequences. It is thus ideal for the analysis of genome survey and/or fragmented sequence data from uncharacterized microorganisms. The method was validated on 10 complete bacterial genomes (from four major phylogenetic lineages). The results show that protein-coding regions can be identified with an accuracy of up to 90% with a totally automated and objective procedure.

PubMed Disclaimer

Figures

Figure 1

Figure 1

Convergence of the iterative homogeneous Markov modeling. The numbers of nucleotides correctly assigned as “coding” or “reverse coding” are plotted to follow the convergence of the iterative procedure. (A) Influence of the Markov chain order. (B) Influence of the window size. (C) Influence of the simulated error rate. (D) Specificity of the recognition of coding (+) and reverse coding (o) segments for 10 genomes of different G+C content. Mj, M. jannaschii; Mg, M. genitalium; Mp, M. pneumoniae; Hi, H. influenzae; Hp, H. pylori; Bs, B. subtilis; Mt, M. thermoautotrophicum; Syn, Synechocystis sp.; Af, A. fulgidus; Ec, E. coli. The discrepancies between the recognition of coding and reverse-coding regions in the Mg and Mp genomes indicate an actual strand asymmetry.

Similar articles

Cited by

References

    1. Fleischmann R D, Adams M D, White O, Clayton R A, Kirkness E F, Kerlavage A R, Bult C J, Tomb J F, Dougherty B A, Merrick J M, et al. Science. 1995;269:496–512. - PubMed
    1. Fraser C M, Gocayne J D, White O, Adams M D, Clayton R A, Fleischmann R D, Bult C J, Kerlavage A R, Sutton G, Kelley J M, et al. Science. 1995;270:397–403. - PubMed
    1. Bult C J, White O, Olsen G J, Zhou L, Fleischmann R D, Sutton G G, Blake J A, FitzGerald L M, Clayton R A, Gocayne J D, et al. Science. 1996;273:1058–1073. - PubMed
    1. Kaneko T, Sato S, Kotani H, Tanaka A, Asamizu E, Nakamura Y, Miyajima N, Hirosawa M, Sugiura M, Sasamoto S, et al. DNA Res. 1996;3:109–136. - PubMed
    1. Himmelreich R, Hilbert H, Plagens H, Pirkl E, Li B C, Herrmann R. Nucleic Acids Res. 1996;24:4420–4449. - PMC - PubMed

Publication types

MeSH terms

Substances

LinkOut - more resources