Capturing whole-genome characteristics in short sequences using a naïve Bayesian classifier - PubMed (original) (raw)

Capturing whole-genome characteristics in short sequences using a naïve Bayesian classifier

R Sandberg et al. Genome Res. 2001 Aug.

Abstract

Bacterial genomes have diverged during evolution, resulting in clearcut differences in their nucleotide composition, such as their GC content. The analysis of complete sequences of bacterial genomes also reveals the presence of nonrandom sequence variation, manifest in the frequency profile of specific short oligonucleotides. These frequency profiles constitute highly specific genomic signatures. Based on these differences in oligonucleotide frequency between bacterial genomes, we investigated the possibility of predicting the genome of origin for a specific genomic sequence. To this end, we developed a naïve Bayesian classifier and systematically analyzed 28 eubacterial and archaeal genomes. We found that sequences as short as 400 bases could be correctly classified with an accuracy of 85%. We then applied the classifier to the identification of horizontal gene transfer events in whole-genome sequences and demonstrated the validity of our approach by correctly predicting the transfer of both the superoxide dismutase (sodC) and the bioC gene from Haemophilus influenzae to Neisseria meningitis, correctly identifying both the donor and recipient species. We believe that this classification methodology could be a valuable tool in biodiversity studies.

PubMed Disclaimer

Figures

Figure 1

Figure 1

Outline of the Bayesian classifier. (A) For a given motif length (in this figure, two base pairs), the occurrence of all overlapping motifs for each genome is recorded in the motif occurrence profile (B). The motif occurrence profile for each genome is then transformed to a motif frequency profile (C) by dividing each motif occurrence by the total number of motifs in that genome. (D) A sequence, S, of arbitrary length is taken at random from any of the genomes, consisting of a number of j overlapping motifs. (E) The probability of obtaining the motif distribution present in sequence S is separately calculated for each genome and motif. For example, the probability of obtaining motif i in E. coli, P(Mi:Ge) is estimated by the frequency of that motif in the E. coli genome, calculated in (C). The probability of obtaining the motif distribution present in sequence S is then estimated as the product of the individual probabilities of obtaining each motif (E). The classifier predicts the most probable genomic origin (F), the genome with the highest probability P(S:G).

Figure 2

Figure 2

Visualizing the genomic signature concept. Principal components analysis (PCA) was performed on the motif frequencies of 25 genomic sequences from each eubacterial and archaeal genome. The sequences are mapped into a three-dimensional PCA-space, drawn by three principal components (here components 4, 5, and 6). Each sequence was randomly chosen and had a length of 1000 bp. Closely related microorganisms cluster together in PCA-space here shown for Pyrococcus strains pabyssi and horikoshii and for Helicobacter pylori strains 26695 and J99. For clarity, arrows indicate each distinct genome cluster when similar colors were used to plot the sequences from different eubacterial and archaeal genomes. The figure was plotted using

Spotfire

(Spotfire Inc.).

Figure 3

Figure 3

Dependence of classification accuracy upon motif and sequence length. Motif lengths ranging from one to nine base pairs were evaluated for classification accuracy. A default (“random guess”) classifier is also shown. For each motif length, six different sequence lengths were tested. (35, 60, 100, 200, 400, and 1000 bp). The classification accuracy in percent is represented on the _y_-axis as the arithmetic mean over the independent genome results. One hundred sequences were randomly picked from each genome for each sequence length, and the classification accuracy was calculated as the ratio of correct predictions, divided by the total number of predictions for each genome and test runs. On the _x_-axis, the different sequence lengths are shown. Training was performed on 90% of the genome sequences (“training set”), and the remaining 10% (“test set”) was used to evaluate classification accuracy.

Figure 4

Figure 4

Lack-of-knowledge experiments. The genomic percentage of the genome excluded from the training phase was systematically increased and the classification accuracy was monitored. The percentage of genome excluded when training the classifier ranged from 5% to 90%. The classification accuracy in percent is represented as the arithmetic mean over all genomes and sampled sequences and is plotted on the _y_-axis. For each genome, we sampled 100 random sequences for each sequence length, resulting in 2500 predictions for each plotted value. Different sequence lengths (35, 60, 100, 200, 400, and 100 bp) are plotted on the _x_-axis. Classification was based on nine-nucleotide motifs.

Figure 5

Figure 5

Classification of closely related microorganisms. Classification accuracy between different strains of the same species. The classification accuracy in percent is represented on the _y_-axis as the mean of the ratio of correct predictions, divided by the total number of predictions for each genome and test runs. The _x_-axis represents the different sequence lengths (35, 60, 100, 200, 400, and 1000) in base pairs. We sampled 100 genomic sequences for each genome and sequence length.

Figure 6

Figure 6

Identification of putative horizontal gene transfer events. Identification of horizontal gene transfer events in the Neisseria meningitis genome exemplified by (A) SodC (putative horizontal transferred gene [Kroll et al. 1998]) and (B) a conserved hypothetical protein. Nucleotide coordinates indicate the positions of the genes in the genome. Results are of “sliding window” classification using 500-bp windows with 250-bp overlap. Black, unfilled windows were classified as “being of N. meningitis origin,” and solid black windows as “being of H. influenzae origin.” Haemophilus Uptake Sequences (HmUS) positions are indicated by small arrows. The numbers of perfect matches to the 29-nucleotide consensus are shown in parentheses.

BLASTX

results show similar positions in H. influenzae.

References

    1. Deich RA, Smith HO. Mechanism of homospecific DNA uptake in Haemophilus influenzae transformation. Mol Gen Genet. 1980;177:369–374. - PubMed
    1. Deschavanne PJ, Giron A, Vilain J, Fagot G, Fertil B. Genomic signature: Characterization and classification of species assessed by chaos game representation of sequences. Mol Biol Evol. 1999;16:1391–1399. - PubMed
    1. Domingos P, Pazzani M. On the optimality of the simple Bayesian classifier under zero-one loss. Machine Learn. 1997;29:103–130.
    1. Doolittle WF. Phylogenetic classification and the universal tree [see comments] Science. 1999;284:2124–2129. - PubMed
    1. Durbin R, Eddy S, Krogh A, Mitchison G. Biological sequence analysis. Cambridge: Cambridge University Press; 1998.

Publication types

MeSH terms

LinkOut - more resources