E-Predict: a computational strategy for species identification based on observed DNA microarray hybridization patterns - PubMed (original) (raw)

E-Predict: a computational strategy for species identification based on observed DNA microarray hybridization patterns

Anatoly Urisman et al. Genome Biol. 2005.

Abstract

DNA microarrays may be used to identify microbial species present in environmental and clinical samples. However, automated tools for reliable species identification based on observed microarray hybridization patterns are lacking. We present an algorithm, E-Predict, for microarray-based species identification. E-Predict compares observed hybridization patterns with theoretical energy profiles representing different species. We demonstrate the application of the algorithm to viral detection in a set of clinical samples and discuss its relevance to other metagenomic applications.

PubMed Disclaimer

Figures

Figure 1

Figure 1

E-Predict algorithm. (a) Nucleic acid from an environmental or clinical sample is labeled and hybridized to a species detection microarray. The resulting hybridization pattern is compared with a set of theoretical hybridization energy profiles computed for every species of interest. Energy profiles attaining statistically significant comparison scores suggest the presence of the corresponding species in the sample. (b) Observed hybridization intensities are represented by a row vector x, where each intensity value corresponds to an oligonucleotide on the microarray. Theoretical hybridization energy profiles form a matrix of energy values, Y, where each row represents a profile, and each column corresponds to an oligonucleotide in x. A suitable similarity metric function compares x with each row of Y to produce a column vector of similarity scores, s. Statistical significance of the individual scores in s is estimated to produce the output column vector of probabilities, P, where each probability value corresponds to a profile in Y.

Figure 2

Figure 2

Evaluation of normalization and similarity metric parameters. A training set of 32 microarrays was used to evaluate all nonequivalent combinations of intensity and energy vector normalization (N, none; Q, quadratic; S, sum; U, unit-vector) and similarity metric (DP, dot product; ED, similarity based on Euclidean distance; PC, Pearson correlation; SR, Spearman rank correlation; UP, uncentered Pearson correlation) parameters. For each combination of parameters, intrafamily and interfamily separations were calculated for each microarray as the score of the virus profile matching the virus present in the sample minus the score of the best scoring nonmatch profile from the same or a different virus family (top and bottom panels, respectively), normalized by the range of all scores on that microarray. Bars represent the mean, and error bars represent the standard deviation (±) of separation values from all microarrays. The best performing combinations are shown in order of increasing performance (calculated as the product of the intrafamily and interfamily separation means divided by the corresponding standard deviations).

Figure 3

Figure 3

Estimation of significance of individual similarity scores. Probabilities associated with the similarity scores of nine representative virus profiles obtained for the 15 HeLa, 10 respiratory syncytial virus (RSV), and seven influenza A virus (FluA) microarrays from the training dataset are shown in the top, center, and bottom panels, respectively. Each circle represents one microarray, and vertical 'jitter' is used to resolve individual circles. Probabilities for virus profiles from seven diverse virus families are included with each microarray set: herpes simplex virus (HSV)1; human T-lymphotropic virus (HTLV)1; severe acute respiratory syndrome coronavirus (SARS CoV); human rhinovirus B (HRV)B; FluA; human RSV; and three human papillomaviruses (HPV)18. Red circles represent match and black circles nonmatch interfamily profiles. Two intrafamily nonmatch profiles are also included and are different for the three microarray sets. The most closely related intrafamily profiles are represented by purple circles: HPV45, human metapneumovirus (HMPV), and influenza B virus (FluB). More distant intrafamily profiles are shown in blue: HPV37, mumps virus (MuV), and influenza C virus (FluC). The inset in each panel shows a normalized histogram (density) of the empirical distribution of log-transformed similarity scores for a match profile (black curve) and the corresponding normal fit representing true negative scores (green curve). Inset red bars depict observed log-transformed similarity scores corresponding to the match profile probabilities (red circles).

Figure 4

Figure 4

Human rhinovirus (HRV) serotype discrimination using E-Predict similarity scores. (a) Culture samples of 22 distinct HRV serotypes were separately hybridized to the microarray. E-Predict similarity scores were obtained for all virus profiles in the energy matrix and clustered using average linkage hierarchical clustering and Pearson correlation as the similarity metric. Virus profiles for which similarity scores could be calculated in all 22 experiments were included in the clustering. Both microarrays (rows) and virus profiles (columns) were clustered. (b) Published nucleotide sequences of VP1 capsid protein from the 22 HRV serotypes were aligned using ClustalX. Phylogenetic tree based on the resulting alignment is shown.

References

    1. Riesenfeld CS, Schloss PD, Handelsman J. Metagenomics: genomic analysis of microbial communities. Annu Rev Genet. 2004;38:525–552. doi: 10.1146/annurev.genet.38.072902.091216. - DOI - PubMed
    1. Eyers L, George I, Schuler L, Stenuit B, Agathos SN, El Fantroussi S. Environmental genomics: exploring the unmined richness of microbes to degrade xenobiotics. Appl Microbiol Biotechnol. 2004;66:123–130. doi: 10.1007/s00253-004-1703-6. - DOI - PubMed
    1. Rodriguez-Valera F. Environmental genomics, the big picture? FEMS Microbiol Lett. 2004;231:153–158. doi: 10.1016/S0378-1097(04)00006-0. - DOI - PubMed
    1. Schloss PD, Handelsman J. Biotechnological prospects from metagenomics. Curr Opin Biotechnol. 2003;14:303–310. doi: 10.1016/S0958-1669(03)00067-3. - DOI - PubMed
    1. Venter JC, Remington K, Heidelberg JF, Halpern AL, Rusch D, Eisen JA, Wu D, Paulsen I, Nelson KE, Nelson W, et al. Environmental genome shotgun sequencing of the Sargasso Sea. Science. 2004;304:66–74. doi: 10.1126/science.1093857. - DOI - PubMed

Publication types

MeSH terms

Substances

LinkOut - more resources