Accurate taxonomy assignments from 16S rRNA sequences produced by highly parallel pyrosequencers - PubMed (original) (raw)

Comparative Study

Accurate taxonomy assignments from 16S rRNA sequences produced by highly parallel pyrosequencers

Zongzhi Liu et al. Nucleic Acids Res. 2008 Oct.

Abstract

The recent introduction of massively parallel pyrosequencers allows rapid, inexpensive analysis of microbial community composition using 16S ribosomal RNA (rRNA) sequences. However, a major challenge is to design a workflow so that taxonomic information can be accurately and rapidly assigned to each read, so that the composition of each community can be linked back to likely ecological roles played by members of each species, genus, family or phylum. Here, we use three large 16S rRNA datasets to test whether taxonomic information based on the full-length sequences can be recaptured by short reads that simulate the pyrosequencer outputs. We find that different taxonomic assignment methods vary radically in their ability to recapture the taxonomic information in full-length 16S rRNA sequences: most methods are sensitive to the region of the 16S rRNA gene that is targeted for sequencing, but many combinations of methods and rRNA regions produce consistent and accurate results. To process large datasets of partial 16S rRNA sequences obtained from surveys of various microbial communities, including those from human body habitats, we recommend the use of Greengenes or RDP classifier with fragments of at least 250 bases, starting from one of the primers R357, R534, R798, F343 or F517.

PubMed Disclaimer

Figures

Figure 1.

Figure 1.

Overview of different methods for taxonomy assignment (see text for details).

Figure 2.

Figure 2.

‘Leave-one-out’ evaluations of full-length sequences from Bergey's Manual. The _x_-axis shows recovery (i.e. fraction of sequences given their correct assignment). The _y_-axis shows coverage (i.e. fraction of sequences for which an assignment could be made using each method). Each line represents the assignments of a chosen method at different ranks. Each colored point represents a rank (blue to red correspond to levels from domain to genus). Gray arrows indicate effect of including/excluding sequences that are the sole representative of their genera. (a) BLAST methods. See the text for ‘nearest neighbors’, ‘more neighbors’, ‘common lineage’ and ‘major lineage’. (b) Tree-based methods followed by Fitch parsimony assignment. ‘NAST’, ‘NAST_Kimura’ are phylogenetic tree-based methods that build the relaxed NJ tree from NAST alignments. With ‘NAST_Kimura’, a Kimura adjustment was applied to the distance matrix before tree building. ‘3-mer’, ‘5-mer’: multimer clustering tree-based method that builds the relaxed NJ tree from a Bray–Curtis distance matrix obtained from the multimer (3-mer or 5-mer) count matrix.

Figure 3.

Figure 3.

‘Leave-one-out’ recoveries at the genus level for clipped sequences from Bergey's Manual. The _x_-axis shows the primer and the length of the read. The _y_-axis shows recovery (a) or coverage (b) for each method. Recovery and coverage are defined as in Figure 2. Each method is represented as a line. ‘BLAST’, BLAST method using ‘common lineage for more neighbors’; ‘NAST-Fitch’, ‘NAST-FitchAndBack’, and ‘NAST-LCA’: these phylogenetic tree-based methods build trees from NAST alignments with Kimura correction, followed by Fitch parsimony, Fitch parsimony with back-propagation, and last common ancestor algorithm, respectively. ‘5-mer-Fitch’: multimer clustering tree-based method using Fitch parsimony algorithm (the same with ‘5-mer’ in Figure 2).

Figure 4.

Figure 4.

Recoveries and coverage at the genus level (a and b) and phylum level (c and d) for each of the three datasets: the Guerrero Negro microbial mat, the mouse gut and the human gut. The legend for the series in the first panel applies to all panels. Each line represents the performance (recovery or coverage) of one method on one dataset. The _x_-axis represents primer name and sequence lengths. Apart from the coverage of ‘ORI_seqs’, which is the fraction of full-length sequences with an assignment at a certain rank, recovery and coverage are measured relative to the results of the full-length sequence. Missing data points are for reads that extend past the length of the near full-length amplicons used for this study. Recovery and coverage are defined as in Figure 2.

Figure 5.

Figure 5.

Compositions at the phylum level for each of the three datasets: (a) Guerrero Negro mat, (b) Human gut and (c) Mouse gut, using a range of different methods (separate subpanels within each group). The _x_-axis of each graph shows region sequenced. The _y_-axis shows abundance as a fraction of the total number of sequences in the community. The legend shows colors for phyla (consistent across graphs).

Figure 6.

Figure 6.

Comparison of recoveries and coverage using ARB and either the group name or Fitch parsimony criteria for grouping sequences. The _x_-axis of each graph shows the region of the gene encompassed by the sequence (all 100-base clipped sequences). The _y_-axis plots either coverage or recovery, defined as in Figure 2. Results are shown for (a) family, (b) class and (c) phylum. (d) Compositions at the phylum level obtained using the Group Name method for the combined dataset (i.e. Guerrero Negro mat, mouse gut and human gut).

Figure 7.

Figure 7.

Run time performance of the different methods as a function of the number and length of sequences. The _x_-axis plots the number of sequences, the _y_-axis time (in seconds). The legend shows colors for length of sequence. The error bars represent SDs from 10 replicates.

Similar articles

Cited by

References

    1. Margulies M, Egholm M, Altman WE, Attiya S, Bader JS, Bemben LA, Berka J, Braverman MS, Chen YJ, Chen Z, et al. Genome sequencing in microfabricated high-density picolitre reactors. Nature. 2005;437:376–380. - PMC - PubMed
    1. Rappe MS, Giovannoni SJ. The uncultured microbial majority. Annu. Rev. Microbiol. 2003;57:369–394. - PubMed
    1. Pace NR. A molecular view of microbial diversity and the biosphere. Science. 1997;276:734–740. - PubMed
    1. Binladen J, Gilbert MT, Bollback JP, Panitz F, Bendixen C, Nielsen R, Willerslev E. The use of coded PCR primers enables high-throughput sequencing of multiple homolog amplification products by 454 parallel sequencing. PLoS ONE. 2007;2:e197. - PMC - PubMed
    1. Hamady M, Walker JJ, Harris JK, Gold NJ, Knight R. Error-correcting barcoded primers for pyrosequencing hundreds of samples in multiplex. Nat. Methods. 2008;5:235–237. - PMC - PubMed

Publication types

MeSH terms

Substances

LinkOut - more resources