Evaluation of 16S rDNA-based community profiling for human microbiome research - PubMed (original) (raw)

Evaluation of 16S rDNA-based community profiling for human microbiome research

Jumpstart Consortium Human Microbiome Project Data Generation Working Group. PLoS One. 2012.

Abstract

The Human Microbiome Project will establish a reference data set for analysis of the microbiome of healthy adults by surveying multiple body sites from 300 people and generating data from over 12,000 samples. To characterize these samples, the participating sequencing centers evaluated and adopted 16S rDNA community profiling protocols for ABI 3730 and 454 FLX Titanium sequencing. In the course of establishing protocols, we examined the performance and error characteristics of each technology, and the relationship of sequence error to the utility of 16S rDNA regions for classification- and OTU-based analysis of community structure. The data production protocols used for this work are those used by the participating centers to produce 16S rDNA sequence for the Human Microbiome Project. Thus, these results can be informative for interpreting the large body of clinical 16S rDNA data produced for this project.

PubMed Disclaimer

Conflict of interest statement

Competing Interests: The authors have declared that no competing interests exist.

Figures

Figure 1

Figure 1. Overview amplicons and reads generated for both the 3730 and 454 sequencing.

On a schematic representation of the 16S rDNA gene, the known variable regions and the primers used in this study are indicated. Positions and numbering are based on the Escherichia coli reference sequence. The amplicons generated by each primer set are marked in red, and sequencing directions and expected lengths are indicated in orange for 3730 and green for 454.

Figure 2

Figure 2. Minor differences in classifiability as measured by the RDP Classifier and a BLASTn-based approach

. The left panel shows classification based on BLASTn against reference sequences of the MC members. A sequence is classified if it has >95% global sequence identity with one of the reference sequences and >90% of read is contained in the alignable region. Results are shown as a heatmap depicting the frequency values, using a binary logarithm scale. The middle heatmap illustrates frequency values of taxa identified using the RDP classification tool, applying an 80% confidence cutoff. Right panel shows the difference between RDP and BLASTn based classification, with a heatmap representing the ratio of observed genus-level frequency data (RDP) over expected genus-level frequency (BLASTn) for each of the MC members using a binary logarithm scale.

Figure 3

Figure 3. Mock community-based accuracy of community representation compared across technology and 16S window.

The MC was sequenced by different centers on both 3730 and 454 platforms. Each sequencing trial is represented as a column. For 3730 sequencing of the V1–V9 window, amplicons derived from a common amplification protocol were sequenced with short capillaries (1), long capillaries (2), and three reads per clone (3). 454 sequencing was performed by four centers (A, B, C, and D) with three 16S windows (V1–V3, V3–V5, and V6–V9). (A) The observed genus-level frequency data over expected genus-level frequency ratio for each of the MC members is shown as a heatmap using a binary logarithm scale. The expected frequency ratio is based on the whole genome coverages inferred from mapped Illumina WGS reads to the MC reference genome sequences. Genera with observed frequencies differing more than four-fold from expected are marked with + or – for over- or under-representation, respectively. (B) The fraction of misclassified (0.1% of the total combined data set) and unclassified (4.6% of the total combined data set) sequences displayed as a frequency heatmap. The frequency values are depicted as a binary logarithm scale.

Figure 4

Figure 4. Deviation from expected in the 16S based Mock Community member representation can partially be explained by primer mismatch, not by %GC differences.

The 20 bacterial organisms of the Mock Community are represented by corresponding genus (n = 18) along the bottom of the figure, and across the four panels (DNA from Candida albicans was included in this mock community, but not shown here). (A) The distribution of reads over the 18 genera; The expected frequencies (grey) in the community determined by whole genome shotgun (WGS) sequencing and classified by mapping to reference genomes using BWA, and the observed frequencies determined by 454 reads (red) or 3730 sequences (blue), classified by BLASTn. Error bars indicate standard error from technical replicates. (B) Deviations from expected frequencies as calculated by subtracting expected % from the observed. (C) The average %GC is shown for all its 16S genes, and for their whole genomes. (d) The lowest percent mismatch between primer used in production protocols (Protocols S1 and S2) and any 16S gene copy is shown for each organism; primers are grouped by sequencing technology and 16S window.

Figure 5

Figure 5. Illustration of how flawed taxonomic schemes and sequence quality can result in incorrect classifications.

The phylogenetic trees were created starting from the full-length reference sequences that were used to train RDP’s taxonomic scheme version 5 for Pseudomonas and Azomonas (A), and Neisseriaceae (B), respectively. These sequences were clustered into 3% OTUs with mothur and representatives for each OTU were selected for building a tree with FastTree. The number of sequences belonging to each OTU is indicated in brackets.(C) Scatter density plots of percent low quality (QV<20) bases per read versus read length is shown for the misclassified reads (red) compared to their correctly classified counterparts (blue).

Figure 6

Figure 6. 454 sequences have a higher error rate, mainly resulting from an increased insertion and deletion rate.

(A) For all the quality and chimera filtered 3730 and 454 sequences generated for the MC sample, an alignment-based estimation of errors, including insertions, deletions, and substitutions was performed. For each of the different sequencing approaches, the cumulative frequency distribution of the percent error per sequence is shown for assembled 3730 sequences generated with short capillaries (green), long capillaries (red), and three reads per clone (yellow), and 454 reads spanning the variable regions V1–V3 (light blue), V3–V5 (dark blue), and V6–V9 (fuchsia). A vertical line at 1% was added as a visual aid for upper limit of an acceptable error threshold. (B) Boxplots show the average percentage of errors per read, per sequence approach and per error type, including substitutions, insertions, and deletions. Outliers are not shown.

Figure 7

Figure 7. Error by position profiles indicate hotspots for error.

To visualize where sequencing errors were concentrated along the length of the 16S sequence for each sequencing technology, a root mean square deviation (RMSD) plot was generated for (A) 3730 sequence and (B) 454 read data. The RMSD plot is a graphical representation of the differences in nucleotide distribution between a reference sequence and the samples of interest, for each position along the length of the reference. This figure shows the results for Neisseria meningitidis specifically, but is representative of the profiles observed for the other strains in the MC.

Figure 8

Figure 8. Taxonomic utility of 16S sequence data varies by technology, 16S window, and sample type.

The fraction of successfully classified 3730 and 454 sequences obtained from the MC (A) and clinical samples representing four major body regions (B) is plotted at different taxonomic levels from genus to phylum. Classification was performed on quality and chimera-filtered sequences and considered to be successful if the RDP Classifier result had a confidence score above 80%. In panel B, 454 results include only window V3–V5.

Figure 9

Figure 9. The Lachnospiraceae 16S diversity observed in stool samples is greater than from known reference resources.

A phylogenetic tree constructed with 16S sequences from RDP’s training set (light blue, n = 34), publicly available genomes from human isolates (green, n = 26), publicly available HMP genomes (dark blue, n = 44), and sequences from aggregate stool samples that could be classified at the genus level (dark grey, n = 63) and that remained unclassified at the genus level (light grey, n = 408).

Figure 10

Figure 10. Classifiability of 16S sequence data is differentially impacted by sequencing technology, taxonomic family and body region.

For each of the HMP body regions, the relationship between the average frequency of a given bacterial family (y-axis) versus the contribution of these families to the unclassifiability issue (x-axis) is plotted for (B) 3730 and (C) 454. Only window V3–V5 is presented in 454 results. Classification was performed on quality- and chimera-filtered sequences and classifications assigned only if the RDP Classifier result had a confidence score above 80%.

Figure 11

Figure 11. Improved estimation of community diversity after quality filtering and chimera checking, as evaluated by rarefaction analysis.

The number of observed OTUs in the MC is shown as the function of the number of 3730 (A) and 454 (B) sequences, before filtering (black), after quality filtering (green, 454 only), and after combined quality and chimera filtering (red). Rarefaction curves were generated using mothur, with an OTU defined at 97% similarity. (A) For 3730, separate lines show the rarefaction curves for the three different sequencing approaches. (B) For 454, rarefaction curves for the three 16S windows spanning the variable regions V1–V3, V3–V5, and V6–V9 are shown separately, and analysis was performed on a random 10,000-sequence subset from each sample.

Similar articles

Cited by

References

    1. Chen T, Yu W-H, Izard J, Baranova OV, Lakshmanan A, et al. The Human Oral Microbiome Database: a web accessible resource for investigating oral microbe taxonomic and genomic information. Database 2010: baq013- 2010. - PMC - PubMed
    1. Zoetendal EG, Vaughan EE, de Vos WM. A microbial world within us. Mol Microbiol. 2006;59:1639–1650. - PubMed
    1. Frank DN, Pace NR. Gastrointestinal microbiology enters the metagenomics era. Curr Opin Gastroenterol. 2008;24:4–10. - PubMed
    1. Moore WE, Holdeman LV. Human fecal flora: the normal flora of 20 Japanese-Hawaiians. Appl Microbiol. 1974;27:961–979. - PMC - PubMed
    1. Coolen MJ, Post E, Davis CC, Forney LJ. Characterization of microbial communities found in the human vagina by analysis of terminal restriction fragment length polymorphisms of 16S rRNA genes. Appl Environ Microbiol. 2005;71:8729–8737. - PMC - PubMed

Publication types

MeSH terms

Substances

Grants and funding

LinkOut - more resources