Flexible taxonomic assignment of ambiguous sequencing reads - PubMed (original) (raw)

Flexible taxonomic assignment of ambiguous sequencing reads

José C Clemente et al. BMC Bioinformatics. 2011.

Abstract

Background: To characterize the diversity of bacterial populations in metagenomic studies, sequencing reads need to be accurately assigned to taxonomic units in a given reference taxonomy. Reads that cannot be reliably assigned to a unique leaf in the taxonomy (ambiguous reads) are typically assigned to the lowest common ancestor of the set of species that match it. This introduces a potentially severe error in the estimation of bacteria present in the sample due to false positives, since all species in the subtree rooted at the ancestor are implicitly assigned to the read even though many of them may not match it.

Results: We present a method that maps each read to a node in the taxonomy that minimizes a penalty score while balancing the relevance of precision and recall in the assignment through a parameter q. This mapping can be obtained in time linear in the number of matching sequences, because LCA queries to the reference taxonomy take constant time. When applied to six different metagenomic datasets, our algorithm produces different taxonomic distributions depending on whether coverage or precision is maximized. Including information on the quality of the reads reduces the number of unassigned reads but increases the number of ambiguous reads, stressing the relevance of our method. Finally, two measures of performance are described and results with a set of artificially generated datasets are discussed.

Conclusions: The assignment strategy of sequencing reads introduced in this paper is a versatile and a quick method to study bacterial communities. The bacterial composition of the analyzed samples can vary significantly depending on how ambiguous reads are assigned depending on the value of the q parameter. Validation of our results in an artificial dataset confirm that a combination of values of q produces the most accurate results.

PubMed Disclaimer

Figures

Figure 1

Figure 1

Sample Taxonomic Assignment of an Ambiguous Read. Assigning read Ri to the _j_th node of Ti partitions the leaves of Ti into true positives, false positives, true negatives, and false negatives. In this example, the hits are Mi = {_s_1, _s_5, _s_6, _s_7}. If we let j be the blue circled node, we obtain TPi,j = {_s_5, _s_6, _s_7}, FPi,j = {_s_8}, TNi,j = {_s_2, _s_3, _s_4}, FNi,j = {_s_1}. True hit H = _s_5 (defined in Section "Validation: Performance in ROC Space").

Figure 2

Figure 2

Validation of Results in ROC Space. Distance in ROC space to the diagonal TPRH = FPRH. Points above the diagonal represent good predictions, the larger the distance the better.

Figure 3

Figure 3

Distribution of Sequencing Reads. Distribution of sequencing reads (number of hits and reads) ambiguously assigned with up to 2 mismatches to two or more of the 5,165 sequences in the reference bacterial taxonomy.

Figure 4

Figure 4

Distribution of Hits per Taxonomic Rank. Distribution of the number of hits (species with up to k mismatches) in ambiguous reads per taxonomic rank.

Figure 5

Figure 5

Distribution of Taxonomic Ranks in Metagenomic Datasets. Ambiguous reads assigned in the bacterial taxonomy at each taxonomic rank for q = 0, ..., 1. Color code: domain: purple; phylum: indigo; class: light blue; order: cyan; family: green; genus: yellow; species: red.

Figure 6

Figure 6

Distribution of Taxonomic Ranks in Simulated Datasets. Simulated ambiguous reads assigned in the bacterial taxonomy at each taxonomic rank for q = 0, ..., 1. Datasets were constructed from whole 16S rRNA sequence and from the V1-V2 hypervariable region. Color code: domain: purple; phylum: indigo; class: light blue; order: cyan; family: green; genus: yellow; species: red.

Similar articles

Cited by

References

    1. Ley RE, Hamady M, Lozupone C, Turnbaugh PJ, Ramey RR, Bircher JS, Schlegel ML, Tucker TA, Schrenzel MD, Knight R, Gordon JI. Evolution of mammals and their gut microbes. Science. 2008;320(5883):1647–1651. doi: 10.1126/science.1155725. - DOI - PMC - PubMed
    1. Dethlefsen L, McFall-Ngai M, Relman DA. An ecological and evolutionary perspective on human-microbe mutualism and disease. Nature. 2007;449(7164):811–818. doi: 10.1038/nature06245. - DOI - PMC - PubMed
    1. Alberts B, Johnson A, Lewis J, Raff M, Roberts K, Walter P. Molecular Biology of the Cell. 5. New York, USA: Garland Science; 2008.
    1. Gray NF. Biology of Wastewater Treatment. 2. London, UK: Imperial College Press; 2004.
    1. Jeffries T, Jin YS. Metabolic engineering for improved fermentation of pentoses by yeasts. Appl Microbiol Biotechnol. 2004;63(5):495–509. doi: 10.1007/s00253-003-1450-0. - DOI - PubMed

Publication types

MeSH terms

Substances

LinkOut - more resources