Ultrafast clustering algorithms for metagenomic sequence analysis - PubMed (original) (raw)

Ultrafast clustering algorithms for metagenomic sequence analysis

Weizhong Li et al. Brief Bioinform. 2012 Nov.

Abstract

The rapid advances of high-throughput sequencing technologies dramatically prompted metagenomic studies of microbial communities that exist at various environments. Fundamental questions in metagenomics include the identities, composition and dynamics of microbial populations and their functions and interactions. However, the massive quantity and the comprehensive complexity of these sequence data pose tremendous challenges in data analysis. These challenges include but are not limited to ever-increasing computational demand, biased sequence sampling, sequence errors, sequence artifacts and novel sequences. Sequence clustering methods can directly answer many of the fundamental questions by grouping similar sequences into families. In addition, clustering analysis also addresses the challenges in metagenomics. Thus, a large redundant data set can be represented with a small non-redundant set, where each cluster can be represented by a single entry or a consensus. Artifacts can be rapidly detected through clustering. Errors can be identified, filtered or corrected by using consensus from sequences within clusters.

PubMed Disclaimer

Figures

Figure 1:

Figure 1:

Distribution of microbial diversity measured by NATs (NAT20, NAT50, NAT80 and NAT99) for 33 human gut samples. The _x_-axis is NAT category. The _y_-axis is NAT value. Samples are colored by sample type (obese, over weight or lean). The results show that obese samples have less average NAT50 than the lean samples.

Figure 2:

Figure 2:

Assembly performance of the filtered reads for metagenomic sample MH0006. _x_-axis is the redundancy cutoff N. The length of the longest contig (kb) and N50 (kb) are plotted against the left _y_-axis. The accuracy and genome coverage are against the right _y_-axis. The assembly results for original reads are at far right side marked as ‘ALL’ on _x_-axis. The accuracy of contigs is the total length of correct contigs divided by the total length of all contigs. The genome coverage is the fraction of reference genome covered by the correct contigs.

Figure 3:

Figure 3:

Using NR query and NR reference database for metagenome annotation.

Figure 4:

Figure 4:

Distribution of GOS and MetaHIT protein clusters. The _x_-axis is the cluster size X. The _y_-axis in left figures is the number of clusters of size at least X; the _y_-axis in right figures is the percentage of total sequences included in the clusters of size at least X. Graphs in (A) and (B) are for all GOS and MetaHIT sequences. Graphs in (C) and (D) are only for MetaHIT sequences, grouped by Known and Novel clusters. In addition, two separate lines are made for NR sequences (i.e. the 3 076 514 representative sequences clustered at 90% identity).

Similar articles

Cited by

References

    1. Handelsman J. Metagenomics: application of genomics to uncultured microorganisms. Microbiol Mol Biol Rev. 2004;68:669–85. - PMC - PubMed
    1. Wooley JC, Godzik A, Friedberg I. A primer on metagenomics. PLoS Comput Biol. 2010;6:e1000667. - PMC - PubMed
    1. Venter JC, Remington K, Heidelberg JF, et al. Environmental genome shotgun sequencing of the Sargasso Sea. Science. 2004;304:66–74. - PubMed
    1. Gill SR, Pop M, Deboy RT, et al. Metagenomic analysis of the human distal gut microbiome. Science. 2006;312:1355–9. - PMC - PubMed
    1. Tringe SG, von Mering C, Kobayashi A, et al. Comparative metagenomics of microbial communities. Science. 2005;308:554–7. - PubMed

Publication types

MeSH terms

LinkOut - more resources