Bambus 2: scaffolding metagenomes - PubMed (original) (raw)

Bambus 2: scaffolding metagenomes

Sergey Koren et al. Bioinformatics. 2011.

Abstract

Motivation: Sequencing projects increasingly target samples from non-clonal sources. In particular, metagenomics has enabled scientists to begin to characterize the structure of microbial communities. The software tools developed for assembling and analyzing sequencing data for clonal organisms are, however, unable to adequately process data derived from non-clonal sources.

Results: We present a new scaffolder, Bambus 2, to address some of the challenges encountered when analyzing metagenomes. Our approach relies on a combination of a novel method for detecting genomic repeats and algorithms that analyze assembly graphs to identify biologically meaningful genomic variants. We compare our software to current assemblers using simulated and real data. We demonstrate that the repeat detection algorithms have higher sensitivity than current approaches without sacrificing specificity. In metagenomic datasets, the scaffolder avoids false joins between distantly related organisms while obtaining long-range contiguity. Bambus 2 represents a first step toward automated metagenomic assembly.

Availability: Bambus 2 is open source and available from http://amos.sf.net.

Contact: mpop@umiacs.umd.edu.

Supplementary information: Supplementary data are available at Bioinformatics online.

PubMed Disclaimer

Figures

Fig. 1.

Fig. 1.

(a) The unitig graph representation of a single unitig, 3, having double the coverage of the surrounding unitigs. Solid black arrows correspond to reads comprising a unitig. (b) One of the possible resolutions of the graph presented in (a). This example places unitig 3 in two locations along a single genome. (c) A second of the possible resolutions of the graph presented in (a). This example places unitig 3 at the same location in two genomes (highlighted in different colors).

Fig. 2.

Fig. 2.

(a) A variant motif detected on the Sim3 dataset. The motif corresponds to a ferrochelatase gene in E.coli. There are two alternate versions of the gene within the E.coli K12 (2338) and E.coli O157:H7 (2034) genomes. (b) A CLUSTAL W (Thompson et al., 1994) alignment of a subset of the fasta output from Bambus 2, with an edit region corresponding to (a).

Fig. 3.

Fig. 3.

The figure shows a subset of a bacterial assembly where nodes are connected if they share paired-end reads. The shaded node, 119, is a repeat that occurs on many shortest paths.

Fig. 4.

Fig. 4.

Repeat detection comparison. Ideal repeat detection corresponds to the top-right corner of the graph, with 100% sensitivity and specificity. We vary the genome size estimate (a critical parameter in the procedure for detecting repeats) for CA, generating a curve for each dataset. The CA-met default is indicated by large shaded points. The Bambus 2 repeat detection is fully automated, generating a single point. As CA is designed for clonal organisms, only the default genome size estimate is used for B.suis. The gold standard is built from REPuter. All tests are run using the set of unitigs generated by CA-met. Sensitivity: formula image. Specificity: formula image.

Fig. 5.

Fig. 5.

Assembly results for three simulated datasets. The _y_-axis represents the minimum number of scaffolds that add up to 1% of the genome size. Lower bars represent a better assembly. Bambus 2 produces large scaffolds for a wide range of coverage levels in our simulated datasets. Bambus 2 (CA-met) is Bambus 2 run using CA-met instead of using Minimus unitigs. We aligned the assembly (all contigs >2 kb) to the reference and counted coverage by reciprocal best matches over 95% identity. We use reciprocal best matches to avoid double counting Bambus 2 motifs that cover the same genomic region. We divide the number of scaffolds by the genome coverage and average the results, by genome, on all three simulated datasets to evaluate performance across varying coverage.

Fig. 6.

Fig. 6.

Assembly results for the acid mine metagenomic dataset. The _y_-axis represents the minimum number of scaffolds that add up to 1% of the genome size. Lower bars represent a better assembly. Bambus 2 produces larger scaffolds that CA-met in three of the five genomes. We calculated assembly statistics as in Figure 5. In three genomes, both CA and Bambus 2 produced slightly >100% coverage. This is due to redundancy within the MUMmer alignments.

Similar articles

Cited by

References

    1. Altschul S, et al. Basic local alignment search tool. J. Mol. Biol. 1990;215:403–410. - PubMed
    1. Arumugam M, et al. Enterotypes of the human gut microbiome. Nature. 2011;473:174–180. - PMC - PubMed
    1. Butler J, et al. ALLPATHS: De novo assembly of whole-genome shotgun microreads. Genome Res. 2008;18:810–820. - PMC - PubMed
    1. Dayarian A, et al. Sopra: scaffolding algorithm for paired reads via statistical optimization. BMC Bioinformatics. 2010;11:345. - PMC - PubMed
    1. Eppley J, et al. Strainer: software for analysis of population variation in community genomic datasets. BMC Bioinformatics. 2007a;8:398. - PMC - PubMed

Publication types

MeSH terms

LinkOut - more resources