MLTreeMap--accurate Maximum Likelihood placement of environmental DNA sequences into taxonomic and functional reference phylogenies - PubMed (original) (raw)

MLTreeMap--accurate Maximum Likelihood placement of environmental DNA sequences into taxonomic and functional reference phylogenies

Manuel Stark et al. BMC Genomics. 2010.

Abstract

Background: Shotgun sequencing of environmental DNA is an essential technique for characterizing uncultivated microbes in situ. However, the taxonomic and functional assignment of the obtained sequence fragments remains a pressing problem.

Results: Existing algorithms are largely optimized for speed and coverage; in contrast, we present here a software framework that focuses on a restricted set of informative gene families, using Maximum Likelihood to assign these with the best possible accuracy. This framework ('MLTreeMap'; http://mltreemap.org/) uses raw nucleotide sequences as input, and includes hand-curated, extensible reference information.

Conclusions: We discuss how we validated our pipeline using complete genomes as well as simulated and actual environmental sequences.

PubMed Disclaimer

Figures

Figure 1

Figure 1

MLTreeMap: Placing anonymous sequence fragments into reference phylogenies. Top: overview of the procedure. Informative marker genes (or fragments thereof) are automatically extracted from raw, un-annotated nucleotide sequence fragments, aligned to reference sequences and then placed into externally provided gene trees using RAxML. Below: Overview of reference phylogenies that are currently available in MLTreeMap.

Figure 2

Figure 2

Leave-one-out validation: examples. Individual query genomes were fragmented (1'000 bp fragments) and then placed into reference trees from which the corresponding genomes (or entire clades) had been removed. The assignments are shown graphically (small circles). Note how the placements become increasingly scattered and imprecise upon removal of increasingly deep reference information. MLTreeMap is shown compared to two popular approaches (note that MEGAN, while the least accurate, applies to a much larger fraction of reads in a given sample and thus achieves the best coverage). Definitions of test success: *assignments are designated as correct when they are no more than two nodes away from the target position in the tree. **for MEGAN, assignments are designated as correct when they are mapping to the target phylum.

Figure 3

Figure 3

Systematic validation. MLTreeMap is tested on three different types of input (fragmented genomes, as well as simulated and real metagenomes). In all cases, the pipeline has been run with default settings, using the extended reference phylogeny based on Ciccarelli et al. [47].

Figure 4

Figure 4

Functional characterization of metagenomes. A) Three published environmental sequence datasets have been searched for instances of the RuBisCo and RuBisCo-like enzyme families, using MLTreeMap. Colored spheres represent sequences mapping to a specific position in the tree, whereby the area of each sphere indicates the relative amount of sequences. The resulting placements are largely non-overlapping, suggesting distinct functional RuBisCo classes encountered/required at each of the environmental sites. B) Several datasets, as available at [69] and [70], were assessed with respect to two metabolic functions (CO2 fixation, and nitrogen fixation, respectively). All counts were normalized with respect to sampling depth, and are thus directly comparable.

References

    1. Alain K, Querellou J. Cultivating the uncultured: limits, advances and future challenges. Extremophiles. 2009;13(4):583–594. doi: 10.1007/s00792-009-0261-3. - DOI - PubMed
    1. Ferrari BC, Winsley T, Gillings M, Binnerup S. Cultivating previously uncultured soil bacteria using a soil substrate membrane system. Nat Protoc. 2008;3(8):1261–1269. doi: 10.1038/nprot.2008.102. - DOI - PubMed
    1. Zengler K. Central role of the cell in microbial ecology. Microbiol Mol Biol Rev. 2009;73(4):712–729. doi: 10.1128/MMBR.00027-09. - DOI - PMC - PubMed
    1. Hugenholtz P. Exploring prokaryotic diversity in the genomic era. Genome Biol. 2002;3(2):REVIEWS0003. doi: 10.1186/gb-2002-3-2-reviews0003. - DOI - PMC - PubMed
    1. Liolios K, Chen IM, Mavromatis K, Tavernarakis N, Hugenholtz P, Markowitz VM, Kyrpides NC. The Genomes On Line Database (GOLD) in 2009: status of genomic and metagenomic projects and their associated metadata. Nucleic Acids Res. 2010. pp. D346–354. - DOI - PMC - PubMed

Publication types

MeSH terms

Substances

LinkOut - more resources