Current opportunities and challenges in microbial metagenome analysis—a bioinformatic perspective (original) (raw)

Abstract

Metagenomics has become an indispensable tool for studying the diversity and metabolic potential of environmental microbes, whose bulk is as yet non-cultivable. Continual progress in next-generation sequencing allows for generating increasingly large metagenomes and studying multiple metagenomes over time or space. Recently, a new type of holistic ecosystem study has emerged that seeks to combine metagenomics with biodiversity, meta-expression and contextual data. Such ‘ecosystems biology’ approaches bear the potential to not only advance our understanding of environmental microbes to a new level but also impose challenges due to increasing data complexities, in particular with respect to bioinformatic post-processing. This mini review aims to address selected opportunities and challenges of modern metagenomics from a bioinformatics perspective and hopefully will serve as a useful resource for microbial ecologists and bioinformaticians alike.

Keywords: 16S rRNA biodiversity, binning, bioinformatics, Genomic Standards Consortium, metagenomics, next-generation sequencing

INTRODUCTION

The development of techniques for sequencing deoxyribonucleic acid (DNA) from environmental samples was a crucial factor for the discovery of the exceptional degree of diversity among prokaryotes. In particular, techniques to obtain 16S ribosomal ribonucleic acid (rRNA) sequences from the environment, such as the early reverse transcriptase-based approaches [1] and the later polymerase chain reaction-based methods have been cornerstones toward current large-scale studies of microbial biodiversity. More than 3 million 16S rRNA sequences of Bacteria and Archaea in the release 111 of the SILVA database [2] constitute an impressive hallmark of microbial versatility. This number is already in the order of magnitude of the estimated few million microbial species for the entire ocean [3], whereas on the other hand, it represents just a fraction of the diversity of soils where just a single ton is believed to potentially harbor millions of species [3, 4]. The extent of 16S rRNA gene variation recently discovered among lowly abundant species in the deep sea (‘rare biosphere’) [5–7] indicates that with respect to microbial diversity we so far have seen just the proverbial tip of the iceberg.

For a long time, microbial ecologists were mostly restricted to pure cultures of cultivable isolates to shed light on the diversity and functions of environmental microbes. Pure cultures allow the study of an isolate’s metabolism and of its gene repertoire by genome sequencing. Both provide valuable information for extrapolating on the isolate’s ecophysiological role. Cultivability of environmental microbes often ranges below 1% of the total bacteria [8], but depending on cultivation technique and habitat, much higher cultivation rates have been reported, for example up to 10% for a freshwater lake [9] and 23% for a marine tidal sediment [10]. Such successes notwithstanding, in almost all cases, a major fraction of Bacteria and Archaea evades current cultivation approaches and thus conventional whole genome shotgun sequencing.

Solutions are to sequence either single microbial cells [11] or entire microbial communities—the latter is termed metagenomics [12, 13]. The classical metagenome approach involves cloning of environmental DNA into vectors with the help of ultra-competent bioengineered host strains. The resulting clone libraries are subsequently screened either for dedicated marker genes (sequence-driven approach) or metabolic functions (function-driven approach) [14]. The function-driven approach is still paramount for screening enzymes with prospects in biotechnology (see [15] for a recent mini review), whereas in microbial ecology, increasing throughput (i.e. base pairs per run) and diminishing costs for DNA sequencing have rendered the sequence-driven approach largely obsolete. Nowadays, direct sequencing of environmental DNA (aka shotgun metagenomics) is commonly used to study the gene inventories of microbial communities. By combining the resulting metagenomic data with biodiversity data (e.g. from 16S rRNA gene amplicon sequencing (A. Klindworth et al. submitted for publication), in situ expression data (metatranscriptomics and metaproteomics) and environmental parameters, a new type of holistic ecosystem studies has become feasible [16] (Figure 1). Similarly, metagenome data can be integrated with metabolome data [17]. Such integrative ‘ecosystems biology’ studies (e.g. [18, 19]) introduce a plethora of challenges with respect to experimental design and bioinformatic downstream processing. These involve considerations about the habitat, sampling strategy, sequencing technology, assembly, gene prediction, taxonomic classification and binning, biodiversity estimation, function predictions and analyses, data integration and subsequent interpretation and data deposition. This mini review aims to address some of these aspects and complement more elaborate full reviews of the matter (e.g. [20]).

Figure 1.

Figure 1

Scheme of the major stages of an integrative metagenomic ecosystems study on microbial ecology.

HABITAT

The biodiversity composition (richness and evenness) of a habitat has a profound impact on the quality of a metagenome. For metagenome analyses involving assembly (to generate longer genome fragments with multiple genes), habitats with few microbial species or an uneven population with few dominating species are more promising targets than habitats with many species of even abundance. However, more important than the absolute number of species is their level of genomic coherence. Even seemingly ideal habitats with a stable composition of few dominant species, for example microbial mats [21] or invertebrate bacterial symbioses [22], can be difficult to assemble when evolutionary micro-niche adaptations have led to large pan-genomes and thus to a low level of population clonality (see e.g. [23] for a discussion on sub-species fine-scale evolution and pan-genomes). In contrast, seemingly unsuitable habitats that harbor a multitude of species with dynamically changing compositions can yield long assemblies, when the species that thrive and dominate are largely clonal. This effect is observed when a second round of sequencing and reassembly of an environmental sample breaks rather than elongates assemblies from the previous round. The reason is that in habitats with little clonality, more sequencing covers more genomic heterogeneity. This increases incongruities in putative assemblies, which causes assemblers to generate smaller but congruent assemblies rather than long assemblies with high levels of positional variability.

These common issues in metagenomics might be overcome by switching either to a longer read sequencing technology or by more sequencing to increase coverage (i.e. the number of calls of a base in a given DNA sequence, typically attained by sequencing multiple molecules containing the respective sequence), but in any case they mean increasing costs, data volumes and complexity. Preceding biodiversity analysis can help to properly assess the required amount of sequencing, e.g. in form of full-length 16S rRNA clone libraries analysis (for fine-scale resolution) in conjunction with 16S rRNA gene tag analysis (for abundance estimations).

SAMPLING

When the study target is a specific uncultivable microbial species of low abundance, it is worthwhile to try to enrich the species after sampling. Sometimes favorable culture conditions can be found that result into co-cultures with substantially enriched target species or the species can be physically enriched, for example by methods such as fluorescence-activated cell sorting (e.g. [24, 25]) or by density gradient centrifugation (e.g. [26]). Subsequent multiple-displacement amplification and single cell sequencing might be a viable solution to obtain a draft genome. However, when the target is a habitat’s overall function, a representative sample must be studied, and data from single cells, enrichments or isolates—though valuable—are complementary. Sampling of microbes from environments usually involves a size selection (e.g. fractionating filtration) to minimize contaminations by viruses or eukaryotes. Such reduction of a sample’s complexity introduces a bias in the community composition, for example by under-sampling particle-associated, filamentous, aggregate-forming or very small microbes. This is the reason why even deeply covered metagenomes mostly represent only a select fraction of a habitat’s microbial gene inventory. It can be reasonable to reduce the complexity of an environmental sample by enrichment or size selection, in particular when multiple different enrichments or samples with different size fractions are taken and their results are combined, but such effects need to be taken into account in experiment design and interpretation of the final data.

REPLICATION

It is good scientific practice to analyze true replicates of a sample and to assess whether observed differences within one sample are statistically meaningful. However, this is rarely done in microbial ecology [27]. One reason is that in many habitats it is almost impossible to take true replicates. For example, sediment cores that have been taken only centimeters away from each other might host slightly different microbial communities due to environmental patchiness. Similarly, water samples that were taken within few minutes might differ because the sampled water was moving and not perfectly homogeneous. Hence, comparing such alleged replicates reveal little information on methodological reproducibility. Our own comparisons of true 454/Roche pyrosequencing replicates have shown that library preparation and sequencing are highly reproducible, which is corroborated by a recent comparison of 454/Roche pyrosequencing and Illumina sequencing [28]. Thus it is understandable that environmental biologists prefer to analyze more samples rather than to invest in replication, in particular in expensive large-scale projects. This, however, does not release scientists from assessing the reproducibility of their methods. Part of this can be addressed by pseudo-replication [29], such as sub-sample analysis and comparison of samples within time series [30], and by independent assessments of methodological reproducibility using representative test data sets.

SEQUENCING

Next-generation sequencing (NGS) has been nothing less than a paradigm shift for metagenomics. Not long ago, the classical clone-based metagenome approach in combination with Sanger sequencing usually allowed for obtaining only few selected inserts, as sequencing was the limiting factor. NGS has obliterated the cloning step and its inherent problems and enabled to sequence environmental DNA directly. Initially, 454/Roche pyrosequencing was most widely used, because it generated substantially longer reads than competing platforms. Meanwhile, in particular large-scale metagenome projects make increasing use of the Illumina and, to a lesser extent, SOLiD platforms. Although the latter two still provide shorter reads than pyrosequencing, they offer a much higher throughput and hence coverage for the same price. The read length of 2 × 150 bp provided by the current Illumina GA IIx line of instruments basically matches that of the first generation 454 Life Sciences GS20 instrument and high coverage in conjunction with mate-pair libraries facilitate assembly and can compensate for the lack of read length. A recent comparative study on a freshwater lake planktonic community has shown that Illumina and 454 pyrosequencing lead to similar results with respect to assemblies and the covered taxonomic and functional repertoires [28]. In addition, protocols have been proposed that allow for obtaining longer ‘composite reads’ from short read platforms [31].

It remains to be seen what impact newer sequencing platforms will have on the metagenomic field. The lately announced Ion Torrent Proton fits in between the 454/Roche and Illumina platforms in terms of read length and throughput at a seemingly competitive price. At the same time, single-molecule detection methods such as Pacific Bioscience’s PacBio RS and the recently announced Oxford Nanopore Technologies (ONT) GridION and MinION systems offer much longer read lengths, albeit at the expense of higher sequencing errors (PacBio RS: ∼10–15% and ONT GridION: ∼4%). However, in contrast to the inherent systematic errors of other platforms, these errors are mostly random, and once these platforms improve, they could be reduced in an almost linear fashion by increased multifold sequencing of the exact same DNA molecule and increased coverage. The specific value of PacBio and ONT sequencing is that they provide read lengths that are long enough to span multiple prokaryote genes and thus are able to provide reliable genetic contexts. Currently, a combination of long- and short-read technologies constitutes a particularly promising approach in future metagenomics that bears the potential to significantly advance the field.

Read length, error rate and throughput/coverage of NGS technologies determine the resolution at which we can investigate gene inventories of natural microbial communities in a very similar way as the magnification and aberrations of optical microscopes determine the resolution at which microbes can be directly seen with the human eye. In this respect, advances in sequencing technologies will continue to shape the field of metagenomics and extend our possibilities to address habitats of increasing complexity.

ASSEMBLY

It is a non-trivial question, whether to assemble a metagenome. An assembly yields larger genomic fragments that allow for the study of gene arrangements. Valuable functional knowledge can be deduced from gene neighborhoods, e.g. when a gene of unknown function always appears together with a gene whose function is well known [32, 33]. Large-scale systematic investigations of such gene syntenies across metagenomes have the potential to uncover as yet unknown functional couplings.

Assembly of sequences from metagenomic libraries can result in good draft or even complete genomes when the target species shows little intraspecies variation, but this usually requires a substantial amount of sequencing. For example, massive sequencing allowed Pelletier et al. [34] to obtain a good draft genome of ‘Candidatus Cloacamonas acidaminovorans’ from a wastewater anaerobic digester. Similarly, Hess et al. [35] were able to reconstruct 15 draft genomes by direct assembly of a cow rumen metagenome. Erkel et al. [36] could even obtain a complete genome of a methanogenic archaeon from the Rice Cluster I clade by direct assembly of rice soil-derived metagenome, albeit not from a sample with natural diversity but from an enrichment. Similarly, Iverson et al. [37] succeeded in retrieving a complete genome of a marine group II euryarchaeon from an extensive sea surface water metagenome, despite the genome was represented by less than 2% of the reads. Spiking experiments of metagenomes with a pure culture isolate have suggested that a genome with little intraspecies variation can be retrieved from a metagenome when it is covered at least 20-fold [38].

Although assembly does yield longer sequences, it also bears the risk of creating chimeric contigs, in particular in habitats with closely related species or highly conserved sequences that occur across species (for example as a result from high transposase, phage and lateral gene transfer activities). Furthermore, assembly distorts abundance information, as overlapping sequences from an abundant species will be identified as belonging to the same genome and consequently joined. This leads to a relative underrepresentation of sequences of abundant species. Hence, gene frequencies are better compared based on read representation rather than on the basis of assemblages. An alternative is to back-trace all reads that constitute a given contig (or gene), either by direct mapping of the reads on the assemblies (Figure 1) or by extracting the respective information from the assembly ACE file.

Assemblers yield similar results when the coverage is high, but our own experience indicates that at low coverage, the assembler and its settings can have a notable effect. Furthermore, assemblers are mainly built for assembling all reads into a single sequence, which is exactly the opposite of the separation of sequences of different organisms, which metagenomics strives for. Furthermore, metagenome assemblers need to be more fault tolerant than genome assemblers to account for strain-level genomic heterogeneity, which on the other hand elevates the risk for chimeric assemblies. Dedicated metagenome assemblers that try to address these problems are Genovo [39], Meta-IDBA [40], MetaVelvet [41] and MAP [42]. The first three of these are intended for short-read data, whereas MAP also handles longer reads as they are produced by current 454 FLX+ pyrosequencers. All four assemblers are claimed to yield longer assemblies and more representative taxonomic representations than conventional assemblers. A dedicated stand-alone metagenome scaffolder that can be used to post-process the unitig graphs of other assemblers is BAMBUS2 [43].

One particular problem is that the increasing throughput of NGS platforms imposes challenges on assembly, in particular with respect to memory requirements. A current Illumina HiSeq2000 sequencer can generate 600 Gb in a single run, and higher throughput technologies are almost given in the nearer future. As a result, metagenomics is currently experiencing a split between smaller, more targeted projects with assemblies and large-scale projects without assemblies. The trend in metagenomics for tremendous data scales has been anticipated even before second- and third-generation NGS platforms became available and has been termed ‘megagenomics’ [44]. Such megagenome projects, as for example the Human Microbiome [45] and Earth Microbiome [46, 47] Projects, require dedicated bioinformatic post-processing and data integration pipelines, some of which have yet to be developed.

GENE PREDICTION

Many conventional gene finders require longer stretches of sequence to discriminate coding from non-coding sequences. Furthermore, many gene finders require training sequences from a single species that is subsequently used to build a species-specific gene prediction model. This is unsuitable for metagenomes that are constituted as a mixture of sequences from different organisms and often comprise only a limited number of long contigs but mainly short assemblies and unassembled reads. Furthermore, partial genes must be predicted missing proper gene starts, stops or even both. In addition, metagenomes (in particular at low coverage) are often riddled with frame shifts. This makes gene prediction for metagenomes a non-trivial task [48].

Dedicated gene prediction programs have been developed for metagenomes, such as MetaGene [49], MetaGeneAnnotator [50], Orphelia [51, 52] and FragGeneScan [53]. All these programs have been built for short reads, but they follow different approaches (such as machine learning techniques and Markov models), differ in the precision of ribosomal binding site and thus correct start prediction and in their tolerance for sequencing errors.

No matter how you look at it, the quality of gene predictions in microbial metagenome data sets is inferior to those of sequenced genomes. Combining multiple gene finders, screening intergenic regions for overlooked genes and using dedicated frameshift detectors [54, 55] are common strategies to overcome at least some of these limitations.

TAXONOMIC CLASSIFICATION AND BINNING

One of the key problems of current metagenomics is to assign the obtained sequences and their gene functions to dedicated taxa in the habitat. Phylogenetic marker genes are sparse and thus allow only taxonomic assignment of a minor portion of sequences. Hence, other approaches are needed that can partition metagenomes into taxonomically distinct bins (taxobins) that provide taxon-specific gene inventories with ecologically indicative functions.

A number of such approaches have been developed that can be categorized into classification and binning approaches. Classification approaches assign taxonomies based on similarities between metagenomic sequences and sequences of known taxonomy. Binning approaches work intrinsically (i.e. without reference sequences) and cluster sequences based on compositional characteristics. In general, one can discriminate methods that operate on the level of protein sequences (gene-based classification), on the level of intrinsic DNA characteristics (signature-based binning/classification) and those that map DNA reads to reference sequences (mapping-based classification).

Gene-based classification

Gene-based classification requires all the metagenomic sequences’ potentially full and partial protein-coding regions to be translated into their corresponding protein sequences. There are two main approaches.

The first is to use conventional basic local alignment search tool (BLAST) searches [56] against protein databases such as the non-redundant NCBI database or UniProt [57] and to derive taxonomic information from the resulting hits. This can be done either by constructing a multiple sequence alignment from the best matching hits with subsequent phylogenetic reconstruction, as implemented in Phylogena [58], or it can be done directly based on the BLAST results. Of course, as BLAST is a heuristic for fast sequence database searches and not a phylogenetic algorithm per se, the top BLAST hit does not necessarily agree with the taxonomic affiliation of the gene in question [59]. However, it has been shown that post-processing a larger number of BLAST hits can reveal useful taxonomic assignments, for example by using consensus information as implemented in the lowest common ancestor algorithm of MEGAN [60, 61] or the Darkhorse [62] and Kirsten [19] algorithms.

A second possibility is to infer taxonomic information from HMMer searches against Pfam models [63] as implemented in CARMA [64, 65] or Treephyler [66]. The principle of CARMA is to align a sequence hitting a Pfam model with the model’s curated seed alignment, construct a neighbor-joining tree from the alignment and use this tree to infer the sequence’s taxonomy. Treephyler follows a similar approach but uses speed-optimized Pfam domain prediction and treeing methods. Both approaches provide more accurate classifications than those based on BLAST, but they work for fewer sequences, as Pfam hits are less frequent than BLAST hits (typically ∼20% of the genes).

Signature-based binning/classification

DNA base compositional asymmetries carry a weak but detectable phylogenetic signal [67] that is most pronounced within the patterns of statistical over- and underrepresentation of tetra- to hexanucleotides [68]. Various algorithms have been used to discriminate this signal from the DNA-compositional background noise and to use it for taxonomic inference, e.g. simple [67, 69–71] and advanced Markov models such as interpolated context models (ICMs) [72], Bayesian classifiers [73] and machine-learning algorithms such as support vector machines (SVMs) [68], kernelized nearest-neighbor approaches [74] and self-organizing maps (SOMs) [75–81]. Also, weighted PCA-based [82], Spearman distance-based [83, 84] and Markov Chain Monte Carlo-based [85] assessments of oligomer counts have been used. As the information that DNA composition-based methods rely on is a function of sequence lengths, most of these methods deteriorate below 3–5 kb and perform poorly on sequences shorter than 1 kb. Nonetheless, methods have been developed for successful binning of short reads as they are produced by Illumina machines, e.g. AbundanceBin, which is an unsupervised l-tuple-abundance-based clustering method [86]. Recently, a signature-based method has been developed for fast taxonomic profiling of metagenomes that is independent of length and can be used with very short reads [87].

Mapping-based classification

Sequenced genomes have been used as references with known taxonomies for read recruitment in metagenome studies [18]. This approach is particularly useful for habitats with species that have closely related sequenced relatives. A variant of this approach is to use habitat-specific sets of reference genomes for a competitive metagenome read mapping [19, 88]. Such sets can be compiled using the EnvO-lite environmental ontology [89]. Low-quality repetitive reads should be excluded from the mapping using tools such as mreps [90], and phages should be masked to minimize misclassifications. The mapping itself can be done with tools such as SSAHA2 [91] or its successor SMALT (http://www.sanger.ac.uk/resources/software/smalt). Combined mapping information of the reads constituting a contig can be subsequently combined into a taxonomy consensus.

Combinatory classification

All aforementioned methods have specific advantages and disadvantages, and all are limited by the amount of information that can be retrieved from a sequence at all. Although protein-based methods tend to be more accurate than DNA-based methods, especially on shorter sequences or even reads, they can only classify sequences with existing homologues in public databases. Unfortunately, this is not the case for a large fraction of the genes within environmental microorganisms that are typically the focus of metagenomic studies. At least half of the genes of novel sequenced environmental microbes lack dedicated known functions, and a large proportion of these genes are hypothetical or conserved hypothetical genes that have either no or insufficient homologues with known taxonomic affiliation. This limitation does not apply to DNA-based methods, which, for their part, have other limitations. For example, methods such as ICMs, SVMs or SOMs need to be pre-trained, which is computationally expensive and must be continually done to keep pace with the fast-growing amount of new sequences. On the other hand, pre-training might lead to better prediction accuracy, especially when there is prior knowledge about a habitat’s biodiversity that allows restriction to a dedicated set of training sequences.

In general, DNA-based methods suffer much more from a decrease in prediction accuracy when sequences get shorter than protein-based methods, even though good classification accuracies have been reported for sequences ∼100 bp [72, 86]. One must, however, critically reflect that these results have been obtained either with simulated metagenomes or with real metagenomes of rather low complexity that are not representative for many environmental settings. Signature-based classifications hence work best with sequences from low- to medium diverse habitats where ideally longer assemblies can be obtained or with habitats that feature species with a pronounced DNA composition bias.

Mapping-based classification is the most precise but is often hampered by the availability of suitable reference sequences to map to. As of this writing, 3171 genomes have been completed and 10 536 are ongoing according to the Genomes OnLine Database [92, 93]. Although this number seems impressive, entire clades of the microbial tree are not represented and others only poorly. However, this issue is becoming less and less limiting, as targeted sequencing of as yet unsequenced taxa like in the GEBA project [94] and large-scale metagenome projects like the Earth Microbiome Project [46, 47] start to deliver large quantities of microbial genomes at an increasing pace. Hence, read mapping to closely related reference genomes might become the main method for metagenome taxonomic classifications in the not too distant future.

As of today, there is no standard for the taxonomic classification of metagenome sequences. Also, taxonomic sequence classification can be error prone, in particular for habitats with a complex diversity or high proportions of as yet barely characterized taxa (e.g. [88]). Rather than using a single method, a combination of individual methods is currently the most reasonable approach to partition metagenomes into taxobins. Such combinations have for example been implemented in PhymmBL that combines ICMs and BLAST [72] and in CARMA3 [95] that combines the original CARMA-approach with BLAST. In both cases the combination has already been shown to lead to increased classification accuracy. A combination of BLAST-, CARMA-, SOM- and 16S rRNA gene fragment-based classification termed ‘Taxometer’ was used in recent metagenome studies [19, 88]. Also, different binning methods have been successfully combined to improve accuracy [96]. Besides combining different methods, it has recently been shown that combining multiple related metagenomes in a joint analysis is a way to improve binning accuracy [97].

One interesting aspect of taxonomic sequence classification is that it allows extrapolations onto relative taxon abundances. Although abundance information is lost in the assembly process due to the merger of similar sequences, abundance information can be obtained either from taxonomically classified reads or by back mapping of reads onto taxonomically classified assemblies (Figure 1). It has been shown that relative abundances obtained this way can be close to quantitative cell-based abundances assessments by CARD-FISH [19].

Pre-assembly taxonomic classification and binning

Binning and taxonomic classification methods are typically applied after the assembly. However, these methods can also be used prior to assembly to partition reads into taxonomic bins, which has the potential to substantially reduce the complexity of metagenome assemblies. This strategy might be particularly useful, when sequences from the habitat are already available (e.g. fosmids) that can serve as seeds in an iterative binning-assembly procedure.

BIODIVERSITY ESTIMATION BY 16S rRNA GENE ANALYSIS

About one in every few thousand genes in a metagenome data set is a 16S rRNA gene. With 454 pyrosequencing, this typically translates to ∼1000 reads per picotiter plate (∼1 million reads) that harbor partial 16S rRNA genes with sufficient lengths and quality for phylogenetic analysis. Depending on the length and region of the retrieved partial 16S rRNA gene sequence, phylogenetic analysis can result into varying taxonomic depths. However, since the introduction of 454+ a substantial fraction of the respective reads allows for a genus level assignment, and this situation is expected to even improve with future increases in pyrosequencing read length. A limitation with pyrosequencing is that the number of obtained high-quality 16S rRNA genes might not be sufficient for a representative biodiversity estimation, particular not for lowly abundant taxa. Illumina does not have this problem due to its much higher throughput but on the other hand is plagued by its comparatively short reads that can compromise the depth and quality of the taxonomic assignments.

Dedicated analysis frameworks [98] have been proposed for clustering such data into operational taxonomic units (OTUs). Representative sequences for OTUs can subsequently be mapped against a 16S rRNA reference tree for classification [2, 99]. The advantage of this method over 16S rRNA gene clone libraries is that no primers are involved and hence no primer bias exists (A. Klindworth et al. submitted for publication). The disadvantage, besides not obtaining full-length high-quality 16S rRNA gene sequences, is that different taxa harbor different numbers of rRNA operons, which can distort metagenomic 16S rRNA gene abundances. For example, some Planctomycetes feature large genomes but only a single disjoint rRNA operon [100], which would lead to an underestimation of their abundance in relation to average-sized genomes with more rRNA operons. These limitations notwithstanding, analysis of metagenomic partial 16S rRNA genes provides a direct way to assess a habitat’s biodiversity that in the case of 454+ often provides a resolution down to the genus level. The resulting information is essential for identifying misclassifications in the taxonomic classification of other sequences (as outlined earlier) and identifying taxa that were missed in the taxonomic classification process.

FUNCTIONAL ANALYSIS

Analysis of metagenomes involves functional annotation of the predicted genes by database comparison searches. This typically includes protein BLAST searches against databases such as SWISSPROT, NCBI nr or KEGG [101], HMMer searches against the Pfam [102] and TIGRfam [103] databases, as well as predictions of tRNA [104] and rRNA [104] genes, signal peptides [105], transmembrane regions [106, 107], CRISPR repeats [108] and sub-cellular localization (e.g. using CoBaltDB [109], GNBSL [110], PSLpred [111], CELLO [112] or PSORT-B [113]). Also, dedicated databases are available for special functions, for example the CAZY [111, 114] and dbCAN (http://csbl.bmb.uga.edu/dbCAN/) databases for carbohydrate-active enzymes, the TSdb [115] and TCDB [116, 117] databases for transporters and the MetaBioMe [115] database for enzymes with biotechnological prospects. The resulting annotations are then used as a basis for functional data mining including metabolic reconstruction. Dedicated metagenome annotation systems have been developed to aid these tasks, e.g. WebMGA [118], IMG/M [119–121] and MG-RAST [61, 122, 123]. All three have expanded beyond mere annotation systems and continue to add useful features such as biodiversity analysis, taxonomic classification and metagenome comparisons.

For the latter, a number of dedicated comparison tools have been developed as well, including METAREP [124], STAMP [125], CoMet [126] and RAMMCAP [127]. METAREP and STAMP do not take sequence but already pre-processed data as inputs—tabulated annotations (such as gene ontology (GO) terms, enzyme commission numbers, Pfam hits and BLAST hits) in the case of METAREP and a contingency table of properties (for example exports from Metagenomics Rapid Annotation using Subsystems Technology (MG-RAST), Integrated Microbial Genomes (IMG)/M or CoMet) in the case of STAMP. Both tools feature various statistical tests and visualizations. METAREP is a web service developed by the J. Craig Venter Institute that can compare up to 20 or more metagenomes, whereas STAMP is a stand-alone software. The CoMet and RAMNCAP web servers in contrast do not require pre-computed data. CoMet takes sequence files as an input, does an Orphelia gene prediction, subsequently runs HMMer against the Pfam database followed by multi-dimensional scaling and hierarchical clustering analysis on the Pfam hits and associated GO terms plus visualization of the data. RAMMCAP takes raw reads as an input, does a six-reading frame open reading frame (ORF) prediction, clusters reads and ORFs, does a HMMer and BLAST-based annotation and allows comparison of the data, e.g. by similarity matrices. RAMMCAP is part of the CAMERA data portal [128, 129], which currently comes closest to an integrative processing pipeline for metagenomes with various tools for data retrieval, upload, querying and analysis.

Although automatic in silico annotation is essential for metagenome analysis, one should not forget that a substantial proportion of such annotations are erroneous or even incorrect. Aside from well-studied pathways of the core metabolism, automatic annotations are also often unspecific, i.e. restricted to assigning general functions (e.g. lipase, oxidoreductase, alcohol dehydrogenase) without resolving the involved specific substrates and products. This reflects a fundamental lack of knowledge rather than a limitation of bioinformatic methods per se and can only be addressed by future high-throughput functional screening pipelines.

One of the intriguing aspects of metagenomics is that typically about half of the genes in a metagenome have as yet unknown functions. Hence, restricting metagenome analyses to genes with functional annotations equals to ignoring large proportions of the genes. As a solution, it has been proposed to cluster and analyze metagenomic ORFs in a similar way as OTUs in biodiversity analyses. Such clusters have been termed operational protein families and can be analyzed, for example with MG-DOTUR [130].

AUTOMATIZATION, STANDARDIZATION AND CONTEXTUAL DATA

Until recently, the capacity to sequence has been the limiting factor for metagenome analysis. However, the continual increase in sequencing capacity and decline of costs meanwhile have turned post-metagenomic data analysis into the main bottleneck. Although progress in sequencing technologies still continues at an exponential pace, individuals who analyze the data do not scale equally well. As a consequence, the cost of sequencing drops continuously, whereas the costs for bioinformatic data analysis go up [131]. This is still not recognized widely enough, as metagenome projects tend to suffer from insufficient resource allocation for data post-processing. The latter stresses the needs for further development of semi-automated metagenome analysis tools that allow scientist to handle the wealth of data from recent metagenomics. Steps in metagenome analyses that can be automated should be automated to ensure quality, but this requires the establishment of commonly accepted data formats for metagenome sequences and their associated contextual (meta)data, as well as defined interfaces for data exchange and integration—a task that is tackled by the Genomic Standards Consortium (GSC, http://gensc.org) [132]. Contextual data are among the key factors for successful metagenomes analyses, in particular when it comes to interpretation of time series or biogeographic data. Contextual data are all the data that are associated with a metagenome, such as habitat description (including geographic location and common physicochemical parameters) and sampling procedure (including sampling time). The GSC has published standards for the minimum information about a metagenome sequence (MIMS) [133] as part of the minimum information about any sequence (MIxS) standards and checklist [134], which are supported by the International Nucleotide Sequence Databases Collaboration (INSDC). Similarly, standards have been devised in terms of data formats to ensure data inter-operability, such as the genomic contextual data markup language [135]. It is important that contextual data are collected and integrated into databases, because in the long run these data will allow to extract correlations between geography, time, prevailing environmental conditions and functions from metagenomic data that otherwise never would be uncovered [136].

As of today, there is no comprehensive tool for metagenome analysis that incorporates all types of analysis (biodiversity analysis, taxobinning, functional annotation, metabolic reconstruction and sophisticated statistical comparisons). Consequently, scientists/bioinformaticians in this field need to operate, merge and interpret results from various tools, which for larger data sets can be a daunting task. In terms of pipelines, MetaAMOS [137] provides an integrated solution for the initial post-processing of mated read metagenome data that supports different assemblers, the BAMBUS 2 scaffolder and various gene prediction, annotation and taxonomic classification tools. In terms of data integration, CAMERA has so far developed the most comprehensive infrastructure for holistic metagenome analyses, and further tools and pipelines are currently developed in the GSC and Micro B3 project (http://www.microb3.eu/) frameworks.

DATA SUBMISSION

Progress in sequencing allows for metagenomes with increasing sizes. A full run on an Illumina HiSeq2000 sequencer does not only produce 600 Gb of sequences but also the FASTQ raw data files are multiple times as large. Although sequencing facilities send these data to their customers on Terabyte-scale hard disc drives, such data volumes are certainly not suitable anymore for upload to data analysis servers or INSDC databases for submission, even not with fast user datagram protocoll (UDP) protocols such as Aspera Connect [138]. This is a problem that is as yet unsolved. Also, the INSDC databases are currently not prepared for handling the quality of many metagenomes (pervasiveness of frameshifts, automatically generated non-standard annotations and large amounts of partial genes) and their accessory data (such as lists of metagenomic 16S rRNA gene fragments including taxonomic classifications). It is clear that currently sequencing technologies evolve faster than bioinformatic infrastructures for post-genomic analysis are built. As mentioned before, this has been recognized and efforts such as those of the GSC, Micro B3, CAMERA, MG-RAST and IMG/M are on the way to define standards and develop pipelines for future metagenome data handling. However, it is the authors’ conviction that ultimately the INSDC databases have the mandate and should maintain such tools and the associated infrastructure in the long run. The European Bioinformatics Institute has recognized this and recently has made a submission and analysis pipeline for 454/pyrosequencing metagenome data available (https://www.ebi.ac.uk/metagenomics).

FUTURE PERSPECTIVES

The newest generation of sequencers, such as the PacBio RS, the Ion Torrent Proton or the ONT GRIDIon/MINIon, will continue to propel the field of metagenomics, and who knows whether at some point in the future technologies such as the conceptual IBM/Roche DNA transistor [139] will revolutionize the field again. On the one hand, development of affordable bench-top devices (454 Junior, Illumina MiSeq, Ion Torrent PGM and Proton) has led to a democratization of sequencing, and future devices such as the ONT MINIon could even be used for metagenomic analyses directly in the field. On the other hand, the ever-growing throughput of NGS sequencers is making data analysis increasingly complex.

Although smaller and medium-sized metagenomes can be analyzed with the resources described so far, different infrastructures and bioinformatic pipelines are necessary for future large-scale projects. ‘Megagenome’ projects reach the size of many terabytes of sequences (and beyond), and instead of moving these data around, it is reasonable that they reside at the sequencing institution and that these institutions provide pipelines for remote data analyses. This implies that large-scale sequencing and large-scale computing have become inseparable. For example, the BGI (formerly Beijing Genomics Institute, http://en.genomics.cn) has projected an integrated national center for sample storage, sequencing, data storage and analysis. Monolithic data centers are one way to address this, but also cloud computing such as Amazon’s EC2/S3 can be used as a viable and scalable alternative for large-scale metagenome data analysis [140], provided that the data can be transferred to the cloud, and appropriate data security is guaranteed.

Metagenomics constitutes an invaluable tool for investigating complete microbial communities in situ, in particular when integrated with biodiversity, expression and contextual data (metadata). Continuous advancements in sequencing technologies not only allow for addressing more and more complex habitats but also impose growing demands on bioinformatic data post-processing. Not long ago, sampling and associated logistics, clone library construction and Sanger sequencing of a couple of inserts were the time-consuming steps in metagenomics. Nowadays, analyzing the wealth of data has become the bottleneck, in particular for larger metagenome projects. This stresses the importance of integrative bioinformatic software pipelines for metagenomics/megagenomics, something that we as scientists must support with all efforts to get the most out of metagenome data.

Key Points.

Funding

Funding was provided by the Max Planck Society.

ACKNOWLEDGEMENTS

The authors acknowledge Johannes Werner and Pelin Yilmaz for critical reading of the manuscript. Observations from the following publically funded projects have been incorporated in this mini review: MIMAS (Microbial Interactions in Marine Systems; http://mimas-project.de) funded by the German Federal Ministry of Education and Research (BMBF), MAMBA (Marine Metagenomics for New Biotechnological Applications; http://mamba.bangor.ac.uk/) and Micro B3 (Biodiversity, Bioinformatics, Biotechnology; http://www.microb3.eu/), both funded by the FP7 of the European Union.

Biographies

Hanno Teeling followed independent educations as a chemist and biologist before specializing on bioinformatics for microbial genomics. He has worked for 12 years in this field and is currently a scientist at the Max Planck Institute for Marine Microbiology in Bremen.

F rank Oliver Glöckner entered the field of bioinformatics more than 15 years ago. He specialized on tools and databases for microbial biodiversity analysis and microbial genomics. He is the head of the Microbial Genomics and Bioinformatics Research Group at the Max Planck Institute for Marine Microbiology and Professor of Bioinformatics at the Jacobs University Bremen.

References