IMG/M: the integrated metagenome data management and comparative analysis system (original) (raw)

Journal Article

,

1Biological Data Management and Technology Center, Computational Research Division, Lawrence Berkeley National Laboratory, 1 Cyclotron Road, Berkeley, California, CA 94702 and 2Microbial Genomics and Metagenomics Program, Department of Energy Joint Genome Institute, 2800 Mitchell Drive, Walnut Creek, California, CA 94598, USA

*To whom correspondence should be addressed. Tel: +1 925 296 5718; Fax:

+1 510 486-5812

; Email: vmmarkowitz@lbl.gov

Search for other works by this author on:

,

1Biological Data Management and Technology Center, Computational Research Division, Lawrence Berkeley National Laboratory, 1 Cyclotron Road, Berkeley, California, CA 94702 and 2Microbial Genomics and Metagenomics Program, Department of Energy Joint Genome Institute, 2800 Mitchell Drive, Walnut Creek, California, CA 94598, USA

Search for other works by this author on:

,

1Biological Data Management and Technology Center, Computational Research Division, Lawrence Berkeley National Laboratory, 1 Cyclotron Road, Berkeley, California, CA 94702 and 2Microbial Genomics and Metagenomics Program, Department of Energy Joint Genome Institute, 2800 Mitchell Drive, Walnut Creek, California, CA 94598, USA

Search for other works by this author on:

,

1Biological Data Management and Technology Center, Computational Research Division, Lawrence Berkeley National Laboratory, 1 Cyclotron Road, Berkeley, California, CA 94702 and 2Microbial Genomics and Metagenomics Program, Department of Energy Joint Genome Institute, 2800 Mitchell Drive, Walnut Creek, California, CA 94598, USA

Search for other works by this author on:

,

1Biological Data Management and Technology Center, Computational Research Division, Lawrence Berkeley National Laboratory, 1 Cyclotron Road, Berkeley, California, CA 94702 and 2Microbial Genomics and Metagenomics Program, Department of Energy Joint Genome Institute, 2800 Mitchell Drive, Walnut Creek, California, CA 94598, USA

Search for other works by this author on:

,

1Biological Data Management and Technology Center, Computational Research Division, Lawrence Berkeley National Laboratory, 1 Cyclotron Road, Berkeley, California, CA 94702 and 2Microbial Genomics and Metagenomics Program, Department of Energy Joint Genome Institute, 2800 Mitchell Drive, Walnut Creek, California, CA 94598, USA

Search for other works by this author on:

,

1Biological Data Management and Technology Center, Computational Research Division, Lawrence Berkeley National Laboratory, 1 Cyclotron Road, Berkeley, California, CA 94702 and 2Microbial Genomics and Metagenomics Program, Department of Energy Joint Genome Institute, 2800 Mitchell Drive, Walnut Creek, California, CA 94598, USA

Search for other works by this author on:

,

1Biological Data Management and Technology Center, Computational Research Division, Lawrence Berkeley National Laboratory, 1 Cyclotron Road, Berkeley, California, CA 94702 and 2Microbial Genomics and Metagenomics Program, Department of Energy Joint Genome Institute, 2800 Mitchell Drive, Walnut Creek, California, CA 94598, USA

Search for other works by this author on:

,

1Biological Data Management and Technology Center, Computational Research Division, Lawrence Berkeley National Laboratory, 1 Cyclotron Road, Berkeley, California, CA 94702 and 2Microbial Genomics and Metagenomics Program, Department of Energy Joint Genome Institute, 2800 Mitchell Drive, Walnut Creek, California, CA 94598, USA

Search for other works by this author on:

,

1Biological Data Management and Technology Center, Computational Research Division, Lawrence Berkeley National Laboratory, 1 Cyclotron Road, Berkeley, California, CA 94702 and 2Microbial Genomics and Metagenomics Program, Department of Energy Joint Genome Institute, 2800 Mitchell Drive, Walnut Creek, California, CA 94598, USA

Search for other works by this author on:

... Show more

Received:

19 September 2011

Revision received:

12 October 2011

Accepted:

16 October 2011

Published:

15 November 2011

Cite

Victor M. Markowitz, I-Min A. Chen, Ken Chu, Ernest Szeto, Krishna Palaniappan, Yuri Grechkin, Anna Ratner, Biju Jacob, Amrita Pati, Marcel Huntemann, Konstantinos Liolios, Ioanna Pagani, Iain Anderson, Konstantinos Mavromatis, Natalia N. Ivanova, Nikos C. Kyrpides, IMG/M: the integrated metagenome data management and comparative analysis system, Nucleic Acids Research, Volume 40, Issue D1, 1 January 2012, Pages D123–D129, https://doi.org/10.1093/nar/gkr975
Close

Navbar Search Filter Mobile Enter search term Search

Abstract

The integrated microbial genomes and metagenomes (IMG/M) system provides support for comparative analysis of microbial community aggregate genomes (metagenomes) in a comprehensive integrated context. IMG/M integrates metagenome data sets with isolate microbial genomes from the IMG system. IMG/M's data content and analytical capabilities have been extended through regular updates since its first release in 2007. IMG/M is available at http://img.jgi.doe.gov/m. A companion IMG/M systems provide support for annotation and expert review of unpublished metagenomic data sets (IMG/M ER: http://img.jgi.doe.gov/mer).

INTRODUCTION

The number of metagenome sequence data sets generated by various sequencing centers is rapidly increasing with thousands of data sets already generated. Meteganome sequencing has evolved over the past several years from first generation Sanger (e.g. Applied Biosystems) platforms to second generation 454 Life Sciences Roche (e.g. GS FLX) and Illumina (e.g. GA II and HiSeq) platforms. While cheaper and faster, the new platforms produce shorter sequence fragments (reads). Short read size, higher complexity and inherent incompleteness, make metagenome sequences difficult to assemble and annotate (1,2).

Assembled or unassembled metagenome data sets generated using 454 or Illumina platforms are processed by the IMG/M annotation pipeline (3) before inclusion into IMG/M. Unassembled reads undergo an additional quality control step that includes quality trimming, low-complexity region detection and masking as well as removal of technical replicates. Subsequently, both assembled and unassembled sequences are annotated by the same pipeline that detects CRISPR repeats (4), non-coding RNAs and protein-coding genes (CDSs (Coding Sequence)). RNAs are predicted using tRNAscan-SE (5) for tRNAs, and in-house developed HMM models for rRNAs (6,7,8), while the CDSs are identified using a combination of ab initio gene prediction tools: Prodigal (9), Metagene (10), MetaGenemark (11) and FragGeneScan (12). In addition, sequences in the range of 100–800 bp are compared to the IMG non-redundant protein database using BlastX in order to detect the CDSs missed by ab initio tools. Conflicting gene predictions are consolidated using a weighted schema based on the performance of each method on simulated data sets, with one final gene model generated for each region.

Analysis of the aggregate genomes (metagenomes) of microbial communities (microbiomes) considers the questions of phylogenetic composition and functional or metabolic potential within individual microbiomes, as well as comparisons across microbiomes. IMG/M provides support for such analysis by integrating metagenome data sets with isolate microbial genomes from the integrated microbial genome (IMG) system (13). Using NCBI’s RefSeq (14) as its main source of sequence data, IMG integrates draft and complete microbial genomes from all three domains of life with a large number of plasmids and viruses. Similar to IMG, IMG/M records the primary sequence information for isolate genomes and metagenomes, their organization in scaffolds and/or contigs as well as computationally predicted protein-coding sequences and RNA-coding genes. Protein-coding genes are characterized in terms of additional annotations, such as conserved motifs and domains (15), signal peptides, transmembrane helices (16), pathways and orthology relationships, which may serve as an indication of their functions. These annotations are based on diverse data sources, such as Clusters of Orthologous Genes (COG) clusters and functional categories (17), Pfam (18), TIGRfam and TIGR role categories (19), InterPro domains (20) and KEGG (Kyoto Encyclopedia of Genes and Genomes) Ortholog terms and pathways (21).

We review below IMG/M's data content growth and analysis tool extensions since the last published report on IMG/M (22).

DATA CONTENT

Reference genome data

IMG is the source of IMG/M's reference isolate genomes. The current version of IMG/M is based on the content of IMG 3.4 (V.M. Markowitz et al., submitted publication) consisting of 6891 bacterial, archaeal, eukaryotic and viral genomes, as well as 1186 plasmids that did not come from a specific microbial genome sequencing project, with over 11.6 million protein coding genes.

Genomes generated as part of the Human Microbiome Project (HMP) and the Genome Encyclopedia of Bacterial and Archaea Genomes (GEBA) are of particular importance to metagenome analysis. HMP has generated over 800 reference genomes from both cultured and uncultured bacteria with the goal of supporting the characterization of microbial communities found at multiple human body sites (23). The GEBA project aims at systematically filling the sequencing gaps along the bacterial and archaeal branches of the tree of life (24), with the number of sequenced GEBA genomes standing at 205 as of August 2011. While HMP reference genomes are included into IMG/M from RefSeq via IMG, GEBA genomes are included directly into IMG/M as soon as their annotation is completed at Joint Genome Institute (JGI), before their release through GenBank and RefSeq.

Metagenome data

Unlike isolate genomes which are included into IMG and then IMG/M from a public sequence data resource (RefSeq), metagenome data sets are first included into IMG/M ‘Expert Review’ version, IMG/M ER, which allows scientists to employ IMG/M's annotation pipeline as well as review and curate the functional annotation of metagenomes prior to their public release in the context of IMG/M's reference genomes and public metagenomes. Genome and metagenome submissions are handled by the IMG/ER and IMG/M ER submission site, as illustrated in Figure 1(i).

Metagenome data set classification and metadata characterization. (i) Metagenome data sets are submitted for annotation and inclusion into IMG/M ER via the IMG/ER and IMG/M ER submission site. (ii) Metagenome data sets in IMG/M are organized using a hierarchical classification similar to the phylogenetic classification of isolate genomes. (iii) Metagenome data sets submitted for inclusion into IMG/M ER are associated with metadata characterizing the metagenome study, the associated metagenome sequencing project, environmental information, as well as (iv) sample and sequencing information.

Figure 1.

Metagenome data set classification and metadata characterization. (i) Metagenome data sets are submitted for annotation and inclusion into IMG/M ER via the IMG/ER and IMG/M ER submission site. (ii) Metagenome data sets in IMG/M are organized using a hierarchical classification similar to the phylogenetic classification of isolate genomes. (iii) Metagenome data sets submitted for inclusion into IMG/M ER are associated with metadata characterizing the metagenome study, the associated metagenome sequencing project, environmental information, as well as (iv) sample and sequencing information.

First, the names and classification of metagenome data sets submitted for inclusion into IMG/M ER are curated in GOLD (25) following the five-tiered system as previously proposed (26). This classification scheme underlies the organization of metagenome data sets in IMG/M, as illustrated in Figure 1(ii). Similar to the phylogenetic classification of isolate genomes, the classification of metagenomes is a critical element for conducting metagenome comparative analysis in a rapidly growing universe of metagenome data sets. Thus, all metagenome data sets are organized in three main ecosystem classes: environmental, host associated and engineered classes, then further divided in subclasses characterized by ecosystem categories (e.g. aquatic, terrestrial, air for environmental metagenomes), ecosystem type (e.g. freshwater, marine), ecosystem subtype (e.g. groundwater, drinking water), and specific ecosystem (e.g. cave water, filtered water). Second, metagenome data sets submitted for inclusion into IMG/M ER are associated with comprehensive metadata attributes following the Genome Standards Consortium guidelines (27), as illustrated in Figure 1(iii) and 1(iv). Note that enforcing metadata characterization before metagenome data sets are processed is the most effective way to capture such information.

As of 3 October 2011, IMG/M ER contains about 870 metagenome data sets (samples) with over 163 million protein coding genes that are part of 27 engineered, 110 environmental and 90 host-associated metagenome studies. IMG/M contains the publicly available subset of IMG/M ER metagenome data sets consisting of 289 metagenome data sets with over 60 million protein coding genes, a 10-fold increase compared to August 2007 (22). These data sets are part of 14 engineered, 37 environmental and 32 host-associated studies.

An HMP-specific version of IMG/M, contains 748 metagenome data sets generated as part of the HMP initiative by sequencing samples collected from various body sites (airways, gastrointestinal, oral, skin and urogenital), with a total of 80 million protein-coding genes (http://www.hmpdacc-resources.org/cgi-bin/imgm_hmp/).

DATA ANALYSIS

We briefly review below the IMG/M data analysis tools with emphasis on the support for new metagenome analysis tools developed since the last published report on IMG/M (22).

Data selection and exploration

Metagenomes, genomes, genes and functions can be selected in IMG/M using IMG specific browsers and search tools (15), with the organization of metagenomes using the hierarchical classification discussed above and illustrated in Figure 1 being specific to IMG/M. Metagenomes and genomes that result from search operations are displayed as lists from which they can be selected for inclusion into the ‘Genome Cart’. Genes and functions can be handled in a similar manner using the ‘Gene Cart’ and ‘Function Cart’, respectively.

Individual metagenomes can be explored using the ‘Metagenome Details’ page that provides a variety of tools for browsing, searching for the presence of specific genes or downloading metagenome data sets, as illustrated in Figure 2(i). This page also provides information (metadata) on the metagenome together with various statistics of interest, such as the number of genes that are associated with KEGG, COG, Pfam, InterPro or enzyme information.

Metagenome data exploration. (i) Microbiome samples, such as the Sediment microbial communities from Lake Washington for Methane and Nitrogen Cycle sample, can be examined using the ‘Microbiome Details’ page, which provide tools for browsing, searching or downloading the metagenome data. (ii) ‘Scaffold Cart’ allows selecting individual scaffolds or groups of scaffolds based on properties such as gene content. (iii) The ‘Phylogenetic Distribution of Genes’ provides an estimate of the phylogenetic composition of a metagenome sample based on the distribution of the best BLAST hits of the protein-coding genes in the sample. The result of ‘Phylogenetic Distribution of Genes’ can be displayed using (iv) the ‘Radial Phylogenetic Tree’ viewer or (v) in a tabular format consisting of a histogram with counts protein-coding genes in the sample, which have best BLASTp hits to proteins of isolate genomes in each phylum or class with >90% identity (right column), 60–90% identity (middle column) and 30–60% identity (left column). (vi) The organization of genes by their assignment to COGs is displayed in a pie chart format.

Figure 2.

Metagenome data exploration. (i) Microbiome samples, such as the Sediment microbial communities from Lake Washington for Methane and Nitrogen Cycle sample, can be examined using the ‘Microbiome Details’ page, which provide tools for browsing, searching or downloading the metagenome data. (ii) ‘Scaffold Cart’ allows selecting individual scaffolds or groups of scaffolds based on properties such as gene content. (iii) The ‘Phylogenetic Distribution of Genes’ provides an estimate of the phylogenetic composition of a metagenome sample based on the distribution of the best BLAST hits of the protein-coding genes in the sample. The result of ‘Phylogenetic Distribution of Genes’ can be displayed using (iv) the ‘Radial Phylogenetic Tree’ viewer or (v) in a tabular format consisting of a histogram with counts protein-coding genes in the sample, which have best BLASTp hits to proteins of isolate genomes in each phylum or class with >90% identity (right column), 60–90% identity (middle column) and 30–60% identity (left column). (vi) The organization of genes by their assignment to COGs is displayed in a pie chart format.

One of the ‘Browse’ tools provided for metagenomes allows examining scaffolds and contigs, whereas a new ‘Scaffold Cart’ allows selecting individual scaffolds (rather than all the scaffolds/contigs of a meteganome) or groups of scaffolds based on their properties such as gene or GC content, scaffold length, read depth, as illustrated in Figure 2(ii), and thus focus the analysis on subsets of metagenome sequences. ‘Scaffold Cart’ provides tools for including the genes of one or several scaffolds into the ‘Gene Cart’, associating a name with selected scaffolds for further analysis, computing a function profile across selected scaffolds, and for examining the phylogenetic distribution of genes for one or several scaffolds in the cart.

The ‘Phylogenetic Distribution of Genes’, illustrated in Figure 2(iii), provides an estimate of the phylogenetic composition of a metagenome sample based on the distribution of the best BLAST hits of the protein-coding genes in the sample. The result of ‘Phylogenetic Distribution of Genes’ can be displayed using the ‘Radial Phylogenetic Tree’ viewer as illustrated in Figure 2(iv), or in a tabular format consisting of a histogram, as illustrated in Figure 2(v) with counts protein-coding genes in the sample, which have best BLASTp hits to proteins of isolate genomes in each phylum or class with >90% identity (right column), 60–90% identity (middle column) and 30–60% identity (left column). This tabular display can be adjusted by filtering out the phyla/classes with few or no hits, whereby the higher the number of hits and percent identity cutoff, the more likely it is that the sample contains close relatives of the sequenced isolate genomes from this phylum/class. The CDSs with best BLAST hits to a certain taxonomic lineage can be organized by their assignment to COGs, which in turn can be classified according to COG Functional Categories (COG Functional Category) or COG Pathways (COG Pathways). The latter can be displayed in a tabular or pie chart format, as illustrated in Figure 2(vi), thereby linking the functional complement of metagenomic proteins with their likely affiliations to different phyla/classes and indicating possible functional specialization within the community (functional guilds). Gene counts in the various display formats of the results are linked to the corresponding lists of genes, which can then be selected and added to ‘Gene Cart’ or analyzed through their ‘Gene Pages’.

The ‘Radial Phylogenetic Tree’ tool allows the comparison of up to five user-selected metagenomes in terms of their BLAST hits to isolate genomes in a color-coded hierarchical circular tree. The resulting tree image can show the hits at different taxonomic levels. More statistics of hits for each genome can be accessed by hovering the mouse over the nodes of the tree. Finally, the genes in a metagenome sample can be viewed in the context of individual reference isolate genome using the ‘Protein Recruitment Plot’ that displays the BLASTp hits of the metagenome genes against the genes of the reference genome, with the coordinates of the scaffold reference genome and the BLAST percent identities shown on the X- and _Y-_axis, respectively.

Comparative analysis

Comparative analysis tools are an extension of the analogous tools in IMG (15), and allow examining the gene content and functional capabilities of microbial communities. We discuss below in more detail the main metagenome-specific comparative analysis tools available under the ‘Compare Genomes’ main menu tab of IMG/M, as shown in Figure 3(i).

Abundance profile and function comparison tools. The ‘Abundance Profile Search’ allows finding protein families (COGs and Pfams) in metagenomes and isolate genomes based on their relative abundance, such as (ii) finding all Pfams in the Sediment microbial communities from Lake Washington (Aerobic with added nitrate, 13C SIP) sample, which are at least twice as abundant as in the Sediment microbial communities from Lake Washington (Aerobic without added nitrate, 13C SIP) sample and are at least twice less abundant than in Sediment microbial communities from Lake Washington (Aerobic without added nitrate, SIP additional fraction). (iii) The ‘Abundance Profile Search Results’ consists of a list of protein families that satisfy the search criteria together with the metagenomes or genomes involved in the comparison and their associated raw or normalized gene counts. (iv) The ‘Function Category Comparison’ tool allows comparing a metagenome data set with other metagenome data sets or reference genome data sets in terms of the relative abundance of functional categories (COG Pathway, KEGG Pathway, KEGG Pathway Category, Pfam Category and TIGRfam Role Categories). (v) The result of ‘Function Category Comparison’ lists for each function category, F, the number of genes and estimated gene copies in the target (query) metagenome associated with F and for each reference genome/metagenome the number of genes or estimated gene copies associated with F, as well as an assessment of statistical significance in terms of associated P-value and d-rank.

Figure 3.

Abundance profile and function comparison tools. The ‘Abundance Profile Search’ allows finding protein families (COGs and Pfams) in metagenomes and isolate genomes based on their relative abundance, such as (ii) finding all Pfams in the Sediment microbial communities from Lake Washington (Aerobic with added nitrate, 13C SIP) sample, which are at least twice as abundant as in the Sediment microbial communities from Lake Washington (Aerobic without added nitrate, 13C SIP) sample and are at least twice less abundant than in Sediment microbial communities from Lake Washington (Aerobic without added nitrate, SIP additional fraction). (iii) The ‘Abundance Profile Search Results’ consists of a list of protein families that satisfy the search criteria together with the metagenomes or genomes involved in the comparison and their associated raw or normalized gene counts. (iv) The ‘Function Category Comparison’ tool allows comparing a metagenome data set with other metagenome data sets or reference genome data sets in terms of the relative abundance of functional categories (COG Pathway, KEGG Pathway, KEGG Pathway Category, Pfam Category and TIGRfam Role Categories). (v) The result of ‘Function Category Comparison’ lists for each function category, F, the number of genes and estimated gene copies in the target (query) metagenome associated with F and for each reference genome/metagenome the number of genes or estimated gene copies associated with F, as well as an assessment of statistical significance in terms of associated _P_-value and _d_-rank.

Metagenome samples can be compared in terms of their phylogenetic composition using a variant of the ‘Phylogenetic Distribution of Genes’ tool discussed above, which is extended to allow displaying side by side the phylogenetic distribution of best BLAST hits of protein-coding genes in multiple metagenomes. Two ‘Abundance Profile’ tools allow comparing the functional capabilities of metagenomes and genomes. The ‘Abundance Profile Overview’ tool provides a quick estimate of the functional capabilities of metagenomes in terms of the relative abundance of protein families (COGs and Pfams) and functional families (Enzymes) across selected metagenomes and isolate genomes. The result of this comparison is displayed either as a heat map or in a matrix format, with each column on the map/matrix corresponding to a genome or metagenome, and each row corresponding to a family. Users can ‘drill down’ by following links to lists of genes assigned to a particular family in a specific genome or metagenome.

A new ‘Abundance Profile Search’ tool allows finding protein families (COGs and Pfams) in metagenomes and isolate genomes based on their relative abundance. The tool allows selecting the way the results will be displayed (using raw or normalized gene counts) and setting abundance cutoffs, as illustrated in Figure 3(ii). The ‘Abundance Profile Search Results’ consist of a list of protein families that satisfy the search criteria together with the metagenomes or genomes involved in the comparison and their associated raw or normalized gene counts, as illustrated in Figure 3(iii). Protein families can be selected and added to the ‘Function Cart’, while gene counts are linked to the corresponding lists of genes, which can be subsequently selected and added to the ‘Gene Cart’ for further analysis.

The ‘Abundance Profile’ tools allow comparison of the functional capabilities of metagenomes without assigning statistical significance to the results. However, when metagenomes are compared to each other or to isolate genomes, statistical tests are needed for estimating the statistical significance of the observed differences. The ‘Function Comparison’ and ‘Function Category Comparison’ tools take into account the stochastic nature of metagenome data sets and test whether the differences in abundance can be ascribed to chance variation or not. These tools allow comparing a metagenome data set with other metagenome data sets or reference genome data sets in terms of the relative abundance of (i) protein families (COGs, Pfams and TIGRfams) and functional families (Enzymes) in the case of ‘Function Comparison’ or (ii) functional categories (COG Pathway, KEGG Pathway, KEGG Pathway Category, Pfam Category and TIGRfam subroles) in the case of ‘Function Category Comparison’, as illustrated in Figure 3(iv). The result of these comparisons lists for each function or function category, F, the number of genes or estimated gene copies in the target (query) metagenome associated with F and for each reference genome/metagenome the number of genes or estimated gene copies associated with F. These results include an assessment of statistical significance in terms of associated _P_-value and _d_-scores (for Function Comparison) or _d_-ranks (for Function Category Comparison), as illustrated in Figure 3(v).

FUTURE PLANS

The current version of IMG/M (August 2011) contains 224 metagenome data sets (samples) that are part of 15 engineered, 36 environmental, and 34 host-associated projects (studies). These data sets can be analyzed in the context of 6891 bacterial, archaeal, eukaryotic and virus reference genomes. New metagenome data sets are continuously included into IMG/M from metagenome studies conducted at JGI and other institutes, while new reference isolate genomes are included from IMG on a regular basis.

Data sets from next generation sequencing technology platforms often result in million sequences rendering storing and accessing of data in the standard relational data bases inefficient. As we expect an exponential growth of the size of metagenome data sets by these platforms, we are devising new data management techniques for organizing metagenome data in support of effective analysis.

FUNDING

Director, Office of Science, Office of Biological and Environmental Research, Life Sciences Division, US Department of Energy (Contract No. DE-AC02-05CH11231); National Energy Research Scientific Computing Center, Office of Science of the US Department of Energy (Contract No. DE-AC02-05CH11231); US National Institutes of Health Data Analysis and Coordination Center (Contract U01-HG004866). Funding for open access charge: University of California.

Conflict of interest statement. None declared.

ACKNOWLEDGEMENTS

We thank Shane Cannon of Lawrence Berkeley National Lab's National Energy Research Scientific Computing Center for his help in carrying out large-scale gene similarity computations for IMG/M. We thank Peter Williams, Henrik Nordberg, Roman Nikitin and Simon Minovitsky for their contribution to the development and maintenance of IMG/M. The work of JGI’s production, cloning, sequencing, assembly, finishing and annotation teams is an essential prerequisite for IMG. Eddy Rubin and James Bristow provided support, advice and encouragement throughout this project.

REFERENCES

1

et al.

On the fidelity of processing metagenomic sequences using simulated dataset

,

Nat. Methods

,

2007

, vol.

4

(pg.

495

-

500

)

2

A primer on metagenomics

,

PLoS Comput. Biol.

,

2010

, vol.

6

pg.

e1000667

3

The DOE-JGI Standard Operating Procedure for the Annotations of Metagenomes, Standards in Genomic Sciences

,

2009

, vol.

1

(pg.

63

-

67

)

4

CRISPR Recognition Tool (CRT): a tool for automatic detection of clustered regularly interspaced palindromic repeats

,

BMC Bioinformatics

,

2007

, vol.

8

pg.

209

5

tRNAscan-SE: a program for improved detection of transfer RNA genes in genomic sequence

,

Nucleic Acids Res.

,

1997

, vol.

25

(pg.

955

-

964

)

6

RNAmmer: consistent and rapid annotation of ribosomal RNA genes

,

Nucleic Acids Res.

,

2007

, vol.

35

(pg.

3100

-

3108

)

7

Rfam: annotating non-coding RNAs in complete genomes

,

Nucleic Acids Res.

,

2005

, vol.

33

(pg.

D121

-

D124

)

8

Infernal 1.0: inference of RNA alignments

,

Bioinformatics

,

2009

, vol.

25

(pg.

1335

-

1337

)

9

Prodigal: prokaryotic gene recognition and translation initiation site identification

,

BMC Bioinformatics

,

2010

, vol.

11

pg.

119

10

MetaGene: prokaryotic gene finding from environmental genome shotgun sequences

,

Nucleic Acids Res.

,

2006

, vol.

34

(pg.

5623

-

5630

)

11

Ab initio gene identification in metagenomic sequences

,

Nucleic Acids Res.

,

2010

, vol.

38

pg.

e132

12

FragGeneScan: predicting genes in short and error-prone reads

,

Nucleic Acids Res.

,

2010

, vol.

38

(pg.

e191

-

e191

)

13

et al.

The integrated microbial genomes (IMG) system: an expanding comparative analysis system

,

Nucleic Acids Res.

,

2010

, vol.

38

(pg.

D382

-

D390

)

14

NCBI Reference Sequences: current status, policy and new initiatives

,

Nucleic Acids Res.

,

2009

, vol.

37

(pg.

D32

-

D36

)

15

Evaluation of methods for the prediction of membrane spanning regions

,

Bioinformatics

,

2001

, vol.

17

(pg.

646

-

653

)

16

Locating proteins in the cell using TargetP, SignalP, and related tools

,

Nat. Protocols

,

2007

, vol.

2

(pg.

953

-

971

)

17

et al.

The COG database: an updated version includes eukaryotes

,

BMC Bioinformatics

,

2003

, vol.

4

pg.

41

18

et al.

The Pfam protein families database

,

Nucleic Acids Res.

,

2010

, vol.

38

(pg.

D211

-

D222

)

19

TIGRFAMs and Genome Properties: tools for the assignment of molecular function and biological process in prokaryotic genomes

,

Nucleic Acids Res.

,

2007

, vol.

35

(pg.

D260

-

D264

)

20

et al.

InterPro: the integrative protein signature database

,

Nucleic Acids Res

,

2005

, vol.

37

(pg.

D211

-

D215

)

21

KEGG for representation and analysis of molecular networks involving diseases and drugs

,

Nucleic Acids Res.

,

2010

, vol.

38

(pg.

D355

-

D360

)

22

et al.

IMG/M: a data management and analysis system for metagenomes

,

Nucleic Acids Res.

,

2008

, vol.

36

(pg.

D534

-

D538

)

23

The Human Microbiome Jumpstart Reference Strains Consortium

A catalog of reference genomes from the human microbiome

,

Science

,

2010

, vol.

328

(pg.

994

-

999

)

24

et al.

A phylogeny-driven genomic encyclopaedia of Bacteria and Archaea

,

Nature

,

2009

, vol.

462

(pg.

1056

-

1060

)

25

The genomes on line database (GOLD) in 2009: status of genomic and metagenomic projects and their associated metadata

,

Nucleic Acids Res.

,

2010

, vol.

38

(pg.

D346

-

D354

)

26

A call for standardized classification of metagenome projects, Environmen

,

Microbiol.

,

2010

, vol.

12

(pg.

1803

-

1805

)

27

et al.

Towards a richer description of our complete collection of genomes and metagenomes: the ‘Minimum Information about a Genome Sequence’ (MIGS) specification

,

Nat. Biotechnol.

,

2008

, vol.

26

(pg.

541

-

547

)

Published by Oxford University Press 2011.

This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/3.0), which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.

I agree to the terms and conditions. You must accept the terms and conditions.

Submit a comment

Name

Affiliations

Comment title

Comment

You have entered an invalid code

Thank you for submitting a comment on this article. Your comment will be reviewed and published at the journal's discretion. Please check for further notifications by email.

Citations

Views

Altmetric

Metrics

Total Views 2,295

1,723 Pageviews

572 PDF Downloads

Since 1/1/2017

Month: Total Views:
January 2017 4
February 2017 30
March 2017 6
April 2017 12
May 2017 6
June 2017 8
July 2017 6
August 2017 6
September 2017 1
October 2017 4
November 2017 8
December 2017 37
January 2018 34
February 2018 17
March 2018 44
April 2018 27
May 2018 23
June 2018 22
July 2018 28
August 2018 28
September 2018 37
October 2018 33
November 2018 27
December 2018 18
January 2019 39
February 2019 32
March 2019 40
April 2019 35
May 2019 37
June 2019 29
July 2019 27
August 2019 38
September 2019 36
October 2019 23
November 2019 26
December 2019 40
January 2020 26
February 2020 26
March 2020 33
April 2020 13
May 2020 22
June 2020 18
July 2020 22
August 2020 18
September 2020 26
October 2020 17
November 2020 30
December 2020 21
January 2021 15
February 2021 16
March 2021 37
April 2021 20
May 2021 17
June 2021 30
July 2021 22
August 2021 27
September 2021 29
October 2021 15
November 2021 26
December 2021 20
January 2022 28
February 2022 28
March 2022 24
April 2022 27
May 2022 20
June 2022 19
July 2022 30
August 2022 27
September 2022 29
October 2022 26
November 2022 32
December 2022 29
January 2023 28
February 2023 48
March 2023 27
April 2023 22
May 2023 28
June 2023 17
July 2023 17
August 2023 27
September 2023 25
October 2023 20
November 2023 20
December 2023 36
January 2024 25
February 2024 17
March 2024 27
April 2024 28
May 2024 30
June 2024 17
July 2024 30
August 2024 27
September 2024 26
October 2024 20

Citations

174 Web of Science

×

Email alerts

Citing articles via

More from Oxford Academic