A new genomic blueprint of the human gut microbiota - PubMed (original) (raw)

A new genomic blueprint of the human gut microbiota

Alexandre Almeida et al. Nature. 2019 Apr.

Abstract

The composition of the human gut microbiota is linked to health and disease, but knowledge of individual microbial species is needed to decipher their biological roles. Despite extensive culturing and sequencing efforts, the complete bacterial repertoire of the human gut microbiota remains undefined. Here we identify 1,952 uncultured candidate bacterial species by reconstructing 92,143 metagenome-assembled genomes from 11,850 human gut microbiomes. These uncultured genomes substantially expand the known species repertoire of the collective human gut microbiota, with a 281% increase in phylogenetic diversity. Although the newly identified species are less prevalent in well-studied populations compared to reference isolate genomes, they improve classification of understudied African and South American samples by more than 200%. These candidate species encode hundreds of newly identified biosynthetic gene clusters and possess a distinctive functional capacity that might explain their elusive nature. Our work expands the known diversity of uncultured gut bacteria, which provides unprecedented resolution for taxonomic and functional characterization of the intestinal microbiota.

PubMed Disclaimer

Conflict of interest statement

S.C.F., T.D.L. and R.D.F. are either employees of, or consultants to, Microbiotica Pty Ltd.

Figures

Fig. 1

Fig. 1. Thousands of metagenome-assembled genomes do not match isolate genomes.

a, Left, near-complete (>90% completeness, <5% contamination) MAGs that matched the HR database (green; ≥95% average nucleotide identity over at least 60% of the genome) and those that could not be classified (grey). Right, expanded view of MAGs with an alignment fraction of at least 60%, coloured on the basis of the ANI in relation to the best matching HR genome. b, Number of near-complete MAGs matching HR (blue) and RefSeq (pink) alongside those that did not match any reference genome from either database.

Fig. 2

Fig. 2. Taxonomy of the most prevalent uncultured gut bacterial species.

a, Taxonomic composition of the 1,952 UMGS, with ranks ordered from top to bottom by their increasing proportion among the UMGS collection. Only the five most frequently observed taxa are shown in the legend, with the remaining lineages grouped as ‘other classified taxa’. b, Top 20 most prevalent UMGS genomes across the 13,133 metagenomic datasets, inferred from the level of genome coverage, read depth and evenness. Each species is coloured according to class, with the predicted taxon indicated in brackets.

Fig. 3

Fig. 3. Phylogeny of reference and uncultured human gut bacterial genomes.

a, Maximum-likelihood phylogenetic tree comprising the 553 genomes belonging to the HGR, and 1,952 to UMGS. Clades are labelled according to genome type (HGR, near-complete or medium-quality UMGS) and the corresponding phylum is depicted in the first outer layer. Blue and red dots in the second layer denote genomes that were found in at least one sample from all six continents analysed (Africa, Asia, Europe, North America, South America and Oceania), or exclusively detected in non-European, non-North American samples, respectively. Green bars in the outermost layer represent the prevalence of the genome among the 13,133 metagenomic datasets. b, Level of increase in phylogenetic diversity provided by the UMGS, relative to the complete diversity per phylum (left) and represented as absolute total branch lengths (right). The number of HGR and UMGS genomes assigned to each phylum is depicted in brackets (HGR/UMGS).

Fig. 4

Fig. 4. Geographical distribution of the samples and uncultured species.

a, Distribution of the number of samples (log-transformed) that each HGR or UMGS present in at least one sample was found at a relative abundance above 0.01%. HGR genomes: n = 31 (Africa), n = 340 (Asia), n = 351 (Europe), n = 362 (North America), n = 86 (South America) and n = 129 (Oceania). UMGS genomes: n = 230 (Africa), n = 1,157 (Asia), n = 1,410 (Europe), n = 1,238 (North America), n = 482 (South America) and n = 287 (Oceania). b, Number of species found (abundance > 0.01%) in more than 20% of the samples from each geographical region. c, Percentage increase of the proportion of reads, partitioned by sample geographical location (Africa, n = 21; Asia, n = 1,447; Europe, n = 4,716; North America, n = 6,869; South America, n = 36; Oceania, n = 24), that were assigned to the HR, RefSeq and UMGS, in relation to HR and RefSeq alone. d, Accumulation curve depicting the number of UMGS detected as a function of the number of metagenomic samples per continent. Data points represent the average of ten bootstrap replicates. The curve of best fit generated from an asymptotic regression is represented for each geographical region. In a and c, box lengths represent the IQR of the data, and the whiskers the lowest and highest values within 1.5 times the IQR from the first and third quartiles, respectively. Source Data

Fig. 5

Fig. 5. The uncultured species have a distinct functional capacity.

a, Principal component analysis (PCA) based on GPs of the HGR (n = 553 genomes) and the UMGS (n = 1,952 genomes) coloured by phylum. b, GO functions differentially abundant between the HGR and UMGS genomes from Actinobacteria, Firmicutes, Proteobacteria and Tenericutes. The five functions with the highest and lowest effect size of abundance difference with a false discovery rate (FDR) <5% are represented. A positive effect size denotes overrepresentation in the UMGS genomes. GO terms related to redox functions are highlighted in bold.

Extended Data Fig. 1

Extended Data Fig. 1. Metadata of the human gut datasets.

Percentage of the 13,133 metagenomic datasets according to location, health state and age group of the individual sampled, as depicted in the figure key.

Extended Data Fig. 2

Extended Data Fig. 2. CheckM quality assessment of bins.

a, Quality metrics estimated by CheckM for the 242,836 bins generated by MetaBAT. b, Number of bins recovered according to the level of genome completeness and contamination. QS = completeness – (5 × contamination). Source Data

Extended Data Fig. 3

Extended Data Fig. 3. Technical reproducibility of MAGs.

a, MAGs resulting from the MetaWRAP pipeline (left, n = 9,552) and from a modified co-assembly approach (right, n = 4,404) compared to the original MAGs generated with SPAdes and MetaBAT for 1,000 random datasets. A good match was defined as ≥95% ANI over ≥60% of alignment fraction, whereas an excellent match indicates ≥98% ANI over ≥80% alignment. b, Proportion of MAGs generated with each pipeline (MetaWRAP and co-assembly) coloured by their level of match to the original set.

Extended Data Fig. 4

Extended Data Fig. 4. Phylogenetic diversity of the human-specific isolate genomes.

Phylogenetic tree of the 2,468 HR genomes, labelled according to class, with the bar graphs in the outer layer depicting the log-transformed number of near-complete MAGs matching that corresponding genome.

Extended Data Fig. 5

Extended Data Fig. 5. Analysis of Mash similarity clusters.

Pearson correlation between the log-transformed number of MAGs and the corresponding number of distinct samples (a) or studies (b) per Mash cluster. Data points represent each of the 702 similarity groups (defined with a Mash distance <0.2). The coefficient of determination (_R_2) is depicted in each graph. Source Data

Extended Data Fig. 6

Extended Data Fig. 6. Quality metrics of the metagenomic species.

a, Distribution of completeness (minimum: 55.5; Q1: 80.5; median: 92.3; Q3: 97.1; maximum: 100) and contamination levels (minimum: 0; Q1: 0.1; median: 0.8; Q3: 1.7; maximum: 4.1) estimated by CheckM for the 2,068 metagenomic species (MGS). b, Number of tRNAs coding for the 20 standard amino acids detected across the MGS genomes. c, MCC calculated for all the 2,068 MGS, based on the Mash clustering structure and an average amino acid identity threshold of 97%.

Extended Data Fig. 7

Extended Data Fig. 7. Defining genome presence and prevalence distribution.

a, b, Depth (a) and variation (b) penalty scores plotted against the level of genome coverage of the 1,952 UMGS across all 13,133 metagenomic samples. The depth penalty score was calculated by multiplying the missing coverage (100 − genome coverage) by the log-transformed mean read depth. The variation penalty score was based on the missing coverage multiplied by the depth coefficient of variation (standard deviation of read depth divided by the mean). Dashed red lines correspond to the 99th percentile, set as the upper threshold used to define genome presence. c, Number of UMGS detected in the corresponding number of metagenomic samples. The distribution of UMGS found in up to 100 samples is illustrated as an inset. The vertical dashed line represents the median value of all data.

Extended Data Fig. 8

Extended Data Fig. 8. Biosynthetic gene clusters found in the human gut species.

a, Number of BGCs found in the UMGS and the HGR genomes, subdivided by functional category. Only the 25 most abundant categories are depicted. PKS, polyketide synthases. b, Fraction of all BGCs that did not match the MIBiG database.

Extended Data Fig. 9

Extended Data Fig. 9. Functional capacity of cultured and uncultured species.

a, PCA based on GPs of the 553 HGR genomes and the 1,952 UMGS for the five most prevalent phyla (Actinobacteria, Bacteroidetes, Firmicutes, Proteobacteria and Tenericutes). b, Number of genes found to be enriched with an absolute effect size >0.2 in either the UMGS or HGR genomes across the analyses of each of the five major phyla, grouped by their corresponding KEGG functional category.

Comment in

References

    1. Duvallet C, Gibbons SM, Gurry T, Irizarry RA, Alm EJ. Meta-analysis of gut microbiome studies identifies disease-specific and shared responses. Nat. Commun. 2017;8:1784. - PMC - PubMed
    1. Turnbaugh PJ, et al. An obesity-associated gut microbiome with increased capacity for energy harvest. Nature. 2006;444:1027–1031. - PubMed
    1. Quince C, Walker AW, Simpson JT, Loman NJ, Segata N. Shotgun metagenomics, from sampling to analysis. Nat. Biotechnol. 2017;35:833–844. - PubMed
    1. Nelson KE, et al. A catalog of reference genomes from the human microbiome. Science. 2010;328:994–999. - PMC - PubMed
    1. Human Microbiome Project Consortium Structure, function and diversity of the healthy human microbiome. Nature. 2012;486:207–214. - PMC - PubMed

MeSH terms

LinkOut - more resources