Extensive Unexplored Human Microbiome Diversity Revealed by Over 150,000 Genomes from Metagenomes Spanning Age, Geography, and Lifestyle - PubMed (original) (raw)

. 2019 Jan 24;176(3):649-662.e20.

doi: 10.1016/j.cell.2019.01.001. Epub 2019 Jan 17.

Francesco Asnicar 1, Serena Manara 1, Moreno Zolfo 1, Nicolai Karcher 1, Federica Armanini 1, Francesco Beghini 1, Paolo Manghi 1, Adrian Tett 1, Paolo Ghensi 1, Maria Carmen Collado 2, Benjamin L Rice 3, Casey DuLong 4, Xochitl C Morgan 5, Christopher D Golden 4, Christopher Quince 6, Curtis Huttenhower 7, Nicola Segata 8

Affiliations

Extensive Unexplored Human Microbiome Diversity Revealed by Over 150,000 Genomes from Metagenomes Spanning Age, Geography, and Lifestyle

Edoardo Pasolli et al. Cell. 2019.

Abstract

The body-wide human microbiome plays a role in health, but its full diversity remains uncharacterized, particularly outside of the gut and in international populations. We leveraged 9,428 metagenomes to reconstruct 154,723 microbial genomes (45% of high quality) spanning body sites, ages, countries, and lifestyles. We recapitulated 4,930 species-level genome bins (SGBs), 77% without genomes in public repositories (unknown SGBs [uSGBs]). uSGBs are prevalent (in 93% of well-assembled samples), expand underrepresented phyla, and are enriched in non-Westernized populations (40% of the total SGBs). We annotated 2.85 M genes in SGBs, many associated with conditions including infant development (94,000) or Westernization (106,000). SGBs and uSGBs permit deeper microbiome analyses and increase the average mappability of metagenomic reads from 67.76% to 87.51% in the gut (median 94.26%) and 65.14% to 82.34% in the mouth. We thus identify thousands of microbial genomes from yet-to-be-named species, expand the pangenomes of human-associated microbes, and allow better exploitation of metagenomic technologies.

Keywords: human microbiome; metagenomic assembly; metagenomic mappability; metagenomic meta-analysis; metagenomics; non-Westernized microbiomes; unexplored microbial diversity.

Copyright © 2019 The Author(s). Published by Elsevier Inc. All rights reserved.

PubMed Disclaimer

Figures

None

Graphical abstract

Figure S1

Figure S1

Overview of the Functional and Metabolic Annotations of the Representatives of the SGBs and of the Whole Set of 154,723 Reconstructed Genomes, Related to Figure 1 (A) Ordination plot of the KEGG gene families annotated using eggNOG (see STAR Methods) of the 4,930 SGBs’ representatives, colored by the 14 most represented phyla. (B) Ordination plot of the UniRef50 gene families present in the 154,723 reconstructed genomes as annotated by mapping the genomes against both Uniref90 and Uniref50 (see STAR Methods). Ordination plots of the UniRef90 gene families for all the reconstructed genomes assigned to the (C) Fusobacteria and (D) Tenericutes phyla are also reported as examples of fine-grained functional differentiation.

Figure 1

Figure 1

4,930 SGBs Assembled from 9,428 Meta-analyzed Body-wide Metagenomes (A) A human-associated microbial phylogeny of representative genomes from each species-level genome bin (SGB). Figure S3A reports the same phylogeny but including isolate genomes not found in the human-associated metagenomes. (B) Overlap of SGBs containing both existing microbial genomes (including other metagenomic assemblies) and genomes reconstructed here (kSGBs), SGBs with only genomes reconstructed here and without existing isolate or metagenomically assembled genomes (uSGBs), and SGBs with only existing genomes and no genomes from our metagenomic assembly of human microbiomes (non-human SGBs). (C) Many SGBs contain no genomes from sequenced isolates or publicly available metagenomic assemblies (uSGBs). Only SGBs containing >10 genomes are shown. (D) Fraction of uSGBs and kSGBs as a function of the size of the SGBs (i.e., number of genomes in the SGB). (E) Distribution of the fraction of uSGBs in each sample by age category, body site, and lifestyle. (F) Distribution of the fraction of uSGBs in each study.

Figure S2

Figure S2

Overview of the Reconstructed SGBs and Criteria for SGB Definition and Taxonomic Assignment, Related to Figure 6 (A) Distribution of the distances of each reconstructed genome to the closest available isolate genomes, grouped by the class assigned to the matching isolate genomes. (B) The 4,930 identified species-level genome bins (SGBs) comprise a very variable fraction of already available genomes versus genomes we reconstructed from metagenomes. (C) Minimization criterion adopted to find the optimal cutoff in the hierarchical clustering of genomes to define SGBs. Two criteria are taken into account: minimization of the over-clustering error (C-i), and minimization of the under-clustering error (C-ii). Results showed a minimization of the error for a threshold equal to 0.05 (C-iii), which was thus adopted to discretize subtrees in the dendrogram and generate SGBs spanning ∼5% genetic diversity. (D) The same minimization criterion reported in (C-iii) for species-level bins is also adopted to identify the genomic diversity for genus-level and family-level bins.

Figure S3

Figure S3

Phylogenetic Trees for All SGBs and Reference Genomes and Subtrees of Saccharibacteria and Archaea, Related to Figure 1 (A) Phylogenetic tree that includes the representatives of the SGBs presented in Figure 1A together with all the non-human bins (represented in white in the external rings), for a total of 16,332 genomes (15,299 after the internal quality control in PhyloPhlAn). (B) Phylogenetic tree of the 337 reconstructed genomes taxonomically assigned to the candidate phylum Saccharibacteria present in the 108 SGBs, including available reference genomes (publicly available reference genomes are labeled with the “GCA” prefix). (C) Phylogenetic tree of the 675 archaeal genomes reconstructed in this study. 487 genomes belong to the Methanobrevibacter smithii kSGB (ID 714).

Figure 2

Figure 2

The Expanded Genome Set Substantially Increases the Mappability of Human Metagenomes (A) We mapped the subsampled original 9,428 metagenomes and 389 additional samples not considered for building the SGBs against the 154,723 reconstructed genomes and 80,990 previously available genomes. Raw-read mappability increased significantly (Mann-Whitney U test, p < 1e−50), e.g., from an average of 67.76% to 87.51% in the gut. Representative genomes refer to the highest-quality genomes selected from the 4,930 human SGBs and the 11,402 non-human SGBs. Extended statistics are in Figure S4. (B) Metagenomic read mappability increases more in non-Westernized than Westernized gut microbiomes (Welch's t test, p < 1e−50), both when considering samples used for SGBs’ reconstruction (26.50% average increase in 7,059 Westernized samples versus 96.56% in 454 non-Westernized samples) and when considering 264 additional samples not used for SGBs’ reconstruction (25.16% versus 117.40% average increase, respectively). (C) The gut microbiomes from Madagascar we sequenced here showed several highly abundant uSGBs and a large set of SGBs reconstructed in only subsets of the samples. Many kSGBs in this dataset do not contain isolate genomes but only previous metagenomic assemblies. The 25 most abundant SGBs are reported and ordered according to their average relative abundance. (D) Multidimensional scaling on datasets using the Bray-Curtis distance on per-dataset SGB prevalences highlights distinct microbial communities between Westernized and non-Westernized populations within and between body sites and age categories.

Figure S4

Figure S4

Improvement of Read Mappability Statistics by Considering the Set of Microbial Genomes We Assembled in This Work, Related to Figure 2 (A) Fraction of reads that can be mapped against different sets of genomes from isolate sequencing and the metagenomically reconstructed genomes. A subset of 132 full (i.e., not subsampled) metagenomes is shown (3 metagenomes randomly selected from each study). Samples are colored and grouped by body site. The colored part of the bar refers to the reads that can be mapped against a previously available reference genome, while the gray bars extend to highlight the total mappability we achieved using the 154,723 microbial genomes reconstructed in this study. (B) Percentage of increase in the mappability when using also the 154,723 reconstructed SGBs to map metagenomic reads. Boxplots represent values grouped by body site, lifestyle, age category (upper panel) and study (lower panel). The percentage of improvement is calculated with respect to the fraction of reads that could map using only and all the reference genomes. All the 9,428 metagenomes used in this study were mapped after being subsampled at 1% (see STAR Methods). Averaged statistics are reported in Figures 2A–2B.

Figure 3

Figure 3

Several Prevalent Intestinal uSGBs Are Found within the Clostridiales Order Related to Ruminococcus and Faecalibacterium (A) All SGBs in the assembled phylogeny (Figure 1A) placed between reference genomes for Ruminococcus and Faecalibacterium species that are reported as collapsed trees. A maximum of 25 HQ genomes from each SGB are displayed, and SGBs with <3 genomes are left black. (B) The monophyletic clade with the six uSGBs and the kSGB containing _Gemmiger formicilis_ represent clearly divergent species with inter-species genetic distance typical of genus-level divergence (average 16.6%, SD 3.1% nucleotide distance). (C) A whole-genome phylogeny for the 1,806 genomes in _Ca._ Cibiobacter qucibialis (STAR Methods). Some subtrees associate with geography and non-Westernized populations, while others seems to be geography- and lifestyle-independent (see text). (D) Multidimensional scaling of genetic distances among genomes of _Ca._ Cibiobacter qucibialis highlights the divergence of strains carried by non-Westernized populations, with Chinese populations subclustering within the large cluster of Westernized populations. (E) Madagascar-associated strains of _Ca._ Cibiobacter qucibialis (uSGB 15286) uniquely possess the _trp_ operon for tryptophan metabolism (Table S7). Other functional clusters in Westernized strains from geographically heterogeneous populations include vitamin B12 and fatty acid biosynthesis and galactose metabolism. The KEGG functions present in >80% or in <20% of the samples were discarded except for significant associations with lifestyle.

Figure S5

Figure S5

Phylogenetic Trees for SGBs Placed between Ruminococcus and Faecalibacterium, Succinatimonas kSGB (ID 3677), and Two Elusimicrobia uSGBs, Related to Figure 3 and 5 (A) Phylogenetic tree of SGBs placed between reference genomes for Ruminococcus and Faecalibacterium species in Figure 1A (highlighted in red), as already reported in Figure 3A but without collapsed branches and including the two reference genomes GCA_000238635 and GCA_000437915 (also highlighted), originally labeled as Subdoligranulum sp. 4_3_54A2FAA and Subdoligranulum sp. CAG:314, respectively. (B) Phylogenetic tree of the Succinatimonas kSGB (ID 3677) including the only available reference genome. (C) Phylogenetic tree of the two Elusimicrobia uSGBs enriched in non-Westernized populations and of all the available Elusimicrobia reference genomes.

Figure 4

Figure 4

The Metagenomically Reconstructed Genomes Greatly Expand the Genetic and Functional Diversity of the Ten Bacteroides Species Most Prevalent in the Human Gut (A) Additional Bacteroides genomes we assembled from metagenomes increase the size of the ten most prevalent Bacteroides kSGBs from 4 to >500 times. (B) The expanded Bacteroides kSGBs account for much larger pangenomes that capture a greater functional potential. (C) Ordinations on intra-SGB genetic distances (fractions of nucleotide mutations in the core genome) highlight the genetic structure of Bacteroides species and that reference genomes were available only for a reduced subset of subspecies structures (additional ordinations are in Figure S6A).

Figure S6

Figure S6

Genetic Diversity and Correlation between Genetic and Functional Similarity for Bacteroides Species, Related to Figure 4 (A) MDSs on intra-SGB genetic distances for Bacteroides species not reported in Figure 4C. (B) Scatterplots for the ten most prevalent Bacteroides kSGBs showing the relation between pairs of genomes measured as branch length distance on the core-genome-based phylogenetic tree (x axis) and as branch length on the hierarchical clustering built on the presence and absence of pan-genes (phylogenomic distance, y axis).

Figure 5

Figure 5

SGBs and Single Reconstructed Genomes Associated with Westernized and Non-Westernized Lifestyles (A) 49 total large (>10 genomes) SGBs were significantly enriched (Fisher's test) in the set of 112 Madagascar gut metagenomes sequenced for this study, and 20 were significantly depleted (Fisher's test) relative to Western gut microbiomes (complete results in Table S6). Most Madagascar-enriched SGBs are uSGBs or contain only isolate sequences that were themselves assembled from other metagenomes in other studies. (B) 232 total SGBs were differentially present with respect to the total set of non-Westernized populations, again with the 40 most significant—excluding those already reported in (A)—shown here (Fisher's test, complete results in Table S6). (C) The intra-SGB genetic structure of Succinatimonas spp., the bacterium most associated with non-Westernized lifestyles (multidimensional scaling [MDS] on percentage nucleotide distances between genomes). The few genomes assembled from Westernized countries are tightly clustering together, while strains from non-Westernized populations are distinct and not well represented by the only available co-assembled (but not cultivated) strain. (D) MDS of the two uSGBs (ID 19692 and ID 19694) enriched in the Madagascar cohort and available isolate genomes for the containing Elusimicrobia phylum (phylogeny in Figure S5A). The metagenomically assembled genomes in Elusimicrobia SGBs greatly diverge from the non-human-associated isolate genomes in the phylum. (E) Significant differences in functional potential between the 25 SGBs most strongly associated with Westernized and non-Westernized populations. We report the differential KEGG pathways (Fisher's test Bonferroni-corrected p < 0.05, full list in Table S6) whose components are found in the set of representative genomes for the 50 species (only three genomes per SGB).

Figure 6

Figure 6

Methodology Overview and Quality Characteristics for the 154,723 Reconstructed Genomes (A) Overview of the overall strategy and datasets employed for the reconstruction of microbial genomes and their organizations in SGBs. (B) Completeness and contamination values estimated by CheckM are reported for LQ (low quality, completeness <50% or contamination >5%), MQ (completeness in the range [50%, 90%] and contamination <5%), and HQ (completeness >90%, contamination <5%, CMSeq strain heterogeneity <0.5%) genomes. LQ genomes are excluded from the rest of the analysis. (C) Comparisons between the genomes from UniRef/NCBI used as references and our reconstructed genomes.

Figure 7

Figure 7

Quality of the Single-Sample Assembled Genomes against Multiple Alternative Genome Reconstruction Approaches (A) Percentage identity between genomes from isolates (I) and genomes we reconstructed from metagenomes (M) for five Bifidobacterium species from the FerrettiP_2018 dataset (Ferretti et al., 2018). We mark isolates and metagenomes coming from the same specimen (big filled circles) and coming from specimens of the same mother-infant pair (small filled circles). In all cases, our automatic pipeline reconstructs genomes from metagenomes that are almost identical to the genomes of the expected isolated strains. (B) The strains of S. aureus and P. aeruginosa isolated from three patients are almost perfectly matching the genomes reconstructed from sputum metagenomes sequenced at multiple time points. In the only case in which a S. aureus genome from a metagenome is not matching the strain isolated from a previous time point in the same patient, we verified with MLST typing that a clinical event of strain-replacement from ST45 to ST273 occurred. (C) In the dataset by Nielsen et al. (2014), we successfully recover at >99.5% identity the strain of a B. animalis subspecies lactis present in a commercial probiotic product that was consumed by the enrolled subjects, even if the probiotic strain was at low relative abundance in the stool microbiome (<0.3% on average [Nielsen et al., 2014]). (D) Comparison of the 46 manually curated genomes (using anvi’o) with automatically assembled (using metaSPAdes) and binned (using MetaBAT2) genomes. (E) Example comparison between the set of single-sample assembled genomes and co-assembled genomes for a time series (n = 5) of gut metagenomes from a newborn. Several genomes reconstructed with the two approaches have the same phylogenetic placement, with single-sample assembly retrieving the same (or a very closely related) genome at multiple time points, and both methods retrieving some unique genomes. This is an example of the comprehensive comparison performed in the STAR Methods and reported in Table S2 and Figure S7B.

Figure S7

Figure S7

Comparison between MEGAHIt and metaSPAdes Assemblies and between Assembly and Co-assembly, Related to Figure 7 (A) Comparison between metaSPAdes and MEGAHIT assemblers across all the considered datasets confirms that metaSPAdes performs consistently better especially in recovering long contigs. Stars indicate statistically significance (Welch's t test, p < 0.05). (B) Phylogenetic tree built on the genomes of gut adult metagenomes from 25 women from the FerrettiP_2018 dataset showing comparison between the set of single-sample assembled genomes (in green) and co-assembled genomes (in red). Several genomes reconstructed with the two approaches have the same phylogenetic placement, with single-sample assembly retrieving a total of 605 genomes spanning 257 SGBs, while co-assembly retrieved 172 genomes.

Comment in

References

    1. Alneberg J., Bjarnason B.S., de Bruijn I., Schirmer M., Quick J., Ijaz U.Z., Lahti L., Loman N.J., Andersson A.F., Quince C. Binning metagenomic contigs by coverage and composition. Nat. Methods. 2014;11:1144–1146. - PubMed
    1. Altschul S.F., Gish W., Miller W., Myers E.W., Lipman D.J. Basic local alignment search tool. J. Mol. Biol. 1990;215:403–410. - PubMed
    1. Asnicar F., Weingart G., Tickle T.L., Huttenhower C., Segata N. Compact graphical representation of phylogenetic data and metadata with GraPhlAn. PeerJ. 2015;3:e1029. - PMC - PubMed
    1. Asnicar F., Manara S., Zolfo M., Truong D.T., Scholz M., Armanini F., Ferretti P., Gorfer V., Pedrotti A., Tett A. Studying Vertical Microbiome Transmission from Mothers to Infants by Strain-Level Metagenomic Profiling. mSystems 2. 2017 - PMC - PubMed
    1. Bäckhed F., Roswall J., Peng Y., Feng Q., Jia H., Kovatcheva-Datchary P., Li Y., Xia Y., Xie H., Zhong H. Dynamics and Stabilization of the Human Gut Microbiome during the First Year of Life. Cell Host Microbe. 2015;17:852. - PubMed

Publication types

MeSH terms

Grants and funding

LinkOut - more resources