Natural variation in SAR11 marine bacterioplankton genomes inferred from metagenomic data - PubMed (original) (raw)

Natural variation in SAR11 marine bacterioplankton genomes inferred from metagenomic data

Larry J Wilhelm et al. Biol Direct. 2007.

Abstract

Background: One objective of metagenomics is to reconstruct information about specific uncultured organisms from fragmentary environmental DNA sequences. We used the genome of an isolate of the marine alphaproteobacterium SAR11 ('Candidatus Pelagibacter ubique'; strain HTCC1062), obtained from the cold, productive Oregon coast, as a query sequence to study variation in SAR11 metagenome sequence data from the Sargasso Sea, a warm, oligotrophic ocean gyre.

Results: The average amino acid identity of SAR11 genes encoded by the metagenomic data to the query genome was only 71%, indicating significant evolutionary divergence between the coastal isolates and Sargasso Sea populations. However, an analysis of gene neighbors indicated that SAR11 genes in the Sargasso Sea metagenomic data match the gene order of the HTCC1062 genome in 96% of cases (> 85,000 observations), and that rearrangements are most frequent at predicted operon boundaries. There were no conserved examples of genes with known functions being found in the coastal isolates, but not the Sargasso Sea metagenomic data, or vice versa, suggesting that core regions of these diverse SAR11 genomes are relatively conserved in gene content. However, four hypervariable regions were observed, which may encode properties associated with variation in SAR11 ecotypes. The largest of these, HVR2, is a 48 kb region flanked by the sole 5S and 23S genes in the HTCC1062 genome, and mainly encodes genes that determine cell surface properties. A comparison of two closely related 'Candidatus Pelagibacter' genomes (HTCC1062 and HTCC1002) revealed a number of "gene indels" in core regions. Most of these were found to be polymorphic in the metagenomic data and showed evidence of purifying selection, suggesting that the same "polymorphic gene indels" are maintained in physically isolated SAR11 populations.

Conclusion: These findings suggest that natural selection has conserved many core features of SAR11 genomes across broad oceanic scales, but significant variation was found associated with four hypervariable genome regions. The data also led to the hypothesis that some gene insertions and deletions might be polymorphisms, similar to allelic polymorphisms.

PubMed Disclaimer

Figures

Figure 1

Figure 1

Maximum likelihood tree of recA amino acid sequences. The tree includes Sargasso Sea metagenomic data and predicted proteins from cultured isolates of Pelagibacter. The sequence data was derived from Vergin et al., 2007 [37], with the addition of a sequence from strain HTCC7211, which was recently isolated from the Sargasso Sea. All other HTCC strains are from the coastal Pacific Ocean. The non-HTCC numbers correspond to the fragment identifier in the Sargasso Sea data set. The scale bar indicates substitutions per nucleotide position.

Figure 2

Figure 2

Schematic diagram of procedures used to "bin" the classes of metagenomic data. The query genome, in this case HTCC1062, is represented on the x-axis. A) TBLASTN of protein sequences from the query genome against metagenomic data. "Homologous fragments" were defined as fragments of metagenomic data with expect scores of 1 × 10-10 or better to genes from the query genome. B) "Homlogous fragments with synteny" contain homologs in the same gene order as the query genome, with as many as 5 gene gaps (gene deletions) allowed. C) Best-hit test. Fragments of metagenomic data pass the test if the nucleotide sequence of the fragment gene yields the corresponding query gene as the best hit in a BLASTX search of the NCBI nr database. D) The position of the fragment on the vertical axis corresponds to the average amino acid identity score of all the genes on the fragment.

Figure 3

Figure 3

HTCC1062 genome coverage for the different classes of metagenomic data. The data for the figure are described numerically in Table 1. A). The number of homologous fragments (TBLASTX expect scores ≤ 1 × 10-10) for each HTCC1062 gene, plotted by position in the HTCC1062 genome. B) GC content of HTCC1062 genome. C) The distribution of homologous fragments that passed the best-hit test, regardless of synteny. The data in this plot includes fragments that cover one or more genes. The plotted amino acid identities are for the individual genes, not averaged as they are in the syntenic fragment plot below. D) Syntenic fragment coverage of the HTCC1062 genome as a function of gene position and amino acid identity. See Fig. 2 for an explanation of syntenic fragments. Fragments in this category ("bin"), include parts of at least two genes that could be identified by TBLASTX. Regions of blue on the fragments indicate gaps. Syntenic fragments were allowed to be missing as many as five intervening genes (gaps) between the syntenic genes. Genes that encode ribosomal proteins are indicated in black.

Figure 4

Figure 4

Re-arrangements in the order of SAR11 genes in the Sargasso Sea metagenome, relative to the HTCC1062 genome. The genome of HTCC1062 is represented by the outer circle. Internal lines (chords) indicate SAR11 gene rearrangements found on environmental sequence fragments. The number of occurrences of each gene rearrangement is indicated by the color scale.

Figure 5

Figure 5

Detail of Figs 3C and 3D, in the vicinity of the proteorhodopsin gene. A) homologous fragments that passed the best-hit, and B) syntenic fragments. Regions of blue on the fragments indicate gaps. Only syntenic fragments containing 3 or more genes are shown. (SMRP) small multi-drug resistance protein, (ACAS) acyl-coenzyme A synthetase, (FD) ferrodoxin, (TD) thioredoxin disulfide reductase, (GST) glutathione S-transferase, (DKS) DnaK suppressor protein.

Figure 6

Figure 6

Enlargement of HVR2. HTCC1062 syntenic fragment plot showing detail in the region of HVR2.

Figure 7

Figure 7

Deletion of duplicate genes in the HVR1 region of strains HTCC1002 and HTCC1062. Strain HTCC1002 appears at the top of the display and HTCC1062 at the bottom. One of four homologous Type V autotransporters is deleted in HTCC1062 relative to HTCC1002, and one of two homologous ammonium transporters is deleted in HTCC1002 relative to HTCC1062.

Figure 8

Figure 8

Summary of analysis of HTCC1002/HTCC1062 "gene indels" in core regions. The flow diagram shows how 62 HTCC1002/HTCC1062 gene indels and the data from the syntenic fragment plot were combined to choose 13 genes for the tests of selection data shown in Table 4.

Figure 9

Figure 9

Syntig coverage of HTCC1002/HTCC1062 deletions. Shaded areas highlight the syntig coverage above the genes in HTCC1062 that are deleted in closely related strain HTCC1002. Panels A-G show all 44 genes of this type with HTCC1062 gene numbers on the x-axis.

References

    1. Welch RA, Burland V, Plunkett G, 3rd, Redford P, Roesch P, Rasko D, Buckles EL, Liou SR, Boutin A, Hackett J, et al. Extensive mosaic structure revealed by the complete genome sequence of uropathogenic Escherichia coli. Proc Natl Acad Sci USA. 2002;99:17020–17024. doi: 10.1073/pnas.252529799. - DOI - PMC - PubMed
    1. Thompson JR, Pacocha S, Pharino C, Klepac-Ceraj V, Hunt DE, Benoit J, Sarma-Rupavtarm R, Distel DL, Polz MF. Genotypic diversity within a natural coastal bacterioplankton population. Science. 2005;307:1311–1313. doi: 10.1126/science.1106028. - DOI - PubMed
    1. Tettelin H, Masignani V, Cieslewicz MJ, Donati C, Medini D, Ward NL, Angiuoli SV, Crabtree J, Jones AL, Durkin AS, et al. Genome analysis of multiple pathogenic isolates of Streptococcus agalactiae: implications for the microbial "pan-genome". Proc Natl Acad Sci USA. 2005;102:13950–13955. doi: 10.1073/pnas.0506758102. - DOI - PMC - PubMed
    1. Konstantinidis KT, Tiedje JM. Genomic insights that advance the species definition for prokaryotes. Proc Natl Acad Sci USA. 2005;102:2567–2572. doi: 10.1073/pnas.0409727102. - DOI - PMC - PubMed
    1. Tyson GW, Chapman J, Hugenholtz P, Allen EE, Ram RJ, Richardson PM, Solovyev VV, Rubin EM, Rokhsar DS, Banfield JF. Community structure and metabolism through reconstruction of microbial genomes from the environment. Nature. 2004;428:37–43. doi: 10.1038/nature02340. - DOI - PubMed

Publication types

MeSH terms

LinkOut - more resources