GUNC: detection of chimerism and contamination in prokaryotic genomes - PubMed (original) (raw)

GUNC: detection of chimerism and contamination in prokaryotic genomes

Askarbek Orakov et al. Genome Biol. 2021.

Abstract

Genomes are critical units in microbiology, yet ascertaining quality in prokaryotic genome assemblies remains a formidable challenge. We present GUNC (the Genome UNClutterer), a tool that accurately detects and quantifies genome chimerism based on the lineage homogeneity of individual contigs using a genome's full complement of genes. GUNC complements existing approaches by targeting previously underdetected types of contamination: we conservatively estimate that 5.7% of genomes in GenBank, 5.2% in RefSeq, and 15-30% of pre-filtered "high-quality" metagenome-assembled genomes in recent studies are undetected chimeras. GUNC provides a fast and robust tool to substantially improve prokaryotic genome quality.

Keywords: Bioinformatics; Genome contamination; Genome quality; Metagenome-assembled genomes; Metagenomics.

PubMed Disclaimer

Conflict of interest statement

The authors declare that they have no competing interests.

Figures

Fig. 1

Fig. 1

GUNC quantifies chimerism in prokaryotic genomes. a Genome contamination may originate in vitro (e.g., from culture media, laboratory equipment or kits, index hopping during multiplexed sequencing) or in silico (contig misassembly, erroneous binning). Genomes are represented as circular chromosomes, contigs as sequences of genes (dots). b Two types of genome contamination can be distinguished operationally: redundant contamination by surplus genomic material (“more of the same”) and non-redundant contamination by non-overlapping fragments from distantly related lineages (“something new,” e.g., novel or distant orthologs). Different single-copy marker genes (SCGs) are shown as solid shapes, other genes as dashed circles; colors indicate different source lineages. c GUNC workflow. For a given query genome, genes are called using prodigal, then mapped to the GUNC reference database (based on proGenomes 2.1) using diamond to compute GUNC scores and to generate interactive Sankey diagrams to visualize genome taxonomic composition. GUNC quantifies genome chimerism and reference representation across taxonomic levels. Clade separation scores (CSS) are high if gene classification to distinct lineages (represented by different colors) follows contig boundaries. Reference representation scores (RRS) are high if genes map closely and consistently into the GUNC reference space. The top example illustrates a chimeric genome with good reference representation, the bottom example a non-contaminated genome that is not well represented in the GUNC reference

Fig. 2

Fig. 2

GUNC accurately detects chimerism in incrementally challenging simulation scenarios. a Overview of different types of simulation scenarios. Genomes (filled circles) were simulated as mixtures of lineages (horizontal lines) diverging at various taxonomic levels (columns) from clades (void circles) contained in the GUNC reference (solid lines) or not (dashed). See “Methods” for details. b Median CheckM completeness and contamination estimates (dashed lines) diverged from true values (solid lines) with increasing levels of simulated contamination (type 3a in panel a), whereas GUNC estimates of contamination (green; theoretically expected values as blue solid line) and effective number of surplus lineages (purple) were highly accurate. See Figure S2 for an equivalent plot on type 3b genomes. c–f Detection accuracy across simulation scenarios, quantified using F1-scores (y-axis) across increasing levels of simulated contamination (x-axis). Data shown for scenarios 3a (c, d), 3b (e), 4 (f), and 5a (g); full panels for types 3a and 3b in Figure S3. MIMAG criteria were defined as CheckM contamination < 10%, completeness ≥ 50% (medium) and contamination < 5%, completeness > 90% (high); note that MIMAG criteria on rRNA and tRNA presence were not applied; Cont, CheckM contamination; GUNC, default GUNC CSS cutoff of > 0.45

Fig. 3

Fig. 3

Extensive undetected chimerism in public genome databases and large-scale MAG datasets. a Cumulative plots summarizing genome quality for various genome reference and MAG datasets. The y-axis shows the fraction of genomes passing GUNC filtering at increasing stringency (x-axis), up to the default CSS threshold of 0.45, conservatively ignoring species-level scores. Note that the Almeida, Pasolli, and Nayfach sets were pre-filtered using variations of the MIMAG medium criterion based on CheckM estimates. GTDB, Genome Taxonomy Database; GMGC, Global Microbial Gene Catalog. b Example of detected contamination in an isolate-derived reference genome for which around one fifth of genes were assigned to a different phylum, scattered across hundreds of small contigs. c Example of detected contamination in a MAG for which genes assigned to two major different phyla were well separated into distinct contigs. d Cumulative plots summarizing the quality of species-level genome bins (SGBs) defined by Pasolli et al. [13]. Lines indicate the fraction of SGBs (y-axis) containing at least one or exclusively chimeric genomes at increasingly stringent GUNC cutoffs (x-axis) conservatively ignoring species-level scores. For both series, intervals correspond to edge scenarios in which genomes with limited reference representation are either conservatively ignored (treated as non-chimeric, upper lines) or aggressively removed (lower lines); the true fraction of chimeric SGBs likely falls in between. e Differential filtering of MAGs in the GMGC set based on CheckM contamination (< 5%), CheckM completeness (> 90%), and GUNC (CSS < 0.45, ignoring species-level scores)

References

    1. Koonin EV, Galperin MY. Prokaryotic genomes: the emerging paradigm of genome-based microbiology. Curr Opin Genet Dev. 1997;7(6):757–763. doi: 10.1016/S0959-437X(97)80037-8. - DOI - PubMed
    1. Parks DH, Chuvochina M, Waite DW, Rinke C, Skarshewski A, Chaumeil PA, Hugenholtz P. A standardized bacterial taxonomy based on genome phylogeny substantially revises the tree of life. Nature Biotechnol. 2018;36(10):996–1004. doi: 10.1038/nbt.4229. - DOI - PubMed
    1. Schloss PD, Girard RA, Martin T, Edwards J, Thrash JC. Status of the archaeal and bacterial census: an update. MBio. 2016;7(3):e00201–16. - PMC - PubMed
    1. Allen EE, Banfield JF. Community genomics in microbial ecology and evolution. Nat Rev Microbiol. 2005;3(6):489–498. doi: 10.1038/nrmicro1157. - DOI - PubMed
    1. Fleischmann RD, Adams MD, White O, Clayton RA, Kirkness EF, Kerlavage AR, Bult C, Tomb J, Dougherty B, Merrick J, et al. Whole-genome random sequencing and assembly of Haemophilus influenzae Rd. Science. 1995;269(5223):496–512. doi: 10.1126/science.7542800. - DOI - PubMed

Publication types

MeSH terms

LinkOut - more resources