VirSorter: mining viral signal from microbial genomic data - PubMed (original) (raw)

VirSorter: mining viral signal from microbial genomic data

Simon Roux et al. PeerJ. 2015.

Abstract

Viruses of microbes impact all ecosystems where microbes drive key energy and substrate transformations including the oceans, humans and industrial fermenters. However, despite this recognized importance, our understanding of viral diversity and impacts remains limited by too few model systems and reference genomes. One way to fill these gaps in our knowledge of viral diversity is through the detection of viral signal in microbial genomic data. While multiple approaches have been developed and applied for the detection of prophages (viral genomes integrated in a microbial genome), new types of microbial genomic data are emerging that are more fragmented and larger scale, such as Single-cell Amplified Genomes (SAGs) of uncultivated organisms or genomic fragments assembled from metagenomic sequencing. Here, we present VirSorter, a tool designed to detect viral signal in these different types of microbial sequence data in both a reference-dependent and reference-independent manner, leveraging probabilistic models and extensive virome data to maximize detection of novel viruses. Performance testing shows that VirSorter's prophage prediction capability compares to that of available prophage predictors for complete genomes, but is superior in predicting viral sequences outside of a host genome (i.e., from extrachromosomal prophages, lytic infections, or partially assembled prophages). Furthermore, VirSorter outperforms existing tools for fragmented genomic and metagenomic datasets, and can identify viral signal in assembled sequence (contigs) as short as 3kb, while providing near-perfect identification (>95% Recall and 100% Precision) on contigs of at least 10kb. Because VirSorter scales to large datasets, it can also be used in "reverse" to more confidently identify viral sequence in viral metagenomes by sorting away cellular DNA whether derived from gene transfer agents, generalized transduction or contamination. Finally, VirSorter is made available through the iPlant Cyberinfrastructure that provides a web-based user interface interconnected with the required computing resources. VirSorter thus complements existing prophage prediction softwares to better leverage fragmented, SAG and metagenomic datasets in a way that will scale to modern sequencing. Given these features, VirSorter should enable the discovery of new viruses in microbial datasets, and further our understanding of uncultivated viral communities across diverse ecosystems.

Keywords: Bacteriophage; Metagenomics; Prophage; Single-cell amplified genome; Viral metagenomics; Virus.

PubMed Disclaimer

Conflict of interest statement

The authors declare there are no competing interests.

Figures

Figure 1

Figure 1. VirSorter process: overview (A) and examples of viral sequence detection (B).

(A) Overview of VirSorter process. The top part described the different parts of the sequence analysis pipeline, and the bottom frame summarizes the classification in three categories of decreasing confidence based on the different metrics being significant (green dot) or not (black cross). Viral “hallmark” genes or protein clusters (PCs) were identified by looking for genes typically of viral origin that are annotated as “major capsid protein,” “portal,” “terminase large subunit,” “spike,” “tail,” “virion formation” or “coat” and manually removing all protein domains with a potential overlap with microbial functions. (B) Examples of viral sequence detection by VirSorter. On top is the clearest case, in which a sequence harbors several viral hallmark genes as well as enrichment in viral-like genes (or virome-like when the genes are most similar to a viral metagenome sequence, when using the Viromes database). This type of detection is considered as the most confident. The three examples below are different cases in which only one of the primary metrics is significant. Notably, these examples display how VirSorter can detect new viruses based on a significant depletion in characterized genes associated with a viral hallmark gene (case 3), and how the same number of genes can be a non-significant enrichment when considering all viruses, yet significant when looking at only the non-Caudovirales (case 4). These detections are still considered confident, although less sure than case 1. Finally, a last example (case 5) displays a more ambiguous situation, in which a sequence displays only secondary viral metrics but neither viral gene enrichment nor a viral hallmark gene. For these detections, one of the metrics (at least) must have an E-value lower than 10−04 (note that significance scores used in VirSorter output files are computed as negative log10 transformations of E-values, and would here correspond to a score of 4 or more).

Figure 2

Figure 2. Accuracy of viral sequence predictions of VirSorter, PHAST, Phage_finder and PhiSpy on (A) complete microbial genomes, and (B) draft genomes from simulated SAGs including a microbial and viral genome.

For each set of predictions (i.e., each tool and set of option when applicable), the two metrics used to evaluate the tool performance are Recall (_x_-axis, proportion of known viral sequences or regions detected) and Precision (_y_-axis, proportion of predictions that corresponded to known viral sequences or regions). Prophages identified in the complete microbial genomes are compared to the list of manually curated prophages from Casjens (2003).

Figure 3

Figure 3. Detection of viral sequences in microbial metagenomes by VirSorter.

(A) Average Recall (_x_-axis) and Precision (_y_-axis) of viral sequence detection by VirSorter in 10 simulated microbial metagenomes for different contig size thresholds. (B) Detection of viral sequences by VirSorter in simulated microbial metagenomes by contig size fraction.

References

    1. Akhter S, Aziz RK, Edwards RA. PhiSpy: a novel algorithm for finding prophages in bacterial genomes that combines similarity- and composition-based strategies. Nucleic Acids Research. 2012;40:1–13. doi: 10.1093/nar/gks406. - DOI - PMC - PubMed
    1. Albertsen M, Hugenholtz P, Skarshewski A, Nielsen KL, Tyson GW, Nielsen PH. Genome sequences of rare, uncultured bacteria obtained by differential coverage binning of multiple metagenomes. Nature Biotechnology. 2013;31:533–538. doi: 10.1038/nbt.2579. - DOI - PubMed
    1. Altschul SF, Madden TL, Schäffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Research. 1997;25:3389–3402. doi: 10.1093/nar/25.17.3389. - DOI - PMC - PubMed
    1. Anantharaman K, Duhaime MB, Breier JA, Wendt K, Toner BM, Dick GJ. Sulfur oxidation genes in diverse deep-sea viruses. Science. 2014;344:757–760. doi: 10.3354/meps145269. - DOI - PubMed
    1. Boyd EF. Bacteriophage-encoded bacterial virulence factors and phage-pathogenicity island interactions. In: Łobocka M, Szybalski WT, editors. Advances in virus research. vol. 82. Amsterdam: Elsevier; 2012. pp. 91–118. - PubMed

Grants and funding

This work was performed under the auspices of the Gordon and Betty Moore Foundation (#3790) through grants awarded to Matthew B. Sullivan. Simon Roux was partially supported by the University of Arizona Ecosystem Genomics Institute through a grant from the Technology and Research Initiative Fund through the Water, Environmental and Energy Solutions Initiative. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

LinkOut - more resources