viralFlye: assembling viruses and identifying their hosts from long-read metagenomics data (original) (raw)

Metaviral SPAdes: assembly of viruses from metagenomic data

Bioinformatics, 2020

Motivation Although the set of currently known viruses has been steadily expanding, only a tiny fraction of the Earth’s virome has been sequenced so far. Shotgun metagenomic sequencing provides an opportunity to reveal novel viruses but faces the computational challenge of identifying viral genomes that are often difficult to detect in metagenomic assemblies. Results We describe a MetaviralSPAdes tool for identifying viral genomes in metagenomic assembly graphs that is based on analyzing variations in the coverage depth between viruses and bacterial chromosomes. We benchmarked MetaviralSPAdes on diverse metagenomic datasets, verified our predictions using a set of virus-specific Hidden Markov Models and demonstrated that it improves on the state-of-the-art viral identification pipelines. Availability and implementation Metaviral SPAdes includes ViralAssembly, ViralVerify and ViralComplete modules that are available as standalone packages: https://github.com/ablab/spades/tree/metavir...

Assembly of viral genomes from metagenomes

Frontiers in Microbiology, 2014

Viral infections remain a serious global health issue. Metagenomic approaches are increasingly used in the detection of novel viral pathogens but also to generate complete genomes of uncultivated viruses. In silico identification of complete viral genomes from sequence data would allow rapid phylogenetic characterization of these new viruses. Often, however, complete viral genomes are not recovered, but rather several distinct contigs derived from a single entity are, some of which have no sequence homology to any known proteins. De novo assembly of single viruses from a metagenome is challenging, not only because of the lack of a reference genome, but also because of intrapopulation variation and uneven or insufficient coverage. Here we explored different assembly algorithms, remote homology searches, genome-specific sequence motifs, kmer frequency ranking, and coverage profile binning to detect and obtain viral target genomes from metagenomes. All methods were tested on 454-generated sequencing datasets containing three recently described RNA viruses with a relatively large genome which were divergent to previously known viruses from the viral families Rhabdoviridae and Coronaviridae. Depending on specific characteristics of the target virus and the metagenomic community, different assembly and in silico gap closure strategies were successful in obtaining near complete viral genomes.

Recovering full-length viral genomes from metagenomes

Frontiers in microbiology, 2015

Infectious disease metagenomics is driven by the question: "what is causing the disease?" in contrast to classical metagenome studies which are guided by "what is out there?" In case of a novel virus, a first step to eventually establishing etiology can be to recover a full-length viral genome from a metagenomic sample. However, retrieval of a full-length genome of a divergent virus is technically challenging and can be time-consuming and costly. Here we discuss different assembly and fragment linkage strategies such as iterative assembly, motif searches, k-mer frequency profiling, coverage profile binning, and other strategies used to recover genomes of potential viral pathogens in a timely and cost-effective manner.

Challenges in the analysis of viral metagenomes

Genome sequencing technologies continue to develop with remarkable pace, yet analytical approaches for reconstructing and classifying viral genomes from mixed samples remain limited in their performance and usability. Existing solutions generally target expert users and often have unclear scope, making it challenging to critically evaluate their performance. There is a growing need for intuitive analytical tooling for researchers lacking specialist computing expertise and that is applicable in diverse experimental circumstances. Notable technical challenges have impeded progress; for example, fragments of viral genomes are typically orders of magnitude less abundant than those of host, bacteria, and/or other organisms in clinical and environmental metagenomes; observed viral genomes often deviate considerably from reference genomes demanding use of exhaustive alignment approaches; high intrapopulation viral diversity can lead to ambiguous sequence reconstruction; and finally, the relatively few documented viral reference genomes compared to the estimated number of distinct viral taxa renders classification problematic. Various software tools have been developed to accommodate the unique challenges and use cases associated with characterizing viral sequences; however, the quality of these tools varies, and their use often necessitates computing expertise or access to powerful computers, thus limiting their usefulness to many researchers. In this review, we consider the general and application-specific challenges posed by viral sequencing and analysis, outline the landscape of available tools and methodologies, and propose ways of overcoming the current barriers to effective analysis.

Hecatomb: An End-to-End Research Platform for Viral Metagenomics

bioRxiv (Cold Spring Harbor Laboratory), 2022

Background: Analysis of viral diversity using modern sequencing technologies offers extraordinary opportunities for discovery. However, these analyses present a number of bioinformatic challenges due to viral genetic diversity and virome complexity. Due to the lack of conserved marker sequences, metagenomic detection of viral sequences requires a non-targeted, random (shotgun) approach. Annotation and enumeration of viral sequences relies on rigorous quality control and effective search strategies against appropriate reference databases. Virome analysis also benefits from the analysis of both individual metagenomic sequences as well as assembled contigs. Combined, virome analysis results in large amounts of data requiring sophisticated visualization and statistical tools. Results: Here we introduce Hecatomb, a bioinformatics platform enabling both read and contig based analysis. Hecatomb integrates query information from both amino acid and nucleotide reference sequence databases. Hecatomb integrates data collected throughout the workflow enabling analyst driven virome analysis and discovery. Hecatomb is available on GitHub at https://github.com/shandley/hecatomb. Conclusions: Hecatomb provides a single, modular software solution to the complex tasks required of many virome analysis. We demonstrate the value of the approach by applying Hecatomb to both a host-associated (enteric) and an environmental (marine) virome data set. Hecatomb provided data to determine true-or false-positive viral sequences in both data sets and revealed complex virome structure at distinct marine reef sites. .

Accurate viral genome reconstruction and host assignment with proximity-ligation sequencing

bioRxiv, 2021

Viruses play crucial roles in the ecology of microbial communities, yet they remain relatively understudied in their native environments. Despite many advancements in high-throughput whole-genome sequencing (WGS), sequence assembly, and annotation of viruses, the reconstruction of full-length viral genomes directly from metagenomic sequencing is possible only for the most abundant phages and requires long-read sequencing technologies. Additionally, the prediction of their cellular hosts remains difficult from conventional metagenomic sequencing alone. To address these gaps in the field and to accelerate the study of viruses directly in their native microbiomes, we developed an end-to-end bioinformatics platform for viral genome reconstruction and host attribution from metagenomic data using proximity-ligation sequencing (i.e., Hi-C). We demonstrate the capabilities of the platform by recovering and characterizing the metavirome of a variety of metagenomes, including a fecal microbio...

Linking environmental prokaryotic viruses and their host through CRISPRs

FEMS microbiology ecology, 2015

The ecological pressure that viruses place on microbial communities is not only based on predation, but also on gene transfer. In order to determine the potential impact of viruses and transduction, we need a better understanding of the dynamics of interactions between viruses and their hosts in the environment. Data on environmental viruses is scarce and methods for tracking their interactions with prokaryotes are needed. Clustered regularly interspaced short palindromic repeats (CRISPRs), which contain viral sequences in bacterial genomes, might help document the history of viral-host interactions in the environment. In this study, a bioinformatics network linking viruses and their hosts using CRISPR sequences obtained from metagenomic data was developed and applied to metagenomes from arctic glacial ice and soil. The application of our network approach showed that putative interactions were more commonly detected in the ice samples than the soil which would be consistent with the...

ViromeScan: a new tool for metagenomic viral community profiling

BMC Genomics, 2016

Background: Bioinformatics tools available for metagenomic sequencing analysis are principally devoted to the identification of microorganisms populating an ecological niche, but they usually do not consider viruses. Only some software have been designed to profile the viral sequences, however they are not efficient in the characterization of viruses in the context of complex communities, like the intestinal microbiota, containing bacteria, archeabacteria, eukaryotic microorganisms and viruses. In any case, a comprehensive description of the host-microbiota interactions can not ignore the profile of eukaryotic viruses within the virome, as viruses are definitely critical for the regulation of the host immunophenotype. Results: ViromeScan is an innovative metagenomic analysis tool that characterizes the taxonomy of the virome directly from raw data of next-generation sequencing. The tool uses hierarchical databases for eukaryotic viruses to unambiguously assign reads to viral species more accurately and >1000 fold faster than other existing approaches. We validated ViromeScan on synthetic microbial communities and applied it on metagenomic samples of the Human Microbiome Project, providing a sensitive eukaryotic virome profiling of different human body sites. Conclusions: ViromeScan allows the user to explore and taxonomically characterize the virome from metagenomic reads, efficiently denoising samples from reads of other microorganisms. This implies that users can fully characterize the microbiome, including bacteria and viruses, by shotgun metagenomic sequencing followed by different bioinformatic pipelines.

VirAmp: a galaxy-based viral genome assembly

2015

Background: Advances in next generation sequencing make it possible to obtain high-coverage sequence data for large numbers of viral strains in a short time. However, since most bioinformatics tools are developed for command line use, the selection and accessibility of computational tools for genome assembly and variation analysis limits the ability of individual labs to perform further bioinformatics analysis. Findings: We have developed a multi-step viral genome assembly pipeline named VirAmp, which combines existing tools and techniques and presents them to end users via a web-enabled Galaxy interface. Our pipeline allows users to assemble, analyze, and interpret high coverage viral sequencing data with an ease and efficiency that was not possible previously. Our software makes a large number of genome assembly and related tools available to life scientists and automates the currently recommended best practices into a single, easy to use interface. We tested our pipeline with three different datasets from human herpes simplex virus (HSV). Conclusions: VirAmp provides a user-friendly interface and a complete pipeline for viral genome analysis. We make our software available via an Amazon Elastic Cloud disk image that can be easily launched by anyone with an Amazon web service account. A fully functional demonstration instance of our system can be found at http://viramp.com/. We also maintain detailed documentation on each tool and methodology at http://docs.viramp.com.

VirAmp: a galaxy-based viral genome assembly pipeline

GigaScience, 2015

Background: Advances in next generation sequencing make it possible to obtain high-coverage sequence data for large numbers of viral strains in a short time. However, since most bioinformatics tools are developed for command line use, the selection and accessibility of computational tools for genome assembly and variation analysis limits the ability of individual labs to perform further bioinformatics analysis. Findings: We have developed a multi-step viral genome assembly pipeline named VirAmp, which combines existing tools and techniques and presents them to end users via a web-enabled Galaxy interface. Our pipeline allows users to assemble, analyze, and interpret high coverage viral sequencing data with an ease and efficiency that was not possible previously. Our software makes a large number of genome assembly and related tools available to life scientists and automates the currently recommended best practices into a single, easy to use interface. We tested our pipeline with three different datasets from human herpes simplex virus (HSV). Conclusions: VirAmp provides a user-friendly interface and a complete pipeline for viral genome analysis. We make our software available via an Amazon Elastic Cloud disk image that can be easily launched by anyone with an Amazon web service account. A fully functional demonstration instance of our system can be found at http://viramp.com/. We also maintain detailed documentation on each tool and methodology at http://docs.viramp.com.