Taxonomer: an interactive metagenomics analysis portal for universal pathogen detection and host mRNA expression profiling - PubMed (original) (raw)

doi: 10.1186/s13059-016-0969-1.

Keith Simmon 2, Chase Miller 1, Yi Qiao 1, Brett Kennedy 1, Tonya Di Sera 1, Erin H Graf 3, Keith D Tardif 4, Aurélie Kapusta 1, Shawn Rynearson 1, Chris Stockmann 5, Krista Queen 6, Suxiang Tong 6, Karl V Voelkerding 3 4, Anne Blaschke 5, Carrie L Byington 5, Seema Jain 6, Andrew Pavia 5, Krow Ampofo 5, Karen Eilbeck 2 7, Gabor Marth 1 7, Mark Yandell 8 9, Robert Schlaberg 10 11

Affiliations

Taxonomer: an interactive metagenomics analysis portal for universal pathogen detection and host mRNA expression profiling

Steven Flygare et al. Genome Biol. 2016.

Abstract

Background: High-throughput sequencing enables unbiased profiling of microbial communities, universal pathogen detection, and host response to infectious diseases. However, computation times and algorithmic inaccuracies have hindered adoption.

Results: We present Taxonomer, an ultrafast, web-tool for comprehensive metagenomics data analysis and interactive results visualization. Taxonomer is unique in providing integrated nucleotide and protein-based classification and simultaneous host messenger RNA (mRNA) transcript profiling. Using real-world case-studies, we show that Taxonomer detects previously unrecognized infections and reveals antiviral host mRNA expression profiles. To facilitate data-sharing across geographic distances in outbreak settings, Taxonomer is publicly available through a web-based user interface.

Conclusions: Taxonomer enables rapid, accurate, and interactive analyses of metagenomics data on personal computers and mobile devices.

Keywords: Infectious disease diagnostics; Metagenomics; Microbiome; Pathogen detection.

PubMed Disclaimer

Figures

Fig. 1

Fig. 1

Overview of Taxonomer architecture and user interface. a Taxonomer’s architecture. Raw FASTA, FASTQ, or SRA files (with or without gzip compression) are the input for Taxonomer. For paired-end data, mate pairs are analyzed jointly. Taxonomer consists of four main modules. The “Binner” module categorizes (“bins”) reads into broad taxonomic groups (host and microbial) followed by comprehensive microbial and host gene expression profiling at the nucleotide (“Classifier” module) or amino acid-level (“Protonomer” and “Afterburner” modules). Normalized host gene expression (gene-level read counts) and microbial profiles can be downloaded. Read subsets can be downloaded for custom downstream analyses (b) Taxonomer web-service. To further remove barriers for academic and clinical adoption of metagenomics, we developed a web interface for Taxonomer that allows users to stream sequencing read files (stored locally or http accessible) to the analysis server and interactively visualize results in real time. Main features are described in grey boxes. Taxonomic classification of bacteria, fungi, and viruses is visualized as a sunburst graph (center), in which the size of a given slice represents the relative abundance at the read level. Taxonomic ranks are shown hierarchically with the highest rank in the center of the graph. Sequences that cannot be classified to the species level, either because they are shared between taxa or represent novel microorganisms, are collapsed to the lowest common ancestor and shown as part of slices that terminate at higher taxonomic ranks (e.g. genus, family)

Fig. 2

Fig. 2

Performance of the “Classifier” module for bacterial and fungal classification and bacterial community profiling. a Taxonomer provides superior sensitivity and specificity for read-level bacterial classification compared to two other rapid classification tools SURPI [32] and Kraken [30] when using each tool’s default settings and databases: nt (

www.ncbi.nlm.nih.gov/nucleotide

, SURPI), RefSeq (Kraken), and Greengenes 99 % [70] OTU (Taxonomer). Results for SURPI are based on correct identification by either (dark bar) or both (light bar) read mates. b Of the three commonly used reference databases RefSeq (n = 210,627; 5,242 bacterial genomes), Greengenes 99 % OTU (n = 203,452), and RDP (n = 2,929,433), Taxonomer provides greatest read-level (top) and taxon-level (bottom, i.e. percentage of bacterial species identified) sensitivity for bacterial classification at only a moderate decrease in specificity when using the Greengenes database compared to the RDP and RefSeq databases (simulated 16S rDNA as in a). Because of its large size and greater completeness, the RDP database provides the greatest species-level specificity at the tradeoff of sensitivity. For ease of reference, the top right-most column is repeated from (a). c Bacterial classification accuracy of Taxonomer is similar to the RDP Classifier [35] and superior to Kraken at the read-level (top) and taxon-level (bottom, all using the Greengenes database). Given the applied criteria, BLAST [34] is less sensitive but more specific. d Taxonomer also performs similar to the RDP Classifier and better than Kraken for classification of synthetic fungal internal transcribed spacer (ITS) sequences at the read-level (top) and taxon-level (bottom). e Taxonomer classifies bacterial 16S rRNA reads at >200-fold increased speed compared to the RDP Classifier (times for 1 CPU, multithreading not available for RDP Classifier) while providing highly comparable bacterial community profiles when using 16S rRNA gene amplicon sequencing and shotgun metagenomics. Spearman correlation coefficients (ρ) of abundance estimates are shown for Taxonomer and the RDP Classifier at the order and genus-levels using the Greengenes 99 % OTU reference database. *2.5 %; **1.9 %; ***2.5 %

Fig. 3

Fig. 3

Performance characteristics of the “Protonomer” module for virus detection. RNA-Seq data from 24 samples known to harbor respiratory viruses (Additional file 1: Figure S9 and Table S11) were binned and the “viral” and “unclassified” bins were taxonomically classified by Protonomer, RAPSearch2 [36] (default and fast settings), and DIAMOND [37] (default and sensitive settings). Mean pairwise, genome-level sequence identities of the 24 respiratory viruses to reference sequences in the NCBI nt database were 93.7 % (range, 75.9–99.8 %). a Sensitivity. Protonomer (94.6 ± 2.7 %) and RAPSearch2 (default, 95.0 ± 2.2 %; fast, 94.8 ± 2.2 %) were more sensitive than DIAMOND (default, 90.5 ± 2.7 %; sensitive, 90.5 ± 2.7 %). b Specificity. Conversely, Protonomer (90.7 ± 17.1 %) and DIAMOND (default: 92.0 ± 17.1 %, sensitive: 91.9 ± 14.9 %) provided higher specificity than RAPSearch2 in default mode (88.0 ± 20.0 %). c Analysis times. Protonomer classifies reads faster than RAPSearch2 (24-fold compared to default mode, 11-fold compared to fast mode) and DIAMOND (2.6-fold compared to default mode, 3.3-fold compared to sensitive mode). All tools were run on 16 CPUs

Fig. 4

Fig. 4

Performance characteristics of the “Classifier” module for host transcript expression profiling. a Published RNA-seq data from a commercially available RNA standard (MAQC, Additional file 1: Table S12) were analyzed by Taxonomer, Sailfish, and Cufflinks and estimated transcript expression was compared to data obtained by quantitative PCR (qPCR). Gene-level Pearson and Spearman correlation coefficients for RNA-seq vs. qPCR were 0.85 and 0.84 for Taxonomer, 0.87 and 0.86 for Sailfish, and 0.80 and 0.80 for Cufflinks, respectively. b Application of Taxonomer to metagenomic RNA-seq data from routine respiratory samples from patients with influenza infection (n = 4). c Classification of viral sequencing reads by Protonomer and typing of this strain as influenza A(H1N1)pdm09 (top right sample from a). d Differential gene-level mRNA expression profiles from four patients with influenza A virus compared to asymptomatic controls (n = 40; top 50 differentially expressed genes are shown). Expression profiles for 17 genes were significantly higher in influenza-positive patients (Additional file 1: Table S5). e Expression profiles for the 17 most differentially expressed genes differentiate cases from controls (principal component analysis, PC1 and PC2 explaining 93.8 % of the total variance). f Normalized expression levels for individual patients of seven of the top 17 genes. Gene ontology assignments for enrichment of biological processes (g) and molecular functions (h) are shown

Fig. 5

Fig. 5

Case studies, detection of highly pathogenic viruses (ac). To simulate viral detection and discovery in public health emergencies by Taxonomer, we removed all viral target protein sequences (as per corresponding publications [–43]) from the reference database and analyzed published RNA-seq data with Taxonomer. The predicted viruses were detected in all cases: (a) novel Rhabdovirus in RNA-Seq data (SRR533978) from serum of a patient with hemorrhagic fever in the Democratic Republic of Congo (DRC), now known as Bas Congo Virus [41]; approximately 13 % of target reads from this highly divergent virus were classified at the family level (Rhabdoviridae) with genus-level assignments of Lyssavirus (1), Ephemerovirus (2), unassigned Rhabdoviridae (3), Tibrovirus (4), Sigmavirus (5); (b) avian influenza virus H7N9 in RNA-Seq data (SRR900273) from a throat swab of a patient in Shanghai with H7N9 infection [42]; (c) Ebola virus, strain Zaire 1995, in RNA-Seq data (SRR1553464) from serum of a patient with suspected Ebola virus disease in Sierra Leone [43]. Detection of previously unrecognized infections. d Taxonomer detected a previously unrecognized Chlamydophila psittaci infection (psittacosis) in plasma from a patient with suspected Ebola virus disease in Sierra Leone (SRR1564804) [43]. The 16S rRNA gene was covered a mean of 7035-fold with the consensus 16S rRNA sequence from this isolate sharing 99.9 % identity with the type strain (6BC, ATCC VR-125, CPU68447) enabling reliable identification. Positions of two single nucleotide polymorphisms are highlighted in red. e Taxonomer detected a novel Anellovirus in a nasopharyngeal swab. Forty-four reads were classified at the family level (Anelloviridae) or below. Mapping reads back to a manually constructed viral consensus genome sequence showed 14-fold mean coverage, 68.5 % pairwise nucleotide-level identity and 44–60 % predicted protein identity with TTV-like mini virus isolate LIL-y1 (EF538880.1). f Identification of Mycoplasma yeatsii contamination in RNA-seq data from cultured iPS cell (right) compared to non-contaminated iPS cell culture (left) based on read binning (top). High expression of rRNA is demonstrated by 32 % of RNA-Seq reads mapping to the M. yeatsii 16S rRNA gene (245,000X coverage, 99.4 % sequence identity with type strain GIH (MYU67946)

Comment in

References

    1. Firth C, Bhat M, Firth MA, Williams SH, Frye MJ, Simmonds P, et al. Detection of zoonotic pathogens and characterization of novel viruses carried by commensal Rattus norvegicus in New York City. MBio. 2014;5:e01933–01914. doi: 10.1128/mBio.01933-14. - DOI - PMC - PubMed
    1. National Institutes of Health. Human Microbiome Project. http://commonfund.nih.gov/hmp/index.
    1. Gilbert JA, Jansson JK, Knight R. The Earth Microbiome project: successes and aspirations. BMC Biology. 2014;12:69. doi: 10.1186/s12915-014-0069-1. - DOI - PMC - PubMed
    1. Afshinnekoo E, Meydan C, Chowdhury S, Jaroudi D, Boyer C, Bernstein N, et al. Geospatial resolution of human and bacterial diversity with city-scale metagenomics. Cell Syst. 2015;1:72–87. doi: 10.1016/j.cels.2015.01.001. - DOI - PMC - PubMed
    1. Louis P, Hold GL, Flint HJ. The gut microbiota, bacterial metabolites and colorectal cancer. Nat Rev Microbiol. 2014;12:661–672. doi: 10.1038/nrmicro3344. - DOI - PubMed

Publication types

MeSH terms

Grants and funding

LinkOut - more resources