Gaining comprehensive biological insight into the transcriptome by performing a broad-spectrum RNA-seq analysis - PubMed (original) (raw)

doi: 10.1038/s41467-017-00050-4.

Marghoob Mohiyuddin 1, Robert Sebra 2, Hagen Tilgner 3, Pegah T Afshar 4, Kin Fai Au 5, Narges Bani Asadi 1, Mark B Gerstein 6, Wing Hung Wong 7, Michael P Snyder 3, Eric Schadt 2, Hugo Y K Lam 8

Affiliations

Gaining comprehensive biological insight into the transcriptome by performing a broad-spectrum RNA-seq analysis

Sayed Mohammad Ebrahim Sahraeian et al. Nat Commun. 2017.

Abstract

RNA-sequencing (RNA-seq) is an essential technique for transcriptome studies, hundreds of analysis tools have been developed since it was debuted. Although recent efforts have attempted to assess the latest available tools, they have not evaluated the analysis workflows comprehensively to unleash the power within RNA-seq. Here we conduct an extensive study analysing a broad spectrum of RNA-seq workflows. Surpassing the expression analysis scope, our work also includes assessment of RNA variant-calling, RNA editing and RNA fusion detection techniques. Specifically, we examine both short- and long-read RNA-seq technologies, 39 analysis tools resulting in ~120 combinations, and ~490 analyses involving 15 samples with a variety of germline, cancer and stem cell data sets. We report the performance and propose a comprehensive RNA-seq analysis protocol, named RNACocktail, along with a computational pipeline achieving high accuracy. Validation on different samples reveals that our proposed protocol could help researchers extract more biologically relevant predictions by broad analysis of the transcriptome.RNA-seq is widely used for transcriptome analysis. Here, the authors analyse a wide spectrum of RNA-seq workflows and present a comprehensive analysis protocol named RNACocktail as well as a computational pipeline leveraging the widely used tools for accurate RNA-seq analysis.

PubMed Disclaimer

Conflict of interest statement

The authors declare no competing financial interests.

Figures

Fig. 1

Fig. 1

The RNACocktail analysis protocol. RNACocktail is a comprehensive protocol of RNA-seq data analysis. The figure summarizes the widely used approaches for the key steps over the broad spectrum of RNA-seq analysis and also succinctly captures the possible workflows one can use to analyse RNA-seq data

Fig. 2

Fig. 2

Performance of different alignment schemes. a Overlap between the detected splice junctions by different schemes and their validation rate on reliable junctions in dbEST database. A reliable EST junction set consists of junctions supported by at least two ESTs. The sizes of the circles reflect the number of junctions called by each scheme. For each tool, the number of junctions called and the validation rates (in parentheses) are shown. Validation rates for each subset of junctions are also shown on the Venn diagram. b Read mapping analysis: distribution of mapping status of sequenced fragments (left) (for NA12878, MCF7, and SEQC samples, mapping status for paired-end reads are shown, while for hESC, the distribution reflects percentage of uniquely mapped (blue), multi-mapped (orange), and unmapped (red) single-end reads), distribution of number of soft-clipped bases in mapped fragments (middle), distribution of the number of mismatches in mapped fragments (right)

Fig. 3

Fig. 3

Performance of different transcriptome reconstruction schemes. a Distribution of number of exons per transcripts for different transcriptome reconstruction algorithms. Labels reflect the assembler, the long-read aligner (for IDP), and the short-read aligner used, respectively, with “-” separation. b Sensitivity and precision of different transcriptome reconstruction approaches at gene and transcript levels. The GENCODE reference transcriptome annotation is used as the truth set. The evaluations on a more recent update of MCF7 sample using the Iso-Seq pipeline resulted in a similar performance with only slight improvement. The union approaches that combined predictions from short reads and long reads (shown with a “+” in the label) slightly improved the performance of short-read isoform prediction schemes

Fig. 4

Fig. 4

Performance of different de novo transcriptome assembly techniques. a Distribution of transcript length. b N10-N50 values. (Nx is the contig length for which at least x% of the assembled transcript nucleotides were found in contigs that were at least of Nx length.) c ExN50 value at different expression percentiles. For ExN50, the N50 statistic was computed at percentile x, for only the top most highly expressed transcripts that represent x% of the total normalized expression data. Expression values were measured using eXpress

Fig. 5

Fig. 5

Performance of transcript abundance estimators. a Clustering of different schemes based on the Spearman rank correlation of their log expressions on NA12878. b Distribution of log2-fold change of expressions between MCF7-100 and MCF7-300 samples. For each method, dashed line represents the mean of the distribution and the dotted lines represents the quartiles. c Percentage of expression disagreement between MCF7-100 and MCF7-300 samples when low-expressed transcripts are discarded with different thresholds

Fig. 6

Fig. 6

Performance of differential gene expressions analysis tools on SEQC-A vs. SEQC-B samples. a Spearman rank correlation, root-mean-score-deviation (RMSD), and AUC-30 scores for qPCR measured genes. Spearman rank correlation and RMSD scores are measured between the log2-fold change of the qRT-PCR and RNA-seq tools. AUC-30 score represents the area under the ROC curve up to the false positive rate of 30%. b ROC analysis of qRT-PCR measured genes (left) and ERCC (right) genes. For each differential analysis tool the plot reflects average performance when different alignment-based and alignment-free tools are used for abundance estimation and error bar shows the maximum and minimum variations. Results for each tool combination are shown in Supplementary Figs. 30 and 35

Fig. 7

Fig. 7

Performance of different variant calling (ac) RNA editing (d, e) and RNA fusion (f) detection approaches. a Accuracy of detecting NA12878 high-confidence calls in NIST gold standard. The analysis is restricted to the expressed exons identified by Cufflinks or StringTie. b Overlap between variants predicted by GATK and SAMtools. c Distribution of the predicted mismatch types by GATK that are missed in NIST HC calls (in StringTie’s expressed exons) in different genomic regions. d Distribution of RNA editing events detected in different genomic regions for NA12878. For multiple-samples scheme, final editing sites include the rare variants in NA12878 that are supported by at least 3 out of 12 short-read samples in our analysis. For pooled-samples scheme, final editing sites include the rare variants in NA12878 that are supported by at least 20 reads in the pooled alignment. e Measurement of RNA editing detection accuracy when some portion of SNPs are hidden from GIREMI. FDR represent proportion of predicted edits that are among high-confidence genomic variants in NA12878. f Performance of different RNA fusion detection schemes on MCF-7 sample

Fig. 8

Fig. 8

The current RNACocktail computational pipeline. The pipeline is composed of high-accuracy tools in each step for general-purpose RNA-seq analysis

References

    1. Engström PG, et al. Systematic evaluation of spliced alignment programs for RNA-seq data. Nat. Methods. 2013;10:1185–1191. doi: 10.1038/nmeth.2722. - DOI - PMC - PubMed
    1. Steijger T, et al. Assessment of transcript reconstruction methods for RNA-seq. Nat. Methods. 2013;10:1177–1184. doi: 10.1038/nmeth.2714. - DOI - PMC - PubMed
    1. Hayer KE, Pizarro A, Lahens NF, Hogenesch JB, Grant GR. Benchmark analysis of algorithms for determining and quantifying full-length mRNA splice forms from RNA-seq data. Bioinformatics. 2015;31:3938. - PMC - PubMed
    1. Fonseca NA, Marioni J, Brazma A. RNA-seq gene profiling-a systematic empirical comparison. PLoS ONE. 2014;9:e107026. doi: 10.1371/journal.pone.0107026. - DOI - PMC - PubMed
    1. Teng M, et al. A benchmark for RNA-seq quantification pipelines. Genome. Biol. 2016;17:74. doi: 10.1186/s13059-016-0940-1. - DOI - PMC - PubMed

Publication types

MeSH terms

LinkOut - more resources