A benchmark for RNA-seq quantification pipelines - PubMed (original) (raw)

doi: 10.1186/s13059-016-0940-1.

Mingxiang Teng 1 2 3, Carrie A Davis 4, Sarah Djebali 5, Alexander Dobin 4, Brenton R Graveley 6, Sheng Li 7, Christopher E Mason 7, Sara Olson 6, Dmitri Pervouchine 5, Cricket A Sloan 8, Xintao Wei 6, Lijun Zhan 6, Rafael A Irizarry 9 10

Affiliations

A benchmark for RNA-seq quantification pipelines

Mingxiang Teng et al. Genome Biol. 2016.

Erratum in

Abstract

Obtaining RNA-seq measurements involves a complex data analytical process with a large number of competing algorithms as options. There is much debate about which of these methods provides the best approach. Unfortunately, it is currently difficult to evaluate their performance due in part to a lack of sensitive assessment metrics. We present a series of statistical summaries and plots to evaluate the performance in terms of specificity and sensitivity, available as a R/Bioconductor package ( http://bioconductor.org/packages/rnaseqcomp ). Using two independent datasets, we assessed seven competing pipelines. Performance was generally poor, with two methods clearly underperforming and RSEM slightly outperforming the rest.

PubMed Disclaimer

Figures

Fig. 1

Fig. 1

Estimated log fold changes stratified by transcript abundance on simulation dataset. One example based on Cufflinks quantification of two samples is shown here. Black points are non-differential transcripts; blue points are differentially expressed transcripts which were simulated to have signals on both samples; red points are differentially expressed transcripts which were simulated to have signals in only one of the samples

Fig. 2

Fig. 2

Distribution of reported transcript quantifications on one sample of simulation dataset a before and b after rescaling. Seven quantification methods are shown here

Fig. 3

Fig. 3

Standard deviations of transcript quantifications based on a an experimental dataset (GM12878) and b a simulation dataset (one of the cell lines). Seven quantification methods are shown here

Fig. 4

Fig. 4

Proportions of discordant expression calls based on a an experimental dataset (GM12878) and b a simulation dataset (one of the cell lines). Seven quantification methods are shown here

Fig. 5

Fig. 5

Proportion differences of transcript quantifications in genes with only two annotated transcripts based on a an experimental dataset (GM12878) and b a simulation dataset (one of the cell lines). Seven quantification methods are shown

Fig. 6

Fig. 6

ROC curves indicating performance of quantification methods based on differential expression analysis of a an experimental dataset and b a simulation dataset. Seven quantification methods are shown. FP false positive, TP true positive

References

    1. Consortium EP. The ENCODE (ENCyclopedia Of DNA Elements) Project. Science. 2004;306:636–40. doi: 10.1126/science.1105136. - DOI - PubMed
    1. Bray N, Pimentel H, Melsted P, Pachter L. Near-optimal probabilistic RNA-Seq quantification. Nat Biotechnol. 2016 - PubMed
    1. Patro R, Mount SM, Kingsford C. Sailfish enables alignment-free isoform quantification from RNA-seq reads using lightweight algorithms. Nat Biotechnol. 2014;32:462–4. doi: 10.1038/nbt.2862. - DOI - PMC - PubMed
    1. Norel R, Rice JJ, Stolovitzky G. The self-assessment trap: can we all be better than average? Mol Syst Biol. 2011;7:537. doi: 10.1038/msb.2011.70. - DOI - PMC - PubMed
    1. Kanitz A, Gypas F, Gruber AJ, Gruber AR, Martin G, Zavolan M. Comparative assessment of methods for the computational inference of transcript isoform abundance from RNA-seq data. Genome Biol. 2015;16:150. doi: 10.1186/s13059-015-0702-5. - DOI - PMC - PubMed

Publication types

MeSH terms

LinkOut - more resources