Differential analyses for RNA-seq: transcript-level estimates improve gene-level inferences - PubMed (original) (raw)

Differential analyses for RNA-seq: transcript-level estimates improve gene-level inferences

Charlotte Soneson et al. F1000Res. 2015.

Abstract

High-throughput sequencing of cDNA (RNA-seq) is used extensively to characterize the transcriptome of cells. Many transcriptomic studies aim at comparing either abundance levels or the transcriptome composition between given conditions, and as a first step, the sequencing reads must be used as the basis for abundance quantification of transcriptomic features of interest, such as genes or transcripts. Various quantification approaches have been proposed, ranging from simple counting of reads that overlap given genomic regions to more complex estimation of underlying transcript abundances. In this paper, we show that gene-level abundance estimates and statistical inference offer advantages over transcript-level analyses, in terms of performance and interpretability. We also illustrate that the presence of differential isoform usage can lead to inflated false discovery rates in differential gene expression analyses on simple count matrices but that this can be addressed by incorporating offsets derived from transcript-level abundance estimates. We also show that the problem is relatively minor in several real data sets. Finally, we provide an R package ( tximport) to help users integrate transcript-level abundance estimates from common quantification pipelines into count-based statistical inference engines.

Keywords: RNA-seq; gene expression; quantification; transcriptomics.

PubMed Disclaimer

Conflict of interest statement

Competing interests: No competing interests were disclosed.

Figures

Figure 1 (sim2).

Figure 1 (sim2).

A: Accuracy of gene- and transcript-level TPM estimates from_Salmon_ and scaled FPKM estimates derived from simple counts from_featureCounts_, in one of the simulated samples (sampleA1). Spearman correlations are indicated in the respective panels. Top row: using the complete annotation. Bottom row: using an incomplete annotation, with 20% of the transcripts randomly removed. Gene-level estimates are more accurate than transcript-level estimates. Gene-level estimates from_Salmon_ are more accurate than those from_featureCounts_.B: Distribution of the coefficients of variation of gene- and transcript-level abundance estimates from_Salmon_, calculated across 30 bootstrap samples of one of the simulated samples (sampleA1). Gene-level estimates are less variable than transcript-level estimates.C: An example of unidentifiable transcript-level estimates, as uneven coverage does not cover the critical regions that would determine the amount that each transcript is expressed, while gene-level estimation is still possible.

Figure 2 (sim2).

Figure 2 (sim2).

A: DTE detection performance on transcript- and gene-level, using_edgeR_ applied to transcript-level estimated counts from_Salmon_. The statistical analysis was performed on transcript level and aggregated for each gene using the_perGeneQValue_ function from the_DEXSeq_ R package; aggregated results show higher detection power. The curves trace out the observed FDR and TPR for each significance cutoff value. The three circles mark the performance at adjusted p-value cutoffs of 0.01, 0.05 and 0.1.B: Schematic illustration of different ways in which differential transcript expression (DTE) can arise, in terms of absence or presence of differential gene expression (DGE) and differential transcript usage (DTU).

Figure 3 (sim2).

Figure 3 (sim2).

A: DGE detection performance of_edgeR_ applied to three different count matrices (simplesum, scaledTPM, featureCounts), with or without including an offset representing the average transcript length (for simplesum and featureCounts, avetxl indicates that such offsets were used). Including the offset or using the scaledTPM count matrix leads to improved FDR control compared to using simplesum or featureCounts matrices without offset. The curves trace out the observed FDR and TPR for each significance cutoff value. The three circles mark the performance at adjusted p-value cutoffs of 0.01, 0.05 and 0.1.B: stratification of the results inA by the presence of differential isoform usage. The improvement in FDR control seen inA results from an improved treatment of genes with differential isoform usage, while all methods perform similarly for genes without differential isoform usage.

Figure 4.

Figure 4.. Comparison of log-fold change estimates from edgeR, based on simplesum and scaledTPM count matrices, in four different data sets.

For the simulated data set (sim2), where signals have been exaggerated to pinpoint underlying causes of various observations, genes with induced DTU (whose true overall log-fold change is 0) show a clear overestimation of log-fold changes when usingsimplesum counts. However, none of the real data sets contain a similar population of genes, suggesting that for many real data sets, simple gene counting leads to overall similar conclusions as accounting for underlying changes in transcript usage.

Similar articles

Cited by

References

    1. Liao Y, Smyth GK, Shi W: featureCounts: an efficient general purpose program for assigning sequence reads to genomic features. Bioinformatics. 2014;30(7):923–30. 10.1093/bioinformatics/btt656 - DOI - PubMed
    1. Anders S, Pyl PT, Huber W: HTSeq--a Python framework to work with high-throughput sequencing data. Bioinformatics. 2015;31(2):166–169. 10.1093/bioinformatics/btu638 - DOI - PMC - PubMed
    1. Trapnell C, Roberts A, Goff L, et al. : Differential gene and transcript expression analysis of RNA-seq experiments with TopHat and Cufflinks. Nat Protoc. 2012;7(3):562–78. 10.1038/nprot.2012.016 - DOI - PMC - PubMed
    1. Li B, Dewey CN: RSEM: accurate transcript quantification from RNA-Seq data with or without a reference genome. BMC Bioinformatics. 2011;12:323. 10.1186/1471-2105-12-323 - DOI - PMC - PubMed
    1. Glaus P, Honkela A, Rattray M: Identifying differentially expressed transcripts from RNA-seq data with biological variation. Bioinformatics. 2012;28(13):1721–1728. 10.1093/bioinformatics/bts260 - DOI - PMC - PubMed

LinkOut - more resources