Differential analyses for RNA-seq: transcript-level estimates improve gene-level inferences - PubMed (original) (raw)

Differential analyses for RNA-seq: transcript-level estimates improve gene-level inferences

Charlotte Soneson et al. F1000Res. 2015.

Abstract

High-throughput sequencing of cDNA (RNA-seq) is used extensively to characterize the transcriptome of cells. Many transcriptomic studies aim at comparing either abundance levels or the transcriptome composition between given conditions, and as a first step, the sequencing reads must be used as the basis for abundance quantification of transcriptomic features of interest, such as genes or transcripts. Various quantification approaches have been proposed, ranging from simple counting of reads that overlap given genomic regions to more complex estimation of underlying transcript abundances. In this paper, we show that gene-level abundance estimates and statistical inference offer advantages over transcript-level analyses, in terms of performance and interpretability. We also illustrate that the presence of differential isoform usage can lead to inflated false discovery rates in differential gene expression analyses on simple count matrices but that this can be addressed by incorporating offsets derived from transcript-level abundance estimates. We also show that the problem is relatively minor in several real data sets. Finally, we provide an R package ( tximport) to help users integrate transcript-level abundance estimates from common quantification pipelines into count-based statistical inference engines.

Keywords: RNA-seq; gene expression; quantification; transcriptomics.

PubMed Disclaimer

Conflict of interest statement

Competing interests: No competing interests were disclosed.

Figures

Figure 1 (sim2).

A: Accuracy of gene- and transcript-level TPM estimates from_Salmon_ and scaled FPKM estimates derived from simple counts from_featureCounts_, in one of the simulated samples (sampleA1). Spearman correlations are indicated in the respective panels. Top row: using the complete annotation. Bottom row: using an incomplete annotation, with 20% of the transcripts randomly removed. Gene-level estimates are more accurate than transcript-level estimates. Gene-level estimates from_Salmon_ are more accurate than those from_featureCounts_.B: Distribution of the coefficients of variation of gene- and transcript-level abundance estimates from_Salmon_, calculated across 30 bootstrap samples of one of the simulated samples (sampleA1). Gene-level estimates are less variable than transcript-level estimates.C: An example of unidentifiable transcript-level estimates, as uneven coverage does not cover the critical regions that would determine the amount that each transcript is expressed, while gene-level estimation is still possible.

Figure 2 (sim2).

A: DTE detection performance on transcript- and gene-level, using_edgeR_ applied to transcript-level estimated counts from_Salmon_. The statistical analysis was performed on transcript level and aggregated for each gene using the_perGeneQValue_ function from the_DEXSeq_ R package; aggregated results show higher detection power. The curves trace out the observed FDR and TPR for each significance cutoff value. The three circles mark the performance at adjusted p-value cutoffs of 0.01, 0.05 and 0.1.B: Schematic illustration of different ways in which differential transcript expression (DTE) can arise, in terms of absence or presence of differential gene expression (DGE) and differential transcript usage (DTU).

Figure 3 (sim2).

A: DGE detection performance of_edgeR_ applied to three different count matrices (simplesum, scaledTPM, featureCounts), with or without including an offset representing the average transcript length (for simplesum and featureCounts, avetxl indicates that such offsets were used). Including the offset or using the scaledTPM count matrix leads to improved FDR control compared to using simplesum or featureCounts matrices without offset. The curves trace out the observed FDR and TPR for each significance cutoff value. The three circles mark the performance at adjusted p-value cutoffs of 0.01, 0.05 and 0.1.B: stratification of the results inA by the presence of differential isoform usage. The improvement in FDR control seen inA results from an improved treatment of genes with differential isoform usage, while all methods perform similarly for genes without differential isoform usage.

Figure 4.. Comparison of log-fold change estimates from edgeR, based on simplesum and scaledTPM count matrices, in four different data sets.

For the simulated data set (sim2), where signals have been exaggerated to pinpoint underlying causes of various observations, genes with induced DTU (whose true overall log-fold change is 0) show a clear overestimation of log-fold changes when usingsimplesum counts. However, none of the real data sets contain a similar population of genes, suggesting that for many real data sets, simple gene counting leads to overall similar conclusions as accounting for underlying changes in transcript usage.

Cited by

Transcriptional profiling reveals potential involvement of microvillous TRPM5-expressing cells in viral infection of the olfactory epithelium.
Baxter BD, Larson ED, Merle L, Feinstein P, Polese AG, Bubak AN, Niemeyer CS, Hassell J Jr, Shepherd D, Ramakrishnan VR, Nagel MA, Restrepo D. Baxter BD, et al. BMC Genomics. 2021 Mar 30;22(1):224. doi: 10.1186/s12864-021-07528-y. BMC Genomics. 2021. PMID: 33781205 Free PMC article.
Evolution and Expression of the Immune System of a Facultatively Anadromous Salmonid.
Colgan TJ, Moran PA, Archer LC, Wynne R, Hutton SA, McGinnity P, Reed TE. Colgan TJ, et al. Front Immunol. 2021 Feb 26;12:568729. doi: 10.3389/fimmu.2021.568729. eCollection 2021. Front Immunol. 2021. PMID: 33717060 Free PMC article.
Stromal β-catenin activation impacts nephron progenitor differentiation in the developing kidney and may contribute to Wilms tumor.
Drake KA, Chaney CP, Das A, Roy P, Kwartler CS, Rakheja D, Carroll TJ. Drake KA, et al. Development. 2020 Jul 31;147(21):dev189597. doi: 10.1242/dev.189597. Development. 2020. PMID: 32541007 Free PMC article.
TAZ-CAMTA1 and YAP-TFE3 alter the TAZ/YAP transcriptome by recruiting the ATAC histone acetyltransferase complex.
Merritt N, Garcia K, Rajendran D, Lin ZY, Zhang X, Mitchell KA, Borcherding N, Fullenkamp C, Chimenti MS, Gingras AC, Harvey KF, Tanas MR. Merritt N, et al. Elife. 2021 Apr 29;10:e62857. doi: 10.7554/eLife.62857. Elife. 2021. PMID: 33913810 Free PMC article.
Transcriptomic Signatures of Ageing Vary in Solitary and Social Forms of an Orchid Bee.
Séguret A, Stolle E, Fleites-Ayil FA, Quezada-Euán JJG, Hartfelder K, Meusemann K, Harrison MC, Soro A, Paxton RJ. Séguret A, et al. Genome Biol Evol. 2021 Jun 8;13(6):evab075. doi: 10.1093/gbe/evab075. Genome Biol Evol. 2021. PMID: 33914875 Free PMC article.

References

1. Liao Y, Smyth GK, Shi W: featureCounts: an efficient general purpose program for assigning sequence reads to genomic features. Bioinformatics. 2014;30(7):923–30. 10.1093/bioinformatics/btt656 - DOI - PubMed
1. Anders S, Pyl PT, Huber W: HTSeq--a Python framework to work with high-throughput sequencing data. Bioinformatics. 2015;31(2):166–169. 10.1093/bioinformatics/btu638 - DOI - PMC - PubMed
1. Trapnell C, Roberts A, Goff L, et al. : Differential gene and transcript expression analysis of RNA-seq experiments with TopHat and Cufflinks. Nat Protoc. 2012;7(3):562–78. 10.1038/nprot.2012.016 - DOI - PMC - PubMed
1. Li B, Dewey CN: RSEM: accurate transcript quantification from RNA-Seq data with or without a reference genome. BMC Bioinformatics. 2011;12:323. 10.1186/1471-2105-12-323 - DOI - PMC - PubMed
1. Glaus P, Honkela A, Rattray M: Identifying differentially expressed transcripts from RNA-seq data with biological variation. Bioinformatics. 2012;28(13):1721–1728. 10.1093/bioinformatics/bts260 - DOI - PMC - PubMed

LinkOut - more resources

Full Text Sources
Other Literature Sources
Molecular Biology Databases
- NIAID Data Ecosystem - Find datasets on Infectious and Immune-mediated Diseases

Differential analyses for RNA-seq: transcript-level estimates improve gene-level inferences - PubMed (original) (raw)