ERSSA Package Introduction (original) (raw)

In comparative RNA sequencing (RNA-seq) experiments, selecting the appropriate sample size is an important optimization step [1]. Empirical RNA-seq Sample Size Analysis (ERSSA) is a R software package designed to test whether an existing RNA-seq dataset has sufficient biological replicates to detect a majority of the differentially expressed genes (DEGs) between two conditions. In contrast to existing RNA-seq sample size analysis algorithms, ERSSA does not rely on any a priori assumptions about the data [2]. Rather, ERSSA takes a user-supplied RNA-seq sample set and evaluates the incremental improvement in identification of DEGs with each increase in sample size up to the total samples provided, enabling the user to determine whether sufficient biological replicates have been included.

Based on the number of replicates available (N for each of the two conditions), the algorithm subsamples at each step-wise replicate levels (n= 2, 3, 4, …, N-1) and uses existing differential expression (DE) analysis software (e.g., edgeR [8] and DESeq2 [9]) to measure the number of DEGs. As N increases, the set of all distinct subsamples for a particular n can be very large and experience with ERSSA shows that it is not necessary to evaluate the entire set. Instead, 30-50 subsamples at each replicate level typically provide sufficient evidence to evaluate the marginal return for each increase in sample size. If the number of DEGs identified is similar for n = N-2, N-1 and N, there may be little to be gained by analyzing further replicates. Conversely, if the number of DEGs identified is still increasing strongly as n approaches N, the user can expect to identify significantly more DEGs if they acquire additional samples.

When applied to a diverse set of RNA-seq experimental settings (human tissue, human population study and in vitro cell culture), ERSSA demonstrated proficiency in determining whether sufficient biological replicates have been included. Overall, ERSSA can be used as a flexible and easy-to-use tool that offers an alternative approach to identify the appropriate sample size in comparative RNA-seq studies.

Installation

Install the latest stable version of ERSSA by entering the following commands in R console:

if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")

BiocManager::install("ERSSA")

Usage

Utility

In this vignette, we demonstrate ERSSA’s analytical approach using an RNA-seq dataset containing 10 human heart samples and 10 skeletal muscle samples from GTEx [3] and ask whether 10 replicates are sufficient to identify a majority of DE genes (adjusted p-value < 0.05 and |log2(fold-change)| > 1). At the end of the ERSSA run, four plots are generated to summarize the results. For now, let’s briefly focus on the most important of these, the number of DEGs identified as a function of the number of replicates included in the analysis. In the present example, the average number of DEGs discovered increases approximately 3% from n=6 to n=7 and little to no improvement as n increases to 8 and beyond. This suggests that our example dataset with N=10 replicates is sufficient to identify the vast majority of DEGs. To verify this conclusion, an additional 15 human heart and 15 skeletal muscle samples from GTEx were added and the analysis was repeated with N=25. The results for n<10 obtained with N=25 gave similar mean and distribution of the number of DEGs identified as those obtained with N=10, validating the utility of the statistical subsampling approach. The rest of this vignette will further explore ERSSA’s functionalities using the 10-replicate GTEx heart vs. muscle dataset. We will also briefly go through two additional examples that help to illustrate the variety of experimental settings where ERSSA can be applied.

Load example data

First, let’s load the N=10 GTEx heart and muscle dataset into the R workspace. The data can be loaded into R directly from the ERSSA package using:

library(ERSSA)

# GTEx dataset with 10 heart and 10 muscle samples
# "full"" dataset contains all ensembl genes
data(condition_table.full)
data(count_table.full)

# For test purposes and faster run time, we will use a smaller "partial" dataset
# 4 heart and 4 muscle samples
# partial dataset contains 1000 genes
data(condition_table.partial)
data(count_table.partial)

# NOTE: the figures are generated from the "full" dataset

The original dataset was obtained from the recount2 project [6], which is a systematic effort to generate gene expression count tables from thousands of RNA-seq studies. To generate the objects loaded above, GTEx heart and muscle count tables were manually downloaded from the recount2 website and processed by the recount package. The first 10 samples were then selected and a corresponding condition table generated to complete this example dataset.

For any ERSSA analysis, we need a few essential inputs. First is the RNA-seq count table that contains genes on each row and samples on each column. ERSSA expects the input count table as a dataframe object with gene names as the index and sample IDs as column names. For example, the first few cells of our example count table looks like this:

head(count_table.full[,1:2])

##                    heart_SRR598148.txt heart_SRR598509.txt
## ENSG00000000003.14                 147                 153
## ENSG00000000005.5                    0                   5
## ENSG00000000419.12                 333                 222
## ENSG00000000457.13                 226                  95
## ENSG00000000460.16                 103                  64
## ENSG00000000938.12                 835                1093

Next, we need to supply a condition table in the form of a dataframe object that contains two columns and the same number of rows as the total number of samples. Column one contains the sample IDs exactly as they appear in the count tables and column two contains the corresponding condition names. Only two conditions are supported. Our full condition table is shown below:

condition_table.full

##                    name condition
## 1   heart_SRR598148.txt     heart
## 2   heart_SRR598509.txt     heart
## 3   heart_SRR598589.txt     heart
## 4   heart_SRR599025.txt     heart
## 5   heart_SRR599086.txt     heart
## 6   heart_SRR599249.txt     heart
## 7   heart_SRR599380.txt     heart
## 8   heart_SRR600474.txt     heart
## 9   heart_SRR600829.txt     heart
## 10  heart_SRR600852.txt     heart
## 11 muscle_SRR598044.txt    muscle
## 12 muscle_SRR598452.txt    muscle
## 13 muscle_SRR600656.txt    muscle
## 14 muscle_SRR600981.txt    muscle
## 15 muscle_SRR601387.txt    muscle
## 16 muscle_SRR601671.txt    muscle
## 17 muscle_SRR601695.txt    muscle
## 18 muscle_SRR601815.txt    muscle
## 19 muscle_SRR602010.txt    muscle
## 20 muscle_SRR603116.txt    muscle

Finally, we need to specify which condition to use as the control in the DE comparison. In this case, let’s set “heart” as the control condition.

Additional examples

Human population dataset

Comparison of two human populations illustrates the value of ERSSA when one is confronted with data sets that have a high degree of biological variability. The data for this example is available as part of the International HapMap Project, which performed RNA-seq on lymphoblastoid cell lines derived from 60 European and 69 Nigerian individuals [10,11]. Here, we manually downloaded preprocessed count and condition tables from the original ReCount project to serve as inputs for ERSSA. In this case, running ERSSA on the entire dataset is quite time consuming as well as unnecessary, so 25 replicates from each group were selected for ERSSA analysis.

Compare to the previous GTEx dataset that compared heart to muscle, the HapMap dataset’s DEG discovery increased much more slowly with additional replicates. It is also worth noting the large variability in the discovery when few replicates are used. For this dataset, the number of subsample combinations tested was increased from the default 30 to 50 to reduce the algorithm’s sensitivity to outliers in any particular replicate level. Unlike the GTEx dataset, we see that at least 15 replicates are needed to capture a majority of DEGs. Additionally, the mean discovery line hovers near the full dataset line past 17 replicates, serving as a good indication that we already have sufficient replicates to discovery a majority of DEGs.

Cell culture dataset

In this example, data were obtained by Fossum and coworkers in a study of the transcriptome response to regulation of an E26-related transcription factor, ETS homologous factor (EHF) [12]. As part of the study, EHF expression was depleted using siRNA and compared to negative control siRNA samples. For both conditions, 5 cell culture replicates were used (representative of the typical number of replicates used in most RNA-seq studies). Count and condition tables from recount2 website will serve as inputs for ERSSA. Here are the ERSSA plots with the 5 replicates dataset:

For this particular comparison, we found very few DEGs with |log2(fold-change)| > 1, so the cutoff is adjusted to 0.5.

In contrast to the two previous datasets, ERSSA plots suggest this study would benefit from including additional replicates. In the percent difference plot, moving from 4 replicates to 5 replicates improved discovery significantly by 14.3%. Base on the trend, adding additional replicates will likely to continue improve DEG discovery by high single digit percentages. Additionally, the TPR-FPR plot shows the mean TPR measured with 4 replicate subsamples is quite low at around 0.75. Of course, the caveat here is that since the discovery would most likely benefit from including additional replicates, the list of DEGs from the full dataset (N=5) is a poor representation of the ground truth. While, in certain cases, including additional replicates may not be feasible or economical, ERSSA allows user to answer the question whether their DE analysis would benefit from more biological replicates.

Built with

sessionInfo()

## R version 4.5.0 RC (2025-04-04 r88126)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.2 LTS
## 
## Matrix products: default
## BLAS:   /home/biocbuild/bbs-3.21-bioc/R/lib/libRblas.so 
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.12.0  LAPACK version 3.12.0
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_GB              LC_COLLATE=C              
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## time zone: America/New_York
## tzcode source: system (glibc)
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] ggplot2_3.5.2    ERSSA_1.26.0     BiocStyle_2.36.0
## 
## loaded via a namespace (and not attached):
##  [1] SummarizedExperiment_1.38.0 gtable_0.3.6               
##  [3] xfun_0.52                   bslib_0.9.0                
##  [5] Biobase_2.68.0              lattice_0.22-7             
##  [7] vctrs_0.6.5                 tools_4.5.0                
##  [9] generics_0.1.3              stats4_4.5.0               
## [11] parallel_4.5.0              tibble_3.2.1               
## [13] pkgconfig_2.0.3             Matrix_1.7-3               
## [15] RColorBrewer_1.1-3          S4Vectors_0.46.0           
## [17] lifecycle_1.0.4             GenomeInfoDbData_1.2.14    
## [19] farver_2.1.2                compiler_4.5.0             
## [21] textshaping_1.0.0           statmod_1.5.0              
## [23] munsell_0.5.1               DESeq2_1.48.0              
## [25] codetools_0.2-20            GenomeInfoDb_1.44.0        
## [27] htmltools_0.5.8.1           sass_0.4.10                
## [29] yaml_2.3.10                 pillar_1.10.2              
## [31] crayon_1.5.3                jquerylib_0.1.4            
## [33] BiocParallel_1.42.0         cachem_1.1.0               
## [35] DelayedArray_0.34.0         limma_3.64.0               
## [37] abind_1.4-8                 tidyselect_1.2.1           
## [39] locfit_1.5-9.12             digest_0.6.37              
## [41] dplyr_1.1.4                 bookdown_0.43              
## [43] labeling_0.4.3              fastmap_1.2.0              
## [45] grid_4.5.0                  colorspace_2.1-1           
## [47] cli_3.6.4                   SparseArray_1.8.0          
## [49] magrittr_2.0.3              S4Arrays_1.8.0             
## [51] withr_3.0.2                 edgeR_4.6.0                
## [53] scales_1.3.0                UCSC.utils_1.4.0           
## [55] rmarkdown_2.29              XVector_0.48.0             
## [57] httr_1.4.7                  matrixStats_1.5.0          
## [59] ragg_1.4.0                  evaluate_1.0.3             
## [61] knitr_1.50                  GenomicRanges_1.60.0       
## [63] IRanges_2.42.0              rlang_1.1.6                
## [65] Rcpp_1.0.14                 glue_1.8.0                 
## [67] BiocManager_1.30.25         BiocGenerics_0.54.0        
## [69] jsonlite_2.0.0              plyr_1.8.9                 
## [71] R6_2.6.1                    systemfonts_1.2.2          
## [73] MatrixGenerics_1.20.0

References

Appendix

Ching, Travers, Sijia Huang, and Lana X. Garmire. “Power Analysis and Sample Size Estimation for RNA-Seq Differential Expression.” RNA, September 22, 2014. https://doi.org/10.1261/rna.046011.114.
Hoskins, Stephanie Page, Derek Shyr, and Yu Shyr. “Sample Size Calculation for Differential Expression Analysis of RNA-Seq Data.” In Frontiers of Biostatistical Methods and Applications in Clinical Oncology, 359–79. Springer, Singapore, 2017. https://doi.org/10.1007/978-981-10-0126-0_22.
Melé, Marta, Pedro G. Ferreira, Ferran Reverter, David S. DeLuca, Jean Monlong, Michael Sammeth, Taylor R. Young, et al. “The Human Transcriptome across Tissues and Individuals.” Science 348, no. 6235 (May 8, 2015): 660–65. https://doi.org/10.1126/science.aaa0355.
Anders, Simon, Davis J. McCarthy, Yunshun Chen, Michal Okoniewski, Gordon K. Smyth, Wolfgang Huber, and Mark D. Robinson. “Count-Based Differential Expression Analysis of RNA Sequencing Data Using R and Bioconductor.” Nature Protocols 8, no. 9 (September 2013): 1765–86. https://doi.org/10.1038/nprot.2013.099.
Gentleman R, Carey V, Huber W, Hahne F (2018). genefilter: genefilter: methods for filtering genes from high-throughput experiments. R package version 1.62.0.
Collado-Torres, Leonardo, Abhinav Nellore, Kai Kammers, Shannon E. Ellis, Margaret A. Taub, Kasper D. Hansen, Andrew E. Jaffe, Ben Langmead, and Jeffrey T. Leek. “Reproducible RNA-Seq Analysis Using Recount2.” Nature Biotechnology 35, no. 4 (April 2017): 319–21. https://doi.org/10.1038/nbt.3838.
Morgan M, Obenchain V, Lang M, Thompson R, Turaga N (2018). BiocParallel: Bioconductor facilities for parallel evaluation. R package version 1.14.1, https://github.com/Bioconductor/BiocParallel.
Robinson MD, McCarthy DJ, Smyth GK (2010). “edgeR: a Bioconductor package for differential expression analysis of digital gene expression data.” Bioinformatics, 26(1), 139-140.
Love MI, Huber W, Anders S (2014). “Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2.” Genome Biology, 15, 550. doi: 10.1186/s13059-014-0550-8.
Montgomery, Stephen B., et al. “Transcriptome genetics using second generation sequencing in a Caucasian population.” Nature 464.7289 (2010): 773.
Pickrell, Joseph K., et al. “Understanding mechanisms underlying human gene expression variation with RNA sequencing.” Nature 464.7289 (2010): 768.
Fossum, Sara L., et al. “Ets homologous factor regulates pathways controlling response to injury in airway epithelial cells.” Nucleic acids research 42.22 (2014): 13588-13598.
H. Wickham. ggplot2: Elegant Graphics for Data Analysis. Springer-Verlag New York, 2009.