User Guide (original) (raw)
Contents
- 1 Introduction
- 2 Installation
- 3 Interoperability with IsoformSwitchAnalyzeR
- 4 Paired differential expression and splicing
- 5 Over-Representation Analysis
- 6 Analysing ORA results
- 7 Session Info
- References
Introduction
pariedGSEA
is a user-friendly framework for paired differential gene expression and splicing analyses. Providing bulk RNA-seq count data, pariedGSEA
combines the results of_DESeq2_ (Love, Huber, and Anders 2014) and_DEXSeq_ (Anders, Reyes, and Huber 2012), aggregates the p-values to gene level and allows you to run a subsequent gene set over-representation analysis using_fgsea’s fora
function (Korotkevich, Sukhov, and Sergushichev 2019). Since version 0.99.2, you can also do the differential analyses usinglimma_ .
pairedGSEA
was developed to highlight the importance of differential splicing analysis. It was build in a way that yields comparable results between splicing and expression-related events. It, by default, accounts for surrogate variables in the data, and facilitates exploratory data analysis either through storing intermediate results or through plotting functions for the over-representation analysis.
This vignette will guide you through how to use the main functions ofpairedGSEA
.
pariedGSEA
assumes you have already preprocessed and aligned your sequencing reads to transcript level. Before starting, you should therefore have a counts matrix and a metadata file. This data may also be in the format of a_SummarizedExperiment_ (Morgan et al. 2022) orDESeqDataSet
.Importantly, please ensure that the rownames have the format:gene:transcript
.
The metadata should, as a minimum, contain the sample IDs
corresponding to the column names of the count matrix, a group
column containing information on which samples corresponds to the baseline (controls) and the case (condition). Bear in mind, the column names can be as you wish, but the names must be provided in the sample_col
and group_col
parameters, respectively.
To see an example of what such data could look like, please see ?pairedGSEA::example_se
.
Installation
To install this package, start R (version “4.3”) and enter:
Bioconductor dependencies
# Install Bioconductor dependencies
if (!require("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install(c(
"SummarizedExperiment", "S4Vectors", "DESeq2", "DEXSeq",
"fgsea", "sva", "BiocParallel"))
Install pairedGSEA
from Bioconductor
if (!require("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install("pairedGSEA")
Install development version from GitHub
# Install pairedGSEA from github
devtools::install_github("shdam/pairedGSEA", build_vignettes = TRUE)
Interoperability with IsoformSwitchAnalyzeR
IsoformSwitchAnalyzeR identifies, annotates, and visualizes Isoform Switches with Functional Consequences (from RNA-seq data).
Import and export between the packages with:
IsoformSwitchAnalyzeR::importPairedGSEA()
IsoformSwitchAnalyzeR::exportToPairedGSEA()
Paired differential expression and splicing
Preparing the experiment parameters
Running a paired Differential Gene Expression (DGE) and Differential Gene Splicing (DGS) analysis is the first step in the pairedGSEA
workflow.
But before running the paired_diff
function, we recommend storing the experiment parameters in a set of variables at the top of your script for future reference and easy access:
library("pairedGSEA")
# Defining plotting theme
ggplot2::theme_set(ggplot2::theme_classic(base_size = 20))
## Load data
# In this vignette we will use the included example Summarized Experiment.
# See ?example_se for more information about the data.
data("example_se")
if(FALSE){ # Examples of other data imports
# Example of count matrix
tx_count <- read.csv("path/to/counts.csv") # Or other load function
md_file <- "path/to/metadata.xlsx" # Can also be a .csv file or a data.frame
# Example of locally stored DESeqDataSet
dds <- readRDS("path/to/dds.RDS")
# Example of locally stored SummarizedExperiment
se <- readRDS("path/to/se.RDS")
}
## Experiment parameters
group_col <- "group_nr" # Column with the groups you would like to compare
sample_col <- "id" # Name of column that specifies the sample id of each sample.
# This is used to ensure the metadata and count data contains the same samples
# and to arrange the data according to the metadata
# (important for underlying tools)
baseline <- 1 # Name of baseline group
case <- 2 # Name of case group
experiment_title <- "TGFb treatment for 12h" # Name of your experiment. This is
# used in the file names that are stored if store_results is TRUE (recommended)
Running the analysis
The paired DGE/DGS analysis is run with paired_diff()
.paired_diff()
is essentially a wrapper function aroundDESeq2::DESeq
(Love, Huber, and Anders 2014) and DEXSeq::DEXSeq
(Anders, Reyes, and Huber 2012), the latter takes in the ballpark of 20-30 minutes to run depending on the size of the data and computational resources. Please visit their individual vignettes for further information.
set.seed(500) # For reproducible results
diff_results <- paired_diff(
object = example_se,
metadata = NULL, # Use with count matrix or if you want to change it in
# the input object
group_col = group_col,
sample_col = sample_col,
baseline = baseline,
case = case,
experiment_title = experiment_title,
store_results = FALSE # Set to TRUE (recommended) if you want
# to store intermediate results, such as the results on transcript level
)
#> Running TGFb treatment for 12h
#> Preparing metadata
#>
#> Removing 0 rows with a summed count lower than 10
#> Removing 108 rows with counts in less than 2 samples.
#> Running SVA
#> No significant surrogate variables
#>
#> Found 0 surrogate variables
#> Running DESeq2
#> estimating size factors
#> estimating dispersions
#> gene-wise dispersion estimates
#> mean-dispersion relationship
#> final dispersion estimates
#> fitting model and testing
#> Extracting results
#> Initiating DEXSeq
#> Creating DEXSeqDataSet
#> converting counts to integer mode
#> Warning in DESeqDataSet(rse, design, ignoreRank = TRUE): some variables in
#> design formula are characters, converting to factors
#>
#> Running DEXSeq -- This might take a while
#> TGFb treatment for 12h is analysed.
#> Aggregating p values
#> Done.
After running the analyses, paired_diff
will aggregate the p-values to gene level using lancaster aggregation (Lancaster 1961) and calculate the FDR-adjusted p-values (see ?pairedGSEA:::aggregate_pvalue
for more information). For the DGE transcripts, the log2 fold changes will be aggregated using a weighted mean, whereas the DGS log2 fold changes will be assigned to the log2 fold change of the transcript with the lowest p-value. Use the latter with a grain of salt.
From here, feel free to analyse the gene-level results using your preferred method. If you set store_results = TRUE
, you could extract the transcript level results found in the results
folder under the names "*_expression_results.RDS" and "*_splicing_results.RDS" for the DGE and DGS analysis, respectively (The * corresponds to the provided experiment title).
Additional parameters
There are a range of other parameters you can play with to tailor the experience. Here, the default settings are showed. See ?pairedGSEA::paired_diff
for further details.
covariates = NULL,
run_sva = TRUE,
use_limma = FALSE,
prefilter = 10,
fit_type = "local",
test = "LRT",
quiet = FALSE,
parallel = TRUE,
BPPARAM = BiocParallel::bpparam(),
...
To highlight some examples of use:
- You can use
limma::eBayes
andlimma::diffSplice
for the analyses withuse_limma = TRUE
. - You can use additional columns in your metadata as covariates in the model matrix by setting
covariates
to a character vector of the specific names. This will be used in SVA, DGE, and DGS. - The test should be kept at default settings, but advanced users may use the “wald” test instead, if they wish.
- The
...
parameters will be fed toDESeq2::DESeq
, see their manual for options. - If you want a stricter, more loose, or more advanced pre-filtering (generally not necessary, as _DESeq2_and DEXSeq performs their own pre-filtering), you can set the parameter to a different value or NULL and use your own pre-filtering method directly on the counts matrix.
Over-Representation Analysis
pairedGSEA
comes with a wrapper function for fgsea::fora
(Korotkevich, Sukhov, and Sergushichev 2019). If you wish, feel free to use that directly or any other gene set analysis method - investigate the diff_results
object before use to see what information it contains.
The inbuilt wrapper also computes the relative risk for each gene set and an enrichment score (log2(relative_risk + 0.06)
, the pseudo count is for plotting purposes).
Running the inbuilt ORA function
Before you get going, you will need a list of gene sets (aka. pathways) according to the species you are working with and the category of gene sets of interest. For this purpose, feel free to use the msigdbr (Dolgalev 2022) wrapper function inpairedGSEA
: pairedGSEA::prepare_msigdb
. If you do, see ?pairedGSEA::prepare_msigdb
for further details.
The inbuilt ORA function is called paired_ora
and is run as follows
# Define gene sets in your preferred way
gene_sets <- pairedGSEA::prepare_msigdb(
species = "Homo sapiens",
collection = "C5",
gene_id_type = "ensembl_gene"
)
#> The 'msigdbdf' package must be installed to access the full dataset.
#> Please run the following command to install the 'msigdbdf' package:
#> install.packages('msigdbdf', repos = 'https://igordot.r-universe.dev')
ora <- paired_ora(
paired_diff_result = diff_results,
gene_sets = gene_sets,
experiment_title = NULL # experiment_title - if not NULL,
# the results will be stored in an RDS object in the 'results' folder
)
#> Running over-representation analyses
#> Joining result
Additional parameters
cutoff = 0.05,
min_size = 25,
quiet = FALSE
Please investigate the returned object to see the column names and what they contain.
Analysing ORA results
There are many options for investigating your ORA results.pairedGSEA
comes with an inbuilt scatter plot function that plots the enrichment score of DGE against those of DGS.
The function allows you to interactively look at the placement of the significant pathways using plotly (Sievert 2020). You can color specific points based on a regular expression for gene sets of interest.
Scatter plot of enrichment scores
# Scatter plot function with default settings
plot_ora(
ora,
paired = FALSE,
plotly = FALSE,
pattern = "Telomer", # Identify all gene sets about telomeres
cutoff = 0.05, # Only include significant gene sets
lines = TRUE # Guide lines
)
As mentioned, this function can be utilized in a few different ways. The default settings will plot the enrichment scores of each analysis and draw dashed lines for the y = x
line, y = 0
, and x = 0
. Remove those by setting lines = FALSE
.
Make the plot interactive with plotly = TRUE
.
Highlight gene sets containing a specific regex pattern by setting pattern
to the regex pattern of interest.
Session Info
sessionInfo()
#> R version 4.5.0 RC (2025-04-04 r88126)
#> Platform: x86_64-pc-linux-gnu
#> Running under: Ubuntu 24.04.2 LTS
#>
#> Matrix products: default
#> BLAS: /home/biocbuild/bbs-3.21-bioc/R/lib/libRblas.so
#> LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.12.0 LAPACK version 3.12.0
#>
#> locale:
#> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
#> [3] LC_TIME=en_GB LC_COLLATE=C
#> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
#> [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
#> [9] LC_ADDRESS=C LC_TELEPHONE=C
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
#>
#> time zone: America/New_York
#> tzcode source: system (glibc)
#>
#> attached base packages:
#> [1] stats graphics grDevices utils datasets methods base
#>
#> other attached packages:
#> [1] pairedGSEA_1.8.0 BiocStyle_2.36.0
#>
#> loaded via a namespace (and not attached):
#> [1] DBI_1.2.3 bitops_1.0-9
#> [3] httr2_1.1.2 biomaRt_2.64.0
#> [5] rlang_1.1.6 magrittr_2.0.3
#> [7] msigdbr_10.0.2 aggregation_1.0.1
#> [9] matrixStats_1.5.0 compiler_4.5.0
#> [11] RSQLite_2.3.9 mgcv_1.9-3
#> [13] png_0.1-8 vctrs_0.6.5
#> [15] sva_3.56.0 sysfonts_0.8.9
#> [17] stringr_1.5.1 pkgconfig_2.0.3
#> [19] crayon_1.5.3 fastmap_1.2.0
#> [21] magick_2.8.6 dbplyr_2.5.0
#> [23] XVector_0.48.0 labeling_0.4.3
#> [25] Rsamtools_2.24.0 rmarkdown_2.29
#> [27] UCSC.utils_1.4.0 tinytex_0.57
#> [29] bit_4.6.0 xfun_0.52
#> [31] showtext_0.9-7 cachem_1.1.0
#> [33] GenomeInfoDb_1.44.0 jsonlite_2.0.0
#> [35] progress_1.2.3 blob_1.2.4
#> [37] DelayedArray_0.34.0 BiocParallel_1.42.0
#> [39] parallel_4.5.0 prettyunits_1.2.0
#> [41] R6_2.6.1 bslib_0.9.0
#> [43] stringi_1.8.7 RColorBrewer_1.1-3
#> [45] limma_3.64.0 genefilter_1.90.0
#> [47] GenomicRanges_1.60.0 jquerylib_0.1.4
#> [49] assertthat_0.2.1 Rcpp_1.0.14
#> [51] bookdown_0.43 SummarizedExperiment_1.38.0
#> [53] knitr_1.50 IRanges_2.42.0
#> [55] Matrix_1.7-3 splines_4.5.0
#> [57] tidyselect_1.2.1 abind_1.4-8
#> [59] yaml_2.3.10 codetools_0.2-20
#> [61] hwriter_1.3.2.1 curl_6.2.2
#> [63] lattice_0.22-7 tibble_3.2.1
#> [65] withr_3.0.2 Biobase_2.68.0
#> [67] KEGGREST_1.48.0 evaluate_1.0.3
#> [69] survival_3.8-3 BiocFileCache_2.16.0
#> [71] xml2_1.3.8 Biostrings_2.76.0
#> [73] pillar_1.10.2 BiocManager_1.30.25
#> [75] filelock_1.0.3 MatrixGenerics_1.20.0
#> [77] stats4_4.5.0 generics_0.1.3
#> [79] S4Vectors_0.46.0 hms_1.1.3
#> [81] ggplot2_3.5.2 munsell_0.5.1
#> [83] scales_1.3.0 xtable_1.8-4
#> [85] glue_1.8.0 tools_4.5.0
#> [87] data.table_1.17.0 fgsea_1.34.0
#> [89] annotate_1.86.0 locfit_1.5-9.12
#> [91] babelgene_22.9 XML_3.99-0.18
#> [93] fastmatch_1.1-6 cowplot_1.1.3
#> [95] grid_4.5.0 DEXSeq_1.54.0
#> [97] edgeR_4.6.0 AnnotationDbi_1.70.0
#> [99] colorspace_2.1-1 nlme_3.1-168
#> [101] GenomeInfoDbData_1.2.14 showtextdb_3.0
#> [103] cli_3.6.4 rappdirs_0.3.3
#> [105] S4Arrays_1.8.0 dplyr_1.1.4
#> [107] gtable_0.3.6 DESeq2_1.48.0
#> [109] sass_0.4.10 digest_0.6.37
#> [111] BiocGenerics_0.54.0 SparseArray_1.8.0
#> [113] farver_2.1.2 geneplotter_1.86.0
#> [115] memoise_2.0.1 htmltools_0.5.8.1
#> [117] lifecycle_1.0.4 httr_1.4.7
#> [119] statmod_1.5.0 bit64_4.6.0-1
References
Anders, Simon, Alejandro Reyes, and Wolfgang Huber. 2012. “Detecting Differential Usage of Exons from Rna-Seq Data.” Genome Research 22 (10): 2008–17. https://doi.org/10.1101/gr.133744.111.
Korotkevich, Gennady, Vladimir Sukhov, and Alexey Sergushichev. 2019. “Fast Gene Set Enrichment Analysis.” bioRxiv. https://doi.org/10.1101/060012.
Love, Michael I., Wolfgang Huber, and Simon Anders. 2014. “Moderated Estimation of Fold Change and Dispersion for Rna-Seq Data with Deseq2.” Genome Biology 15 (12): 550. https://doi.org/10.1186/s13059-014-0550-8.
Sievert, Carson. 2020. Interactive Web-Based Data Visualization with R, Plotly, and Shiny. Chapman; Hall/CRC. https://plotly-r.com.