Package Details (original) (raw)
Contents
- 1 Motivation and Necessity
- 2 Installation
- 3 Example Workflow
- 4 Example GRN object
- 5 Input
- 5.1 Open chromatin and RNA-seq data
- 5.2 TF and TFBS data
* 5.2.1 HOCOMOCO-derived TFBS and download links
* 5.2.2 Other TFBS sources
* 5.2.3 Importing TF and TFBS from the JASPAR databases directly in R - 5.3 Sample metadata (optional but highly recommended)
- 5.4 Hi-C data (optional)
- 5.5 Capture Hi-C data / known promoter-enhancer links (optional)
- 5.6 SNP data (optional)
- 5.7 TF activity data (optional, coming soon)
- 6 Methodological Details and Basic Mode of Action
- 6.1 Data normalization
* 6.1.1 Normalization methods common for peaks and RNA data
* 6.1.2 Normalization methods specific for peaks data
* 6.1.3 Raw vs pre.normalized data - 6.2 TF-peak connections
* 6.2.1 TF-peak connection types
* 6.2.2 GC correction
* 6.2.3 TF Activity connections - 6.3 Peak-gene associations
- 6.4 Building eGRNs: Linking TF-peak and peak-gene links and filtering
- 6.5 Background eGRN
* 6.5.1 Methods
* 6.5.2 Object and output details related to the background links - 6.6 Constructing the eGRN graph
- 6.7 Enrichment analyses
- 6.8 Network visualization and visualization filtering
* 6.8.1 General comments
* 6.8.2 Changing the visualization parameters and layout
* 6.8.3 Filtering the network for visualization purposes - 6.9 Feature variation quantification
- 6.1 Data normalization
- 7 Output
- 7.1 GRN object and results stored within
* 7.1.1 Object details
* 7.1.2 Results in the object
* 7.1.3 Object summary - 7.2 Original and normalized data, annotations and provided metadata
* 7.2.1 Object data
* 7.2.2 Plots - 7.3 PCA plots and results
* 7.3.1 Object data
* 7.3.2 Plots - 7.4 TF-peak results and diagnostic plots
* 7.4.1 Object data
* 7.4.2 Plots
* 7.4.3 Extra plots for the GC correction - 7.5 Activator-repressor classification results and diagnostic plots
* 7.5.1 Object data
* 7.5.2 Plots - 7.6 Peak-gene results and diagnostic plots
* 7.6.1 Object data
* 7.6.2 Plots - 7.7 Filtered TF-peak-gene connections (eGRN table)
* 7.7.1 Object data
* 7.7.2 Plots - 7.8 TF-gene connections
* 7.8.1 Object data
* 7.8.2 Plots - 7.9 Connection summary plots
* 7.9.1 Object data
* 7.9.2 Plots - 7.10 eGRN graph
* 7.10.1 Object data
* 7.10.2 Plots - 7.11 Community information
* 7.11.1 Object data
* 7.11.2 Plots - 7.12 Enrichment results and plots
* 7.12.1 Object data
* 7.12.2 Plots - 7.13 Feature variation quantification
* 7.13.1 Object data
* 7.13.2 Plots - 7.14 SNP data
* 7.14.1 Object data
* 7.14.2 Plots
- 7.1 GRN object and results stored within
- 8 Guidelines, Recommendations, Limitations, Scope
- 8.1 Package scope
- 8.2 Number of samples
- 8.3 Peaks
- 8.4 RNA-Seq
- 8.5 Transcription factor binding sites (TFBS)
- 8.6 Choice of correlation methods for TF-peak, peak-gene, TF-gene connections and outlier robustness
- 8.7 Peak-gene p-values accuracy and violations
- 8.8 eGRNs from single-cell data
- 8.9 Recapitulating object history, function parameters etc
- 9 Memory footprint and execution time, feasibility with large datasets
- 10 References
- Appendix
Motivation and Necessity
Genetic variants associated with diseases often affect non-coding regions, thus likely having a regulatory role. To understand the effects of genetic variants in these regulatory regions, identifying genes that are modulated by specific regulatory elements (REs) is crucial. The effect of gene regulatory elements, such as enhancers, is often cell-type specific, likely because the combinations of transcription factors (TFs) that are regulating a given enhancer have cell-type specific activity. This TF activity can be quantified with existing tools such as diffTF
and captures differences in binding of a TF in open chromatin regions. Collectively, this forms a enhancer-mediated gene regulatory network (eGRN) with cell-type and data-specific TF-RE and RE-gene links. Here, we reconstruct such a eGRN using bulk RNA-seq and open chromatin (e.g., using ATAC-seq or ChIP-seq for open chromatin marks) and optionally TF activity data. Our network contains different types of links, connecting TFs to regulatory elements, the latter of which is connected to genes in the vicinity or within the same chromatin domain (TAD). We use a statistical framework to assign empirical FDRs and weights to all links using various statistical approaches.
In summary, we present a framework to reconstruct predictive enhancer-mediated regulatory network models that are based on integrating of expression and chromatin accessibility/activity pattern across individuals, and provide a comprehensive resource of cell-type specific gene regulatory networks for particular cell types.
For the latest version of the paper, see the References.
Installation
GRaNIE
package and required dependency packages
GRaNIE
is available on Bioconductor. The package and installation instructions can be found here and are very simple.
The basic installation installs all required packages but not necessarily those that are listed under Suggests
unless you installed the package with BiocManager::install("GRaNIE", dependencies = TRUE)
.
However, we also provide instructions on how to install the newest version of the package outside of Bioconductor for users who do not use the newest version of Bioconductor and/or do not have the _devel_version of Bioconductor installed. For details, see the instructionshere.
Note that from the additional packages, only some of the genome annotation packages are actually required, which of them depends on your genome assembly version (see below).
Additional packages
Overall, we tried to minimize the installation burden and only require packages if they are absolutely necessary for the main workflow. If you want to pre-install also all optional packages, please consider the following options:
- Use
BiocManager::install("GRaNIE", dependencies = TRUE)
which however installs all annotation packages for all supported genomes, which may take a longer time due to the overall download size of multiple Gb. - Install only the required annotation packages, along with all other optional packages. You may use the following for it:
- Genome annotation packages (only one of the four is optionally needed for some functions, see section below, which of them depends on your genome assembly version):
*hg38
:BiocManager::install(c("org.Hs.eg.db", "BSgenome.Hsapiens.UCSC.hg38", "TxDb.Hsapiens.UCSC.hg38.knownGene"))
*hg19
:BiocManager::install(c("org.Hs.eg.db", "BSgenome.Hsapiens.UCSC.hg19", "TxDb.Hsapiens.UCSC.hg19.knownGene"))
*mm10
:BiocManager::install(c("org.Mm.eg.db", "BSgenome.Mmusculus.UCSC.mm10", "TxDb.Mmusculus.UCSC.mm10.knownGene"))
*mm39
:BiocManager::install(c("org.Mm.eg.db", "BSgenome.Mmusculus.UCSC.mm39", "TxDb.Mmusculus.UCSC.mm39.knownGene"))
*mm9
:BiocManager::install(c("org.Mm.eg.db", "BSgenome.Mmusculus.UCSC.mm9", "TxDb.Mmusculus.UCSC.mm9.knownGene"))
*rn6
:BiocManager::install(c("org.Rn.eg.db", "BSgenome.Rnorvegicus.UCSC.rn6", "TxDb.Rnorvegicus.UCSC.rn6.refGene"))
*rn7
:BiocManager::install(c("org.Rn.eg.db", "BSgenome.Rnorvegicus.UCSC.rn7", "TxDb.Rnorvegicus.UCSC.rn7.refGene"))
*dm6
:BiocManager::install(c("org.Dm.eg.db", "BSgenome.Dmelanogaster.UCSC.dm6", "TxDb.Dmelanogaster.UCSC.dm6.ensGene"))
- If you want to use the JASPAR database for TF and TFBS, you need the following extra packages:
BiocManager::install(c("JASPAR2024", "TFBSTools", "motifmatchr", "rbioapi"))
- For all other optional packages, you may execute this line for installing them in one go:
BiocManager::install(c("IHW", "WGCNA", "csaw", "EDASeq", "ChIPseeker", "variancePartition", "ReactomePA", "DOSE", "clusterProfiler", "BiocFileCache", "BiocParallel", "LDlinkR"))
- Genome annotation packages (only one of the four is optionally needed for some functions, see section below, which of them depends on your genome assembly version):
Detailed information about the scope of the optional packages
GRaNIE
has a number of packages that enhance the functionality for particular cases, may be required depending on certain parameters, only when using a particular functionality or that generally speed-up the computation. In detail, the following packages are currently optional and may be needed context-specifically:
- Genome assembly and annotation packages (only one of the four is optionally needed, which of them depends on your genome assembly version)
- all of the following packages are ONLY needed for either additional peak annotation in combination with
ChIPseeker
(see below) or if the chosen peak normalization method is GC-based (EDASeq_GC_peaks
,gcqn_peaks
)
*org.Hs.eg.db
: Needed only forhg19
orhg38
*org.Mm.eg.db
: Needed only formm9
ormm10
ormm39
*org.Rn.eg.db
: Needed only forrn6
orrn7
*org.Dm.eg.db
: Needed only fordm6
*BSgenome.Hsapiens.UCSC.hg19
,TxDb.Hsapiens.UCSC.hg19.knownGene
: Needed only forhg19
*BSgenome.Hsapiens.UCSC.hg38
,TxDb.Hsapiens.UCSC.hg38.knownGene
: Needed only forhg38
*BSgenome.Mmusculus.UCSC.mm39
,TxDb.Mmusculus.UCSC.mm39.knownGene
: Needed only formm39
*BSgenome.Mmusculus.UCSC.mm10
,TxDb.Mmusculus.UCSC.mm10.knownGene
: Needed only formm10
*BSgenome.Mmusculus.UCSC.mm9
,TxDb.Mmusculus.UCSC.mm9.knownGene
: Needed only formm9
*BSgenome.Rnorvegicus.UCSC.rn6
,TxDb.Rnorvegicus.UCSC.rn6.refGene
: Needed only forrn6
*BSgenome.Rnorvegicus.UCSC.rn7
,TxDb.Rnorvegicus.UCSC.rn7.refGene
: Needed only forrn7
*BSgenome.Dmelanogaster.UCSC.dm6
,TxDb.Dmelanogaster.UCSC.dm6.ensGene
: Needed only fordm6
ChIPseeker
: Not needed strictly speaking, but used only for the functionaddData()
to provide additional peak annotation, fully optional otherwise.
- all of the following packages are ONLY needed for either additional peak annotation in combination with
- TF and TFBS related packages
JASPAR2024
,TFBSTools
,motifmatchr
,rbioapi
: Needed only whensource = "JASPAR"
foraddTFBS()
for using JASPAR 2024 TFs and TFBS
- Additional statistical packages and for enhanced functionality
IHW
: Needed only for the functionfilterGRNAndConnectGenes()
for p-value adjustment of the raw peak-gene p-values when using theIHW
framework (parameterpeak_gene.fdr.method
)csaw
: Needed only for a cyclic LOESS normalization inaddData()
, which is the default normalization currently for adding TF activity data viaaddData_TFActivity()
EDASeq
: Provides additional, GC-aware normalization schemes for the peak data in in `addData().variancePartition
: Needed only for the functionadd_featureVariation()
to quantify multiple sources of biological and technical variation for features (TFs, peaks, and genes)WGCNA
: Needed only when settingcorMethod = "bicor"
in multiple supported functions for enabling another type of robust correlations as alternative to Spearman (an experimental, largely undocumented feature so far).
- Enrichment-associated packages
topGO
,ReactomePA
,DOSE
,clusterProfiler
: Needed only for all enrichment functions forGO
,Reactome
,DO
, andKEGG
respectively
- SNP-related extra packages
LDlinkR
: Needed only foraddSNPData()
to add SNP LD information.
- Computational efficacy
BiocFileCache
: Needed only forloadExampleObject()
, in cases when this function is executed multiple times, caching is faster than re-downloading the file anew every time the function is executed.BiocParallel
: Only used but highly recommended in the functionsaddConnections_peak_gene()
,add_TF_gene_correlation()
, andoverlapPeaksAndTFBS()
to speed-up computation when requesting multiple cores.
Example GRN object
For user convenience, we provide the function loadExampleObject()
that can load a supported GRN object from the GRaNIE
package from an arbitrary URL into R. This function can be used, for example, to load an example object that contains a small number of TFs that we specifically prepared for package use and exploring the package.
Simply run the following command:
GRN = loadExampleObject()
You can start by simply typing GRN
(i.e., the object name) into the console, and a summary of the GRN object is printed. The example object is always up to date with the most recent version of the package and everything that the package can calculate is contained in the object. Thus, you can run any function after loading the example object.
For more details on where the data from the example object comes from, see the workflow vignette with explanations.
Input
In our GRaNIE
approach, we integrate multiple data modalities. Here, we describe them in detail and their required format.
Open chromatin and RNA-seq data
Open chromatin data may come from ATAC-seq, DNAse-seq or ChIP-seq data for particular histone modifications that associate with open chromatin such as histone acetylation (e.g., H3K27ac). They all capture open chromatin either directly or indirectly, and while we primarily tested and used ATAC-seq while developing the package, the others should also be applicable for our framework. From here on, we will refer to these regions simply as peaks.
For RNA-seq, the data represent expression counts per gene across samples.
Here is a quick graphical representation which format is required to be compatible with our framework:
- columns are samples, rows are peaks / genes
- column names are required while row names are ignored
- except for one ID columns, all other columns must be numeric and represent counts per sample
- ID column:
- The name of the ID column can be anything and can be specific later in the pipeline. For peaks, we usually use
peakID
while for RNA-seq, we useEnsemblID
- for peaks, the required format is “chr:start-end”, with
chr
denoting the chromosome, followed by:
, and thenstart
,-
, andend
for the peak start and end, respectively. Coordinates are assumed to be zero-based exclusive, the standard for BED files, seehereor here for more information. In short, the first base in a chromosome is numbered 0, and the last base is not included in the feature. For example, a peak with the coordinateschr1:100-150
starts at position 100 (or the 101th base of the chromosome, as counting starts with 0), spans 50 bp and ends at 149 (and NOT 150).
- The name of the ID column can be anything and can be specific later in the pipeline. For peaks, we usually use
- counts should be raw if possible (that is, integers), but we also support pre-normalized data. See here for more information.
- peak and RNA-seq data may contain a distinct set of samples, with some samples overlapping but others not. This is no issue and as long as some samples are found in both of them, the
GRaNIE
pipeline can work with it. Note that only the shared samples between both data modalities are kept, however, so make sure that the sample names match between them and share as many samples as possible. See the methods part for guidelines on how many samples we recommend.
Note that peaks should not overlap. If they do, an informative error message is thrown and the user is requested to modify the peak input data so that no overlaps exist among all peaks. This can be done by either merging overlapping peaks or deleting those that overlap with other peaks based on other criteria such as peak signal, by keeping only the strongest peak, for example.
For guidelines on how many peaks are necessary or recommended, see the section below.
For guidelines on whether and how to prepare single-cell 10x multiome data, see here.
TF and TFBS data
TF and TFBS data is mandatory as input. As of March 2023, we offer two different ways of how to import TF and TFBS data into GRaNIE
.
First, we support the import of any TF and TFBS data irrespective of how it was generated or what it represents, thereby offering full flexibility for how our framework is used. The only drawback is that preparing the TF and TFBS data is a little more involved. This was the only and default way of how to use GRaNIE
until February 2023. In what follows, we explain more details.
Specifically, the package requires a bed
file per TF with TF binding sites (TFBS). TFBS can either be in-silico predicted, or experimentally verified, as long as genome-wide TFBS can be used.
Our recommended and default TF database is HOCOMOCO v12, see below for details.
HOCOMOCO-derived TFBS and download links
As a default, we use HOCOMOCO-derived TFBS predictions that were predicted with PWMScan
. For further method details, see the diffTF paper from our lab.
We provide TFBS predictions for HOCOMOCO-based TF motifs that were used with PWMScan
for hg19
, hg38
, mm10
and soon mm39
. As of 2024, we recommend using HOCOMOCO v12 (Website,Paper). Using this database requires only one extra step of downloading the database and preparing it for the use with GRaNIE.
We provide the following downloads that can be directly used with GRaNIE:
hg38:
PWMScan
+ HOCOMOCO v12 INVIVO collection (Download here)PWMScan
+ HOCOMOCO v11 (Download here)FIMO
+ HOCOMOCO v11 (Download here)
hg19:
PWMScan
+ HOCOMOCO v10 (Download here)
mm10:
PWMScan
+ HOCOMOCO v12 INVIVO collection (Download here)PWMScan
+ HOCOMOCO v10 (Download here)
mm39:
PWMScan
+ HOCOMOCO v12 INVIVO collection (Download here)
To use them, simply download the file, extract and then reference the folder within the motifFolder
argument in the addTFBS()
function - see also the Workflow Vignettefor examples.
Other TFBS sources
However, you may also use your own TFBS data, and we provide full flexibility in doing so. Only some manual preparation is necessary. Briefly, if you decide to use your own TFBS data, you have to prepare the following:
- a folder that contains one TFBS file per TF in
bed
orbed.gz
format, 6 columns - file names must be
{TF}{suffix}.{fileEnding}
, where{TF}
specifies the name of the TF,{suffix}
an optional and arbitrary string (we use_TFBS
, for example), and{fileEnding}
the file type (supported arebed
andbed.gz
). - the folder must also contain a so-called translation table that must be called
translationTable.csv
. This file must have the following structure: 3 columns (tab-separated), calledSYMBOL
,ENSEMBL
, andHOCOID
. The first column denotes a symbol or shortcut for the TF that is used throughout the pipeline (e.g.,AHR
), the second the ENSEMBL ID (without the dot suffix; e.g.,ENSG00000106546
) and the third column the prefix of how the file is called with the TFBS (e.g.,AHR.0.B
if the file forAHR
is calledAHR.0.B_TFBS.bed.gz
). - we provide example files for selected genome assemblies (hg19,hg38, mm9, mm10, mm39) that are fully compatible with
GRaNIE
as separate downloads. For more information, check the links above.
For more methodological details, details on how to construct these files, their exact format etc we refer to diffTF
paper andwebsite for details.
Importing TF and TFBS from the JASPAR databases directly in R
As of March 2023, we additionally a more user-friendly way of importing TF and TFBS data from the various JASPAR databases that are available in R/Bioconductor: JASPAR
2022 and 2024. While we so far tested only with the JASPAR2022
database / R package, the other ones should also work - if not, please let us know. Using JASPAR2024
as TF database is much easier and also quicker from a user perspective, and none of the additional complexities that were mentioned above for the custom import apply here. However, we do note that HOCOMOCO v12 produces larger GRNs from our (limited) test experience. For details, see the addTFBS()
andoverlapPeaksAndTFBS()
functions.
Note that we also implemented an option to customize which TFBS collection to use from the JASPAR
database via the function argumentJASPAR_useSpecificTaxGroup
from addTFBS()
. For example, this feature can be used to specify the whole TF collection irrespective of the genome assembly the data is in - which is useful for genomes such as rat or mouse, for which JASPAR only finds a few dozen TF at most if the specific genome was specified otherwise.
Be aware, though, that you may want to compare your eGRN across different TF and TFBS data - we are currently investigating the effect of using JASPAR
motives and cannot currently comment much on it except what we sumamrized above: HOCOMOCO v12 produces larger GRNs from our (limited) test experience. We will, however, update the vignette here once we have more results to share.
Hi-C data (optional)
Integration of Hi-C data is optional and serves as alternative to identifying peak-gene pairs to test for correlation based on a predefined and fixed neighborhood size (seeMethods).
If TAD domains are used, the neighborhood of a peak will be defined by the TAD the peak is located in, and all genes in the same TAD will be tested for correlation. If the TAD is very narrow, this consequently results in fewer genes to be tested as compared to a fixed 250 kb neighborhood (the default), for example, while the opposite is true for a large megabase-long TAD domain.
If Hi-C data are available, the pipeline expects a BED file format with at least 3 columns: chromosome name
, start
, and end
. An ID
column is optional and assumed to be in the 4th column, all additional columns are ignored.
For more details, see the R help (?addConnections_peak_gene
) and theMethods.
Capture Hi-C data / known promoter-enhancer links (optional)
It is now also possible to input, in addition to or as replacement of the regular peak-gene links (see here), known links that can either be Capture Hi-C links or more generally known promoter-enhancer links. If provided, the coordinates for both bait(the promoter coordinates) as well as other end (OE) coordinates (the enhancers) are overlapped with the gene promoters and peaks/enhancers as defined in the object, and the corresponding pair is added to the list of pairs to test for correlation, irrespective of their genomic distance. In principle, this even works for links that are defined on different chromosomes.
SNP data (optional)
Optionally, SNP data can also be integrated into our framework via the function addSNPData()
. The main idea is to annotate each peak with the information whether it overlaps any of the user-provided SNPs (and how many), and use this extra annotation later on for creating the TF-peak-gene eGRN within filterGRNAndConnectGenes()
. For more information, see here.
Optionally, we now also support identifying SNPS that are in LD with any of the user-provided SNPs to expand the set of SNPs that are peak-associated. For more details, see addSNPData()
.
SNP IDs have to be provided with their corresponding rsIDs
, and the genomic position is retrieved automatically within addSNPData()
by using biomaRt
.
TF activity data (optional, coming soon)
We also plan to make it possible to manually import TF activity or any other TF-specific data that can be used for generating TF-peaks links, in addition to our default approach of using TF expression. Stay tuned!
For more details, see here.
Methodological Details and Basic Mode of Action
In this section, we give methodological details and guidelines.
Data normalization
As everywhere in bioinformatics, proper data normalization for both RNA and open chromatin (peaks) data is very important. Thus, the chosen normalization method may have a large influence on the resulting eGRN, so make sure the choice of normalization is reasonable. Data normalization is performed when the data is imported into GRaNIE
in the addData()
function and cannot be changed thereafter.
In this section, we give details on the supported data normalization schemes currently available in the GRaNIE
framework.
Normalization methods common for peaks and RNA data
We currently support six choices of normalization of either peak or RNA-Seq data: limma_quantile
, DESeq2_sizeFactors
,limma_cyclicloess
, limma_scale
, csaw_cyclicLoess_orig
, csaw_TMM
plus none
to skip normalization altogether. The default for RNA-Seq is a quantile normalization, while for the open chromatin peak data, it isDESeq2_sizeFactors
(i.e., a “regular” DESeq2
size factor normalization). Importantly, some normalization methods such asDESeq2_sizeFactors
strictly require raw data, while others, such aslimma_quantile
, do not necessarily.
Normalization methods specific for peaks data
For peaks, two additional GC-based normalization schemes are offered:EDASeq_GC_peaks
and gcqn_peaks
. We refer to the R help for more details (?addData
).
Raw vs pre.normalized data
We recommend raw data as input, although it is also possible to provide pre-normalized data as input. Pre-normalizing data beforehand may be useful in case there are known batch effects, for example. Removing the batch effects from the counts before running GRaNIE
therefore may be a worthwhile strategy, a procedure that generally produces non-integer counts. In such a case, the user can then either normalize the data with another normalization method (one that is suitable also with non-integer values, see above) or skip additional normalization and use the counts as they are (choosing none
therefore for the normalization parameter).
TF-peak connections
To identify statistically significant TF-peak connections we implement a cell-type specific data-driven approach. In brief, we first calculate the Pearson correlation coefficients between the expression level of each TF and the open chromatin signal of each peak across all samples. We then use an empirical FDR procedure to identify statistically significant TF-peak connections. For this, for each TF, we split the peaks into two sets: A foreground set containing the peaks with a putative TFBS and a corresponding background set consisting of peaks without putative TFBS based on the TF-peak binding matrix calculated above. We then discretize the TF-peak correlation r into 40 bins in steps of 0.05 ranging from -1, -0.95, …, 0, …, 1 and calculate a bin-specific FDR value using two different directions. For more details on how we establish TF-peak links, please check the Supplement of the corresponding publication. See References for links.
We employ a shuffling-based approach to calculate background TF-peak links, as described in detail here.
TF-peak connection types
TF-peak connections are found by correlating a suitable measure related to TFs with peak accessibility, and then identifying statistically significant links. By default, TF expression is used as TF measure, but we also implemented an additional measure for establishing these links based on a measure called TF activity (see below).
GC correction
TFs and therefore also TF motifs can have very different GC preferences. Some TFs are known to bind to very GC-rich regions such as SP1. Thus, when calculating the TF-peak link FDRs, comparing a GC-rich or GC-poor foreground set of peaks with the full background (that is, all other peaks with no predicted site), the latter of which may exhibit a very different GC distribution, is potentially biasing the FDR values. We therefore offer an experimental feature to correct for this effect and to use a corresponding GC-matching background. The GC correction is enabled with useGCCorrection = TRUE
in the functionaddConnections_TF_peak()
. Note that the GC adjustment will take additional time and is currently much slower as compared to not adjusting the background.
Importantly, we note again that this feature is experimental as of now and we are still exploring its effects and plausibility.
When the GC correction is activated, the following is performed:
- The pre-calculated GC content of each peak is taken and binned into 10 GC bins (0-10%, 11-20%, …, 91-100%)
- For both TF-specific foreground and background, the relative frequencies of each of the GC bins is calculated (thus, the sum of all relative frequencies equals 1)
- We currently offer two different modes of how to create a GC-matched background.
- When
percBackground_resample
is set to a value > 0 (default is 75), the background size is fixed to this value (i.e., 75 translates to a background size of 75% of the original size), and no iterative procedure is run to automatically determine the background size (see b). - When
percBackground_resample = 0
, an automatic iterative procedure is run that determines the maximum value ofpercBackground_resample
(starting from a value of 100 - the full size of the background - down to 5 maximum in steps of 5 per iteration) so that all GC bins with a relative frequency of at least 5% in the foreground can be matched in sufficient numbers with the corresponding GC bin from the background. GC bins with a relative frequency of less than 5% are ignored due to their low relevance. If the background size as identified by the iterative procedure selects a background that is too small (< 1000 peaks), the selected background percentage is again increased in steps of 5% until at least 1000 peaks can be selected from. This is done for statistical reasons and ensures a minimum no of points in background, even if this sacrifices the mimicking of the distributions as described above.
- When
- Finally, after a value for the percentage of the background size to sample from, the background can now actually be sampled for each GC bin. If
percBackground_resample
is set toTRUE
, then the background distribution will mimic the foreground distribution perfectly even for GC bins that are very lowly abundant (i.e., those with a relative frequency of < 0.05 in the foreground). In summary, for GC bins for which not enough peaks are available in the background, a resampling scheme is performed that selects the required number of peaks by sampling with replacement from the background. If set toFALSE
, however, no resampling is performed and the number of peaks selected from the background is the maximum that can be selected. In this case, the relative frequencies in foreground and background may differ. Note that whenpercBackground_resample = 0
, this is only relevant for GC bins with a relative frequency of < 0.05, while otherwise, resampling may occur for an GC bin depending on the chosen value ofpercBackground_size
Rationale for choosing an appropriate number forpercBackground_resample
Ideally, one would want to select as many peaks in the background as possible to maximize the sample size and therefore randomness for better statistical sampling. However, this often fails because particular GC bins may be very dominant in the foreground for GC-rich TFs (e.g., the GC bin 71-80% has a relative frequency of 25%) while the background contains overall only 5% peaks with this GC bin. Thus, one cannot select 25% of all background peaks (as the relative frequency of the GC bins in the foreground is to be kept in the background) of the same GC bin without resampling. The only alternative is to not use the full background size as reference, but only a particular percentage of it so that the required absolute numbers of peaks to sample from decreases and eventually can be selected for all relevant GC bins. When choosing this option, the background size can be much smaller (e.g., 15% only), therefore also choosing only a relative small absolute number of peaks from the background. However, no resampling is done here, therefore not (by chance) giving particular peaks a higher weight.
You may ask which value to choose for percBackground_size
and whether to use GC correction at all, however? These are good questions, and while we cannot give any guaranteed reply, we can give recommendations and our experiences from a limited number of datasets:
- for most TFS, those that have no GC preference, whether or not the background is GC corrected should not make a big difference as foreground and background are sufficiently similar for these TFs (in theory, we do not to verify whether this is actually really true)
- for those with extreme GC preferences, in either direction, foreground and background will be very different, and the effects of the GC correction are most visible. Choose some TFs with known GC preferences such as SP1 and check the diagnostic plots to verify and get a feeling for how well the correction worked.
- we generally recommend running
GRaNIE
once with and once without GC correction to compare the eGRNs. What we sometimes see is that in the eGRN without a GC correction, the most abundant TFs actually have a GC-rich motif, indicating that they may be overrepresented in the eGRN. - for choosing an appropriate value of
percBackground_size
, we recommend leaving it at the default value of 75 first and potentially adjusting it after looking at the diagnostic plots. Higher values will result in (1) a higher number of GC bins that need resampling (and therefore potentially overrepresenting some peaks or indirectly causing a bias due to the higher percentage of resampled peaks) while (2) increasing the background size overall, which is better from a statistical sampling point of view. We think- outweighs (1) and therefore set the default to a relatively high value, but we are still checking and experimenting with this feature. This is why the GC correction is still marked as experimental. On the other hand, a lower value will result in the opposite of the two points mentioned above. As often in statistics and science, it is about finding a good compromise.
Judging how well the GC correction worked
We also provide detailed QC plots that summarize both the GC background selection as well as the differences for teh TF-peak connections with and without GC correction. For more details, seehere.
Reproducibility of GC correction
Lastly, we need to emphasize that the GC correction, due to the random sampling from the background, is not deterministic and may yield a slightly different number of connections each time it is run. The more bins are resampled (see discussion above), the more differences may arise
TF Activity connections
As explained above, TF-peak connections are found by correlation TF_expression_ with peak accessibility. In addition to expression, we also offer to identify statistically significant TF-peak links based on_TF Activity_. The concept of TF Activity is described in more detail in our diffTF
paper. In short, we define TF motif activity, or TF activity for short, as the effect of a TF on the state of chromatin as measured by chromatin accessibility or active chromatin marks (i.e., ATAC-seq, DNase sequencing [DNase-seq], or histone H3 lysine 27 acetylation [H3K27ac] ChIP-seq). A TF Activity score is therefore needed for each TF and each sample.
TF Activity information can either be calculated within the GRaNIE
framework using a simplified and empirical approach) or it can be calculated outside of our framework using designated methods and thenimported into our framework. We now describe these two choices in more detail.
Calculating TF Activity
In our GRaNIE
approach, we empirically estimate TF Activity for each TF with the following approach:
- normalize the raw peak counts by one of the supported normalization methods (see below)
- from the TF-peak accessibility matrix as calculated before, identify the subset of peaks with a TFBS overlap for the particular TF based on the user-provided TFBS data
- scaling and centering of the normalized accessibility scores per row (i.e., peak) so that row means are close to 0 for each peak
- the column means (i.e., sample means) from the scaled and centered counts are then taken as approximation for the TF Activity
By default, we currently offer the four different types of normalizing the raw data for calculating TF Activity. Options 2 to 4 are described in more detail in the section Data normalization above, while option 1 is currently only available for TF Activity and therefore explained below (this may change in the future):
- Cyclic LOESS normalization (default)
Local regression (LOESS) is a commonly used approach for fitting flexible non-linear functions, which involves computing many local linear regression fits and combining them. Briefly, a normalization factor is derived per gene and sample using thenormOffsets
function of thecsaw
package in R as opposed to using one size factor for each sample only as with the regular size factor normalization inDeSeq
. For each sample, a LOWESS (Locally Weighted Scatterplot Smoothing) curve is fitted to the log-counts against the log-average count. The fitted value for each bin pair is used as the generalized linear model offset for that sample. The use of the average count provides more stability than the average log-count when low counts are present. For more details, see thecsaw
package inR
and thenormOffsets
methods therein. - Standard size factor normalization from
DeSeq
- Quantile normalization
- No normalization
Importing TF Activity
Soon, it will also be possible to import TF Activity data into our framework as opposed to calculating it using the procedure as described above. This feature is currently in development and will be available soon.
Adding TF Activity TF-peak connections
Once TF Activity data is available, finding TF-peak links and assessing their significance is then done in complete analogy as for the TF expression data - just the input data is different (TF Activity as opposed to TF expression). The so-called connection type - _expression_or TF Activity, is stored in the GRN object and output tables and therefore allows to tailor and filter the resulting network accordingly. All output PDFs also contain the information whether a TF-peak link has been established via the TF expression or TF Activity.
Peak-gene associations
In the GRaNIE
framework, we construct peak-gene pairs completely independently and separately from the TF-peak links. For this, the function addConnections_peak_gene()
is used. In general, we offer two options to decide which peak-gene pairs to test for correlation:
- By default, without additional Hi-C data, we use a local neighborhood-based approach with a user-defined neighborhood size (default: 250 kb both up- and downstream of the peak; see parameter
promoterRange
for details) to select which peak-gene pairs to test. Thus, for a particular peak, all genes within 250 kb either up- or downstream of it would be selected and tested for correlation. - In the presence of additional topologically associating domain (TADs) data from Hi-C or similar approaches, for a particular peak, the package selects all peak-gene pairs within the same TAD for correlation. Noteworthy, peaks located outside of any TAD domain are ignored. The user has furthermore the choice to specify whether overlapping TADs should be merged beforehand or not.
- If other known promoter-enhancer links are provided, these are taken either in addition to the links as identified above or as a replacement thereof, depending on the parameter
knownLinks_useExclusively
that was provided as part of the function call. If multiple promoter-enhancer peaks overlap with the provided user-defined links, all unique combinations are taken and added to the list of peak-gene pairs to test. By default, only the TSS of genes is checked for overlap, while this can be changed with the parameteroverlapTypeGene
in analogy to how it works everywhere else.
For all approaches, the user can select between two options of how to exactly calculate the distance between a peak and a gene. That is, which part of the gene is taken as reference point: the 5’ end (TSS, the default) or the full gene including all exons and introns. For more information see the R help (?addConnections_peak_gene
and the parameter overlapTypeGene
in particular)
Importantly, as for TF-peaks, we also employ a shuffling-based approach to calculate background peak-gene links, as described in detailhere.
Building eGRNs: Linking TF-peak and peak-gene links and filtering
As GRaNIE
models TF-peak and peak-gene links independently from one another, the package function filterGRNAndConnectGenes()
can be used to combine them to a tripartite TF-peak-gene combination. In addition, filtering both TF-peaks and peak-gene links according to different criteria is important and can be user-adjusted flexibly. Importantly, for statistical reasons, multiple testing correction of the peak-gene raw p-values is only done after connecting them to TFs and filtering them according to user-defined choices. We offer a wide range of different adjustment methods, including Independent hypothesis weighting (IHW), a multiple testing procedure that increases power compared to the method of Benjamini and Hochberg by assigning data-driven weights to each hypothesis. For more details, see, for example,here.
When SNPs have been integrated (see here), the user can choose whether to 1. additionally add a SNP filter to keep or discard TF-peak-gene connections (e.g. by requiring a minimum number of SNPs that have to overlap a peak) 2. keep peaks (either as part of TF-peak or peak-gene connections) if they overlap with a user-defined minimum number of SNPs (e.g, at least 1).
Thus, integrating SNPs can either reduce the size of eGRN by adding another filtering step or increase it by extending the criteria for which we find a peak relevant enough: In addition to being connected to a TF, a peak may also be kept just because it overlaps with a SNP (independent of any TF connection).
After successful execution, the filtered TF-peak-gene network is stored in GRN@connections$all.filtered
(both real and background eGRN). It can be easily retrieved, along with additional options for which output columns to include, using getGRNConnections()
.
After a filtered eGRN has been stored in the GRN object, the user can run all network related analysis such as enrichment or visualization. See the next sections for details.
Importantly, after successful execution offilterGRNAndConnectGenes()
and re-population of the filtered eGRN inGRN@connections$all.filtered[["0"]]
, the eGRN graph as produced bybuild_eGRN_graph()
(GRN@graph
) is reset to avoid inconsistencies with mismatches between the filtered eGRN, the graph representation of it and associated enrichment results.
Background eGRN
Methods
In the GRaNIE
framework, in addition to the real connections, we also calculate background connections in each step based on randomized data. Calculating a background eGRN follows exactly the same principles and methods as the real eGRN, without exceptions. It is only the input data that is shuffled in different ways, for both TF-peak links as well as peak-gene links, as follows:
- TF-peak links: First, the rows of the binary 0/1 TF-peak overlap table are shuffled separately for each sample (i.e., per column) to produce a truly random overlap matrix. In addition, the sample labels for the RNA counts are shuffled. Thus, subsequently, the counts that are correlated between RNA and peak data are not from the same pair of samples in fact.
- peak-gene links: After creating the real and true table for the peak-gene pairs that fulfil the user-specified requirements for being tested for correlation (e.g., within the specified distance from one another as defined by the parameter
promoterRange
), the peaks are shuffled. This consequently means that the peak-gene pairs that are subsequently tested for correlation are (usually) not in the specified proximity anymore, but instead mostly on other chromosomes or at a very different location on the same chromosome. Notably, this preserves the degree distribution for both peaks and genes: that is, if a gene or peak is overrepresented, this is also true for the shuffled version. Second, in analogy to the background TF-peak links, the sample labels for the RNA counts are shuffled before peak-gene correlations are calculated. The latter can be controlled via the parametershuffleRNACounts
inaddConnections_peak_gene()
. Empirically, we found that only shuffling the peak-gene links is not sufficient to create a truly random background, which is why the default behavior is to do the shuffling in addition. Nevertheless, this can be user-controlled. - TF-peak-gene links (eGRN): Combining TF-peak and peak-gene links for the background data then again follows the exact same methods as the real eGRN. However, due to the randomized nature of both the TF-peak and peak-gene links, the background _eGRN_typically has very few or no statistically significant connections at all.
In summary, we want to emphasize that we currently do not simply permute the real eGRN network. Instead, we re-construct an _eGRN_using the very same methods but with randomized data, as outlined above. This allows us to judge and compare the connectivity for the real network as compared to a background eGRN.
Object and output details related to the background links
In the GRN
object (GRN@connections
slot, to be specific), we label the background data with a 1
, while the original data is labeled as0
. For example, GRN@connections$TF_peaks[["0"]]
stores the original connections, while GRN@connections$TF_peaks[["1"]]
stores those arising from the background.
Relevant output files consequently contain the background
label to designate that the corresponding background has been used to generate the plot, while the original data is labeled as original
.
Constructing the eGRN graph
After a filtered set of TF-peak-gene links (i.e., a full eGRN) has been built with filterGRNAndConnectGenes()
, the actual graph must be constructed using the function build_eGRN_graph()
based on the filtered eGRN (in GRN@connections$all.filtered[["0"]]
).
Importantly, two types of networks are constructed:
- TF-peak-gene network: The networks is tripartite and contains TF, peak, and gene nodes. For example, if a TF links to multiple peaks, all of which may link to the same target gene, these will be represented by separate and different edges in the network. In essence, this graph is in principle the same as the filtered _eGRN_table as produced by
filterGRNAndConnectGenes()
. - TF-gene network: The networks is bipartite and contains only TF and gene, but not peak nodes. Thus, if a TF links to multiple peaks in the underlying set of filtered connections, all of which may link to the same target gene, these will be represented as only one edge in the network after removing duplicate links. Also, particular edges may be removed from the graph such as direct TF-TF edges, depending on the chosen parameters (e..g,
allowLoops
).
For subsequent functions, either of the two networks is used, and the user can choose which type of network to use if both are possible (such as for visualizeGRN
).
For more information, see the R help (?build_eGRN_graph
).
Enrichment analyses
After the graph has been built, three types of enrichment analysis are available within the GRaNIE
framework:
- Enrichment based on the full network (function
calculateGeneralEnrichment
) - Enrichment based on communities (function
calculateCommunitiesEnrichment()
) - Enrichment based on TF regulons (function
calculateTFEnrichment()
)
All enrichment functions here use the TF-gene graph in the GRN
object and therefore require the function build_eGRN_graph()
to be run beforehand. They are, as explained above, based on the filtered network as obtained by filterGRNAndConnectGenes()
and include only the genes from the corresponding (sub)network.
All three functions have corresponding plot functions to visualize the enrichment. For more information such as supported ontologies, see the corresponding R help for the functions, all details are explained there.
For an explanation of the output files from the plot functions, seehere.
All results are stored in GRN@stats$Enrichment
. For example, the results from the enrichment results of TF E2F7.0.B
from the example GRN object (see function loadExampleObject()
) for the GO
biological process (BP) enrichment can be retrieved fromGRN@stats$Enrichment$byTF$E2F7.0.B$GO_BP$results
, while the parameters that were used to run the enrichment are correspondingly stored inGRN@stats$Enrichment$byTF$E2F7.0.B$GO_BP$parameters
.
Network visualization and visualization filtering
Changing the visualization parameters and layout
Visualizing the whole network usually works well for small to medium-sized networks that contain up to a few hundred edges, but may fail for larger eGRNs. Drawing thousands of nodes and edges at once without losing readability is almost impossible. However, here are a few tips on what you can do if your eGRN cannot be drawn or it looks not nice enough:
- increase the value of the parameter
maxEdgesToPlot
to say 1000 or even more, and check whether the eGRN can be drawn. It may also help to increase the dimensions of the PDF (ifplotAsPDF = TRUE
) by increasingpdf_width
andpdf_height
. - change the visualization layout by changing the
layout
parameter. Try all of the supported layouts (see?visualizeGRN
for a list), maybe one looks good enough.
Filtering the network for visualization purposes
If none of the plotting-related parameters improve the visualization or if you want to visually explore the network for particular features of the network (such as specific TFs, peaks, or genes), you can usefilterConnectionsForPlotting()
to filter a eGRN just for visualization purposes. This reduces the number of nodes and edges to plot and gives the user ultimate flexibility of what to visualize. For example, you can filter the network to just visualize the part of the network that is connected to a specific set of TFs (i.e, their regulons).
Alternatively, you can also re-filter the network again more stringently using filterGRNAndConnectGenes()
, but then all subsequent analyses have to be rerun as well (e.g., build_eGRN_graph()
and all enrichment functions). This may reduce the multiple testing burden for peak-gene p-values and recover connections that may not have passed the filter beforehand.
Feature variation quantification
One of our latest features we added to the package is to use thevariancePartition
package to quantify and interpret the multiple sources of biological and technical variation from the features (TFs, peaks, and genes) in a GRN
object. This can be done via the functionadd_featureVariation()
. This can be helpful to identify where exactly the observed variation actually comes from, which we believe is useful information when checking eGRNs and particular elements of it. Note that this is still a work of progress, however, and we did not yet test this feature for many datasets.
In essence, we run the main function fitExtractVarPartModel
of the package variancePartition
. This fits a linear (mixed) model to estimate the contribution of multiple sources of variation while simultaneously correcting for all other variables for the features in aGRN
object (TFs, peaks, and genes), given particular metadata. The function reports the fraction of variance attributable to each metadata variable. As input, the normalized count matrices are used.
The formula used for model fitting can either be provided manually or set to auto
. For the latter case, the formula will be build automatically based on all metadata variables as specified with themetadata
parameter. By default, numerical variables will be modeled as fixed effects, while variables that are defined as factors or can be converted to factors (characters and logical variables) are modeled as random effects as recommended by the variancePartition
package.
The user can also speciy whether to run the procedure only on the features from the filtered set of connections (the eGRN, the result offilterGRNAndConnectGenes()
) or on all filters (that is, for all features irrespective of whether or not they are part of the final_eGRN_).
For more information where the information is stored in the GRN
object and which QC plots are available, see here.
Output
Here, we describe the various output files that are produced by the pipeline. They are described in the order they are produced in the pipeline.
GRN object and results stored within
Object details
Our pipeline works and output a so-called GRN
object. The goal is simple: All information is stored in it, and by keeping everything within one object and sharing it with others, they have all the necessary data and information to run the GRaNIE
workflow. A consistent and simple workflow logic makes it easy and intuitive to work with it, similar to other packages such as DESeq2
.
Technically speaking, it is an S4 object of class GRN
. As you can see from the workflow vignette, almost all GRaNIE
functions return a GRN
object (with the notable exception of get
functions - i.e., all functions that start with get). ExceptinitializeGRN()
, which creates an empty GRN
object, they also all require a GRN
object as first argument, which makes is easy and intuitive to work with.
GRN
objects contain all data and results necessary for the various functions the package provides, and various extractor functions allow to extract information out of an GRN
object such as the various get
functions. In addition, printing a GRN
object results in an object summary that is printed (try it out and just type GRN
in the console if your GRN
object is called like this!). In the future, we aim to add more convenience functions. If you have specific ideas, please let us know.
The slots of a GRN
object are described in the R help, see?GRN-class
for details. While we work on general extractor functions for the various slots for optimal user experience, we currently suggest to also access and explore the data directly with the @
operator until we finalized it. For example, for a GRN
object called GRN
,GRN@config
accesses the configuration slot that contains all parameters and object metadata, and slotNames(GRN)
prints all available slots of the object.
Results in the object
During a typical GRaNIE
analysis, almost all results are automatically stored in GRN
object that is used as input (exceptions are indicated in the corresponding sections) and therefore, almost all functions return also the very same GRN
object, containing in addition the results of the function.
The only exception is the function getGRNConnections()
that can be used to extract the resulting eGRN network from a GRN
object and returns a separate data frame and NOT a GRN
object. For more information and an explanation about the various columns that are returned from the function, see the corresponding R help (?getGRNConnections
).
Object summary
A summary of a GRN
object can be retrieved with the functiongetGRNSummary()
. This returns a named list that summarizes all relevant aspects for a GRN
object. For more details, see the R help.
PCA plots and results
Object data
Currently, the actual PCA result data are not stored in the GRN
object, but we may change this. We will update the Vignette once this is done and mention it in the News/Changelog.
Plots
The pipeline outputs PCA plots for both peaks and RNA as well as original (i..e, the counts the user provided as input) and normalized (i.e., the counts after normalizing them if any normalization method has been provided) data. Thus, in total, 4 different PCA plots are produced, 2 per data modality (peaks and RNA) and 2 per data type (original and normalized counts).
Each PDF consists of three parts: PCA results based on the top 500, top 1000 and top 5000 features (see page headers, depending on the parametertopn
in plotPCA_all()
). For each part, different plot types are available and briefly explained in the following:
- Multi-density plot across all samples (1 page)
- Screeplot (1 page)
- Metadata correlation plot
- PCA plots with different metadata being colored (5 or more pages, depending on available metadata)
TF-peak results and diagnostic plots
Object data
The TF-peak connections are stored in GRN@connections$TF_peaks
. This list is structured as follows:
- the first level constitutes, in analogy to many other places within the object, whether the real or background links are stored. For more details, see here.
- inside both real and background links (i.e., within
GRN@connections$TF_peaks[["0"]]
andGRN@connections$TF_peaks[["1"]]
), a list with two elements is stored:main
, containing a data frame with the actual TF-peak connections along with additional propertiesconnectionStats
, containing a summary of the empirical FDR procedure, stratified by TF, correlation bin, FDR direction and TF connection type. This slots also contains details on GC-specific information if activated (see parameteruseGCCorrection = TRUE
inaddConnections_TF_peak
)
The slot GRN@stats$GC
is also populated:
- Irrespective of whether or not a GC correction has been done, overall statistics for the user-provided peaks with respect to GC properties and relative frequencies are stored in
GRN@stats$GC$peaks
. - If the GC adjustment of the background has been performed, additional elements within
GRN@stats$GC
are stored:TFs_GC_correction
: A data frame that summarized the GC correction on the TF and GC bin levelTFs_GC_correction_plots
: For each TF, the GC plots for each TF are stored, namely the same two plots that were part of the output plots (see below). For example, to plot the first plot of the TFAIRE.0.C
, simply typeGRN@stats$GC$TFs_GC_correction_plots$AIRE.0.C[[1]]
.
Plots
We provide multiple QC plots for the TF-peak connections as part of the output, as explained in the following.
Overall connection summary
First, we provide an overview of the total number of TF-peak connections for a range of typically used FDR values for both real and background TF-peak links, stratified by the TF-peak correlation bin. Typically, one sees three things:
- The number of true links is much larger than the number of background links. Often, there are barely any background links passing our statistical procedure of identifying only links that are different from noise.
- The number of links depends highly on the correlation bin. Often, only more extreme correlation bins are populated, simply because a low correlation coefficient close to 0 does not yield a significant p-value. However, depending on the total number of samples and other parameters, the threshold for when a correlation bin may give significant result is changing.
- The higher the FDR, the more links remain.
TF-specific plots
Connection summary
In addition, we provide TF-specific diagnostic plots for all TFs that are included in the GRaNIE
analysis. They summarize the FDR and number of connections, stratified by the connection type (seehere for methodological details), the FDR directionality and the TF-peak correlation bin for both real and background links (see here for methodological details).
The TF name is indicated in the title, and each page shows two plots. In each plot, the TF-peak FDR for each correlation bin (ranging from -1 to 1 in bins of size 0.05) is shown. The only difference between the two plots is the directionality upon which the FDR is empirically derived from: the upper plot is for the positive and the lower plot for the_negative_ direction (for more details, see the References). Each plot is also colored by the number of distinct TF-peak connections that fall into the particular bin. Mostly, correlation bins with smaller absolute correlation values have higher frequencies (i.e., more TF-peak links fall into them) while correlation bins with more extreme correlation values are less frequent. In the end, for the resulting TF-peak and ultimately the eGRN network, the directionality can be ignored and only those TF-peak links with small FDRs are kept, irrespective of the directionality.
Correlation plots
For a particular (set of) TF-peak pair(s), the underlying TF-peak data and the correlation can be visualized with plotCorrelations()
. For more information, see the R help for the function.
Activator-repressor classification results and diagnostic plots
Object data
It is also possible to extract the results from the AR classification out of a GRN
object. Currently, this can only be done manually, extractor functions are in the works that will further enhance the user experience. The results are stored in the slotGRN@data$TFs$classification[[permIndex]] [[connectionType]]$TF.classification
. Here, permIndex
refers to the original, non-permuted (0
) or permuted (1
) data, while connectionType
here is either expression
orTFActivity
, depending on whether the pipeline has also be run for TF Activity in addition to expression (see functionaddConnections_TF_peak()
). Thus, typically, the results for the original data are stored inGRN@data$TFs$classification[["0"]] [["expression"]]$TF.classification
. If intermediate results from the classification have not been deleted (the default is to delete them as they can occupy a large amount of memory in the object, see the parameters of AR_classification_wrapper
for details), they can be accessed similarly withinGRN@data$TFs$classification[[permIndex]] [[connectionType]]
:TF_cor_median_foreground
, TF_cor_median_background
,TF_peak_cor_foreground
, TF_peak_cor_background
, andact.rep.thres.l
.
Plots
The pipeline produces 3 different plot types related to the activator-repressor (AR) classification that can optionally be run as part of the GRaNIE
workflow. For each of the 3 types, plots are produced for both the original (labeled as original
) as well as the permuted (labeled as permuted
) data.
The AR classification is run for the RNA expression data (labeled asexpression
) and can additionally also be run for TF activity data (labeled as TFActivity
, see the function addConnections_TF_peak()
and its parameter options).
In the following, the 3 plot types are briefly explained:
- Summary heatmaps (files starting with
TF_classification_summaryHeatmap
): This is described in detail in the diffTFdocumentation.
We explain and summarize this type of plot in the Workflow Vignette. Please check there for details. - Classification stringency summary plots (files starting with
TF_classification_stringencyThresholds
): This is described in detail in the diffTFdocumentation. - Density plots per TF (files starting with
TF_classification_densityPlotsForegroundBackground
): Density plots for each TF, with one TF per page. The plot shows the foreground (red, labeled asMotif
) and background (gray, labeled asNon-motif
) densities of the correlation coefficient (either Pearson or Spearman, see x-axis label) from peaks with (foreground) or without (background) a (predicted) TFBS in the peak for the particular TF. The numbers in the parenthesis summarize the underlying total number of peaks.
Peak-gene results and diagnostic plots
Object data
The peak-gene connections are stored in GRN@connections$peak_genes
. This list is structured as follows:
- the first level constitutes, in analogy to many other places within the object, whether the real or background links are stored. For more details, see here.
- inside both real and background links (i.e., within
GRN@connections$peak_genes[["0"]]
andGRN@connections$peak_genes[["1"]]
), a data frame with the peak-gene connections along with additional metadata (e.g.,peak.ID
,gene.ENSEMBL
,peak_gene.distance
), their connection origin (e.gTADs
,knownLinks
orneighborhood
), and statistical properties are stored (peak_gene.r
,peak_gene.p_raw
).
Plots
The function plotDiagnosticPlots_peakGene()
or indirectly the functionaddConnections_peak_gene()
(when plotDiagnosticPlots = TRUE
) plots various diagnostic plots for the peak-gene links that are imperative for understanding the biological system and resulting eGRN. In what follows, we describe them briefly, along with some notes on expected patterns, implications etc. Note that this section is subject to regular change.
The main QC summary figure on page 1 is divided into two rows: the upper row focuses on the peak-gene raw p-value from the correlation tests, while the lower row focuses on the peak-gene correlation coefficient. The left side visualizes the data of the corresponding metrics via density plots, while the right side bins the metrics and visualizes them with barplots for highlighting differences between real and background data as well as negatively and positively correlated peak-gene links (denoted as r+
and r-
, respectively).
We will now discuss and explain both of them in more detail.
Correlation raw p-value distribution
First and most importantly, we focus on the distribution of the raw p-values from the correlation tests (i.e., peak accessibility correlated with gene expression) of all peak-gene links. We can investigate these from multiple perspectives.
Let’s start with a density plot. The upper left plot shows the raw p-value density for the particular gene type as indicated in the title (here: all gene types), stratified on two levels:
- Real (left) versus background (right) connections
- Connections that have a negative (
r-
, gray) and positive (r+
, black) correlation coefficient, respectively
Generally, we also consider r-
connections as background. The reasoning for this is as follows: Since chromatin accessibility in regulatory elements is generally associated with active gene regulation and transcription, we only expect positive correlations for functional peak-gene links. Notably, the same is still true even for repressor-bound elements, where binding of most repressors leads to loss of both accessibility and transcription. Negative correlations have no clear biological meaning and therefore may indicate remaining batch effects or random noise.
What we would like to see is:
- background connections show little to no signal, with a flat line along the x-axis (i.e., a flat p-value histogram), with little to no difference between
r+
andr-
connections - for the real connections and
r+
links, a strong peak at small p-values, and a (marginally) flat distribution for higher ones (similar to a well-calibrated raw p-value distribution for any hypothesis test such as differential expression). Forr-
links, the peak at small p-values should be much smaller and ideally the curve is completely flat. However, from the many datasets we examined, this is rarely the case.
If any of these quality controls fails, it indicates at least one assumption of the framework is violated. This could be due to an issue with data normalization, the underlying biology and what the eGRN represents, or an issue with data size or covariates influencing the results. Often, when the number of samples is small, the QC does not look satisfactory. Thus, it is important to use as many samples as possible, preferably at least 12 or so (seehere for more details).
To quantify the difference between r+
and r-
links within each connection type (background vs real), we can also plot the results in form of ratios rather than densities for either the r+
/ r-
or the real / background dimension. These plots are shown in the upper right panel of the summary plot.
For the r+
/ r-
dimension and background data, the ratios should be close to 1 across all p-value bins, while for the real data, a high ratio is typically seen for small p-values. In general, the difference between the background and real bar should be large for small p-values and close to 1 for larger ones.
For the real / background dimension, what we want to see is again a high ratio for small p-value bins for the r+
links, indicating that when comparing background vs real, there are many more small p-value links in real data as compared to background. This usually does not hold true for the r-
links, though, as can be seen also from the plot: the gray bars are smaller and closer to 1 across the whole binned p-value range.
Correlation coefficient distribution
We also provide a summary of the distribution of the peak-gene correlation coefficient. Typically, one sees that the distribution of the real links is wider (i.e., more links with high absolute values - in either direction) and less pronounced around 0 as compared to the background, an indication of the captured signal in the real data.
Correlation plots
For a particular (set of) peak-gene pair(s), the underlying peak-gene data and the correlation can be visualized with plotCorrelations()
. For more information, see the R help for the function.
Filtered TF-peak-gene connections (eGRN table)
Object data
The peak-gene connections are stored in GRN@connections$all.filtered
. This list is structured as follows:
- the first level constitutes, in analogy to many other places within the object, whether the real or background links are stored. For more details, see here.
- inside both real and background links (i.e., within
GRN@connections$all.filtered[["0"]]
andGRN@connections$all.filtered[["1"]]
), a data frame with the TF-peak-gene connections along with additional metadata and statistical properties are stored. This table is not to be used directly, however; instead, the functiongetGRNConnections()
has to be used to retrieve them and add additional information as needed along with it.
Plots
No plots are available here, but the eGRN can be visualized withvisualizeGRN()
.
TF-gene connections
Object data
The peak-gene connections are stored inGRN@connections$$TF_genes.filtered
. This list is structured as follows:
- the first level constitutes, in analogy to many other places within the object, whether the real or background links are stored. For more details, see here.
- inside both real and background links (i.e., within
GRN@connections$$TF_genes.filtered[["0"]]
andGRN@connections$$TF_genes.filtered[["1"]]
), a data frame with the TF-gene connections along with additional metadata and statistical properties are stored. This table is not to be used directly, however; instead, the functiongetGRNConnections()
withinclude_TF_gene_correlations
has to be used to retrieve them.
Plots
For a particular (set of) TF-gene pair(s), the underlying TF-gene data and the correlation can be visualized with plotCorrelations()
. For more information, see the R help for the function.
In addition, the eGRN can be visualized with visualizeGRN()
.
Connection summary plots
Object data
The results of the function generateStatsSummary()
is stored inGRN@stats$connections
. This table contains a lot of information, including but not limited to the various parameters the function iterated over to generate multiple eGRNs for different parameter thresholds and combinations such as the number of distinct TFs, peaks, genes, their connectivity, etc.
Plots
We currently offer two different connection summary PDFs, both of which are produced from the function plot_stats_connectionSummary()
. Both PDFs shows the number of connections for each node type (TF, peak, and gene), while in the boxplots, peaks are further differentiated into TF-peak and peak-gene entities. They also iterate over various parameters and plot one plot per page and parameter combination, as indicated in the title:
allowMissingTFs
:TRUE
orFALSE
(i.e., allow TFs to be missing when summarizing the eGRN network. If set toTRUE
, a valid connection may consist of just peak to gene, with no TF being connected to the peak. For more details, see theR
help forplot_stats_connectionSummary()
)allowMissingGenes
:TRUE
orFALSE
(i.e., allow genes to be missing when summarizing the eGRN network. If set toTRUE
, a valid connection may consist of just TF to peak, with no gene being connected to the peak. For more details, see theR
help forplot_stats_connectionSummary()
)TF_peak.connectionType
. Eitherexpression
orTFActivity
to denote which connection type the summary is based on.
Both plot types compare the connectivity for the real and background data (denoted as Network type
in the boxplot PDF), which allows a better judgment of the connectivity from the real data.
An example page for the summary heatmap looks like this:
Here, two heatmaps are shown, one for real (top) and one for background (bottom) network. Each of them shows for different combinations of TF-peak and peak-gene FDRs (0.01 to 0.2) the number of unique node types for the given FDR combination (here: TFs).
Second, a multi-page summary PDF for the connections in form of a boxplot, as exemplified with the following Figure:
eGRN graph
Object data
The actual graph structure is stored in GRN@graph
after successful execution of the function build_eGRN_graph()
.It contains two different kind of graphs, namely the TF-peak-gene and the TF-gene graph. Each of them is associated with a list element with the same name, which stores also the table that was used to create the graph along with the graph itself and the used parameters for creating it.
Plots
The function visualizeGRN()
can be used for visualization of the_eGRN_ graph.
Enrichment results and plots
Object data
All enrichment results are stored inside GRN@stats$Enrichment
. This contains a nested list with the named elements general
, byCommunity
, and byTF
for general, community-specific and TF-specific enrichments, respectively.
Inside each of these elements, there is further nesting by the chosen ontology (e.g, ` GO_BP
), and whether results or the parameters of the enrichment function should be retrieved.
Plots
General enrichment
The output is structured as follows: - one page per ontology for the whole eGRN with an enrichment summary that includes the ontology terms (y-axis), the gene ratio (the number of genes found divided by the total number of genes in the foreground) on the x-axis, the raw p-value (color) and the absolute number of genes found (dot size). The plot title gives some additional information such as the chosen ontology, the statistic used, the used background, the number of genes in foreground and background, respectively, and the chosen multiple testing adjustment (if applicable; in some cases, this is ignored due to the way the enrichment calculation works)
TF enrichment
The output is structured in analogy to the community enrichment, but instead of communities, the regulon that a particular TF spans is chosen as subnetwork (i.e., all genes the TF is connected to).
Feature variation quantification
Object data
The results are added to GRN@stats$variancePartition
as well as within the various feature-related elements of GRN@annotation
. The former slot and results can be used for the various diagnostic and plotting functions from the variancePartition
package.
The easiest way to retrieve and include the results into the final_eGRN_ is to rerun the function getGRNConnections()
and setinclude_variancePartitionResults = TRUE
, as the results are NOT added automatically to GRN@connections$all.filtered
either.
Plots
QC plots for the feature variation results are available via the function plotDiagnosticPlots_featureVariation()
.
We are currently working on updating details for this section. Stay tuned, coming soon!
SNP data
Object data
SNP data is added via the function addSNPData()
. Whenadd_SNPs_LD = TRUE
, additional information for SNPs in LD with any of the user-provided input SNPs are stored within GRN@annotation$SNPs
andGRN@annotation$SNPs_filtered
for the unfiltered and filtered SNP table, respectively, and the list of peak-associated SNPs is extended to SNPs in LD with overlapping SNPs from the user input.
Final peak-SNP overlaps are stored in GRN@data$peaks$counts_metadata
. For more details, see ?addSNPData
.
Plots
No plots are specifically available for the SNp data at the moment.
Guidelines, Recommendations, Limitations, Scope
In this section, we provide a few guidelines and recommendations that may be helpful for your analysis.
Package scope
In this section, we want explicitly mention the designated scope of theGRaNIE
package, its limitations and additional / companion packages that may be used subsequently or beforehand.
This section is still being completed, and we regularly extend it.
Number of samples
The number of samples is very important. Due to the nature of the method (correlating sample data), a small number of samples is likely to violate the assumptions of our framework, in particular the peak-gene links (see here for more details). We generally recommend at least 10, better 15 samples that are shared between the RNA and peak data - remember that in the GRaNIE
framework, only samples for which both modalities are available can be included in the analysis. Thus, the overlap of RNA and peak data should be at least 10-15 samples, the more the better in general.
Peaks
The number of peaks that is provided as input matters greatly for the resulting eGRN and its connectivity. From our experience, this number should be in a reasonable range so that there is enough data and TFBS overlap to build a eGRN, but also not too many so that the whole pipeline runs unnecessarily long. We have good experience with the number of peaks ranging between 50,000 and 200,000, although these are not hard thresholds but rather recommendations.
With respect to the recommended width of the peaks, we usually use peaks that have a width of a couple of hundred base pairs until a few kb, while the default is to filter peaks if they are wider than 10,000 bp (parameter maxSize_peaks
in the function filterData()
). Remember: peaks are used to overlap them with TFBS, so if a particular peak is too narrow, the likelihood of not overlapping with any (predicted) TFBS from any TF increases, and such a peak is subsequently essentially ignored.
RNA-Seq
The following list is subject to change and provides some rough guidelines for the RNA-Seq data:
- We recommend using raw counts if possible, and checking carefully in a PCA whether any batch effects are visible (provide and check with metadata such as age or gender - the more, the better).
- Genes with very small counts across samples are advised to be removed by running the function
filterData()
, see the argumentminNormalizedMeanRNA
for more information. You may want to check beforehand how many gens have a row mean of >1. This number is usually in the tens of thousands.
Transcription factor binding sites (TFBS)
TFBS are a crucial input for any GRaNIE
analysis. Our GRaNIE
approach is very agnostic as to how these files are generated - as long as one BED file per TF is provided with TFBS positions, the TF can be integrated. As explained above, we usually work with TFBS as predicted by PWMScan
based on HOCOMOCO
TF motifs, while this is by no means a requirement of the pipeline. Instead, JASPAR
TFBS or TFBS from any other database can also be used. The total number of TF and TFBS per TF seems more relevant here, due to the way we integrate TFBS: We create a binary 0/1 overlap matrix for each peak and TF, with 0 indicating that no TFBS for a particular TF overlaps with a particular peak, while 1 indicates that at least 1 TFBS from the TFBS input data does indeed overlap with the particular peak by at least 1 bp. Thus, having more TFBS in general also increases the number of 1s and therefore the_foreground_ of the TF (see the diagnostic plots) while it makes the foreground also more noisy if the TFBS list contains too many false positives. As always in biology, this is a trade-off.
Choice of correlation methods for TF-peak, peak-gene, TF-gene connections and outlier robustness
This section is partially based on and inspired by this nice summary about association measures in the context of gene expression networks.
We currently offer 3 different choices for how to measure the correlation (generally defined as an association measure that is used to estimate the relationships between two random variables) between different pairs of features such as TF-peaks, TF-genes, and peak-genes. All correlation coefficients take on values between −1 and 1 where negative values indicate an inverse relationship.
- Pearson correlation, which measures the extent of a linear relationship, is the most widely used correlation measure and well known so that no further explanations are necessary here. It is, however, very prone to outliers, which is why we offer two additional measures of correlation that are deemed to be more robust.
- Spearman correlation is based on ranks, and measures the extent of a monotonic relationship between x and y. it is also well known and less susceptible to individual outliers.
- Biweight midcorrelation or short bicor is a median based correlation measure, and has been found to be more robust than the Pearson correlation but often more powerful than the Spearman correlation. In particular, it has been shown to be more robust in evaluating similarity in gene expression networks and is often used for weighted correlation network analysis.
The choice depends on the user, and is dependent on whether or not the data contain outlier points that may heavily affect the correlation measures. While we generally advise to remove obvious outlier samples, as identified in a PCA, for example, from the analysis altogether in case they come from a single sample, in other cases this may not be the best approach and a more robust correlation measure such as Spearman or bicor seems more appropriate. We are currently also testing and evaluating which of them are best under different criteria and data typoes. For single-cell derived data, this choice seems to be particularly relevant.
Peak-gene p-values accuracy and violations
As outlined already above, the peak-gene diagnostic plots are very important to check and to verify that the assumptions of our framework and the underlying data are met. We highly recommend doing so, and we provide detailed recommendations and our experiences here.
eGRNs from single-cell data
We now also provide some preliminary guidelines on whether and how to use GRaNIE
for single-cell eGRN inference. For more details, see thedesignated vignette for single-cell data.
Recapitulating object history, function parameters etc
A nice little helper feature for the GRaNIE
package is that the object history, including all function calls that have been applied to the object, including the function parameters and their actual values (unless a variable has been provided, then only the variable name is stored and not the actual value), the date, and the actual call are stored in the actual GRN
object in a nested list insideGRN@GRN@config$functionParameters
. This looks like the following, for example, for the function add_TF_gene_correlation
that has been executed at 2023-01-23 21:50:52 CET:
> GRN@config$functionParameters$add_TF_gene_correlation$`2023-01-23_21:50:52`
$call
add_TF_gene_correlation(GRN = GRN, nCores = 5)
$parameters
<span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mi>p</mi><mi>a</mi><mi>r</mi><mi>a</mi><mi>m</mi><mi>e</mi><mi>t</mi><mi>e</mi><mi>r</mi><mi>s</mi></mrow><annotation encoding="application/x-tex">parameters</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.8095em;vertical-align:-0.1944em;"></span><span class="mord mathnormal">p</span><span class="mord mathnormal">a</span><span class="mord mathnormal" style="margin-right:0.02778em;">r</span><span class="mord mathnormal">am</span><span class="mord mathnormal">e</span><span class="mord mathnormal">t</span><span class="mord mathnormal">ers</span></span></span></span>GRN
GRN
<span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mi>p</mi><mi>a</mi><mi>r</mi><mi>a</mi><mi>m</mi><mi>e</mi><mi>t</mi><mi>e</mi><mi>r</mi><mi>s</mi></mrow><annotation encoding="application/x-tex">parameters</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.8095em;vertical-align:-0.1944em;"></span><span class="mord mathnormal">p</span><span class="mord mathnormal">a</span><span class="mord mathnormal" style="margin-right:0.02778em;">r</span><span class="mord mathnormal">am</span><span class="mord mathnormal">e</span><span class="mord mathnormal">t</span><span class="mord mathnormal">ers</span></span></span></span>nCores
[1] 5
<span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mi>p</mi><mi>a</mi><mi>r</mi><mi>a</mi><mi>m</mi><mi>e</mi><mi>t</mi><mi>e</mi><mi>r</mi><mi>s</mi></mrow><annotation encoding="application/x-tex">parameters</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.8095em;vertical-align:-0.1944em;"></span><span class="mord mathnormal">p</span><span class="mord mathnormal">a</span><span class="mord mathnormal" style="margin-right:0.02778em;">r</span><span class="mord mathnormal">am</span><span class="mord mathnormal">e</span><span class="mord mathnormal">t</span><span class="mord mathnormal">ers</span></span></span></span>corMethod
[1] "pearson"
<span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mi>p</mi><mi>a</mi><mi>r</mi><mi>a</mi><mi>m</mi><mi>e</mi><mi>t</mi><mi>e</mi><mi>r</mi><mi>s</mi></mrow><annotation encoding="application/x-tex">parameters</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.8095em;vertical-align:-0.1944em;"></span><span class="mord mathnormal">p</span><span class="mord mathnormal">a</span><span class="mord mathnormal" style="margin-right:0.02778em;">r</span><span class="mord mathnormal">am</span><span class="mord mathnormal">e</span><span class="mord mathnormal">t</span><span class="mord mathnormal">ers</span></span></span></span>forceRerun
[1] FALSE
This can be very helpful for recapitulating at a later point which functions have been applied to a GRN
object, and to verify specific parameter assignments.
Memory footprint and execution time, feasibility with large datasets
In this section, we will give an overview over CPU and memory requirements when running or planning an analysis with GRaNIE
.
Both CPU time and memory footprint primarily depend on similar factors, namely the number of
- TFs
- peaks
- samples
While the number of TFs is typically similar across analyses when the default database is used (HOCOMOCO
+ PWMScan
), the number of peaks and samples may vary greatly across analyses.
For optimized CPU times, make sure to have the package BiocParallel
installed, which is not automatically installed with GRANIE
, even though it should be already installed for most Bioconductor installations as it is one of the core packages.
CPU time
A typical analysis runs within an hour or two on a standard machine with 2 to 4 cores or so. CPU time non-surprisingly depends primarily on the number of used cores for those functions that support multiple cores and that are time-consuming in nature.
Memory footprint
The maximum memory footprint even for a large dataset during a typicalGRaNIE
workflow usually does not exceed a few GB in R
. Instead, it is usually in the range of 1-2 GB only maximum. Thus, GRaNIE
can usually be run on a normal laptop without problems.
We recommend not using more than a few 100,000 or so peaks as the memory footprint as well as running time may otherwise increase notably. However, there is no package-defined upper limit, it all depends on the available system memory.
Given that our approach is correlation-based, it seems preferable to maximize the number of samples while retaining only peaks that carry a biological signal that is differing among samples.
The size of the GRN
object itself is typically in the range of a few hundred MB, but can go up to 1-2 GB for large datasets. If you have troubles with big datasets, let us know! We always look for ways to further optimize the memory footprint. However, with the recent optimizations to store the count matrices as sparse matrix if the fraction of 0s is too large, the overall memory footprint significantly reduced.
References
Appendix
1. [GRaNIE and GRaNPA: inference and evaluation of enhancer-mediated gene regulatory networks](https://www.embopress.org/doi/full/10.15252/msb.202311627)
2. [Comparison of co-expression measures: mutual information, correlation, and model based indices](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3586947/)