BioTIP: an R-package for Characterization of Biological Tipping-Points (original) (raw)
As scRNA-seq is becoming more popular, there is a vast amount of information and computational tools. We recommend Bioconductor Workshops which have a variety of topics for not only beginner users but also novice users and developers. Helpful information is available at https://bioconductor.github.io/BiocWorkshops/. For orchestrating scRNA-seq analysis with Bioconductor, we recommend the online tutorial https://osca.bioconductor.org/. The following figure shows BioTIP’s analytic pipeline of single-cell RNA-seq analysis, as an alternative to differential expression analysis - instead of discovering marker genes from the differential-expression analysis between stable states, BioTIP introduces the tipping-point analysis that focuses on unstable transition states. The orange arrows show that BioTIP is applied to cell state ensembles (clusters) without the pseudo-order information. The dashed orange arrow shows that BioTIP may use pseudo-order information to focus cell states on the trajectory of interests.
Fig 2. BioTIP Single-Cell Expression Data-Analytical Pipeline.
Advanced Estimation for Pearson Correlation Coefficient Matrix
Ic (Index of criticity) and DNB (dynamic network module which we renamed as MCI) are two existing tipping-point models applied to gene expression profiles (Mojtahedi M 2016, Chen 2012). These two modules have an intrinsic limit in the standard correlation calculation that is inappropriate for large-scale genomic data (Schafer 2005). This limit establishes a bias towards small-sized populations. Having differences in population sizes of 10-fold, ScRNAseq data is particularly vulnerable to this deficiency. We propose an advanced estimation of correlation matrices (Schafer 2005) to correct the bias of the existing mathematical models. The implication is the functions ave.cor.shrink
. Using this shrinkage estimation strategy, we refined the tipping-point identification with Ic score, by calling the function getIc
and setting the parameter fun=’BioTIP’. We also refined the CTS identification with MCI score, by calling the function getMCI
and setting the parameter fun=’BioTIP’. Remember that when running corresponding functions for random scores (e.g., simulation_Ic
, simulation_Ic_sample
, and simulationMCI
, also set the parameter fun=’BioTIP.’ Following is an example of applications to scRNA-seq data.
Data Preprocessing with Trajectory Building Tools
A published dataset, GSE81682, is utilized for BioTIP application to scRNA-sequence data for demonstration purposes.
To apply BioTIP, the input data is the output of a pseudotime (the measure of how far a cell traveled in its process) construction approach. Currently, there are 459 scRNA-seq analytics tools developed by August 2019 (Zappiaet al. 2019). Three candidate tools were widely used for trajectory construction including Seurat, Monocle 3, and STREAM (Chen et al 2019). STREAM has been stated as the only trajectory inference method explicitly implementing a mapping procedure, using the old cells as a reference for new cells (Chen et al 2019). STREAM is available as an interactive website, a bioconda package using Python, and as a command-line with Docker. Ultimately, it has been determined that STREAM has good recall and precision for data with multi-branching structures of high precision.
This study collected and performed scRNA-seq using hematopoietic stem and progenitor cells (HSPCs) from ten female mice (Nestorowa 2016). For more detailed information on hematopoietic cell gating, please refer to http://blood.stemcells.cam.ac.uk/data/. The rows in these data represent genes while columns represent cells.The raw data has been adjusted with normalization of the library size and log2 transformation. For data that has not yet been log2 transformed please refer to the example of log2 transformation below.
Here, we demonstrated the utilization of the STREAM Python package to obtain trajectory outputs (Chen et al. 2019). Using STREAM, the expression profiles of approximate 4000 (10% of the originally measured genes) informative genes in 1656 single cells were pre-selected. We downloaded the data matrix from STREAM publication. The rows in the matrix represent genes while columns represent cells. The raw data has been adjusted with normalization of the library size and log2 transformation.
STREAM needs to be installed via bioconda. A detailed instruction for installing STREAM could be found at the STREAM github repository: https://github.com/pinellolab/STREAM/blob/master/README.md. A set of parameters can be adjusted for STREAM analysis. For the dataset from Nestorowa et al., we will use the set of parameters summarized by the following command line:
stream -m data_Nestorowa.tsv -l cell_label.tsv -c cell_color_label.tsv --norm --loess_frac 1 --pca_n_PC 30 --umap
--norm
normalizes the data matrix; --loess_frac
indicates the fraction of the data used in LOESS regression with a default value of 0.1 which needs to be increased if the dataset is relatively small; --pca_n_PC
indicates the number of PC selected; --umap
uses UMAP for visualization.
STREAM analysis on Nestorowa dataset yields the following output:
We will be focusing on the path S1-S0-S3-S4 in our analysis.
We begin by installing the ‘BioTIP’ package and other dependent packages, using Bioconductor.To avoid errors, it is recommended to use the latest version of R.
library(BioTIP)
Once BioTIP is installed we must load dependent R packages:
#Load stringr library
library(stringr)
#BioTIP dependent libraries
library(cluster)
library(psych)
library(stringr)
library(GenomicRanges)
# library(Hmisc)
library(MASS)
library(igraph)
# library(RCurl)
Once the required packages are installed we want to become familiar with the data using read.delim
function. The dim
function will show you the dimensions of your data.
## familiarize yourself with data
cell_info = read.delim("cell_info.tsv", head = T, sep = '\t', row.names=1)
dim(cell_info)
#> [1] 1645 13
## in case R automatically reformated characters in cell labels, match cell labels in the data and in the infor table
rownames(cell_info) = sub('LT-HSC', 'LT.HSC', rownames(cell_info))
## focus on one branch, e.g., S4-S3-S0-S1
## no filter because it was already 4k genes, 10% of the originally measured genes in the single cell experiment
cell_info <- subset(cell_info, branch_id_alias %in% c("('S4', 'S3')", "('S3', 'S0')", "('S1', 'S0')" ))
## check cell sub-population sizes along the STREAM outputs
(t = table(cell_info$label, cell_info$branch_id_alias))
#>
#> ('S1', 'S0') ('S3', 'S0') ('S4', 'S3')
#> CMP 29 68 92
#> GMP 6 15 10
#> LMPP 162 44 0
#> LTHSC 219 14 28
#> MEP 7 17 326
#> MPP 99 0 2
#> MPP1 60 0 0
#> MPP2 12 5 5
#> MPP3 91 31 0
## focus on sub-populations of cells with more than 10 cells
( idx = apply(t, 2, function(x) names(which(x>10))) )
#> $`('S1', 'S0')`
#> [1] "CMP" "LMPP" "LTHSC" "MPP" "MPP1" "MPP2" "MPP3"
#>
#> $`('S3', 'S0')`
#> [1] "CMP" "GMP" "LMPP" "LTHSC" "MEP" "MPP3"
#>
#> $`('S4', 'S3')`
#> [1] "CMP" "LTHSC" "MEP"
## generate a list of samples (cells) of interest
samplesL = sapply(names(idx),
function(x) sapply(idx[[x]], function(y)
rownames(subset(cell_info, branch_id_alias == x & label == y))))
## check the number of sub-populations along the pseudotime trajectory (generated by STREAM for example)
## recommend to focus on each branch, respectively
names(samplesL[[1]])
#> [1] "CMP" "LMPP" "LTHSC" "MPP" "MPP1" "MPP2" "MPP3"
## merge sub-populations at each pseudo-time 'state'
samplesL = do.call(c, samplesL)
lengths(samplesL)
#> ('S1', 'S0').CMP ('S1', 'S0').LMPP ('S1', 'S0').LTHSC ('S1', 'S0').MPP
#> 29 162 219 99
#> ('S1', 'S0').MPP1 ('S1', 'S0').MPP2 ('S1', 'S0').MPP3 ('S3', 'S0').CMP
#> 60 12 91 68
#> ('S3', 'S0').GMP ('S3', 'S0').LMPP ('S3', 'S0').LTHSC ('S3', 'S0').MEP
#> 15 44 14 17
#> ('S3', 'S0').MPP3 ('S4', 'S3').CMP ('S4', 'S3').LTHSC ('S4', 'S3').MEP
#> 31 92 28 326
## load normalized gene expression matrix, refer to the example below if not yet normalized
## e.g, here is a STREAM-published data matrix of normalized gene expression matrix
## when analyzing SingleCellExperiment object SCE, translate to matrix using the following code
# counts <- logcounts(SCE)
## data matrix imported from our github repository due to large size
githuburl = "https://github.com/ysun98/BioTIPBigData/blob/master/data_Nestorowa.tsv?raw=true"
counts = read.table(url(githuburl), head=T, sep="\t", row.names=1)
dim(counts)
#> [1] 4768 1645
if (class(counts)=="data.frame") counts = as.matrix(counts)
all(colnames(counts) %in% as.character(rownames(cell_info)))
#> [1] FALSE
## log2 transformation example
if(max(counts)>20) counts = log2(counts)
## Check the overall distribution using histogram
# hist (counts, 100)
Predicting Tipping Point (Advanced Index of Critical Transition (IC))
Before the identification of CTS, we first ask whether the tipping point could be inferred from the global transcriptome of this dataset by estimating IC score from random genes. For the theoretical details, please see the original publication of the IC score.
## First, estimate the random Ic-scores by permuting the expression values of genes
RandomIc_g = list()
set.seed(2020)
C= 10 # for real data analysis C = at least 500
i = 1
n <- 200 # randomly pick up 200 genes
RandomIc_g[[i]] <- simulation_Ic(n, samplesL, counts,
B=C,
fun="BioTIP")
names(RandomIc_g)[i] = paste0("Simulation",i,"gene")
medianIc = apply(RandomIc_g[[i]],1,median)
par(mfrow = c(1,1))
plot_Ic_Simulation(medianIc, RandomIc_g[[i]], las = 2, ylab="BioTIP",
main= paste("Simulation using",n,"transcripts"),
fun="boxplot", ylim=c(0,0.3))
Gene Pre-selection
Log-transformed data are now ready for gene preselection. To this end, we have two functions: sd_selection
and optimize.sd_selection
. In both functions, the number of preselected genes are sensitive to two key parameters: ‘opt.cutoff’ and ‘opt.per’. For these two parameters, we recommend a range of 100-500 genes to be preselected per cell state. We use optimize.sd_selection
for scRNA-seq data in contrast to sd_selection
for limited sample sizes (demonstrated in ‘An Identification of Critical Tipping Point using Bulk RNA-seq’). This function optimize.sd_selection
is an optimization of the function sd_selection
, optimize.sd_selection
selects highly oscillating transcriptsusing sub-sampling strategy repeatedly.
The amount of selected transcripts is based on your input of cutoff value arbitrary between 0.01 and 0.2 with a default value of 0.01. If each state contains more than 10 cells the default cutoff value is suggested.
## setup parameters for gene preselection
B=10 ##for demonstration purposes use 10, when running dataset we recommend at least 100
##optimize and one nonoptimize for single cell use optimize with B 100 or higher for demo purpose only we run
opt.cutoff = 0.2
opt.per = 0.8
## Commented out because of calculation expense, recommend to save the calculated data file after one calculation for future use
## set.seed(100)
## subcounts = optimize.sd_selection(counts, samplesL, cutoff = opt.cutoff, percent=opt.per, B)
## save( subcounts, file='BioTIP_Output_xy/subcounts.optimize_80per.rData')
data("subcounts.optimize_80per")
Network Partition
The getNetwork
function links any two preselected genesthat are co-expressed based onPearson correlation coefficient. An fdr smaller than the cutoff value indicates co-expression. This function’s output is an R ‘igraph’ object. The representation of genes of interest can be visualized using ‘igraphL’.
The getCluster_methods
function clusters transcript nodes from the correlation network which was generated by getNetwork
. We suggest each module to contain over 200 nodesfor downstream analysis.
## use 0.01-0.2 to ensure gene-gene co-expression is significant
## increase fdr cutoff to obtain larger gene sets
network.fdr = 0.1
min.size = 50
igraphL = getNetwork(subcounts, fdr=network.fdr)
#> ('S1', 'S0').CMP:400 nodes
#> ('S1', 'S0').LMPP:401 nodes
#> ('S1', 'S0').LTHSC:474 nodes
#> ('S1', 'S0').MPP:363 nodes
#> ('S1', 'S0').MPP1:398 nodes
#> ('S1', 'S0').MPP2:248 nodes
#> ('S1', 'S0').MPP3:366 nodes
#> ('S2', 'S0').LMPP:336 nodes
#> ('S2', 'S0').MPP3:498 nodes
#> ('S3', 'S0').CMP:390 nodes
#> ('S3', 'S0').GMP:444 nodes
#> ('S3', 'S0').LMPP:433 nodes
#> ('S3', 'S0').LTHSC:380 nodes
#> ('S3', 'S0').MEP:302 nodes
#> ('S3', 'S0').MPP3:331 nodes
#> ('S4', 'S3').CMP:399 nodes
#> ('S4', 'S3').LTHSC:463 nodes
#> ('S4', 'S3').MEP:671 nodes
#> ('S5', 'S3').CMP:376 nodes
#> ('S5', 'S3').GMP:407 nodes
#> ('S5', 'S3').MPP3:340 nodes
names(igraphL)
#> [1] "('S1', 'S0').CMP" "('S1', 'S0').LMPP" "('S1', 'S0').LTHSC"
#> [4] "('S1', 'S0').MPP" "('S1', 'S0').MPP1" "('S1', 'S0').MPP2"
#> [7] "('S1', 'S0').MPP3" "('S2', 'S0').LMPP" "('S2', 'S0').MPP3"
#> [10] "('S3', 'S0').CMP" "('S3', 'S0').GMP" "('S3', 'S0').LMPP"
#> [13] "('S3', 'S0').LTHSC" "('S3', 'S0').MEP" "('S3', 'S0').MPP3"
#> [16] "('S4', 'S3').CMP" "('S4', 'S3').LTHSC" "('S4', 'S3').MEP"
#> [19] "('S5', 'S3').CMP" "('S5', 'S3').GMP" "('S5', 'S3').MPP3"
## Network partition using random walk
cluster = getCluster_methods(igraphL)
tmp = igraphL[["('S3', 'S0').CMP"]]
E(tmp)$width <- E(tmp)$weight*3
V(tmp)$community= cluster[["('S3', 'S0').CMP"]]$membership
mark.groups = table(cluster[["('S3', 'S0').CMP"]]$membership)
colrs = rainbow(length(mark.groups), alpha = 0.3)
V(tmp)$label <- NA
plot(tmp, vertex.color=colrs[V(tmp)$community],vertex.size = 5,
mark.groups=cluster[["('S3', 'S0').CMP"]])
Generated above is a graph view of network modulation determined by the random-walk approach for the preselected transcripts at the LIR state. Background colors represent different clusters.
Identifying putative Critical Transition Signals (CTS) using the DNB Module
Each module is a cluster of network nodes (transcripts) that are linked by correlation in the previous step. Each cell state may have multiple modules identified. Biomodules are the modules with relatively higher MCI scores in the system that has discrete states, while empirically significant.
We have four steps to find the states of interest (i.e. tipping point candidates) and their CTS candidates. We first use the getMCI
function to obtain MCI scores of the states. Then the user determines how many states may be of interest using function getTopMCI
and parameter ‘n’. Another parameter ‘min’ is important for the output. It determines the minimum module size for calculating the top MCI scores. Third, we get state IDs and their MCI scores for the states of interest. Here, there are two steps to follow: Step a. for each state of interest we get maximum scores using getMaxMCImember
and getMaxStats
. Step b. we use parameter n to obtain modules with relatively high scores besides the top scores. The getNextMaxStats
function will be applied, we demonstrate the application here (n=2). Next we obtain the CTS of each state of interest using getCTS
.
The following code documents the four steps:
fun = 'BioTIP'
## Commented out because of calculation expense, recommend to save the calculated data file after one calculation for future use
## membersL = getMCI(cluster, subcounts, adjust.size = F, fun)
## save(membersL, file="membersL.RData", compress=TRUE)
data("membersL")
par(mar=c(1,1,2,1))
plotBar_MCI(membersL, ylim=c(0,30), min=50)
## Decide how many states of interest, here is 4
n.state.candidate <- 4
topMCI = getTopMCI(membersL[["members"]], membersL[["MCI"]], membersL[["MCI"]],
min=min.size,
n=n.state.candidate)
names(topMCI)
#> [1] "('S4', 'S3').CMP" "('S1', 'S0').LTHSC" "('S4', 'S3').MEP"
#> [4] "('S2', 'S0').MPP3"
## Obtain state ID and MCI statistics for the n=3 leading MCI scores per state
maxMCIms = getMaxMCImember(membersL[["members"]], membersL[["MCI"]],
min =min.size,
n=3)
names(maxMCIms)
#> [1] "idx" "members" "2topest.members" "3topest.members"
## list the maximum MCI score per state, for all states
maxMCI = getMaxStats(membersL[['MCI']], maxMCIms[['idx']])
unlist(maxMCI)
#> ('S1', 'S0').LMPP ('S1', 'S0').LTHSC ('S1', 'S0').MPP ('S1', 'S0').MPP1
#> 1.734805 22.807675 2.859558 6.068915
#> ('S2', 'S0').MPP3 ('S3', 'S0').LMPP ('S4', 'S3').CMP ('S4', 'S3').MEP
#> 9.734967 7.637239 26.468265 12.167936
#> ('S5', 'S3').CMP
#> 1.394558
#### extract biomodule candidates in the following steps: ####
## record the gene members per toppest module for each of these states of interest
CTS = getCTS(maxMCI[names(topMCI)], maxMCIms[["members"]][names(topMCI)])
#> Length: 58
#> Length: 63
#> Length: 474
#> Length: 98
## tmp calculates the number of bars within each named state
tmp = unlist(lapply(maxMCIms[['idx']][names(topMCI)], length))
## here returns all the groups with exactly 2 bars
(whoistop2nd = names(tmp[tmp==2]))
#> [1] "('S4', 'S3').CMP" "('S1', 'S0').LTHSC" "('S4', 'S3').MEP"
## here returns all the groups with exactly 3 bars
(whoistop3rd = names(tmp[tmp==3]))
#> character(0)
## add the gene members of the 2nd toppest biomodue in the states with exactly 2 bars
if(length(whoistop2nd)>0) CTS = append(CTS, maxMCIms[["2topest.members"]][whoistop2nd])
names(CTS)
#> [1] "('S4', 'S3').CMP" "('S1', 'S0').LTHSC" "('S4', 'S3').MEP"
#> [4] "('S2', 'S0').MPP3" "('S4', 'S3').CMP" "('S1', 'S0').LTHSC"
#> [7] "('S4', 'S3').MEP"
## add the gene members of the 2nd toppest biomodue in the states with exactly 3 bars
if(length(whoistop3rd)>0) CTS = append(CTS, maxMCIms[["2topest.members"]][whoistop3rd])
names(CTS)
#> [1] "('S4', 'S3').CMP" "('S1', 'S0').LTHSC" "('S4', 'S3').MEP"
#> [4] "('S2', 'S0').MPP3" "('S4', 'S3').CMP" "('S1', 'S0').LTHSC"
#> [7] "('S4', 'S3').MEP"
## add the gene members of the 3rd toppest biomodue in the states with exactly 3 bars
if(length(whoistop3rd)>0) CTS = append(CTS, maxMCIms[["3topest.members"]][whoistop3rd])
names(CTS)
#> [1] "('S4', 'S3').CMP" "('S1', 'S0').LTHSC" "('S4', 'S3').MEP"
#> [4] "('S2', 'S0').MPP3" "('S4', 'S3').CMP" "('S1', 'S0').LTHSC"
#> [7] "('S4', 'S3').MEP"
Finding Tipping Point and Evaluating CTS
To find the tipping point of the above-detected tipping point candidates we calculate the Index of Critical Transition or the Ic score using the getIc
function. The plotIc
function is used to obtain a line plot of the Ic score. The function simulation_Ic
calculates random Ic scores by permutating transcript expression values. The function simulation_Ic_sample
calculates random Ic scores by shuffling cell labels. Only the states with significant Ic scores are where the tipping points will be identified.
plot_Ic_Simulation
function compares observed Ic scores with simulated Ic scores across states. The parameter ‘fun’ can be adjusted to produce either line plot or box plot.
plot_SS_Simulation
function compares observed with simulated Ic scores across states, given p-value based on the delta scores. Given one biomodule, we calculate the Ic scores across states. Then the delta score calculates the distance between the first largest and second largest Ic scores among states. Similarly, in the simulation cases we have simulated delta scores. When the module presents a significant delta score it is the identified CTS.
#### extract CTS scores for each biomodule candidate in the following steps: ####
## first to record the max MCI for the n.state.candidate
maxMCI <- maxMCI[names(CTS)[1:n.state.candidate]]
maxMCI
## then applendix the 2nd highest MCI score (if existing) for the states with exactly 2 bars
if(length(whoistop2nd)>0) maxMCI <- c(maxMCI, getNextMaxStats(membersL[['MCI']], idL=maxMCIms[['idx']], whoistop2nd))
names(maxMCI)
## applendix the 2nd highest MCI score (if existing) for the states with exactly 3 bars
if(length(whoistop3rd)>0) maxMCI <- c(maxMCI, getNextMaxStats(membersL[['MCI']], idL=maxMCIms[['idx']], whoistop3rd))
names(maxMCI)
## then applendix the 3rd highest MCI score (if existing) for the states with exactly 3 bars
if(length(whoistop3rd)>0) maxMCI <- c(maxMCI, getNextMaxStats(membersL[['MCI']], idL=maxMCIms[['idx']], whoistop3rd, which.next=3))
maxMCI
## to ensure the same order between maxMCI and CTS
all(names(CTS) == names(maxMCI))
## estimate empritical significance from the MCI scores
## M is precalculated correlation matrix for large dataset (>2k genes), will be reused in the downstream simulation analysis
#counts = read.table("data_Nestorowa.tsv")
#if (class(counts)=="data.frame") counts = as.matrix(counts)
#M <- cor.shrink(counts, Y = NULL, MARGIN = 1, shrink = TRUE)
#save(M, file="cor.shrink_M.RData", compress=TRUE)
#dim(M)
## C is the runs of permutations to estimate random scores
#C = 10 # for real data analysis C = at least 200
#RandomMCI = list()
#n <- length(CTS) # number of CTS candidates
#set.seed(2020)
#for (i in 1:n) #
# i=1; par(mfrow=c(1,1))
# x <- length(CTS[[i]])
# RandomMCI[[i]] <- simulationMCI(x, samplesL, counts, B=C, fun="BioTIP", M=M)
# dim(RandomMCI)
# plot_MCI_Simulation(maxMCI[i], RandomMCI[[i]], las=2,
# ylim=c(0, max(maxMCI[i], 2*RandomMCI[[i]])),
# main=paste(names(maxMCI)[i], length(CTS[[i]]), "genes",
# "\n","vs. ",C, "times of gene-permutation"),
# which2point=names(maxMCI)[i])
######## Finding Tipping Point #################
newIc_score = lapply(CTS, function(x) getIc(counts, samplesL, x, fun="BioTIP", PCC_sample.target = 'average'))
names(newIc_score) <- names(CTS)
We provided the detail step for calculating the M matrix and plotting MCI simulation but skipped the calculation of M matrix for size and run time concerns. The result of the simulation is described in the plot below:
######## verify using IC score #################
## First, estimate the random Ic-scores by permuating the expresion values of genes
RandomIc_g = list()
set.seed(2020)
C= 10 # for real data analysis C = at least 500
# for(i in 1:length(CTS)){ Not to run the full loop B# }
i = 1
CTS <- CTS[[i]]
n <- length(CTS)
RandomIc_g[[i]] <- simulation_Ic(n, samplesL, counts,
B=C,
fun="BioTIP")
names(RandomIc_g)[i] = names(CTS)[i]
# par(mfrow=c(1,length(int)))
# for(i in 1:length(newIc_score)){ Not to run the full loop B plotting
par(mfrow = c(1,1))
n = length(CTS[[i]])
plot_Ic_Simulation(newIc_score[[i]], RandomIc_g[[i]], las = 2, ylab="BioTIP",
main= paste(names(newIc_score)[i],"(",n," transcripts)"),
#fun="matplot", which2point=names(newIc_score)[i])
fun="boxplot", which2point=names(newIc_score)[i], ylim=c(0,0.5))
interesting = which(names(samplesL) == names(newIc_score[i]))
p = length(which(RandomIc_g[[i]][interesting,] >= newIc_score[[i]][names(newIc_score)[i]]))
p = p/ncol(RandomIc_g[[i]])
# first p value (p1) calculated for exactly at tipping point
p2 = length(which(RandomIc_g[[i]] >= newIc_score[[i]][names(newIc_score)[i]]))
p2 = p2/ncol(RandomIc_g[[i]])
p2 = p2/nrow(RandomIc_g[[i]])
# second p value (p2) calculated across all statuses
## local Kernel Density Plot
d <- density(RandomIc_g[[i]]) # returns the density data
plot(d, xlim=range(c(newIc_score[[i]],RandomIc_g[[i]])),
main=paste("Random genes: p.Local=",p)) # plots the results
abline(v=newIc_score[[i]][names(newIc_score)[i]], col="green")
## global Kernel Density Plot
d <- density(unlist(RandomIc_g)) # returns the density data
plot(d, xlim=range(c(newIc_score[[i]],unlist(RandomIc_g))),
main=paste("Random genes: p.Global=",p2)) # plots the results
abline(v=newIc_score[[i]][names(newIc_score)[i]], col="green")
# } Not to run the full loop B plotting
## Second, estimate the random Ic-scores by randomly shulffing cell labels
RandomIc_s = list()
set.seed(2020)
# for(i in 1:length(CTS)){ Not to run the full loop C
i = 1
RandomIc_s[[i]] <- matrix(nrow=length(samplesL), ncol=C)
rownames(RandomIc_s[[i]]) = names(samplesL)
for(j in 1:length(samplesL)) {
ns <- length(samplesL[[j]]) # of cells at the state of interest
RandomIc_s[[i]][j,] <- simulation_Ic_sample(counts, ns,
Ic=BioTIP_score[x],
genes=CTS,
B=C,
fun="BioTIP")
}
names(RandomIc_s)[i] = names(CTS)[i]
# } Not to run the full loop C
# par(mfrow=c(1,length(int)))
# for(i in 1:length(newIc_score)){ Not to run the full loop C plotting
n = length(CTS[[i]])
plot_Ic_Simulation(newIc_score[[i]], RandomIc_s[[i]], las = 2, ylab="BioTIP",
main= paste(names(newIc_score)[i],"(",n," transcripts)"),
fun="boxplot", which2point=names(newIc_score)[i])
interesting = which(names(samplesL) == names(newIc_score[i]))
p = length(which(RandomIc_s[[i]][interesting,] >= newIc_score[[i]][names(newIc_score)[i]]))
p = p/ncol(RandomIc_s[[i]])
# first p value (p1) calculated for exactly at tipping point
p2 = length(which(RandomIc_s[[i]] >= newIc_score[[i]][names(newIc_score)[i]]))
p2 = p2/ncol(RandomIc_s[[i]])
p2 = p2/nrow(RandomIc_s[[i]])
p
p2
# } Not to run the full loop C plotting
## Alternatively, p values is estimated from delta scores
P3 = plot_SS_Simulation(newIc_score[[i]], RandomIc_s[[i]],
xlim=c(0, max(newIc_score[[i]], RandomIc_s[[i]])),
las=2,
main=paste(names(CTS)[i], length(CTS[[i]]), "genes, ", "\n", C, "Shuffling labels"))
P3
#> [1] 0.5
Note that in order to save running time in this vignette, we omitted the full loop and scaled down the runs of permutations when estimating random scores because the main goal of this document is to guide the users through the analysis pipeline and the related functions. In our full experiment, we estimated radom scores for all four of ('S4', 'S3').CMP
, ('S1', 'S0').LTHSC
, ('S4', 'S3').MEP
, and ('S2', 'S0').MPP3
with 200 runs of permutations. Our results can be summarized in the following charts:
Inferring Tipping Point-Driven Transcription Factors
Characterizing the tipping point by CTS is a key to understanding biological systems that are non-stationary, high-dimensional, and noisy, being fascinating to study but hard to understand. Using BioTIP, one can identify non-random critical-transition-characteristic CTSs. To evaluate whether these gene features inform key factors of biological progression that were previously unexplored, motif enrichment analysis could be applied. We recommend the tools HOMER (http://homer.ucsd.edu/homer/motif/), MEME Suite (http://meme-suite.org/), and the Bioconductor package PWMEnrich (https://www.bioconductor.org/packages/release/bioc/html/PWMEnrich.html).