Applying MIRA to a Biological Question (original) (raw)
Preparation: Start with appropriately formatted single-nucleotide resolution methylation data, region sets – each with genomic regions corresponding to some biological annotation, and a sample annotation file that has sample name (sampleName column) and sample type (sampleType column). See ?SummarizedExperimentToDataTable
for options for converting formats from some other DNA methylation packages to our format.
1. Do a quick run-through of the MIRA workflow with each region set with a few samples from each condition/sample type to make sure the region sizes are reasonable (discussed later).
1.1. Aggregate methylation data across regions to get a MIRA profile for each region set, sample combination.
1.2. Calculate MIRA score for each region set, sample combination based on the shape of the MIRA profile.
2. Based on preliminary results, pick a region size for each region set depending on what the MIRA profiles look like.
3. Run the full MIRA analysis with chosen region sizes.
3.1. Aggregate methylation data across regions to get a MIRA profile for each region set, sample combination.
3.2. Calculate MIRA score for each region set, sample combination based on the shape of the MIRA profile.
Initial Run-through
First, let’s load the libraries and read in the data files. A sample annotation file was included with the package:
library(MIRA)
library(data.table) # for the functions: fread, setkey, merge
library(GenomicRanges) # for the functions: GRanges, resize
library(ggplot2) # for the function: ylab
exampleAnnoDT2 <- fread(system.file("extdata", "exampleAnnoDT2.txt",
package="MIRA"))
Next, construct file names for the DNA methylation samples and the region sets (downloaded through the link in the Input
section).
# 12 Ewing samples: T1-T12
pathToData <- "/path/to/MIRA_Ewing_Vignette_Files/"
ewingFiles <- paste0(pathToData, "EWS_T", 1:12, ".bed")
# 12 muscle related samples, 3 of each type
muscleFiles <- c("Hsmm_", "Hsmmt_", "Hsmmfshd_","Hsmmtubefshd_")
muscleFiles <- paste0(pathToData, muscleFiles, rep(1:3, each=4), ".bed")
RRBSFiles <- c(ewingFiles, muscleFiles)
regionSetFiles <- paste0(pathToData, c("sknmc_specific.bed", "muscle_specific.bed"))
Let’s read the files into R:
BSDTList <- lapply(X=RRBSFiles, FUN=BSreadBiSeq)
regionSets <- lapply(X=regionSetFiles, FUN=fread)
We need to do a bit of preprocessing before we plug the data into MIRA. The BSreadBiSeq
function reads DNA methylation calls from the software Biseqmethcalling, reading the methylation calls into data.table
objects. We read in the region sets with fread
so they will also be data.table
objects. First, let’s add sample names to the list of methylation data.tables
. Then let’s convert the region sets from data.table
objects to GRanges
objects, including chromosome, start, and end info. In most cases, strand information is not needed by MIRA and should be omitted.
names(BSDTList) <- tools::file_path_sans_ext(basename(RRBSFiles))
regionSets <- lapply(X=regionSets,
FUN=function(x) setNames(x, c("chr", "start", "end")))
regionSets <- lapply(X=regionSets, FUN=
function(x) GRanges(seqnames=x$chr,
ranges=IRanges(x$start, x$end)))
We also need to pick initial region sizes. To get the methylation profile around our sites of interest, we are resizing the regions to 5000 bp. Also, using the same region size for all regions in a set will make the final methylation profile easier to interpret. This is a wide initial guess compared to the average region sizes for the region sets (around 151 bp for both region sets), but we will adjust this later on.
regionSets[[1]] <- resize(regionSets[[1]], 5000, fix="center")
regionSets[[2]] <- resize(regionSets[[2]], 5000, fix="center")
names(regionSets) <- c("Ewing_Specific", "Muscle_Specific")
Now we are ready to run MIRA on a subset of the samples – three Ewing and three muscle samples. First, we aggregate the methylation data into bins with aggregateMethyl
. MIRA divides each region into bins and finds methylation that is contained in each bin in each region. Then matching bins are aggregated over all regions (all bin1’s together, all bin2’s together, etc.). The bin number (binNum
) will affect the resolution or noisiness of the data – a higher bin number will allow more resolution since each bin will correspond to a smaller genomic region but with the trade-off of possibly increasing noise due to less methylation data in each bin. An odd bin number is recommended for the sake of symmetry and MIRA scoring.
subBSDTList <- BSDTList[c(1, 5, 9, 13, 17, 21)]
bigBin <- lapply(X=subBSDTList, FUN=aggregateMethyl, GRList=regionSets,
binNum=21, minBaseCovPerBin=0)
We then combine all samples into one data.table
and add a column for sample name based on the names of the list items:
bigBinDT1 <- rbindNamedList(bigBin, newColName = "sampleName")
If you are running the actual samples on your own device, you can skip the following step. Otherwise, let’s load the binned methylation data (bigBinDT1
) that would have been obtained by the previous step.
# from data included with MIRA package
data(bigBinDT1)
Next we add on annotation data: the experimental condition or sample type for each sample. This is needed in order to group samples by sample type, which we will do when plotting the methylation profiles and MIRA scores. We will use functions from the data.table
package to do this, merging based on the sampleName
column.
setkey(exampleAnnoDT2, sampleName)
setkey(bigBinDT1, sampleName)
bigBinDT1 <- merge(bigBinDT1, exampleAnnoDT2, all.x=TRUE)
Let’s look at the MIRA profiles to see if we should change our region sizes.
plotMIRAProfiles(binnedRegDT=bigBinDT1)
Choosing Appropriate Region Sizes
While the methylation profiles look pretty good, the dips are a bit too narrow for the muscle-related samples when looking at the muscle-specific region set. The scoring function in MIRA could run into problems if the dips are too narrow since it expects the dips to be at least 5 bins wide from outside edge to outside edge (that number includes the outside edges). Let’s use smaller region sizes for both region sets: 4000 bp for the Ewing DHS regions and 1750 bp for the muscle DHS regions (the choice is somewhat arbitrary). The size of the regions should not be decreased too much; if the regions are too small, there might be more noise since there would be less methylation data for each bin. Also, the regions must be wide enough to capture the outer edges of the dip since the outer edges are used for the MIRA score.
To implement our new region sizes:
regionSets[[1]] <- resize(regionSets[[1]], 4000, fix="center") # Ewing
regionSets[[2]] <- resize(regionSets[[2]], 1750, fix="center") # muscle
Full MIRA Analysis
Now we can run MIRA with all our samples. Let’s aggregate methylation data again with aggregateMethyl
.
bigBin <- lapply(X=BSDTList, FUN=aggregateMethyl, GRList=regionSets,
binNum=21, minBaseCovPerBin=0)
bigBinDT2 <- rbindNamedList(bigBin)
If you are running the actual samples on your own device, you can skip the following step. Otherwise, let’s load the binned methylation data (bigBinDT2
) that would have been obtained by the previous step.
# from data included with MIRA package
data(bigBinDT2)
We add annotation data using the same process as before:
# data.table functions are used to add info from annotation object to bigBinDT2
setkey(exampleAnnoDT2, sampleName)
setkey(bigBinDT2, sampleName)
bigBinDT2 <- merge(bigBinDT2, exampleAnnoDT2, all.x=TRUE)
Let’s look at the MIRA profiles:
plotMIRAProfiles(binnedRegDT=bigBinDT2)
Next we normalize the binned methylation data (the MIRA profile). Normalizing by subtracting the minimum methylation bin value from each sample/region set combination generally brings the center MIRA profiles to a similar methylation value, making the MIRA score more about the shape of the profile and less about the average methylation. We do this with data.table
syntax.
bigBinDT2[, methylProp := methylProp - min(methylProp) + .05, by=.(featureID, sampleName)]
normPlot <- plotMIRAProfiles(binnedRegDT=bigBinDT2)
normPlot + ylab("Normalized DNA Methylation (%)")
Now we can calculate MIRA scores for all samples with the calcMIRAScore
function.
sampleScores <- calcMIRAScore(bigBinDT2,
shoulderShift="auto",
regionSetIDColName="featureID",
sampleIDColName="sampleName")
head(sampleScores)
## featureID sampleName score
## <char> <char> <num>
## 1: Ewing_Specific EWS_T1 2.180698
## 2: Ewing_Specific EWS_T10 2.114649
## 3: Ewing_Specific EWS_T11 2.064900
## 4: Ewing_Specific EWS_T12 2.290313
## 5: Ewing_Specific EWS_T2 1.987881
## 6: Ewing_Specific EWS_T3 2.230074
Finally, let’s add the annotation data back on and take a look at the scores!
setkey(exampleAnnoDT2, sampleName)
setkey(sampleScores, sampleName)
sampleScores <- merge(sampleScores, exampleAnnoDT2, all.x=TRUE)
plotMIRAScores(sampleScores)
## Warning: The `guide` argument in `scale_*()` cannot be `FALSE`. This was deprecated in
## ggplot2 3.3.4.
## ℹ Please use "none" instead.
## ℹ The deprecated feature was likely used in the MIRA package.
## Please report the issue at <https://github.com/databio/MIRA>.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.