BayesSpace (original) (raw)

Preparing your experiment for BayesSpace

Loading data

BayesSpace supports three ways of loading a SingleCellExperiment for analysis.

Visium datasets processed with Space Ranger can be loaded directly via the readVisium() function. This function takes only the path to the Space Ranger output directory (containing the spatial/ and filtered_feature_bc_matrix/ subdirectories) and returns a SingleCellExperiment.

sce <- readVisium("path/to/spaceranger/outs/")

Second, all datasets analyzed for the BayesSpace manuscript are readily accessible via the getRDS() function. This function takes two arguments - the name of the dataset, and the name of the sample in the dataset.

melanoma <- getRDS(dataset="2018_thrane_melanoma", sample="ST_mel1_rep2")

Finally, SingleCellExperiment objects can be constructed manually from a counts matrix and tables of row and column data. BayesSpace only requires that spot array coordinates be provided as columns named array_row and array_col in colData. (Note that enhancement of Visium datasets additionally requires the pixel coordinates of each spot in the tissue image, but in this case the dataset should be loaded with readVisium(), which loads these data automatically.)

library(Matrix)

rowData <- read.csv("path/to/rowData.csv", stringsAsFactors=FALSE)
colData <- read.csv("path/to/colData.csv", stringsAsFactors=FALSE, row.names=1)
counts <- read.csv("path/to/counts.csv.gz",
                   row.names=1, check.names=F, stringsAsFactors=FALSE))

sce <- SingleCellExperiment(assays=list(counts=as(counts, "dgCMatrix")),
                            rowData=rowData,
                            colData=colData)

We’ll continue with the melanoma sample from the 2018 Spatial Transcriptomics paper for the remaining examples in this vignette.

Pre-processing data

BayesSpace requires minimal data pre-processing, but we provide a helper function to automate it.

spatialPreprocess() log-normalizes the count matrix and performs PCA on the top n.HVGs highly variable genes, keeping the top n.PCs principal components. Additionally, the spatial sequencing platform is added as metadata in the SingleCellExperiment for downstream analyses. If you do not wish to rerun PCA, running spatialPreprocess() with the flag skip.PCA=TRUE will only add the metadata BayesSpace requires.

set.seed(102)
melanoma <- spatialPreprocess(melanoma, platform="ST", 
                              n.PCs=7, n.HVGs=2000, log.normalize=TRUE)

Clustering

Selecting the number of clusters

We can use the qTune() and qPlot() functions to help choose q, the number of clusters to use in our analysis.

qTune() runs the BayesSpace clustering algorithm for multiple specified values of q (by default, 3 through 7) and computes their average pseudo-log-likelihood. It accepts any arguments to spatialCluster().
qPlot() plots the pseudo-log-likelihood as a function of q; we suggest choosing a q around the elbow of this plot.

melanoma <- qTune(melanoma, qs=seq(2, 10), platform="ST", d=7, cores=2)
qPlot(melanoma)

Clustering with BayesSpace

The spatialCluster() function clusters the spots, and adds the predicted cluster labels to the SingleCellExperiment. Typically, as we did for the analyses in the paper, we suggest running with at least 10,000 iterations (nrep=10000), but we use 2,000 iteration in this demonstration for the sake of runtime. (Note that a random seed must be set in order for the results to be reproducible.)

set.seed(149)
melanoma <- spatialCluster(melanoma, q=4, platform="ST", d=7,
                           init.method="mclust", model="t", gamma=2,
                           nrep=2000, burn.in=100,
                           save.chain=TRUE)

Both the mclust initialization (cluster.init) and the BayesSpace cluster assignments (spatial.cluster) are now available in the SingleCellExperiment’s colData.

head(colData(melanoma))
#> DataFrame with 6 rows and 9 columns
#>          barcode in_tissue array_row array_col sizeFactor  spot.idx
#>      <character> <logical> <numeric> <numeric>  <numeric> <integer>
#> 7x15        7x15      TRUE         7        15   0.795588         1
#> 7x16        7x16      TRUE         7        16   0.307304         2
#> 7x17        7x17      TRUE         7        17   0.331247         3
#> 7x18        7x18      TRUE         7        18   0.420747         4
#> 8x13        8x13      TRUE         8        13   0.255453         5
#> 8x14        8x14      TRUE         8        14   1.473439         6
#>      spot.neighbors cluster.init spatial.cluster
#>         <character>    <numeric>       <numeric>
#> 7x15            2,7            1               1
#> 7x16          1,3,8            1               1
#> 7x17          2,4,9            2               2
#> 7x18           3,10            3               2
#> 8x13           6,12            1               1
#> 8x14         5,7,13            1               1

Visualizing spatial clusters

We can plot the cluster assignments over the spatial locations of the spots with clusterPlot().

As clusterPlot() returns a ggplot object, it can be customized by composing with familiar ggplot2 functions. Additionally, the argument palette sets the colors used for each cluster, and clusterPlot() takes additional arguments to geom_polygon() such as size or color to control the aesthetics of the spot borders.

clusterPlot(melanoma, palette=c("purple", "red", "blue", "yellow"), color="black") +
  theme_bw() +
  xlab("Column") +
  ylab("Row") +
  labs(fill="BayesSpace\ncluster", title="Spatial clustering of ST_mel1_rep2")

Enhanced resolution

Clustering at enhanced resolution

The spatialEnhance() function will enhance the resolution of the principal components, and add these PCs as well as predicted cluster labels at subspot resolution to a new SingleCellExperiment. As with our demonstration of spatialCluster() above, we are using fewer iterations for the purpose of this example (nrep=2000) than we recommend in practice (nrep=100000 or greater).

Note that the jitter_scale parameter should be tuned so that proposals for updating subspot-level expression are accepted around 30% of the time. This can be evaluated using mcmcChain(melanoma.enhanced, "Ychange"), where the chain should stabilize to 0.25-0.40. Typically 1000-2500 iterations are sufficient to evaluate if jitter_scale should be increased if acceptance is too high or decreased if acceptance is too low.

We have implemented another way to avoid manually tuning jitter_scale, which uses adaptive MCMC. To use this feature, simply set jitter.scale=0. BayesSpace will automatically tune this parameter to achieve the best acceptance rate. Note that such tuning by default only happens in the burn-in phase.

After tuning, proceed to a full run of spatialEnhance with more iterations.

melanoma.enhanced <- spatialEnhance(melanoma, q=4, platform="ST", d=7,
                                    model="t", gamma=2,
                                    jitter.prior=0.3, jitter.scale=0,
                                    nrep=2000, burn.in=100,
                                    save.chain=TRUE, cores=2)

The enhanced SingleCellExperiment includes an index to the parent spot in the original sce (spot.idx), along with an index to the subspot. It adds the offsets to the original spot coordinates, and provides the enhanced cluster label (spatial.cluster).

head(colData(melanoma.enhanced))
#> DataFrame with 6 rows and 9 columns
#>              spot.idx spot.neighbors subspot.idx subspot.neighbors  spot.row
#>             <numeric>    <character>   <integer>       <character> <numeric>
#> subspot_1.1         1            2,7           1           294,880         7
#> subspot_2.1         2          1,3,8           1       295,587,881         7
#> subspot_3.1         3          2,4,9           1       296,588,882         7
#> subspot_4.1         4           3,10           1       297,589,883         7
#> subspot_5.1         5           6,12           1           298,884         8
#> subspot_6.1         6         5,7,13           1       299,591,885         8
#>              spot.col array_row array_col spatial.cluster
#>             <numeric> <numeric> <numeric>       <numeric>
#> subspot_1.1        15   6.66667   14.6667               1
#> subspot_2.1        16   6.66667   15.6667               1
#> subspot_3.1        17   6.66667   16.6667               2
#> subspot_4.1        18   6.66667   17.6667               2
#> subspot_5.1        13   7.66667   12.6667               1
#> subspot_6.1        14   7.66667   13.6667               1

We can plot the enhanced cluster assignments as above.

clusterPlot(melanoma.enhanced)

Enhancing the resolution of gene expression

BayesSpace operates on the principal components of the gene expression matrix, and spatialEnhance() therefore computes enhanced resolution PC vectors. Enhanced gene expression is not computed directly, and is instead imputed using a regression algorithm. For each gene, a model using the PC vectors of each spot is trained to predict the spot-level gene expression, and the fitted model is used to predict subspot expression from the subspot PCs.

Gene expression enhancement is implemented in the enhanceFeatures() function. BayesSpace predicts expression with xgboost by default, but linear and Dirichlet regression are also available via the model argument. When using xgboost, we suggest automatically tuning the nrounds parameter by setting it to 0, although this comes at the cost of increased runtime (~4x slower than a pre-specified nrounds in practice).

enhanceFeatures() can be used to impute subspot-level expression for all genes, or for a subset of genes of interest. Here, we’ll demonstrate by enhancing the expression of four marker genes: PMEL (ENSG00000185664, melanoma), CD2 (ENSG00000116824, T-cells), CD19 (ENSG00000177455, B-cells), and COL1A1 (ENSG00000108821, fibroblasts).

markers <- c("ENSG00000185664", "ENSG00000116824", "ENSG00000177455", "ENSG00000108821")
melanoma.enhanced <- enhanceFeatures(melanoma.enhanced, melanoma,
                                     feature_names=markers,
                                     nrounds=0)

By default, log-normalized expression (logcounts(sce)) is imputed, although other assays or arbitrary feature matrices can be specified.

logcounts(melanoma.enhanced)[markers, 1:5]
#>                 subspot_1.1 subspot_2.1 subspot_3.1 subspot_4.1 subspot_5.1
#> ENSG00000185664   2.6628363   1.9758757   2.5970280   2.5534587   2.6945219
#> ENSG00000116824   0.4270058   0.3931560   0.3871010   0.3545035   0.4673422
#> ENSG00000177455   0.5908179   0.3442686   0.4279284   0.4189567   0.4279284
#> ENSG00000108821   0.0000000   1.0713838   1.4132689   1.0773389   0.9019099

Diagnostic measures from each predictive model, such as rmse when using xgboost, are added to the rowData of the enhanced dataset.

rowData(melanoma.enhanced)[markers, ]
#> DataFrame with 4 rows and 4 columns
#>                         gene_id   gene_name    is.HVG enhanceFeatures.rmse
#>                     <character> <character> <logical>            <numeric>
#> ENSG00000185664 ENSG00000185664        PMEL      TRUE             0.752284
#> ENSG00000116824 ENSG00000116824         CD2      TRUE             0.655391
#> ENSG00000177455 ENSG00000177455        CD19      TRUE             0.711007
#> ENSG00000108821 ENSG00000108821      COL1A1      TRUE             0.548235

Visualizing enhanced gene expression

Spatial gene expression is visualized with featurePlot().

featurePlot(melanoma.enhanced, "ENSG00000185664")

Here, we compare the spatial expression of the imputed marker genes.

enhanced.plots <- purrr::map(markers, function(x) featurePlot(melanoma.enhanced, x))
patchwork::wrap_plots(enhanced.plots, ncol=2)

And we can compare to the spot-level expression.

spot.plots <- purrr::map(markers, function(x) featurePlot(melanoma, x))
patchwork::wrap_plots(c(enhanced.plots, spot.plots), ncol=4)

Accessing Markov chains

If save.chain is set to TRUE in either spatialCluster() or spatialEnhance(), the chain associated with the respective MCMC run is preserved to disk as an HDF5 file. The path to this file is stored in the SingleCellExperiment’s metadata at metadata(sce)$h5.chain, and can be read directly using mcmcChain().

The chain is provided as a coda::mcmc object, which can be analyzed with TidyBayes or as a matrix. The object has one row per iteration, with the values of the parameters concatenated across the row. Columns are named with the parameter name and index (if any).

chain <- mcmcChain(melanoma)
chain[1:5, 1:5]
#>      lambda[1,1] lambda[1,2] lambda[1,3]  lambda[1,4] lambda[1,5]
#> [1,]   0.0100000  0.00000000  0.00000000 0.000000e+00  0.00000000
#> [2,]   0.1836711  0.01019059  0.08664012 3.994578e-02 -0.02443957
#> [3,]   0.1790958 -0.04576098  0.07102969 8.286855e-05 -0.10919746
#> [4,]   0.1730928 -0.02089801  0.08429716 2.803978e-02 -0.10322589
#> [5,]   0.1923552 -0.02112783  0.07563260 2.088451e-02 -0.04688843

To remove the HDF5 file from disk and remove its path from the metadata, use removeChain().