scDesign3 marginal distribution for genes (original) (raw)

Introduction

In this tutorial, we explain different forms of the function that can be used when fitting the marginal distribution for each gene.

Notation

The following notations are used:

\({\mathbf{Y}} = [Y_{ij}] \in \mathbb{R}^{n \times m}\): the cell-by-feature matrix with \(n\) cells as rows, \(m\) features as columns, and \(Y_{ij}\) as the measurement of feature \(j\) in cell \(i\); for single-cell sequencing data, \({\mathbf{Y}}\) is often a count matrix.
\(\mathbf{X} = [\mathbf{x}_1, \cdots, \mathbf{x}_n]^T \in \mathbb{R}^{n\times p}\): the cell-by-state-covariate matrix with \(n\) cells as rows and \(p\) cell-state covariates as columns; example covariates are cell type, cell pseudotime, and cell spatial locations.
\(\mathbf{Z} = [\mathbf{b}, \mathbf{c}]\): \(\mathbf{b} = (b_1, \ldots, b_n)^T\) has \(b_i \in \{1, \ldots, B \}\) representing cell \(i\)’s batch, and \(\mathbf{c} = (c_1, \ldots, c_n)^T\) has \(c_i \in \{1, \ldots, C \}\) representing cell \(i\)’s condition.

For each feature \(j=1,\ldots,m\) in every cell \(i=1,\ldots,n\), the measurement \(Y_{ij}\)—conditional on cell \(i\)’s state covariates \(\mathbf{x_i}\) and design covariates \(\mathbf{z}_i = (b_i, c_i)^T\)—is assumed to follow a distribution \(F_{j}( \cdot~|~\mathbf{x}_i, \mathbf{z}_i~;~\mu_{ij}, \sigma_{ij}, p_{ij})\), which is specified as the generalized additive model for location, scale and shape (GAMLSS). The various specifications of \(f_{jc_i}(\cdot)\), \(g_{jc_i}(\cdot)\), and \(h_{jc_i}(\cdot)\) are summarized in the next section. \[\begin{equation} \begin{cases} Y_{ij}~|~\mathbf{x}_i, \mathbf{z}_i &\overset{\mathrm{ind}}{\sim} F_{j}( \cdot~|~\mathbf{x}_i, \mathbf{z}_i~;~\mu_{ij}, \sigma_{ij}, p_{ij})\\ \theta_{j}(\mu_{ij}) &= \alpha_{j0} + \alpha_{jb_i} + \alpha_{jc_i} + f_{jc_i}(\mathbf{x}_i) \\ \log(\sigma_{ij}) &= \beta_{j0}+ \beta_{jb_i} + \beta_{jc_i} + g_{jc_i}(\mathbf{x}_i) \\ \operatorname{logit}(p_{ij}) &= \gamma_{j0} + \gamma_{jb_i}+ \gamma_{jc_i}+ h_{jc_i}(\mathbf{x}_i) \\ \end{cases} \, \end{equation}\]

Summary

Covariate type	Covariate form	Function form	Explaination	Geometric meaning	Code Example
Discrete cell type	\(x_i \in \left\{1, \ldots, K_C\right\}\)	\(f_{jc_i}(x_i) = \alpha_{jc_ix_i}\)	Cell type \(x_i\) has the effect \(\alpha_{jc_ix_i}\); for identifiability, \(\alpha_{jc_ix_i} = 0\) if \(x_i = 1\)	One intercept for each cell type	mu_formula = “cell_type”
Continuous pseudotime in one lineage	\(x_i \in [0,\infty)\)	\(f_{jc_i}({x}_i) = \sum_{k = 1}^Kb_{jc_ik}(x_{i})\beta_{jc_ik}\)	\(b_{jc_ik}(\cdot)\) is a basis function of cubic spline; \(K\) is the dimension of the basis	A curve along the pseudotime	mu_formula = “s(pseudotime)”
Continuous pseudotimes in \(p\) lineages	\(\mathbf{x}_i = (x_{i1}, \ldots, x_{ip})^T \in [0,\infty)^{p}\)	\(f_{jc_i}(\mathbf{x}_i) = \sum_{l = 1}^p \sum_{k = 1}^Kb_{jc_ilk}(x_{il})\beta_{jc_ilk}\)	\(b_{jc_ilk}(\cdot)\) is a basis function of cubic spline; \(K\) is the the dimension of the basis (default \(K=10\))	One curve along each lineage	mu_formula = “s(pseudotime1, k = 10, by = l1, bs = ‘cr’) + s(pseudotime2, k = 10, by = l2, bs = ‘cr’)”, \(p = 2\) in this case
Spatial location	\(\mathbf{x}_i = (x_{i1}, x_{i2})^T \in \mathbb{R}^{2}\)	\(f_{jc_i}(\mathbf{x}_i) = f_{jc_i}^{\operatorname{GP}}(x_{i1}, x_{i2}, K)\)	\(f_{jc_i}^{\operatorname{GP}}(\cdot, \cdot, K)\) is a Gaussian process smoother; \(K\) is the dimension of the basis (default \(K=400\))	A smooth surface	mu_formula = “s(spatial1, spatial2, bs = ‘gp’, k = 400)”
Note: For simplicity, we only show the form of \(f_{jc_i}(\cdot)\) because \(g_{jc_i}(\cdot)\) and \(h_{jc_i}(\cdot)\) have the same form.