Bayesian approach to single-cell differential expression analysis - PubMed (original) (raw)

Bayesian approach to single-cell differential expression analysis

Peter V Kharchenko et al. Nat Methods. 2014 Jul.

Abstract

Single-cell data provide a means to dissect the composition of complex tissues and specialized cellular environments. However, the analysis of such measurements is complicated by high levels of technical noise and intrinsic biological variability. We describe a probabilistic model of expression-magnitude distortions typical of single-cell RNA-sequencing measurements, which enables detection of differential expression signatures and identification of subpopulations of cells in a way that is more tolerant of noise.

PubMed Disclaimer

Figures

Figure 1

Figure 1. Modeling single-cell RNA-seq measurement as a mixture of two processes

a. Types of cell-to-cell variability observed in single-cell RNA-seq measurements. A smoothed scatter plot compares gene expression estimates from two cells of the same type (MEF cells), illustrating prevalence of drop-out events, over-dispersion, and high-magnitude outliers. b. Single-cell variability throws off standard RNA-seq analysis methods, with top differentially expressed genes influenced by difference in drop-out (Rnaseh2a) or outlier (Bmp4) events. The examples are taken from CuffDiff2 comparison of 10 ESC and 10 MEF cells, with triangles showing expression magnitudes observed in different cells, and whiskers spanning the range of observed expression magnitudes. c. To identify a reliable set of genes for fitting model parameters, our approach initially uses cross-comparison of single-cell measurements (using cells of the same type, e.g. MEF), determining whether the transcript is likely to have been successfully amplified in both experiments (correlated component). The true expression magnitude of such genes is estimated as a median expression level across cells in which the gene appears in a correlated component. d. Each single-cell measurement is modeled as a mixture of drop-out and successful amplification processes. The parameters of the distributions and the magnitude-dependent mixing of the two processes are determined based on the expected population expression averages of genes appearing in many correlated components (c.). e. Drop-out rates vary between different cell types. The rate of transcript detection failures (drop-out events) depends on the average expression magnitude of a gene in the cell population, and varies among the cells. In Islam et al. dataset, higher drop-out frequencies are observed for mouse ES cells compared to MEF cells. f. Drop-out rates for 4, 8 and 16-cell embryo samples examined by Deng et al. using a recently-developed protocol also show systematic differences.

Figure 2

Figure 2. Applying single-cell models for differential expression and subpopulation analyses

a. The model fitted for each single cell is used to estimate the likelihood that a gene is expressed at any particular level (i.e. posterior distribution) given the observed data (colored curves). The approach estimates joint posterior distribution for the overall level with each cell type (black curves), and the expression fold difference between the cell types (middle plot). The example demonstrates expression differences of Sox2 between all ES and MEF cells measured by Islam et al. The plots show posterior probability of expression magnitudes in proximal (top) and distal (bottom) cells. The posterior probability of the fold-expression difference magnitude is shown in the middle plot with the associated raw P-value of differential expression. b. Differential expression of Dazl between cells of 8-cell and 16-cell mouse embryo stages, as determined by SCDE method. A regulator factor expressed in mammalian embryos, , Dazl is expressed at earlier stages, and shows a drop-off between 8- and 16-cell stages. c. The ability of different analysis methods to detect differentially expressed genes is shown using the false/true positive rate relationship (ROC curve), using traditional bulk expression measurements as a benchmark. The SCDE method shows higher sensitivity at low false-positive range, as well as higher overall performance, as measured by area under the curve (AUC) scores. d. Performance of error-model-based transcriptional similarity measures in distinguishing ES and MEF cell types. The plot shows the fraction of correctly classified cells, assessed for increasingly difficult classification problem by iteratively excluding up to 7000 most informative genes (i.e. genes differentially expressed between ES and MEF, x-axis). The 95% confidence bands are shown in light shading. Transcriptional similarity measures that take into account direct or reciprocal drop-out event probability show consistently better classification performance than Pearson linear correlation or Bray-Curtis similarity measure.

Similar articles

Cited by

References

    1. Tang F, et al. Nat Methods. 2009;6:377–382. - PubMed
    1. Islam S, et al. Genome Res. 2011;21:1160–1167. - PMC - PubMed
    1. Hashimshony T, Wagner F, Sher N, Yanai I. Cell Rep. 2012;2:666–673. - PubMed
    1. Ramskold D, et al. Nat Biotechnol. 2012;30:777–782. - PMC - PubMed
    1. Dalerba P, et al. Nat Biotechnol. 2011;29:1120–1127. - PMC - PubMed

Publication types

MeSH terms

Grants and funding

LinkOut - more resources