Revealing the vectors of cellular identity with single-cell genomics - PubMed (original) (raw)

Review

Revealing the vectors of cellular identity with single-cell genomics

Allon Wagner et al. Nat Biotechnol. 2016.

Abstract

Single-cell genomics has now made it possible to create a comprehensive atlas of human cells. At the same time, it has reopened definitions of a cell's identity and of the ways in which identity is regulated by the cell's molecular circuitry. Emerging computational analysis methods, especially in single-cell RNA sequencing (scRNA-seq), have already begun to reveal, in a data-driven way, the diverse simultaneous facets of a cell's identity, from discrete cell types to continuous dynamic transitions and spatial locations. These developments will eventually allow a cell to be represented as a superposition of 'basis vectors', each determining a different (but possibly dependent) aspect of cellular organization and function. However, computational methods must also overcome considerable challenges-from handling technical noise and data scale to forming new abstractions of biology. As the scale of single-cell experiments continues to increase, new computational approaches will be essential for constructing and characterizing a reference map of cell identities.

PubMed Disclaimer

Conflict of interest statement

COMPETING FINANCIAL INTERESTS

A.R. is a member of the Scientific Advisory Board for Thermo Fisher Scientific and Syros Pharmaceuticals and a consultant for Driver Group. A.W. and N.Y. declare no competing financial interests.

Figures

Figure 1

Figure 1

(a) A cell participates simultaneously in multiple biological contexts. The illustration depicts a particular cell (highlighted in blue) as it experiences multiple concurrent contexts that shape its identity simultaneously (from left to right): environmental stimuli, such as nutrient availability or the binding of a signaling molecule to a receptor; a specific state on a developmental trajectory; the cell cycle; and a spatial context, which determines its physical environment (e.g., oxygen availability), cellular neighbors, and developmental cues (e.g., morphogen gradients). (b) The biological factors affecting the cell combine to create its unique, instantaneous identity, which is captured in the cell’s molecular profile. Computational methods dissect the molecular profile and tease apart facets of the cell’s identity, which are akin to ‘basis vectors’ that span a space of possible cellular identities. Key examples include (counterclockwise from top): (1) division into discrete types (e.g., cell populations in the retina (A.R. and colleagues)); cell type frequency can vary by multiple orders of magnitude from the most abundant to the rarest subtype; (2) continuous phenotypes (e.g., the pro-inflammatory potential of each individual T cell, quantified through a gene expression signature derived from bulk pathogenic T cell profiles (N.Y., A.R. and colleagues)); (3) temporal progression (e.g., normal differentiation, such as hematopoiesis); (4) temporal vacillation between cellular states (e.g., oscillation through cell cycle; data taken from A.R. and colleagues); (5) physical locations: a schematic representation of an embryo at 50% epiboly (only half is shown), divided into discrete spatial bins; independent in situ hybridization data of landmark genes allows inferring spatial bins (highlighted) from which single cells had likely originated (figure adapted from A.R. and colleagues). The scatterplots represent single cells (dots) projected onto two dimensions (e.g., first two principal components or using t-SNE).

Figure 2

Figure 2

Biological and technical factors combine to determine the measured genomic profiles of single cells; computational methods remove technical effects and tease apart facets of the biological variation. The sources of variation that affect single-cell genomics data are (1) technical factors that reflect variance due to the experimental process (e.g., batch effects); (2) factors that are intrinsic to the process under study (e.g., transcription) and reflect stochastic fluctuations (e.g., transcriptional or translational bursts in mRNA or proteins) that do not correlate between two alleles of the same gene; and (3) factors that are extrinsic to the process under study, reflecting the presence of different cell types and states (e.g., concentrations of key transcription, translation, or metabolic factors). Computational methods are needed to remove the nuisance technical variation (although they typically cannot completely eliminate it) before the biological variation can be confidently explored. Most single-cell studies explore allele-extrinsic factors and can be classified as either cell-centric or gene-centric. Cell-centric analyses aim to catalog the cells into phenotypic groups, whether discrete (e.g., clustering) or continuous (e.g., temporal ordering). Gene-centric analyses aim to understand the dynamics and regulation of the generating mechanism (e.g., transcriptional circuits).

Figure 3

Figure 3

Technical confounders of single-cell RNA-seq and computational methods to handle them. (a) Batch effects. This source of technical variability can be mitigated by careful experimental design. The upper panel shows a design in which two biological conditions (“1” and “2”; for example, wild type and knockout) are distributed evenly between two technical batches (“Prep a” and “Prep b”). This allows statistical methods to account for the batch effect. In contrast, in the lower panel, the biological variation cannot be separated from the batch effect. (b) Library quality. The primary principal components of single-cell gene expression correlate strongly with quality metrics such as number of aligned reads and library complexity (A.R. and colleagues). A typical example is provided. The y axis shows the –log10 P value of the Spearman correlation between each of 18 quality metrics (color coded) and one of the primary principal components of the unnormalized expression data (FPKM units; data is taken from N.Y., A.R. and colleagues, in which the quality metrics are described; SZ = size, STD = standard deviation). (c) Dropouts and amplification bias. Because of the minute quantities of starting material in single cells, expressed transcripts may not be detected because they stochastically failed to amplify; on the other hand, the massive amplification exaggerates any source of technical noise. (d) Latent technical confounders. These can be identified using matrix factorization (notation follows ref. ; visualization adapted from ref. 216). The observed expression (Y, m samples by n genes) is often assumed to be a linear combination of biological and technical factors for statistical tractability,,,. It can be generally modeled as a sum of (a) X: p biological factors (either known a priori—for example, genetic background—or latent, in which case p is unknown); (b) Z: n known technical covariates (e.g., experimental batches); (c) W: k latent factors of technical noise (k is unknown); (d) random noise ε with zero mean. α, β, γ determine the influence of each factor on every gene, with β representing the biology of interest and α, γ, being nuisance factors that need to be properly handled before β can be inferred. (e) The prevalence of dropouts is modeled through a zero-inflated model: gene expression is modeled as a mixture of two distributions: the ‘real’ one, observed when a transcript is successfully amplified, that reflects the true mRNA abundance (_p_success, in orange)) and a ‘dropout’ that occurs when a transcript fails to amplify (_p_dropout, in teal). The mixing ratio π depends on the transcript’s real expression since it has been empirically observed that weakly expressed transcripts are more prone to dropouts. (f) Modeling dropout probabilities based on empirical data. Left and middle, false-negative rate curves (computed for each cell) describe the probability for a dropout event (y axis) as a logistic function of transcript abundance in the corresponding bulk population. Right, the inferred rates weigh down the effect of possible dropout events. Each dot represents the expression of one gene in two arbitrary single cells (x and y axes). Undetected genes (circled) are weighed down when computing the correlation between the expression profiles of the two cells (data obtained from N.Y., A.R. and colleagues).

References

    1. Gaublomme JT, et al. Single-cell genomics unveils critical regulators of Th17 Cell pathogenicity. Cell. 2015;163:1400–1412. - PMC - PubMed
    1. Shalek AK, et al. Single-cell RNA-seq reveals dynamic paracrine control of cellular variation. Nature. 2014;510:363–369. - PMC - PubMed
    1. Shalek AK, et al. Single-cell transcriptomics reveals bimodality in expression and splicing in immune cells. Nature. 2013;498:236–240. - PMC - PubMed
    1. Zeisel A, et al. Brain structure. Cell types in the mouse cortex and hippocampus revealed by single-cell RNA-seq. Science. 2015;347:1138–1142. - PubMed
    1. Grün D, et al. Single-cell messenger RNA sequencing reveals rare intestinal cell types. Nature. 2015;525:251–255. - PubMed

Publication types

MeSH terms

Substances

Grants and funding

LinkOut - more resources