Revealing the vectors of cellular identity with single-cell genomics - PubMed (original) (raw)
Review
Revealing the vectors of cellular identity with single-cell genomics
Allon Wagner et al. Nat Biotechnol. 2016.
Abstract
Single-cell genomics has now made it possible to create a comprehensive atlas of human cells. At the same time, it has reopened definitions of a cell's identity and of the ways in which identity is regulated by the cell's molecular circuitry. Emerging computational analysis methods, especially in single-cell RNA sequencing (scRNA-seq), have already begun to reveal, in a data-driven way, the diverse simultaneous facets of a cell's identity, from discrete cell types to continuous dynamic transitions and spatial locations. These developments will eventually allow a cell to be represented as a superposition of 'basis vectors', each determining a different (but possibly dependent) aspect of cellular organization and function. However, computational methods must also overcome considerable challenges-from handling technical noise and data scale to forming new abstractions of biology. As the scale of single-cell experiments continues to increase, new computational approaches will be essential for constructing and characterizing a reference map of cell identities.
Conflict of interest statement
COMPETING FINANCIAL INTERESTS
A.R. is a member of the Scientific Advisory Board for Thermo Fisher Scientific and Syros Pharmaceuticals and a consultant for Driver Group. A.W. and N.Y. declare no competing financial interests.
Figures
Figure 1
(a) A cell participates simultaneously in multiple biological contexts. The illustration depicts a particular cell (highlighted in blue) as it experiences multiple concurrent contexts that shape its identity simultaneously (from left to right): environmental stimuli, such as nutrient availability or the binding of a signaling molecule to a receptor; a specific state on a developmental trajectory; the cell cycle; and a spatial context, which determines its physical environment (e.g., oxygen availability), cellular neighbors, and developmental cues (e.g., morphogen gradients). (b) The biological factors affecting the cell combine to create its unique, instantaneous identity, which is captured in the cell’s molecular profile. Computational methods dissect the molecular profile and tease apart facets of the cell’s identity, which are akin to ‘basis vectors’ that span a space of possible cellular identities. Key examples include (counterclockwise from top): (1) division into discrete types (e.g., cell populations in the retina (A.R. and colleagues)); cell type frequency can vary by multiple orders of magnitude from the most abundant to the rarest subtype; (2) continuous phenotypes (e.g., the pro-inflammatory potential of each individual T cell, quantified through a gene expression signature derived from bulk pathogenic T cell profiles (N.Y., A.R. and colleagues)); (3) temporal progression (e.g., normal differentiation, such as hematopoiesis); (4) temporal vacillation between cellular states (e.g., oscillation through cell cycle; data taken from A.R. and colleagues); (5) physical locations: a schematic representation of an embryo at 50% epiboly (only half is shown), divided into discrete spatial bins; independent in situ hybridization data of landmark genes allows inferring spatial bins (highlighted) from which single cells had likely originated (figure adapted from A.R. and colleagues). The scatterplots represent single cells (dots) projected onto two dimensions (e.g., first two principal components or using t-SNE).
Figure 2
Biological and technical factors combine to determine the measured genomic profiles of single cells; computational methods remove technical effects and tease apart facets of the biological variation. The sources of variation that affect single-cell genomics data are (1) technical factors that reflect variance due to the experimental process (e.g., batch effects); (2) factors that are intrinsic to the process under study (e.g., transcription) and reflect stochastic fluctuations (e.g., transcriptional or translational bursts in mRNA or proteins) that do not correlate between two alleles of the same gene; and (3) factors that are extrinsic to the process under study, reflecting the presence of different cell types and states (e.g., concentrations of key transcription, translation, or metabolic factors). Computational methods are needed to remove the nuisance technical variation (although they typically cannot completely eliminate it) before the biological variation can be confidently explored. Most single-cell studies explore allele-extrinsic factors and can be classified as either cell-centric or gene-centric. Cell-centric analyses aim to catalog the cells into phenotypic groups, whether discrete (e.g., clustering) or continuous (e.g., temporal ordering). Gene-centric analyses aim to understand the dynamics and regulation of the generating mechanism (e.g., transcriptional circuits).
Figure 3
Technical confounders of single-cell RNA-seq and computational methods to handle them. (a) Batch effects. This source of technical variability can be mitigated by careful experimental design. The upper panel shows a design in which two biological conditions (“1” and “2”; for example, wild type and knockout) are distributed evenly between two technical batches (“Prep a” and “Prep b”). This allows statistical methods to account for the batch effect. In contrast, in the lower panel, the biological variation cannot be separated from the batch effect. (b) Library quality. The primary principal components of single-cell gene expression correlate strongly with quality metrics such as number of aligned reads and library complexity (A.R. and colleagues). A typical example is provided. The y axis shows the –log10 P value of the Spearman correlation between each of 18 quality metrics (color coded) and one of the primary principal components of the unnormalized expression data (FPKM units; data is taken from N.Y., A.R. and colleagues, in which the quality metrics are described; SZ = size, STD = standard deviation). (c) Dropouts and amplification bias. Because of the minute quantities of starting material in single cells, expressed transcripts may not be detected because they stochastically failed to amplify; on the other hand, the massive amplification exaggerates any source of technical noise. (d) Latent technical confounders. These can be identified using matrix factorization (notation follows ref. ; visualization adapted from ref. 216). The observed expression (Y, m samples by n genes) is often assumed to be a linear combination of biological and technical factors for statistical tractability,,,. It can be generally modeled as a sum of (a) X: p biological factors (either known a priori—for example, genetic background—or latent, in which case p is unknown); (b) Z: n known technical covariates (e.g., experimental batches); (c) W: k latent factors of technical noise (k is unknown); (d) random noise ε with zero mean. α, β, γ determine the influence of each factor on every gene, with β representing the biology of interest and α, γ, being nuisance factors that need to be properly handled before β can be inferred. (e) The prevalence of dropouts is modeled through a zero-inflated model: gene expression is modeled as a mixture of two distributions: the ‘real’ one, observed when a transcript is successfully amplified, that reflects the true mRNA abundance (_p_success, in orange)) and a ‘dropout’ that occurs when a transcript fails to amplify (_p_dropout, in teal). The mixing ratio π depends on the transcript’s real expression since it has been empirically observed that weakly expressed transcripts are more prone to dropouts. (f) Modeling dropout probabilities based on empirical data. Left and middle, false-negative rate curves (computed for each cell) describe the probability for a dropout event (y axis) as a logistic function of transcript abundance in the corresponding bulk population. Right, the inferred rates weigh down the effect of possible dropout events. Each dot represents the expression of one gene in two arbitrary single cells (x and y axes). Undetected genes (circled) are weighed down when computing the correlation between the expression profiles of the two cells (data obtained from N.Y., A.R. and colleagues).
Similar articles
- Identifying gene expression programs of cell-type identity and cellular activity with single-cell RNA-Seq.
Kotliar D, Veres A, Nagy MA, Tabrizi S, Hodis E, Melton DA, Sabeti PC. Kotliar D, et al. Elife. 2019 Jul 8;8:e43803. doi: 10.7554/eLife.43803. Elife. 2019. PMID: 31282856 Free PMC article. - The challenges of modeling mammalian biocomplexity.
Nicholson JK, Holmes E, Lindon JC, Wilson ID. Nicholson JK, et al. Nat Biotechnol. 2004 Oct;22(10):1268-74. doi: 10.1038/nbt1015. Nat Biotechnol. 2004. PMID: 15470467 Review. - Metabolomics and the challenges ahead.
Mendes P. Mendes P. Brief Bioinform. 2006 Jun;7(2):127. doi: 10.1093/bib/bbl010. Brief Bioinform. 2006. PMID: 16914436 No abstract available. - Characterizing transcriptional heterogeneity through pathway and gene set overdispersion analysis.
Fan J, Salathia N, Liu R, Kaeser GE, Yung YC, Herman JL, Kaper F, Fan JB, Zhang K, Chun J, Kharchenko PV. Fan J, et al. Nat Methods. 2016 Mar;13(3):241-4. doi: 10.1038/nmeth.3734. Epub 2016 Jan 18. Nat Methods. 2016. PMID: 26780092 Free PMC article. - The next wave in metabolome analysis.
Nielsen J, Oliver S. Nielsen J, et al. Trends Biotechnol. 2005 Nov;23(11):544-6. doi: 10.1016/j.tibtech.2005.08.005. Epub 2005 Sep 12. Trends Biotechnol. 2005. PMID: 16154652 Review.
Cited by
- Analysis of circulating breast cancer cell heterogeneity and interactions with peripheral blood mononuclear cells.
Brechbuhl HM, Vinod-Paul K, Gillen AE, Kopin EG, Gibney K, Elias AD, Hayashi M, Sartorius CA, Kabos P. Brechbuhl HM, et al. Mol Carcinog. 2020 Oct;59(10):1129-1139. doi: 10.1002/mc.23242. Epub 2020 Aug 21. Mol Carcinog. 2020. PMID: 32822091 Free PMC article. - Multi-Objective Optimized Fuzzy Clustering for Detecting Cell Clusters from Single-Cell Expression Profiles.
Mallik S, Zhao Z. Mallik S, et al. Genes (Basel). 2019 Aug 13;10(8):611. doi: 10.3390/genes10080611. Genes (Basel). 2019. PMID: 31412637 Free PMC article. - Identifying gene expression programs of cell-type identity and cellular activity with single-cell RNA-Seq.
Kotliar D, Veres A, Nagy MA, Tabrizi S, Hodis E, Melton DA, Sabeti PC. Kotliar D, et al. Elife. 2019 Jul 8;8:e43803. doi: 10.7554/eLife.43803. Elife. 2019. PMID: 31282856 Free PMC article. - From bench to bedside: Single-cell analysis for cancer immunotherapy.
Davis-Marcisak EF, Deshpande A, Stein-O'Brien GL, Ho WJ, Laheru D, Jaffee EM, Fertig EJ, Kagohara LT. Davis-Marcisak EF, et al. Cancer Cell. 2021 Aug 9;39(8):1062-1080. doi: 10.1016/j.ccell.2021.07.004. Epub 2021 Jul 29. Cancer Cell. 2021. PMID: 34329587 Free PMC article. Review. - Effect of distance measures on confidences of t-SNE embeddings and its implications on clustering for scRNA-seq data.
Ozgode Yigin B, Saygili G. Ozgode Yigin B, et al. Sci Rep. 2023 Apr 21;13(1):6567. doi: 10.1038/s41598-023-32966-x. Sci Rep. 2023. PMID: 37085593 Free PMC article.
References
- Zeisel A, et al. Brain structure. Cell types in the mouse cortex and hippocampus revealed by single-cell RNA-seq. Science. 2015;347:1138–1142. - PubMed
- Grün D, et al. Single-cell messenger RNA sequencing reveals rare intestinal cell types. Nature. 2015;525:251–255. - PubMed
Publication types
MeSH terms
Substances
Grants and funding
- U24 CA180922/CA/NCI NIH HHS/United States
- P30 CA014051/CA/NCI NIH HHS/United States
- U01 MH105979/MH/NIMH NIH HHS/United States
- U01 MH105960/MH/NIMH NIH HHS/United States
- HHMI/Howard Hughes Medical Institute/United States
- P50 HG006193/HG/NHGRI NIH HHS/United States
- U24 AI118672/AI/NIAID NIH HHS/United States
- RM1 HG006193/HG/NHGRI NIH HHS/United States
LinkOut - more resources
Full Text Sources
Other Literature Sources
Research Materials