Characterizing the replicability of cell types defined by single cell RNA-sequencing data using MetaNeighbor - PubMed (original) (raw)
Characterizing the replicability of cell types defined by single cell RNA-sequencing data using MetaNeighbor
Megan Crow et al. Nat Commun. 2018.
Abstract
Single-cell RNA-sequencing (scRNA-seq) technology provides a new avenue to discover and characterize cell types; however, the experiment-specific technical biases and analytic variability inherent to current pipelines may undermine its replicability. Meta-analysis is further hampered by the use of ad hoc naming conventions. Here we demonstrate our replication framework, MetaNeighbor, that quantifies the degree to which cell types replicate across datasets, and enables rapid identification of clusters with high similarity. We first measure the replicability of neuronal identity, comparing results across eight technically and biologically diverse datasets to define best practices for more complex assessments. We then apply this to novel interneuron subtypes, finding that 24/45 subtypes have evidence of replication, which enables the identification of robust candidate marker genes. Across tasks we find that large sets of variably expressed genes can identify replicable cell types with high accuracy, suggesting a general route forward for large-scale evaluation of scRNA-seq data.
Conflict of interest statement
The authors declare no competing interests.
Figures
Fig. 1
MetaNeighbor quantifies cell-type identity across experiments. a Schematic representation of gene set co-expression across individual cells. Cell types are indicated by their color. b Similarity between cells is measured by taking the correlation of gene set expression between individual cells. On the top left of the panel, gene set expression between two cells, A and B, is plotted. There is a weak correlation between these cells. On the bottom left of the panel we see the correlation between cells A and C, which are strongly correlated. By taking the correlations between all pairs of cells we can build a cell network (right), where every node is a cell and the edges represent how similar each cell is to each other cell. c The cell network that was generated in b can be extended to include data from multiple experiments (multiple datasets). The generation of this multi-dataset network is the first step of MetaNeighbor. d The cross-validation and scoring scheme of MetaNeighbor is demonstrated in this panel. To assess cell-type identity across experiments we use neighbor voting in cross-validation, systematically hiding the labels from one dataset at a time for testing. Cells within the test set are predicted as similar to the cell types from other training sets using a neighbor-voting formalism. Whether these scores prioritize cells as the correct type within the dataset determines the performance, expressed as the AUROC. In other words, comparative assessment of cells occurs only within a dataset, but is based only on training information from outside that dataset. This is then repeated for all gene sets of interest
Fig. 2
Cell-type identity is widely represented in the transcriptome. a,b Distribution of AUROC scores from MetaNeighbor for discriminating neurons from non-neuronal cells (“task one”, a) and for distinguishing excitatory vs. inhibitory neurons (“task two”, b). GO scores are in black and random gene set scores are plotted in gray. Dashed gray lines indicate the null expectation for correctly guessing cell identity (AUROC = 0.5). For both tasks, almost any gene set can be used to improve performance above the null, suggesting widespread encoding of cell identity across the transcriptome. c Comparison of GO group scores across tasks. GO groups at the extremes of the distribution are labeled. Most gene sets have higher performance for task one, and a number of groups have high performance for both tasks (e.g., transmembrane transport). d Task one AUROC scores for each gene set are plotted with respect to the number of genes. A strong, positive relationship is observed between gene set size and AUROC score, regardless of whether genes were chosen randomly or based on shared functions. e Distribution of AUROC scores for task one using 100 sets of 100 randomly chosen genes, or 800 randomly chosen genes. The mean AUROC score is significantly improved with the use of larger gene sets (mean 100 = 0.80 ± 0.05, mean 800 = 0.90 ± 0.03). f Relationship between AUROC score and coefficient of variation. Points indicate individual gene set performances, the line shows the running average and the SD of the line is indicated by the shaded region. Task one was re-run using sets of genes chosen based on mean coefficient of variation across datasets. A strong positive relationship was observed between this factor and performance (_r_s ~ 0.67)
Fig. 3
Empirical modeling demonstrates that MetaNeighbor readily identifies rare and transcriptionally subtle cell types. a Schematic of the empirical model. For simplicity only a single dataset is depicted. (Top left) In this dataset, we begin with an expression matrix containing gene expression levels for two cell types comprising 10 cells each. Here we will be assessing the replicability of cell type 1 (“positives”) relative to cell type 2 (“negatives”). (Top right) We first adjust cell rarity by randomly sampling subsets of the original expression matrix. In the schematic, incidence is set to 20% (two positives and eight negatives). In addition, we partition two negatives from the original data for later use. (Middle) Next, we adjust transcriptional subtlety by randomly sampling genes from a given fraction of the transcriptome. Gene expression in the positives will be replaced with data from the unused negatives, creating a modeled cell type varying from the negative class only in a subset of its genes. (Bottom) All datasets are combined and MetaNeighbor is run to assess the replicability of the positives at each level of rarity and subtlety. b MetaNeighbor results for empirical modeling of excitatory neuron rarity and subtlety, repeated 100 times. Mean performance for the top GO group is in black; performance for 20 randomly chosen genes is shown in red; dashed lines indicate 20% rarity and solid lines show 1% rarity. MetaNeighbor is robust to differences in cell rarity, and can reliably distinguish between types even when they are very similar (AUROC > 0.7 at >88% subtlety). c MetaNeighbor results for empirical modeling of excitatory neuron rarity and subtlety using highly variable genes (HVGs), repeated 100 times. Performance for the HVG varying set is shown in black, performance for the HVG static is shown in red; dashed lines indicate 20% rarity and solid lines show 1% rarity. HVGs allow for robust identification of positives even when cells are rare or differences are subtle
Fig. 4
Cross-dataset analysis of interneuron diversity. a Heatmap of AUROC scores between interneuron subtypes based on the highly variable gene set (HVG). Dendrograms were generated by hierarchical clustering of Euclidean distances using average linkage. Row and column colors indicate data origin and marker expression. Clustering of AUROC score profiles recapitulates known cell-type structure, with major branches representing the Pv, Sst, and Htr3a lineages. b Boxplots of GO performance (3888 sets) for each putatively replicated subtype, ordered by their AUROC score from the highly variable gene set. Subtypes are labeled with the names from Tasic et al. A positive relationship is observed between AUROC scores from the highly variable set and the average AUROC score for each subtype
Fig. 5
Replicated subtypes show consistent differential expression. a (Top) Heatmap of FDR-adjusted P values of top differentially expressed genes among replicated interneuron subtypes (NB only 10 subtypes are shown as no differentially expressed genes were found for the Ndnf Car4 subtype). Subtype names are listed at the top of the columns and are labeled as in Tasic et al. Many genes are commonly differentially expressed among multiple subtypes, but combinatorial patterns distinguish them. b Standardized Ptn expression is plotted across the three experiments, where each box represents an interneuron subtype. Boxes bound the quartiles, middle lines represent the median, whiskers extend to 1.5 times the interquartile range, and values outside of this range are shown as individual points. High, but variable expression is observed across the three Sst Chodl types. c Confocal images of co-immunostaining for _Ptn_-CreER;Ai14 with RFP and Nos1 antibodies in adult mouse cortex. _Ptn_-CreER;Ai14 expression was induced with low-dose tamoxifen postnatally. Clear co-labeling is observed in a deep layer (L6) long projecting neuron
Similar articles
- Detection of high variability in gene expression from single-cell RNA-seq profiling.
Chen HI, Jin Y, Huang Y, Chen Y. Chen HI, et al. BMC Genomics. 2016 Aug 22;17 Suppl 7(Suppl 7):508. doi: 10.1186/s12864-016-2897-6. BMC Genomics. 2016. PMID: 27556924 Free PMC article. - A multitask clustering approach for single-cell RNA-seq analysis in Recessive Dystrophic Epidermolysis Bullosa.
Zhang H, Lee CAA, Li Z, Garbe JR, Eide CR, Petegrosso R, Kuang R, Tolar J. Zhang H, et al. PLoS Comput Biol. 2018 Apr 9;14(4):e1006053. doi: 10.1371/journal.pcbi.1006053. eCollection 2018 Apr. PLoS Comput Biol. 2018. PMID: 29630593 Free PMC article. - Improving replicability in single-cell RNA-Seq cell type discovery with Dune.
Roux de Bézieux H, Street K, Fischer S, Van den Berge K, Chance R, Risso D, Gillis J, Ngai J, Purdom E, Dudoit S. Roux de Bézieux H, et al. BMC Bioinformatics. 2024 May 24;25(1):198. doi: 10.1186/s12859-024-05814-6. BMC Bioinformatics. 2024. PMID: 38789920 Free PMC article. - Single-Cell RNA Sequencing for Studying Human Cancers.
Aran D. Aran D. Annu Rev Biomed Data Sci. 2023 Aug 10;6:1-22. doi: 10.1146/annurev-biodatasci-020722-091857. Epub 2023 Apr 11. Annu Rev Biomed Data Sci. 2023. PMID: 37040737 Review. - Scaling up reproducible research for single-cell transcriptomics using MetaNeighbor.
Fischer S, Crow M, Harris BD, Gillis J. Fischer S, et al. Nat Protoc. 2021 Aug;16(8):4031-4067. doi: 10.1038/s41596-021-00575-5. Epub 2021 Jul 7. Nat Protoc. 2021. PMID: 34234317 Free PMC article. Review.
Cited by
- Single-cell Proteomics: Progress and Prospects.
Kelly RT. Kelly RT. Mol Cell Proteomics. 2020 Nov;19(11):1739-1748. doi: 10.1074/mcp.R120.002234. Epub 2020 Aug 26. Mol Cell Proteomics. 2020. PMID: 32847821 Free PMC article. Review. - Single-cell RNA sequencing of cultured human endometrial CD140b+CD146+ perivascular cells highlights the importance of in vivo microenvironment.
Cao D, Chan RWS, Ng EHY, Gemzell-Danielsson K, Yeung WSB. Cao D, et al. Stem Cell Res Ther. 2021 May 29;12(1):306. doi: 10.1186/s13287-021-02354-1. Stem Cell Res Ther. 2021. PMID: 34051872 Free PMC article. - Enhanced feature matching in single-cell proteomics characterizes IFN-γ response and co-existence of cell states.
Krull KK, Ali SA, Krijgsveld J. Krull KK, et al. Nat Commun. 2024 Sep 26;15(1):8262. doi: 10.1038/s41467-024-52605-x. Nat Commun. 2024. PMID: 39327420 Free PMC article. - Molecular and cellular evolution of the amygdala across species analyzed by single-nucleus transcriptome profiling.
Yu B, Zhang Q, Lin L, Zhou X, Ma W, Wen S, Li C, Wang W, Wu Q, Wang X, Li XM. Yu B, et al. Cell Discov. 2023 Feb 14;9(1):19. doi: 10.1038/s41421-022-00506-y. Cell Discov. 2023. PMID: 36788214 Free PMC article. - A single-cell transcriptomic atlas tracking the neural basis of division of labour in an ant superorganism.
Li Q, Wang M, Zhang P, Liu Y, Guo Q, Zhu Y, Wen T, Dai X, Zhang X, Nagel M, Dethlefsen BH, Xie N, Zhao J, Jiang W, Han L, Wu L, Zhong W, Wang Z, Wei X, Dai W, Liu L, Xu X, Lu H, Yang H, Wang J, Boomsma JJ, Liu C, Zhang G, Liu W. Li Q, et al. Nat Ecol Evol. 2022 Aug;6(8):1191-1204. doi: 10.1038/s41559-022-01784-1. Epub 2022 Jun 16. Nat Ecol Evol. 2022. PMID: 35711063 Free PMC article.
References
Publication types
MeSH terms
Substances
Grants and funding
- R01 MH094705/MH/NIMH NIH HHS/United States
- R01 MH113005/MH/NIMH NIH HHS/United States
- R01 MH109665/MH/NIMH NIH HHS/United States
- F32 MH114501/MH/NIMH NIH HHS/United States
- R01 LM012736/LM/NLM NIH HHS/United States
LinkOut - more resources
Full Text Sources
Other Literature Sources
Molecular Biology Databases