Characterizing the replicability of cell types defined by single cell RNA-sequencing data using MetaNeighbor - PubMed (original) (raw)

Characterizing the replicability of cell types defined by single cell RNA-sequencing data using MetaNeighbor

Megan Crow et al. Nat Commun. 2018.

Abstract

Single-cell RNA-sequencing (scRNA-seq) technology provides a new avenue to discover and characterize cell types; however, the experiment-specific technical biases and analytic variability inherent to current pipelines may undermine its replicability. Meta-analysis is further hampered by the use of ad hoc naming conventions. Here we demonstrate our replication framework, MetaNeighbor, that quantifies the degree to which cell types replicate across datasets, and enables rapid identification of clusters with high similarity. We first measure the replicability of neuronal identity, comparing results across eight technically and biologically diverse datasets to define best practices for more complex assessments. We then apply this to novel interneuron subtypes, finding that 24/45 subtypes have evidence of replication, which enables the identification of robust candidate marker genes. Across tasks we find that large sets of variably expressed genes can identify replicable cell types with high accuracy, suggesting a general route forward for large-scale evaluation of scRNA-seq data.

PubMed Disclaimer

Conflict of interest statement

The authors declare no competing interests.

Figures

Fig. 1

Fig. 1

MetaNeighbor quantifies cell-type identity across experiments. a Schematic representation of gene set co-expression across individual cells. Cell types are indicated by their color. b Similarity between cells is measured by taking the correlation of gene set expression between individual cells. On the top left of the panel, gene set expression between two cells, A and B, is plotted. There is a weak correlation between these cells. On the bottom left of the panel we see the correlation between cells A and C, which are strongly correlated. By taking the correlations between all pairs of cells we can build a cell network (right), where every node is a cell and the edges represent how similar each cell is to each other cell. c The cell network that was generated in b can be extended to include data from multiple experiments (multiple datasets). The generation of this multi-dataset network is the first step of MetaNeighbor. d The cross-validation and scoring scheme of MetaNeighbor is demonstrated in this panel. To assess cell-type identity across experiments we use neighbor voting in cross-validation, systematically hiding the labels from one dataset at a time for testing. Cells within the test set are predicted as similar to the cell types from other training sets using a neighbor-voting formalism. Whether these scores prioritize cells as the correct type within the dataset determines the performance, expressed as the AUROC. In other words, comparative assessment of cells occurs only within a dataset, but is based only on training information from outside that dataset. This is then repeated for all gene sets of interest

Fig. 2

Fig. 2

Cell-type identity is widely represented in the transcriptome. a,b Distribution of AUROC scores from MetaNeighbor for discriminating neurons from non-neuronal cells (“task one”, a) and for distinguishing excitatory vs. inhibitory neurons (“task two”, b). GO scores are in black and random gene set scores are plotted in gray. Dashed gray lines indicate the null expectation for correctly guessing cell identity (AUROC = 0.5). For both tasks, almost any gene set can be used to improve performance above the null, suggesting widespread encoding of cell identity across the transcriptome. c Comparison of GO group scores across tasks. GO groups at the extremes of the distribution are labeled. Most gene sets have higher performance for task one, and a number of groups have high performance for both tasks (e.g., transmembrane transport). d Task one AUROC scores for each gene set are plotted with respect to the number of genes. A strong, positive relationship is observed between gene set size and AUROC score, regardless of whether genes were chosen randomly or based on shared functions. e Distribution of AUROC scores for task one using 100 sets of 100 randomly chosen genes, or 800 randomly chosen genes. The mean AUROC score is significantly improved with the use of larger gene sets (mean 100 = 0.80 ± 0.05, mean 800 = 0.90 ± 0.03). f Relationship between AUROC score and coefficient of variation. Points indicate individual gene set performances, the line shows the running average and the SD of the line is indicated by the shaded region. Task one was re-run using sets of genes chosen based on mean coefficient of variation across datasets. A strong positive relationship was observed between this factor and performance (_r_s ~ 0.67)

Fig. 3

Fig. 3

Empirical modeling demonstrates that MetaNeighbor readily identifies rare and transcriptionally subtle cell types. a Schematic of the empirical model. For simplicity only a single dataset is depicted. (Top left) In this dataset, we begin with an expression matrix containing gene expression levels for two cell types comprising 10 cells each. Here we will be assessing the replicability of cell type 1 (“positives”) relative to cell type 2 (“negatives”). (Top right) We first adjust cell rarity by randomly sampling subsets of the original expression matrix. In the schematic, incidence is set to 20% (two positives and eight negatives). In addition, we partition two negatives from the original data for later use. (Middle) Next, we adjust transcriptional subtlety by randomly sampling genes from a given fraction of the transcriptome. Gene expression in the positives will be replaced with data from the unused negatives, creating a modeled cell type varying from the negative class only in a subset of its genes. (Bottom) All datasets are combined and MetaNeighbor is run to assess the replicability of the positives at each level of rarity and subtlety. b MetaNeighbor results for empirical modeling of excitatory neuron rarity and subtlety, repeated 100 times. Mean performance for the top GO group is in black; performance for 20 randomly chosen genes is shown in red; dashed lines indicate 20% rarity and solid lines show 1% rarity. MetaNeighbor is robust to differences in cell rarity, and can reliably distinguish between types even when they are very similar (AUROC > 0.7 at >88% subtlety). c MetaNeighbor results for empirical modeling of excitatory neuron rarity and subtlety using highly variable genes (HVGs), repeated 100 times. Performance for the HVG varying set is shown in black, performance for the HVG static is shown in red; dashed lines indicate 20% rarity and solid lines show 1% rarity. HVGs allow for robust identification of positives even when cells are rare or differences are subtle

Fig. 4

Fig. 4

Cross-dataset analysis of interneuron diversity. a Heatmap of AUROC scores between interneuron subtypes based on the highly variable gene set (HVG). Dendrograms were generated by hierarchical clustering of Euclidean distances using average linkage. Row and column colors indicate data origin and marker expression. Clustering of AUROC score profiles recapitulates known cell-type structure, with major branches representing the Pv, Sst, and Htr3a lineages. b Boxplots of GO performance (3888 sets) for each putatively replicated subtype, ordered by their AUROC score from the highly variable gene set. Subtypes are labeled with the names from Tasic et al. A positive relationship is observed between AUROC scores from the highly variable set and the average AUROC score for each subtype

Fig. 5

Fig. 5

Replicated subtypes show consistent differential expression. a (Top) Heatmap of FDR-adjusted P values of top differentially expressed genes among replicated interneuron subtypes (NB only 10 subtypes are shown as no differentially expressed genes were found for the Ndnf Car4 subtype). Subtype names are listed at the top of the columns and are labeled as in Tasic et al. Many genes are commonly differentially expressed among multiple subtypes, but combinatorial patterns distinguish them. b Standardized Ptn expression is plotted across the three experiments, where each box represents an interneuron subtype. Boxes bound the quartiles, middle lines represent the median, whiskers extend to 1.5 times the interquartile range, and values outside of this range are shown as individual points. High, but variable expression is observed across the three Sst Chodl types. c Confocal images of co-immunostaining for _Ptn_-CreER;Ai14 with RFP and Nos1 antibodies in adult mouse cortex. _Ptn_-CreER;Ai14 expression was induced with low-dose tamoxifen postnatally. Clear co-labeling is observed in a deep layer (L6) long projecting neuron

Similar articles

Cited by

References

    1. Treutlein B, et al. Reconstructing lineage hierarchies of the distal lung epithelium using single-cell RNA-seq. Nature. 2014;509:371–375. doi: 10.1038/nature13173. - DOI - PMC - PubMed
    1. Wang, Y. J. et al. Single cell transcriptomics of the human endocrine pancreas. Diabetes65, 3028–3030 (2016). - PMC - PubMed
    1. Muraro MauroJ, et al. A single-cell transcriptome atlas of the human pancreas. Cell Syst. 2016;3:385–394. doi: 10.1016/j.cels.2016.09.002. - DOI - PMC - PubMed
    1. Segerstolpe A, et al. Single-cell transcriptome profiling of human pancreatic islets in health and type 2 diabetes. Cell Metab. 2016;24:593–607. doi: 10.1016/j.cmet.2016.08.020. - DOI - PMC - PubMed
    1. Baron M, et al. A single-cell transcriptomic map of the human and mouse pancreas reveals Inter- and Intra-cell population structure. Cell Syst. 2016;3:346–360. doi: 10.1016/j.cels.2016.08.011. - DOI - PMC - PubMed

Publication types

MeSH terms

Substances

Grants and funding

LinkOut - more resources