Batch effects in single-cell RNA-sequencing data are corrected by matching mutual nearest neighbors - PubMed (original) (raw)

. 2018 Jun;36(5):421-427.

doi: 10.1038/nbt.4091. Epub 2018 Apr 2.

Affiliations

Batch effects in single-cell RNA-sequencing data are corrected by matching mutual nearest neighbors

Laleh Haghverdi et al. Nat Biotechnol. 2018 Jun.

Abstract

Large-scale single-cell RNA sequencing (scRNA-seq) data sets that are produced in different laboratories and at different times contain batch effects that may compromise the integration and interpretation of the data. Existing scRNA-seq analysis methods incorrectly assume that the composition of cell populations is either known or identical across batches. We present a strategy for batch correction based on the detection of mutual nearest neighbors (MNNs) in the high-dimensional expression space. Our approach does not rely on predefined or equal population compositions across batches; instead, it requires only that a subset of the population be shared between batches. We demonstrate the superiority of our approach compared with existing methods by using both simulated and real scRNA-seq data sets. Using multiple droplet-based scRNA-seq data sets, we demonstrate that our MNN batch-effect-correction method can be scaled to large numbers of cells.

PubMed Disclaimer

Conflict of interest statement

Competing interests

The authors declare no competing financial interests.

Figures

Figure 1

Figure 1

Schematics of batch effect correction by MNN. (a) Batch 1 and batch 2 in high dimensions with an almost orthogonal batch effect difference between them. (b) The algorithm identifies matching cell types by finding mutual nearest neighbouring pairs of cells (grey box). (c) Batch correction vectors are calculated between the MNN pairs. (d) Batch 1 is regarded as the reference and batch 2 is integrated into it by subtraction of correction vectors. (e) The integrated data are considered as the reference and the procedure is repeated for integration of any new batch.

Figure 2

Figure 2

_t_-SNE plots of simulated scRNA-seq data containing two batches of different cell types (with each batch containing n=1000 cells), (a) before and after correction with (b) our MNN method, (c) limma or (d) ComBat. In this simulation, each batch (closed circle or open triangle) contained different numbers of cells in each of three cell types (specified by colour).

Figure 3

Figure 3

_t_-SNE plots of scRNA-seq count data for cells from the haematopoietic lineage, prepared in two batches using different technologies (SMART-seq2 with n=1920 cells, closed circle; MARS-seq, with n=2729 cells, open circle). Plots were generated (a) before and after batch correction using (b) our MNN method, (c) limma or (d) ComBat. Cells are coloured according to their annotated cell type. (e) The expected hierarchy of haematopoietic cell types. PCA plots of scRNA-seq count data for common cells types between the two batches of the haematopoietic lineage generated (SMART-seq2 with n=791 cells and MARS-seq, with n=2729 cells) (f) before and after batch correction using (g) our MNN method, (h) limma or (i) ComBat.

Figure 4

Figure 4

Application of MNN batch correction to pancreas cells using four data sets (GSE81076 with n=1007, GSE86473 with n= 2331, GSE85241 with n=1595 and E-MTAB-5061 with n=2163 cells) measured on two different platforms, CEL-seq(2) and SMART-seq2. _t_-SNE plots for (a) uncorrected (raw) data and (b) data corrected with our MNN method. The different batches are represented by four colours in the top panel of (a) and (b), whilst the different cell types are denoted in the bottom panels by distinct colours. (c) Combining data sets by using MNN correction increases the power to detect differentially expressed genes. Volcano plots of differential expression testing in a single data set (GSE81076; _δ_-cells=54, _γ_-cells=19, left panel) and using the new cell type labels after MNN correction (Combined; _δ_-cells=428, _γ_-cells=425, right panel). The y-axis represents the -log10 Benjamini-Hochberg adjusted p-value (-log10 p-value > 100 are censored at 100 for comparable scales), and the x-axis is the log2 fold change of expression in cells over cells. Individual gene symbols are labelled where |log2 fold change| > 3. More genes are consistently differentially expressed at a FDR 5% in the combined data sets. (d) Venn diagrams representing the intersection of differentially expressed genes using the cell type labels after batch correction (blue circle) and using the original cell type labels from each individual study (orange circle). Numbers in each segment are the total number of DE genes between δ and γ islet cells in each batch. Each Venn diagram corresponds to a batch in which both cell types are present.

Figure 5

Figure 5

MNN batch correction scales to tens of thousands of cells. _t_-SNE plots of scRNA-seq data of human peripheral blood mononuclear cells and T cells (n=73039 cells), prior to batch correction (a, c) and following MNN correction (b, d). Individual points are coloured by their original cell type labels (c, d) and by the study batch of origin (a, b). (e) CPU time increases linearly in the number of input cells to MNN correction. Points represent the number of sub-sampled cells; the red dashed line represents the linear t between CPU time (minutes) and number of cells.

Similar articles

Cited by

References

    1. Jaitin Diego Adhemar, Kenigsberg Ephraim, Keren-Shaul Hadas, Elefant Naama, Paul Franziska, Zaretsky Irina, Mildner Alexander, Cohen Nadav, Jung Steffen, Tanay Amos, et al. Massively parallel single-cell RNA-seq for marker-free decomposition of tissues into cell types. Science. 2014;343(6172):776–779. - PMC - PubMed
    1. Klein Allon M, Mazutis Linas, Akartuna Ilke, Tallapragada Naren, Veres Adrian, Li Vector, Peshkin Leonid, Weitz David A, Kirschner Marc W. Droplet barcoding for single-cell transcriptomics applied to embryonic stem cells. Cell. 2015;161(5):1187–1201. - PMC - PubMed
    1. Macosko Evan Z, Basu Anindita, Satija Rahul, Nemesh James, Shekhar Karthik, Goldman Melissa, Tirosh Itay, Bialas Allison R, Kamitaki Nolan, Martersteck Emily M, et al. Highly parallel genome-wide expression profiling of individual cells using nanoliter droplets. Cell. 2015;161(5):1202–1214. - PMC - PubMed
    1. Gierahn Todd M, Wadsworth Marc H, II, Hughes Travis K, Bryson Bryan D, Butler Andrew, Satija Rahul, Fortune Sarah, Love J Christopher, Shalek Alex K. Seq-Well: portable, low-cost RNA sequencing of single cells at high throughput. Nature Methods. 2017;14(4):395–398. - PMC - PubMed
    1. Hicks Stephanie C, Townes F William, Teng Mingxiang, Irizarry Rafael A. Missing data and technical variability in single-cell RNA-sequencing experiments. BioRxiv. 2017 doi: 10.1101/025528. - DOI - PMC - PubMed

Publication types

MeSH terms

LinkOut - more resources