Batch effects in single-cell RNA-sequencing data are corrected by matching mutual nearest neighbors (original) (raw)

Accession codes

Primary accessions

ArrayExpress

Gene Expression Omnibus

Referenced accessions

ArrayExpress

References

  1. Jaitin, D.A. et al. Massively parallel single-cell RNA-seq for marker-free decomposition of tissues into cell types. Science 343, 776–779 (2014).
    Article CAS PubMed PubMed Central Google Scholar
  2. Klein, A.M. et al. Droplet barcoding for single-cell transcriptomics applied to embryonic stem cells. Cell 161, 1187–1201 (2015).
    Article CAS PubMed PubMed Central Google Scholar
  3. Macosko, E.Z. et al. Highly parallel genome-wide expression profiling of individual cells using nanoliter droplets. Cell 161, 1202–1214 (2015).
    Article CAS PubMed PubMed Central Google Scholar
  4. Gierahn, T.M. et al. Seq-Well: portable, low-cost RNA sequencing of single cells at high throughput. Nat. Methods 14, 395–398 (2017).
    Article CAS PubMed PubMed Central Google Scholar
  5. Hicks, S.C., Townes, F.W., Teng, M. & Irizarry, R.A. Missing data and technical variability in single-cell RNA-sequencing experiments. Preprint at https://www.biorxiv.org/content/early/2017/05/08/025528/ (2017).
  6. Tung, P.Y. et al. Batch effects and the effective design of single-cell gene expression studies. Sci. Rep. 7, 39921 (2017).
    Article CAS PubMed PubMed Central Google Scholar
  7. Ritchie, M.E. et al. limma powers differential expression analyses for RNA-sequencing and microarray studies. Nucleic Acids Res. 43, e47 (2015).
    Article PubMed PubMed Central Google Scholar
  8. Johnson, W.E., Li, C. & Rabinovic, A. Adjusting batch effects in microarray expression data using empirical Bayes methods. Biostatistics 8, 118–127 (2007).
    Article PubMed Google Scholar
  9. Risso, D., Ngai, J., Speed, T.P. & Dudoit, S. Normalization of RNA-seq data using factor analysis of control genes or samples. Nat. Biotechnol. 32, 896–902 (2014).
    Article CAS PubMed PubMed Central Google Scholar
  10. Leek, J.T. svaseq: removing batch effects and other unwanted noise from sequencing data. Nucleic Acids Res. 42, e161 (2014).
    Article PubMed Central Google Scholar
  11. Spitzer, M.H. et al. An interactive reference framework for modeling a dynamic immune system. Science 349, 1259425 (2015).
    Article PubMed PubMed Central Google Scholar
  12. Nestorowa, S. et al. A single-cell resolution map of mouse hematopoietic stem and progenitor cell differentiation. Blood 128, e20–e31 (2016).
    Article CAS PubMed PubMed Central Google Scholar
  13. Scialdone, A. et al. Resolving early mesoderm diversification through single-cell expression profiling. Nature 535, 289–293 (2016).
    Article CAS PubMed PubMed Central Google Scholar
  14. Baron, M. et al. A single-cell transcriptomic map of the human and mouse pancreas reveals inter-and intra-cell population structure. Cell Syst. 3, 346–360.e4 (2016).
    Article CAS PubMed PubMed Central Google Scholar
  15. Bendall, S.C. et al. Single-cell trajectory detection uncovers progression and regulatory coordination in human B cell development. Cell 157, 714–725 (2014).
    Article CAS PubMed PubMed Central Google Scholar
  16. van der Maaten, L. & Hinton, G. Visualizing data using t-SNE. J. Mach. Learn. Res. 9, 2579–2605 (2008).
    Google Scholar
  17. Picelli, S. et al. Smart-seq2 for sensitive full-length transcriptome profiling in single cells. Nat. Methods 10, 1096–1098 (2013).
    Article CAS PubMed Google Scholar
  18. Paul, F. et al. Transcriptional heterogeneity and lineage commitment in myeloid progenitors. Cell 163, 1663–1677 (2015).
    Article CAS PubMed Google Scholar
  19. Angerer, P. et al. destiny: diffusion maps for large-scale single-cell data in R. Bioinformatics 32, 1241–1243 (2016).
    Article CAS PubMed Google Scholar
  20. Grün, D. et al. De novo prediction of stem cell identity using single-cell transcriptome data. Cell Stem Cell 19, 266–277 (2016).
    Article PubMed PubMed Central Google Scholar
  21. Muraro, M.J. et al. A single-cell transcriptome atlas of the human pancreas. Cell Syst. 3, 385–394.e3 (2016).
    Article CAS PubMed PubMed Central Google Scholar
  22. Lawlor, N. et al. Single-cell transcriptomes identify human islet cell signatures and reveal cell-type-specific expression changes in type 2 diabetes. Genome Res. 27, 208–222 (2017).
    Article CAS PubMed PubMed Central Google Scholar
  23. Segerstolpe, Å. et al. Single-cell transcriptome profiling of human pancreatic islets in health and type 2 diabetes. Cell Metab. 24, 593–607 (2016).
    Article CAS PubMed PubMed Central Google Scholar
  24. Zheng, G.X.Y. et al. Massively parallel digital transcriptional profiling of single cells. Nat. Commun. 8, 14049 (2017).
    Article CAS PubMed PubMed Central Google Scholar
  25. Brennecke, P. et al. Accounting for technical noise in single-cell RNA-seq experiments. Nat. Methods 10, 1093–1095 (2013).
    Article CAS PubMed Google Scholar
  26. Dobin, A. et al. STAR: ultrafast universal RNA-seq aligner. Bioinformatics 29, 15–21 (2013).
    Article CAS PubMed Google Scholar
  27. Liao, Y., Smyth, G.K. & Shi, W. featureCounts: an efficient general purpose program for assigning sequence reads to genomic features. Bioinformatics 30, 923–930 (2014).
    Article CAS PubMed Google Scholar
  28. Lun, A.T., Bach, K. & Marioni, J.C. Pooling across cells to normalize single-cell RNA sequencing data with many zero counts. Genome Biol. 17, 75 (2016).
    Article PubMed Google Scholar
  29. Xu, C. & Su, Z. Identification of cell types from single-cell transcriptomes using a novel clustering method. Bioinformatics 31, 1974–1980 (2015).
    Article CAS PubMed PubMed Central Google Scholar
  30. Pons, P. & Latapy, M. Computing communities in large networks using random walks. ISCIS 3733, 284–293 (2005).
    Google Scholar
  31. Buttner, M., Miao, Z., Wolf, A., Teichmann, S.A. & Theis, F.J. Assessment of batch-correction methods for scRNA-seq data with a new test metric. Preprint at https://www.biorxiv.org/content/early/2017/10/09/200345/ (2017).
  32. Brandani, G.B. et al. Quantifying disorder through conditional entropy: an application to fluid mixing. PloS One 6, e65617 (2013).
    Article Google Scholar

Download references

Acknowledgements

We are grateful to F.K. Hamey, J.P. Munro, J. Griffiths and M. Büttner for helpful discussions. L.H. was supported by Wellcome Trust Grant 108437/Z/15 to J.C.M. A.T.L.L. was supported by core funding from CRUK (award number 17197 to J.C.M.). M.D.M. was supported by Wellcome Trust Grant 105045/Z/14/Z to J.C.M. J.C.M. was supported by core funding from EMBL and from CRUK (award number 17197).

Author information

Authors and Affiliations

  1. European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Cambridge, UK
    Laleh Haghverdi & John C Marioni
  2. Institute of Computational Biology, Helmholtz Zentrum München, Munich, Germany
    Laleh Haghverdi
  3. Cancer Research UK Cambridge Institute, University of Cambridge, Cambridge, UK
    Aaron T L Lun & John C Marioni
  4. Wellcome Trust Sanger Institute, Cambridge, UK
    Michael D Morgan & John C Marioni

Authors

  1. Laleh Haghverdi
    You can also search for this author inPubMed Google Scholar
  2. Aaron T L Lun
    You can also search for this author inPubMed Google Scholar
  3. Michael D Morgan
    You can also search for this author inPubMed Google Scholar
  4. John C Marioni
    You can also search for this author inPubMed Google Scholar

Contributions

L.H. developed the method and the computational tools, performed the analysis and wrote the paper. A.T.L.L. developed the method and the computational tools and wrote the paper. M.D.M. developed the method, performed the analysis and wrote the paper. J.C.M. developed the method, wrote the paper and supervised the study.

Corresponding author

Correspondence toJohn C Marioni.

Ethics declarations

Competing interests

The authors declare no competing financial interests.

Integrated supplementary information

Supplementary Figure 1 MNN corrects nonconstant batch effects.

By using locally linear corrections, MNN can handle non-constant batch effects, here simulated as a small angle rotation of data on two-dimensional x-y coordinates. Each shown batch contains 400 cells (points). The reference batch is shown in red and the second (rotated) batch is shown in black for (a) raw (uncorrected) data and (b) data after MNN correction.

Supplementary Figure 2 Simulation of batch effect in two batches with identical cell-type composition.

_t_-SNE plots of (a) the raw (uncorrected) simulated data, and the simulation data corrected by (b) our MNN approach, (c) limma and (d) ComBat. The filled circles and open triangles represent cells from the first and second batch respectively. The three different cell types are shown by different colours. While there is a split between cells of the same cell type in the uncorrected data, all batch correction methods remove the batch effect successfully for this simple example and yield clusters consistent with the original simulated cell types. The data were simulated to have identical cell type compositions (0.2/0.3/0.5) in both batches, with each batch containing 1000 cells.

Supplementary Figure 3 Analysis of the hematopoietic data by using all 3,904 highly variable genes.

_t_-SNE plots for (a) uncorrected data and data after correction by (b) our MNN approach, (c) limma and (d) ComBat. Cells are coloured according to their batch labels. (e) Histogram of the angle (f) between the first two SVD components of the reference data (SMART-seq2) and the correction vectors of the MARS-seq data calculated by MNN. Diffusion maps of the haematopoietic data after MNN correction, shown on the (f) first two diffusion components, (g) first and the third diffusion components, and (h) second and the third diffusion components.

Supplementary Figure 4 Analysis of the hematopoietic data by using 1,500 genes randomly subsampled from the highly variable gene set.

_t_-SNE plots for (a) uncorrected data and data after correction by (b) MNN, (c) limma and (d) ComBat, coloured according to cell types. The same t -SNE plots are coloured according to batch for (e) uncorrected and batch-corrected data from (f) MNN, (g) limma and (h) ComBat. PCA plots for shared cell types (the SMART-seq2 batch with n=791 cells and the MARS-seq batch with n=2729 cells) between the two batches for (i) uncorrected data and batch-corrected data from (j) MNN, (k) limma and (l) ComBat.

Supplementary Figure 5 Analysis of the pancreas data by using all 2,507 highly variable genes.

_t_-SNE plots of (a) uncorrected data and data after correction by (b) MNN, (c) limma and (d) ComBat, coloured according to cell type labels. _t_-SNE plots were also generated for (e) uncorrected data and batch-corrected data from (f) MNN, (g) limma and (h) ComBat, coloured according to batch. PCA plots were also generated for (i) uncorrected and batch-corrected data from (j) MNN, (k) limma and (l) ComBat, coloured according to batch. Histograms of the angle between the batch effect vectors and the first two SVDs for the (m) reference (GSE85241) and the E-MTAB-5061 batch, (n) reference and the GSE86473 batch, and the (o) reference and the GSE81076 batch. (p) Silhouette coefficients according to cell type labels, with n=7096 (i.e. integrated number of cell from all four batches) observations for each boxplot. (q) Boxplots of the entropy of batch mixing on the first two PCs, with n=100 (i.e. number of bootstraps) observations for each boxplot. Boxes indicate median and first and third quartile, and whiskers extend to +/-1.5 times the interquartile ratio divided by the square root of the number of observations, and single points denote values outside this range.

Supplementary Figure 6 Analysis of pancreas data on 1,500 genes randomly subsampled from the highly variable gene set.

_t_-SNE plots of (a) uncorrected data and data corrected by (b) our MNN approach, (c) limma and (d) ComBat, coloured according to cell type labels. _t_-SNE plots of (e) uncorrected data and batch-corrected data from (f) MNN, (g) limma and (h) ComBat, coloured according to batch. PCA plots of (i) uncorrected data and batch-corrected data from (j) MNN, (k) limma and (l) ComBat, coloured according to batch. Histogram of the angle between the batch effect vectors and the first two SVD components for the (m) reference (GSE85241) and the E-MTAB-5061 batch, (n) reference and the GSE86473 batch, and the (o) reference and the GSE81076 batch. (p) Silhouette coefficients according to cell type labels, with n=7096 (i.e. integrated number of cell from all four batches) observations for each boxplot. (q) Boxplots of the entropy of batch mixing on the first two PCs, with n=100 (i.e. number of bootstraps) observations for each boxplot. Boxes indicate median and first and third quartile, and whiskers extend to +/-1.5 times the interquartile ratio divided by the square root of the number of observations, and single points denote values outside this range.

Supplementary Figure 7 Locally varying batch correction versus global (i.e., constant vector) batch correction.

_t_-SNE plots of pancreas data (GSE81076 with n=1007, GSE86473 with n= 2331, GSE85241 with n=1595 and E-MTAB-5061 with n=2163 cells) after batch correction with (a,c) MNN allowing for local batch vectors (default) or (b,d) MNN with a single global batch vector for all cells, coloured according to cell type labels (a,b) or batch (c,d). PCA plots of pancreas data after batch correction with (e) MNN allowing for local batch vectors (default) or (f) MNN with a single global batch vector for all cells, coloured according to batch. (g) Silhouette coefficients for clustering according to cell types after correction with two alternative settings of MNN, with n=7096 (i.e. integrated number of cell from all four batches) observations for each boxplot. The difference between the Silhouette coefficients is not significant (two-sided Welch's test p-value=0.97). (h) Entropy of batch mixing on the first two PCs for batch-corrected data with the two alternative settings of MNN, with n=100 (i.e. number of bootstraps) observations for each boxplot Allowing for local batch vectors has significantly (two-sided Welch's test p-value = 0.00001) larger entropy compared to the use of a global batch vector. Boxes indicate median and first and third quartile, and whiskers extend to +/-1.5 times the interquartile ratio divided by the square root of the number of observations, and single points denote values outside this range.

Supplementary information

Rights and permissions

About this article

Cite this article

Haghverdi, L., Lun, A., Morgan, M. et al. Batch effects in single-cell RNA-sequencing data are corrected by matching mutual nearest neighbors.Nat Biotechnol 36, 421–427 (2018). https://doi.org/10.1038/nbt.4091

Download citation