Removal of batch effects using distribution-matching residual networks - PubMed (original) (raw)

Removal of batch effects using distribution-matching residual networks

Uri Shaham et al. Bioinformatics. 2017.

Abstract

Motivation: Sources of variability in experimentally derived data include measurement error in addition to the physical phenomena of interest. This measurement error is a combination of systematic components, originating from the measuring instrument and random measurement errors. Several novel biological technologies, such as mass cytometry and single-cell RNA-seq (scRNA-seq), are plagued with systematic errors that may severely affect statistical analysis if the data are not properly calibrated.

Results: We propose a novel deep learning approach for removing systematic batch effects. Our method is based on a residual neural network, trained to minimize the Maximum Mean Discrepancy between the multivariate distributions of two replicates, measured in different batches. We apply our method to mass cytometry and scRNA-seq datasets, and demonstrate that it effectively attenuates batch effects.

Availability and implementation: our codes and data are publicly available at https://github.com/ushaham/BatchEffectRemoval.git.

Contact: yuval.kluger@yale.edu.

Supplementary information: Supplementary data are available at Bioinformatics online.

PubMed Disclaimer

Figures

Fig. 1

Calibration of CyTOF data. Projection of the source (red) and target (blue) samples on the first two principal components of the target data. Left: before calibration. Right: after calibration

Fig. 2

A typical ResNet block

Fig. 3

Quality of calibration in terms of the marginal distribution of each marker. Empirical cumulative distribution functions of the first three markers in the CyTOF calibration experiment. In each plot the full, dashed and dotted curves corresponds to the target, source and calibrated source samples, respectively. In each marker the full and dotted curves are substantially closer than the full and dashed curves

Fig. 4

Calibration of CyTOF data: CD8 + T-cells cells (red) and target (blue) samples in the (CD28, GzB) plane. Left: before calibration. Center: calibration using MLP. Right, calibration using ResNet

Fig. 5

Histograms of the 25 _P_-values of Kolmogorov-Smirnov tests, comparing the distributions of the calibrated data with the target distribution of each of the 25 markers

Fig. 6

Calibration of scRNA-seq. Top: _t_-SNE plots before (left) and after (right) calibration using MMD-ResNet. Bottom: Calibration of cells with high expression of Prkca. _t_-SNE plots before calibration (left), after calibration using Combat (middle) and MMD-ResNet (right)

Cited by

Single Cell RNA Sequencing in Autoimmune Inflammatory Rheumatic Diseases: Current Applications, Challenges and a Step Toward Precision Medicine.
Kuret T, Sodin-Šemrl S, Leskošek B, Ferk P. Kuret T, et al. Front Med (Lausanne). 2022 Jan 18;8:822804. doi: 10.3389/fmed.2021.822804. eCollection 2021. Front Med (Lausanne). 2022. PMID: 35118101 Free PMC article. Review.
In-silico generation of high-dimensional immune response data in patients using a deep neural network.
Fallahzadeh R, Bidoki NH, Stelzer IA, Becker M, Marić I, Chang AL, Culos A, Phongpreecha T, Xenochristou M, De Francesco D, Espinosa C, Berson E, Verdonk F, Angst MS, Gaudilliere B, Aghaeepour N. Fallahzadeh R, et al. Cytometry A. 2023 May;103(5):392-404. doi: 10.1002/cyto.a.24709. Epub 2022 Dec 27. Cytometry A. 2023. PMID: 36507780 Free PMC article.
deepMNN: Deep Learning-Based Single-Cell RNA Sequencing Data Batch Correction Using Mutual Nearest Neighbors.
Zou B, Zhang T, Zhou R, Jiang X, Yang H, Jin X, Bai Y. Zou B, et al. Front Genet. 2021 Aug 10;12:708981. doi: 10.3389/fgene.2021.708981. eCollection 2021. Front Genet. 2021. PMID: 34447413 Free PMC article.
A novel batch-effect correction method for scRNA-seq data based on Adversarial Information Factorization.
Monnier L, Cournède PH. Monnier L, et al. PLoS Comput Biol. 2024 Feb 22;20(2):e1011880. doi: 10.1371/journal.pcbi.1011880. eCollection 2024 Feb. PLoS Comput Biol. 2024. PMID: 38386700 Free PMC article.
CytofIn enables integrated analysis of public mass cytometry datasets using generalized anchors.
Lo YC, Keyes TJ, Jager A, Sarno J, Domizi P, Majeti R, Sakamoto KM, Lacayo N, Mullighan CG, Waters J, Sahaf B, Bendall SC, Davis KL. Lo YC, et al. Nat Commun. 2022 Feb 17;13(1):934. doi: 10.1038/s41467-022-28484-5. Nat Commun. 2022. PMID: 35177627 Free PMC article.

References

1. Dziugaite G.K. et al. (2015). Training generative neural networks via maximum mean discrepancy optimization. Uncertainty in Artificial Intelligence-Proceedings of the 31st Conference. UAI 2015, 258–267.
1. Finck R. et al. (2013) Normalization of mass cytometry data with bead standards. Cytometry Part A, 83, 483–494. - PMC - PubMed
1. Glorot X., Bengio Y. (2010) Understanding the difficulty of training deep feedforward neural networks. In Proceedings of AISTATS, Sardinia, Italy, vol 9, pp. 249–256.
1. Gretton A. et al. (2006) A kernel method for the two-sample-problem. Advances in Neural Information Processing Systems, 19, 513–520.
1. Gretton A. et al. (2012) A kernel two-sample test. J. Mach. Learn. Res., 13, 723–773.

MeSH terms

LinkOut - more resources

Full Text Sources
Other Literature Sources
- scite Smart Citations
Medical
- MedlinePlus Health Information
Molecular Biology Databases
- NIAID Data Ecosystem - Find datasets on Infectious and Immune-mediated Diseases

Removal of batch effects using distribution-matching residual networks - PubMed (original) (raw)