Earth Mover's Distance (EMD): A True Metric for Comparing Biomarker Expression Levels in Cell Populations - PubMed (original) (raw)
Comparative Study
. 2016 Mar 23;11(3):e0151859.
doi: 10.1371/journal.pone.0151859. eCollection 2016.
Noah Zimmerman 2, Stephen Meehan 2, Connor Meehan 3, Jeffrey Waters 1, Eliver E B Ghosn 1, Alexander Filatenkov 4, Gleb A Kolyagin 5, Yael Gernez 1 6, Shanel Tsuda 1, Wayne Moore 1, Richard B Moss 7, Leonore A Herzenberg 1, Guenther Walther 2
Affiliations
- PMID: 27008164
- PMCID: PMC4805242
- DOI: 10.1371/journal.pone.0151859
Comparative Study
Earth Mover's Distance (EMD): A True Metric for Comparing Biomarker Expression Levels in Cell Populations
Darya Y Orlova et al. PLoS One. 2016.
Abstract
Changes in the frequencies of cell subsets that (co)express characteristic biomarkers, or levels of the biomarkers on the subsets, are widely used as indices of drug response, disease prognosis, stem cell reconstitution, etc. However, although the currently available computational "gating" tools accurately reveal subset frequencies and marker expression levels, they fail to enable statistically reliable judgements as to whether these frequencies and expression levels differ significantly between/among subject groups. Here we introduce flow cytometry data analysis pipeline which includes the Earth Mover's Distance (EMD) metric as solution to this problem. Well known as an informative quantitative measure of differences between distributions, we present three exemplary studies showing that EMD 1) reveals clinically-relevant shifts in two markers on blood basophils responding to an offending allergen; 2) shows that ablative tumor radiation induces significant changes in the murine colon cancer tumor microenvironment; and, 3) ranks immunological differences in mouse peritoneal cavity cells harvested from three genetically distinct mouse strains.
Conflict of interest statement
Competing Interests: The authors have declared that no competing interests exist.
Figures
Fig 1. EMD score increases linearly with the growing separation between two populations.
Panel (a) of Fig 1 shows two normal distributions: a large population (black) and a smaller population (green). The green population starts with a mean at the same position as the black population, and increases along the x axis in fixed increments (2 standard deviations) in each of the successive panels. At each step, we calculate the probability binning (PB) statistic (T (χ)) [7], which is based on p-values, and the Earth Mover’s Distance (EMD, described in detail Results section) between the “unstimulated” first panel in (a), and the joint distribution of the main (black) population with stimulated population (green). As the green population moves further from the black population, both the PB and the EMD increase monotonically. However, when the green population gets past 2 standard deviations from the black population, the PB plateaus as the two distributions have reached “maximum” separation based on the PB statistic. No additional movement of the green population will provide further evidence about the hypothesis that these two populations are the same, while a larger separation clearly carries biologically relevant information. Conversely, EMD continues to increase linearly with the growing separation of the green population. This example illustrates the two shortcomings of using p-values to quantitate change: Even a small change which may not be meaningful can be highly significant and thus produce a large value of the PB statistic, and larger changes may not increase this statistic further although from a biological point of view it would be desirable to do so. While the data for this figure were generated synthetically, one can imagine an experiment in which increasing amounts of a drug are applied causing a subset of cells to increase expression of a marker based on the amount of the drug. In order to correlate the amount of drug with the level of expression in a reliable fashion, one needs a true distance metric to measure the magnitude of the change in distributions. This figure appeared originally in the PhD thesis written by one of the authors (Noah Zimmerman), which was accepted in 2011 by Stanford University and is available online [14]. Reprinted from [14] under a CC BY license, with permission from Noah Zimmerman, original copyright 2011.
Fig 2. Data analysis workflow for application of EMD to flow cytometry data.
Flow data were collected with H-D flow instruments available in the Stanford Shared FACS Facility and preprocessed with AutoGate software (freely available at
). The third step (classification analysis was applied only to the first two studies described in the Results section).
Fig 3. Analysis of basophils activation status.
(a) To identify basophils, we used the following gating sequence (shown in the figure by the red arrows): FSC-A/SSC-A (total white blood cells)→ FSС-A/FSC-H (singlets)→ CD41a/live/dead (CD41a—live)→ Dump [CD3, CD66b, HLA-DR]/CD123 (Dump—, CD123++)→ use EMD with CD203c/ CD63 [19] to determine basophil activation status. (b) An example of basophils response to stimulation with the A. fumigatus allergen. Here, MFI represents median fluorescence intensity.
Fig 4. EMD scores based on expression of two independent flow cytometry markers more accurately distinguish allergic (CF-ABPA) from non-allergic (CF) patients.
(a) This panel compares EMD scores for the combined expression of CD203c and CD63 with “classical” median fluorescence intensity (MFI) values computed separately for the expression of each marker and with MFI values computed for the combined expression of CD203c and CD63. (b) Performance comparison between EMD and two other representative “metrics”, one based on test statistics, Chi-Square (ChS) [2] and the other is a distance measure, Mahalanobis Distance (MD). All six measures were calculated relative to each sample’s unstimulated control. Data are shown for 20/45 CF patients (10 with CF-ABPA and 10 with only CF) drawn from a previously published CF-ABPA study [19, 23] and selected as described in Materials and Methods. For each CF patient sample (n = 10) and CF-ABPA (n = 10), we calculated the EMD on the CD63 and CD203c channels between the unstimulated controls and samples stimulated with the A.fumigatus allergen/extract. Using the SVM method we then defined thresholds (red dashed lines at 123.67 for MFI CD203c, 112.75 for MFI CD63, 0.92 for MD CD203c/CD63, 426.66 for CD203c/CD63 ChS and 0.03 for EMD CD203c/CD63) to distinguish allergic/positive responses from non-allergic/negative responses, i.e., we tried all possible combinations of 5 CF and 5 CF-ABPA patients and used the scores in each case as a training set to find an SVM threshold that divides the dataset in two categories (CF and CF-ABPA) with the lowest possible misclassification rate.
Fig 5. Combined EMD score for three pairs of biomarkers distinguishes mice that received tumor radiotherapy from untreated tumor-bearing mice.
Tumor infiltrating cells from tumor-bearing mice that received tumor radiotherapy (n = 4, red dots) or were untreated (n = 3, blue dots) were gated according to three different strategies: CD4/CD8 expression on live lymphocytes (dead—/ SSC-A; FSC-A/SSC-A); CD25/CD4 expression on B220—/CD4hi live lymphocytes; and, CD25/CD8 expression on live lymphocytes. We then calculated three EMD scores for the following combinations of biomarkers and cell populations: CD8/CD4 expression on lymphocytes, CD25/CD4 expression on B220—/CD4hi lymphocytes, CD25/CD8 expression on lymphocytes. The EMD scores were calculated relative to the control sample (mouse which did not receive tumor radiotherapy). The threshold (plane) was defined using the SVM method.
Fig 6. EMD scores detect relatively small differences between wild-type strains and detect the much larger differences between the wild-type mice and knockout mice.
(a) We used the following gating strategy (according to [26]): dead—/FSC-A→FSC-H/FSC-A. Then we performed the EMD comparison for the expression level of CD19/CD5 biomarkers on peritoneal cells among three mouse strains. EMD values represent the EMD scores between the first BALB/c sample (reference) and the other 3 subsequent samples, including a replicate for BALB/c cells. (b) Mouse spleen cells from BALB/c and C57BL/6 mice stained and analyzed using 13-parameter high-dimensional FACS. We then used the following gating strategy: FSC-H/FSC-A (singlets)→dead—/SSC-A (live). Each plot displays the expression level of CD8 and CD25 for spleen cells from corresponding mouse strain. On this figure, replicate is the same sample which was run several times. EMD scores for inter-sample variability are significantly lower than EMD scores for inter-strain variability.
Fig 7. EMD clearly differentiates allergic (n = 13) and non-allergic (n = 9) groups of samples, while PB and MD are not able to differentiate them with sensitivity and specificity comparable to EMD.
This figure appeared originally in the PhD thesis written by one of the authors (Noah Zimmerman), which was accepted in 2011 by Stanford University and is available online [14]. The data come from a peanut allergy study by Gernez et al [31].Fold change calculated as a ratio of EMD (PB or MD) between unstimulated sample and sample stimulated with offending allergen, normalized by the EMD (PB or MD) between unstimulated sample and sample stimulated with non-offending allergen. Reprinted from [14] under a CC BY license, with permission from Noah Zimmerman, original copyright 2011.
Similar articles
- On Markov Earth Mover's Distance.
Wei J. Wei J. Int J Image Graph. 2014 Oct;14(4):1450016. doi: 10.1142/S0219467814500168. Int J Image Graph. 2014. PMID: 25983362 Free PMC article. - On the Definiteness of Earth Mover's Distance and Its Relation to Set Intersection.
Gardner A, Duncan CA, Kanno J, Selmic RR. Gardner A, et al. IEEE Trans Cybern. 2018 Nov;48(11):3184-3196. doi: 10.1109/TCYB.2017.2761798. Epub 2017 Oct 30. IEEE Trans Cybern. 2018. PMID: 29990093 - EMBEDDING SIGNALS ON GRAPHS WITH UNBALANCED DIFFUSION EARTH MOVER'S DISTANCE.
Tong A, Huguet G, Shung D, Natik A, Kuchroo M, Lajoie G, Wolf G, Krishnaswamy S. Tong A, et al. Proc IEEE Int Conf Acoust Speech Signal Process. 2022 May;2022:5647-5651. doi: 10.1109/icassp43922.2022.9746556. Epub 2022 Apr 27. Proc IEEE Int Conf Acoust Speech Signal Process. 2022. PMID: 36628172 Free PMC article. - H-EMD: A Hierarchical Earth Mover's Distance Method for Instance Segmentation.
Liang P, Zhang Y, Ding Y, Chen J, Madukoma CS, Weninger T, Shrout JD, Chen DZ. Liang P, et al. IEEE Trans Med Imaging. 2022 Oct;41(10):2582-2597. doi: 10.1109/TMI.2022.3169449. Epub 2022 Sep 30. IEEE Trans Med Imaging. 2022. PMID: 35446762 - Spatial and Texture Analysis of Root System distribution with Earth mover's Distance (STARSEED).
Peeples J, Xu W, Gloaguen R, Rowland D, Zare A, Brym Z. Peeples J, et al. Plant Methods. 2023 Jan 5;19(1):2. doi: 10.1186/s13007-022-00974-z. Plant Methods. 2023. PMID: 36604751 Free PMC article.
Cited by
- Prdm6 controls heart development by regulating neural crest cell differentiation and migration.
Hong L, Li N, Gasque V, Mehta S, Ye L, Wu Y, Li J, Gewies A, Ruland J, Hirschi KK, Eichmann A, Hendry C, van Dijk D, Mani A. Hong L, et al. JCI Insight. 2022 Feb 2;7(4):e156046. doi: 10.1172/jci.insight.156046. JCI Insight. 2022. PMID: 35108221 Free PMC article. - Optimal Estimation of Wasserstein Distance on A Tree with An Application to Microbiome Studies.
Wang S, Cai TT, Li H. Wang S, et al. J Am Stat Assoc. 2021;116(535):1237-1253. doi: 10.1080/01621459.2019.1699422. Epub 2020 Jan 23. J Am Stat Assoc. 2021. PMID: 36860698 Free PMC article. - Enhanced mitochondrial fission suppresses signaling and metastasis in triple-negative breast cancer.
Humphries BA, Cutter AC, Buschhaus JM, Chen YC, Qyli T, Palagama DSW, Eckley S, Robison TH, Bevoor A, Chiang B, Haley HR, Sahoo S, Spinosa PC, Neale DB, Boppisetti J, Sahoo D, Ghosh P, Lahann J, Ross BD, Yoon E, Luker KE, Luker GD. Humphries BA, et al. Breast Cancer Res. 2020 Jun 5;22(1):60. doi: 10.1186/s13058-020-01301-x. Breast Cancer Res. 2020. PMID: 32503622 Free PMC article. - Soft X-Ray Imaging of Cellular Carbon and Nitrogen Distributions in Heterocystous Cyanobacteria.
Teramoto T, Azai C, Terauchi K, Yoshimura M, Ohta T. Teramoto T, et al. Plant Physiol. 2018 May;177(1):52-61. doi: 10.1104/pp.17.01767. Epub 2018 Mar 26. Plant Physiol. 2018. PMID: 29581180 Free PMC article. - Cell-type-specific signaling networks in heterocellular organoids.
Qin X, Sufi J, Vlckova P, Kyriakidou P, Acton SE, Li VSW, Nitz M, Tape CJ. Qin X, et al. Nat Methods. 2020 Mar;17(3):335-342. doi: 10.1038/s41592-020-0737-8. Epub 2020 Feb 17. Nat Methods. 2020. PMID: 32066960 Free PMC article.
References
- Drouet M and Lees O. Clinical applications of flow cytometry in hematology and immunology. Biol Cell. 1993; 78: 73–78. - PubMed
- Sheskin D. Handbook of parametric and nonparametric statistical procedures 2nd ed. Chapman & Hall/CRC, Boca Raton; 2000.
- Roederer M and Hardy RR. Frequency difference gating: a multivariate method for identifying subsets that differ between samples. Cytometry. 2001; 45: 56–64. - PubMed
Publication types
MeSH terms
Substances
LinkOut - more resources
Full Text Sources
Other Literature Sources