Multi-Omics Factor Analysis-a framework for unsupervised integration of multi-omics data sets - PubMed (original) (raw)

Multi-Omics Factor Analysis-a framework for unsupervised integration of multi-omics data sets

Ricard Argelaguet et al. Mol Syst Biol. 2018.

Abstract

Multi-omics studies promise the improved characterization of biological processes across molecular layers. However, methods for the unsupervised integration of the resulting heterogeneous data sets are lacking. We present Multi-Omics Factor Analysis (MOFA), a computational method for discovering the principal sources of variation in multi-omics data sets. MOFA infers a set of (hidden) factors that capture biological and technical sources of variability. It disentangles axes of heterogeneity that are shared across multiple modalities and those specific to individual data modalities. The learnt factors enable a variety of downstream analyses, including identification of sample subgroups, data imputation and the detection of outlier samples. We applied MOFA to a cohort of 200 patient samples of chronic lymphocytic leukaemia, profiled for somatic mutations, RNA expression, DNA methylation and ex vivo drug responses. MOFA identified major dimensions of disease heterogeneity, including immunoglobulin heavy-chain variable region status, trisomy of chromosome 12 and previously underappreciated drivers, such as response to oxidative stress. In a second application, we used MOFA to analyse single-cell multi-omics data, identifying coordinated transcriptional and epigenetic changes along cell differentiation.

Keywords: data integration; dimensionality reduction; multi‐omics; personalized medicine; single‐cell omics.

© 2018 The Authors. Published under the terms of the CC BY 4.0 license.

PubMed Disclaimer

Figures

Figure 1

Figure 1. Multi‐Omics Factor Analysis: model overview and downstream analyses

  1. Model overview: MOFA takes M data matrices as input (Y 1,…, Y M), one or more from each data modality, with co‐occurrent samples but features that are not necessarily related and that can differ in numbers. MOFA decomposes these matrices into a matrix of factors (Z) for each sample and M weight matrices, one for each data modality (W 1,.., W M). White cells in the weight matrices correspond to zeros, i.e. inactive features, whereas the cross symbol in the data matrices denotes missing values.
  2. The fitted MOFA model can be queried for different downstream analyses, including (i) variance decomposition, assessing the proportion of variance explained by each factor in each data modality, (ii) semi‐automated factor annotation based on the inspection of loadings and gene set enrichment analysis, (iii) visualization of the samples in the factor space and (iv) imputation of missing values, including missing assays.

Figure EV1

Figure EV1. Scalability of MOFA, GFA and iCluster

Time required for model training for GFA (red), MOFA (blue) and iCluster (green) as a function of number of factors K, number of features D, number of samples N and number of views M. Baseline parameters were M = 3, K = 10, D = 1,000 and N = 100 and 5% missing values. Shown are average time across 10 trials, and error bars denote standard deviation. iCluster is only shown for the lowest M as all other settings require on average more than 200 min for training.

Figure 2

Figure 2. Application of MOFA to a study of chronic lymphocytic leukaemia

  1. A
    Study overview and data types. Data modalities are shown in different rows (D = number of features) and samples (N) in columns, with missing samples shown using grey bars.
  2. B, C
    (B) Proportion of total variance explained (R 2) by individual factors for each assay and (C) cumulative proportion of total variance explained.
  3. D
    Absolute loadings of the top features of Factors 1 and 2 in the Mutations data.
  4. E
    Visualization of samples using Factors 1 and 2. The colours denote the IGHV status of the tumours; symbol shape and colour tone indicate chromosome 12 trisomy status.
  5. F
    Number of enriched Reactome gene sets per factor based on the gene expression data (FDR < 1%). The colours denote categories of related pathways defined as in Appendix Table S2.

Figure 3

Figure 3. Characterization of the inferred factor associated with the differentiation state of the cell of origin

  1. Beeswarm plot with Factor 1 values for each sample with colours corresponding to three groups found by 3‐means clustering with low factor values (LZ), intermediate factor values (IZ) and high factor values (HZ).
  2. Absolute loadings for the genes with the largest absolute weights in the mRNA data. Plus or minus symbols on the right indicate the sign of the loading. Genes highlighted in orange were previously described as prognostic markers in CLL and associated with IGHV status (Vasconcelos et al, 2005; Maloum et al, 2009; Trojani et al, 2012; Morabito et al, 2015; Plesingerova et al, 2017).
  3. Heatmap of gene expression values for genes with the largest weights as in (B).
  4. Absolute loadings of the drugs with the largest weights, annotated by target category.
  5. Drug response curves for two of the drugs with top weights, stratified by the clusters as in (A).

Figure EV2

Figure EV2. Characterization of Factor 5 (oxidative stress response factor) in the CLL data

  1. Beeswarm plot of Factor 5. Colours denote the expression of TNF, an inflammatory stress marker.
  2. Gene set enrichment analysis for the top Reactome pathways in the mRNA data (_t_‐test, Materials and Methods).
  3. Heatmap of gene expression values for the six genes with largest loading. Samples are ordered by their factor values.
  4. Scaled loadings for the top drugs with the largest loading, annotated by target category.
  5. Heatmap of drug response values for the top three drugs with largest loading.

Figure EV3

Figure EV3. Prediction of IGHV status based on Factor 1 in the CLL data and validation on outlier cases on independent assays

  1. Beeswarm plot of Factor 1 with colours denoting agreement between predicted and clinical labels as in (B).
  2. Pie chart showing total numbers for agreement of imputed labels with clinical label.
  3. Sample‐to‐sample correlation matrix based on drug response data.
  4. Sample‐to‐sample correlation matrix based on methylation data.
  5. Drug response to ONO‐4509 (not included in the training data): Boxplots for the viability values in response to ONO‐4509. The three outlier samples are shown in the middle; on the left and right, the viabilities of the other M‐CLL and U‐CLL samples are shown, respectively. The panels show different drug concentrations tested. Boxes represent the first and third quartiles of the values for M‐CLL and U‐CLL samples, for individual patients the single value.
  6. Whole exome sequencing data on IGHV genes (not included in the training data): the number of mutations found on IGHV genes using whole exome sequencing is shown on the _y_‐axis, separately for U‐CLL and M‐CLL samples. The three outlier samples are labelled.

Figure EV4

Figure EV4. Imputation of missing values in the drug response assay of the CLL data

  1. A, B
    Considered were MOFA, SoftImpute, imputation by feature‐wise mean (Mean) and k‐nearest neighbour (kNN). Shown are averages of the mean squared error (MSE) across 15 imputation experiments for increasing fractions of missing data, considering (A) values missing at random and (B) entire assay missing for samples at random. Error bars denote plus or minus two standard error.

Figure 4

Figure 4. Relationship between clinical data and latent factors

  1. Association of MOFA factors to time to next treatment using a univariate Cox regression with N = 174 samples (96 of which are uncensored cases) and _P_‐values based on the Wald statistic. Error bars denote 95% confidence intervals. Numbers on the right denote _P_‐values for each predictor.
  2. Kaplan–Meier plots measuring time to next treatment for the individual MOFA factors. The cut‐points on each factor were chosen using maximally selected rank statistics (Hothorn & Lausen, 2003), and _P_‐values were calculated using a log‐rank test on the resulting groups.
  3. Prediction accuracy of time to treatment for N = 174 patients using multivariate Cox regression trained using the 10 factors derived using MOFA, as well using the first 10 components obtained from PCA applied to the corresponding single data modalities and the full data set (assessed on hold‐out data). Shown are average values of Harrell's C‐index from fivefold cross‐validation. Error bars denote standard error of the mean.

Figure 5

Figure 5. Application of MOFA to a single‐cell multi‐omics study

  1. A
    Study overview and data types. Data modalities are shown in different rows (D = number of features) and samples (N) in columns, with missing samples shown using grey bars.
  2. B, C
    (B) Fraction of the variance explained (R 2) by individual factors for each data modality and (C) cumulative proportion of variance explained.
  3. D
    Absolute loadings of Factor 1 (bottom) and Factor 2 (top) in the mRNA data. Labelled genes in Factor 1 are known markers of pluripotency (Mohammed et al, 2017) and genes labelled in Factor 2 are known differentiation markers (Fuchs, 1988).
  4. E
    Scatterplot of Factors 1 and 2. Colours denote culture conditions. The grey arrow illustrates the differentiation trajectory from naive pluripotent cells via primed pluripotent cells to differentiated cells.

Figure EV5

Figure EV5. Transcriptomic and epigenetic changes associated with Factor 1 in the scMT data

  1. RNA expression changes for the top 20 genes with largest weight on Factor 1.
  2. DNA methylation rate changes for the top 20 CpG sites with largest weight. Shown is a non‐linear loess regression model fit per CpG site.
  3. RNA expression changes for the top 20 genes with largest weight on Factor 2.

Similar articles

Cited by

References

    1. Akavia UD, Litvin O, Kim J, Sanchez‐Garcia F, Kotliar D, Causton HC, Pochanard P, Mozes E, Garraway LA, Pe'er D (2010) An integrated approach to uncover drivers of cancer. Cell 143: 1005–1017 - PMC - PubMed
    1. Åkerfelt M, Morimoto RI, Sistonen L (2010) Heat shock factors: integrators of cell stress, development and lifespan. Nat Rev Mol Cell Biol 11: 545 - PMC - PubMed
    1. Angermueller C, Clark SJ, Lee HJ, Macaulay IC, Teng MJ, Hu TX, Krueger F, Smallwood S, Ponting CP, Voet T, Kelsey G, Stegle O, Reik W (2016) Parallel single‐cell sequencing links transcriptional and epigenetic heterogeneity. Nat Methods 13: 229–232 - PMC - PubMed
    1. Auclair G, Guibert S, Bender A, Weber M (2014) Ontogeny of CpG island methylation and specificity of DNMT3 methyltransferases during embryonic development in the mouse. Genome Biol 15: 545 - PMC - PubMed
    1. Benjamini Y, Hochberg Y (1995) Controlling the false discovery rate: a practical and powerful approach to multiple testing. J R Stat Soc B 57: 289–300

Publication types

MeSH terms

Substances

LinkOut - more resources