Multi-Omics Factor Analysis-a framework for unsupervised integration of multi-omics data sets - PubMed (original) (raw)
Multi-Omics Factor Analysis-a framework for unsupervised integration of multi-omics data sets
Ricard Argelaguet et al. Mol Syst Biol. 2018.
Abstract
Multi-omics studies promise the improved characterization of biological processes across molecular layers. However, methods for the unsupervised integration of the resulting heterogeneous data sets are lacking. We present Multi-Omics Factor Analysis (MOFA), a computational method for discovering the principal sources of variation in multi-omics data sets. MOFA infers a set of (hidden) factors that capture biological and technical sources of variability. It disentangles axes of heterogeneity that are shared across multiple modalities and those specific to individual data modalities. The learnt factors enable a variety of downstream analyses, including identification of sample subgroups, data imputation and the detection of outlier samples. We applied MOFA to a cohort of 200 patient samples of chronic lymphocytic leukaemia, profiled for somatic mutations, RNA expression, DNA methylation and ex vivo drug responses. MOFA identified major dimensions of disease heterogeneity, including immunoglobulin heavy-chain variable region status, trisomy of chromosome 12 and previously underappreciated drivers, such as response to oxidative stress. In a second application, we used MOFA to analyse single-cell multi-omics data, identifying coordinated transcriptional and epigenetic changes along cell differentiation.
Keywords: data integration; dimensionality reduction; multi‐omics; personalized medicine; single‐cell omics.
© 2018 The Authors. Published under the terms of the CC BY 4.0 license.
Figures
Figure 1. Multi‐Omics Factor Analysis: model overview and downstream analyses
- Model overview: MOFA takes M data matrices as input (Y 1,…, Y M), one or more from each data modality, with co‐occurrent samples but features that are not necessarily related and that can differ in numbers. MOFA decomposes these matrices into a matrix of factors (Z) for each sample and M weight matrices, one for each data modality (W 1,.., W M). White cells in the weight matrices correspond to zeros, i.e. inactive features, whereas the cross symbol in the data matrices denotes missing values.
- The fitted MOFA model can be queried for different downstream analyses, including (i) variance decomposition, assessing the proportion of variance explained by each factor in each data modality, (ii) semi‐automated factor annotation based on the inspection of loadings and gene set enrichment analysis, (iii) visualization of the samples in the factor space and (iv) imputation of missing values, including missing assays.
Figure EV1. Scalability of MOFA, GFA and iCluster
Time required for model training for GFA (red), MOFA (blue) and iCluster (green) as a function of number of factors K, number of features D, number of samples N and number of views M. Baseline parameters were M = 3, K = 10, D = 1,000 and N = 100 and 5% missing values. Shown are average time across 10 trials, and error bars denote standard deviation. iCluster is only shown for the lowest M as all other settings require on average more than 200 min for training.
Figure 2. Application of MOFA to a study of chronic lymphocytic leukaemia
- A
Study overview and data types. Data modalities are shown in different rows (D = number of features) and samples (N) in columns, with missing samples shown using grey bars. - B, C
(B) Proportion of total variance explained (R 2) by individual factors for each assay and (C) cumulative proportion of total variance explained. - D
Absolute loadings of the top features of Factors 1 and 2 in the Mutations data. - E
Visualization of samples using Factors 1 and 2. The colours denote the IGHV status of the tumours; symbol shape and colour tone indicate chromosome 12 trisomy status. - F
Number of enriched Reactome gene sets per factor based on the gene expression data (FDR < 1%). The colours denote categories of related pathways defined as in Appendix Table S2.
Figure 3. Characterization of the inferred factor associated with the differentiation state of the cell of origin
- Beeswarm plot with Factor 1 values for each sample with colours corresponding to three groups found by 3‐means clustering with low factor values (LZ), intermediate factor values (IZ) and high factor values (HZ).
- Absolute loadings for the genes with the largest absolute weights in the mRNA data. Plus or minus symbols on the right indicate the sign of the loading. Genes highlighted in orange were previously described as prognostic markers in CLL and associated with IGHV status (Vasconcelos et al, 2005; Maloum et al, 2009; Trojani et al, 2012; Morabito et al, 2015; Plesingerova et al, 2017).
- Heatmap of gene expression values for genes with the largest weights as in (B).
- Absolute loadings of the drugs with the largest weights, annotated by target category.
- Drug response curves for two of the drugs with top weights, stratified by the clusters as in (A).
Figure EV2. Characterization of Factor 5 (oxidative stress response factor) in the CLL data
- Beeswarm plot of Factor 5. Colours denote the expression of TNF, an inflammatory stress marker.
- Gene set enrichment analysis for the top Reactome pathways in the mRNA data (_t_‐test, Materials and Methods).
- Heatmap of gene expression values for the six genes with largest loading. Samples are ordered by their factor values.
- Scaled loadings for the top drugs with the largest loading, annotated by target category.
- Heatmap of drug response values for the top three drugs with largest loading.
Figure EV3. Prediction of IGHV status based on Factor 1 in the CLL data and validation on outlier cases on independent assays
- Beeswarm plot of Factor 1 with colours denoting agreement between predicted and clinical labels as in (B).
- Pie chart showing total numbers for agreement of imputed labels with clinical label.
- Sample‐to‐sample correlation matrix based on drug response data.
- Sample‐to‐sample correlation matrix based on methylation data.
- Drug response to ONO‐4509 (not included in the training data): Boxplots for the viability values in response to ONO‐4509. The three outlier samples are shown in the middle; on the left and right, the viabilities of the other M‐CLL and U‐CLL samples are shown, respectively. The panels show different drug concentrations tested. Boxes represent the first and third quartiles of the values for M‐CLL and U‐CLL samples, for individual patients the single value.
- Whole exome sequencing data on IGHV genes (not included in the training data): the number of mutations found on IGHV genes using whole exome sequencing is shown on the _y_‐axis, separately for U‐CLL and M‐CLL samples. The three outlier samples are labelled.
Figure EV4. Imputation of missing values in the drug response assay of the CLL data
- A, B
Considered were MOFA, SoftImpute, imputation by feature‐wise mean (Mean) and k‐nearest neighbour (kNN). Shown are averages of the mean squared error (MSE) across 15 imputation experiments for increasing fractions of missing data, considering (A) values missing at random and (B) entire assay missing for samples at random. Error bars denote plus or minus two standard error.
Figure 4. Relationship between clinical data and latent factors
- Association of MOFA factors to time to next treatment using a univariate Cox regression with N = 174 samples (96 of which are uncensored cases) and _P_‐values based on the Wald statistic. Error bars denote 95% confidence intervals. Numbers on the right denote _P_‐values for each predictor.
- Kaplan–Meier plots measuring time to next treatment for the individual MOFA factors. The cut‐points on each factor were chosen using maximally selected rank statistics (Hothorn & Lausen, 2003), and _P_‐values were calculated using a log‐rank test on the resulting groups.
- Prediction accuracy of time to treatment for N = 174 patients using multivariate Cox regression trained using the 10 factors derived using MOFA, as well using the first 10 components obtained from PCA applied to the corresponding single data modalities and the full data set (assessed on hold‐out data). Shown are average values of Harrell's C‐index from fivefold cross‐validation. Error bars denote standard error of the mean.
Figure 5. Application of MOFA to a single‐cell multi‐omics study
- A
Study overview and data types. Data modalities are shown in different rows (D = number of features) and samples (N) in columns, with missing samples shown using grey bars. - B, C
(B) Fraction of the variance explained (R 2) by individual factors for each data modality and (C) cumulative proportion of variance explained. - D
Absolute loadings of Factor 1 (bottom) and Factor 2 (top) in the mRNA data. Labelled genes in Factor 1 are known markers of pluripotency (Mohammed et al, 2017) and genes labelled in Factor 2 are known differentiation markers (Fuchs, 1988). - E
Scatterplot of Factors 1 and 2. Colours denote culture conditions. The grey arrow illustrates the differentiation trajectory from naive pluripotent cells via primed pluripotent cells to differentiated cells.
Figure EV5. Transcriptomic and epigenetic changes associated with Factor 1 in the scMT data
- RNA expression changes for the top 20 genes with largest weight on Factor 1.
- DNA methylation rate changes for the top 20 CpG sites with largest weight. Shown is a non‐linear loess regression model fit per CpG site.
- RNA expression changes for the top 20 genes with largest weight on Factor 2.
Similar articles
- Clustering and variable selection evaluation of 13 unsupervised methods for multi-omics data integration.
Pierre-Jean M, Deleuze JF, Le Floch E, Mauger F. Pierre-Jean M, et al. Brief Bioinform. 2020 Dec 1;21(6):2011-2030. doi: 10.1093/bib/bbz138. Brief Bioinform. 2020. PMID: 31792509 - Application of Unsupervised Multi-Omic Factor Analysis to Uncover Patterns of Variation and Molecular Processes Linked to Cardiovascular Disease.
Losert C, Pekayvaz K, Knottenberg V, Nicolai L, Stark K, Heinig M. Losert C, et al. J Vis Exp. 2024 Sep 20;(211). doi: 10.3791/66659. J Vis Exp. 2024. PMID: 39373483 - Consistency and overfitting of multi-omics methods on experimental data.
McCabe SD, Lin DY, Love MI. McCabe SD, et al. Brief Bioinform. 2020 Jul 15;21(4):1277-1284. doi: 10.1093/bib/bbz070. Brief Bioinform. 2020. PMID: 31281919 Free PMC article. - A Detailed Catalogue of Multi-Omics Methodologies for Identification of Putative Biomarkers and Causal Molecular Networks in Translational Cancer Research.
Vlachavas EI, Bohn J, Ückert F, Nürnberg S. Vlachavas EI, et al. Int J Mol Sci. 2021 Mar 10;22(6):2822. doi: 10.3390/ijms22062822. Int J Mol Sci. 2021. PMID: 33802234 Free PMC article. Review. - Advances in bulk and single-cell multi-omics approaches for systems biology and precision medicine.
Li Y, Ma L, Wu D, Chen G. Li Y, et al. Brief Bioinform. 2021 Sep 2;22(5):bbab024. doi: 10.1093/bib/bbab024. Brief Bioinform. 2021. PMID: 33778867 Review.
Cited by
- Single-cell joint detection of chromatin occupancy and transcriptome enables higher-dimensional epigenomic reconstructions.
Xiong H, Luo Y, Wang Q, Yu X, He A. Xiong H, et al. Nat Methods. 2021 Jun;18(6):652-660. doi: 10.1038/s41592-021-01129-z. Epub 2021 May 6. Nat Methods. 2021. PMID: 33958790 - Single-cell omics: experimental workflow, data analyses and applications.
Sun F, Li H, Sun D, Fu S, Gu L, Shao X, Wang Q, Dong X, Duan B, Xing F, Wu J, Xiao M, Zhao F, Han JJ, Liu Q, Fan X, Li C, Wang C, Shi T. Sun F, et al. Sci China Life Sci. 2025 Jan;68(1):5-102. doi: 10.1007/s11427-023-2561-0. Epub 2024 Jul 23. Sci China Life Sci. 2025. PMID: 39060615 Review. - Unraveling the Complexity of the Cancer Microenvironment With Multidimensional Genomic and Cytometric Technologies.
de Vries NL, Mahfouz A, Koning F, de Miranda NFCC. de Vries NL, et al. Front Oncol. 2020 Jul 23;10:1254. doi: 10.3389/fonc.2020.01254. eCollection 2020. Front Oncol. 2020. PMID: 32793500 Free PMC article. Review. - Multi-omic approaches for host-microbiome data integration.
Chetty A, Blekhman R. Chetty A, et al. Gut Microbes. 2024 Jan-Dec;16(1):2297860. doi: 10.1080/19490976.2023.2297860. Epub 2024 Jan 2. Gut Microbes. 2024. PMID: 38166610 Free PMC article. Review. - A guide to systems-level immunomics.
Bonaguro L, Schulte-Schrepping J, Ulas T, Aschenbrenner AC, Beyer M, Schultze JL. Bonaguro L, et al. Nat Immunol. 2022 Oct;23(10):1412-1423. doi: 10.1038/s41590-022-01309-9. Epub 2022 Sep 22. Nat Immunol. 2022. PMID: 36138185 Review.
References
- Benjamini Y, Hochberg Y (1995) Controlling the false discovery rate: a practical and powerful approach to multiple testing. J R Stat Soc B 57: 289–300
Publication types
MeSH terms
Substances
LinkOut - more resources
Full Text Sources
Other Literature Sources