Meta-analysis using SIAMCAT (original) (raw)
Contents
- 1 About This Vignette
- 2 Compare Associations
- 3 Study as Confounding Factor
- 4 ML Meta-analysis
- 5 Session Info
About This Vignette
In this vignette, we want to demonstrate how SIAMCAT
can facilitate metagenomic meta-analyses, focussing both on association testing and ML workflows. As an example, we use five different studies of Crohn’s disease (CD), since we have taxonomic profiles from five different metagenomic datasets available. Those studies are:
Setup
library("tidyverse")
library("SIAMCAT")
First, we load the data for all studies, which are available for download from Zenodo. The raw data have been preprocessed and taxonomically profiled withmOTUs2 and were then aggregated at genus level.
# base url for data download
data.loc <- 'https://zenodo.org/api/files/d81e429c-870f-44e0-a44a-2a4aa541b6c1/'
# datasets
datasets <- c('metaHIT', 'Lewis_2015', 'He_2017', 'Franzosa_2019', 'HMP2')
# metadata
meta.all <- read_tsv(paste0(data.loc, 'meta_all_cd.tsv'))
# features
feat <- read.table(paste0(data.loc, 'feat_genus_cd.tsv'),
check.names = FALSE, stringsAsFactors = FALSE, quote = '',
sep='\t')
feat <- as.matrix(feat)
# check that metadata and features agree
stopifnot(all(colnames(feat) == meta.all$Sample_ID))
Let us have a look at the distribution of groups across the studies
table(meta.all$Study, meta.all$Group)
##
## CD CTR
## Franzosa_2019 88 56
## HMP2 583 357
## He_2017 49 53
## Lewis_2015 294 25
## metaHIT 21 71
Some of the studies contain more than one sample for the same subject. For example, the HMP2 publication focussed on the longitudinal aspect of CD. Therefore. we want to take this into account when training and evaluating the machine learning model (see the vignette about Machine learning pitfalls) and when performing the association testing. Thus, it will be convenient to create a second metadata table containing a single entry for each individual.
meta.ind <- meta.all %>%
group_by(Individual_ID) %>%
filter(Timepoint==min(Timepoint)) %>%
ungroup()
Compare Associations
Compute Associations with SIAMCAT
To test for associations, we can encapsulate each dataset into a differentSIAMCAT
object and use the check.associations
function:
assoc.list <- list()
for (d in datasets){
# filter metadata and convert to dataframe
meta.train <- meta.ind %>%
filter(Study==d) %>%
as.data.frame()
rownames(meta.train) <- meta.train$Sample_ID
# create SIAMCAT object
sc.obj <- siamcat(feat=feat, meta=meta.train, label='Group', case='CD')
# test for associations
sc.obj <- check.associations(sc.obj, log.n0=1e-05,
feature.type = 'original')
# extract the associations and save them in the assoc.list
temp <- associations(sc.obj)
temp$genus <- rownames(temp)
assoc.list[[d]] <- temp %>%
select(genus, fc, auc, p.adj) %>%
mutate(Study=d)
}
# combine all associations
df.assoc <- bind_rows(assoc.list)
df.assoc <- df.assoc %>% filter(genus!='unclassified')
head(df.assoc)
## genus fc auc p.adj Study
## 159730 Thermovenabulum...1 159730 Thermovenabulum 0 0.5 NaN metaHIT
## 42447 Anaerobranca...2 42447 Anaerobranca 0 0.5 NaN metaHIT
## 1562 Desulfotomaculum...3 1562 Desulfotomaculum 0 0.5 NaN metaHIT
## 60919 Sanguibacter...4 60919 Sanguibacter 0 0.5 NaN metaHIT
## 357 Agrobacterium...5 357 Agrobacterium 0 0.5 NaN metaHIT
## 392332 Geoalkalibacter...6 392332 Geoalkalibacter 0 0.5 NaN metaHIT
Plot Heatmap for Interesting Genera
Now, we can compare the associations stored in the df.assoc
tibble. For example, we can extract features which are very strongly associated with the label (single-feature AUROC > 0.75 or < 0.25) in at least one of the studies and plot the generalized fold change as heatmap.
genera.of.interest <- df.assoc %>%
group_by(genus) %>%
summarise(m=mean(auc), n.filt=any(auc < 0.25 | auc > 0.75),
.groups='keep') %>%
filter(n.filt) %>%
arrange(m)
After we extracted the genera, we plot them:
df.assoc %>%
# take only genera of interest
filter(genus %in% genera.of.interest$genus) %>%
# convert to factor to enforce an ordering by mean AUC
mutate(genus=factor(genus, levels = rev(genera.of.interest$genus))) %>%
# convert to factor to enforce ordering again
mutate(Study=factor(Study, levels = datasets)) %>%
# annotate the cells in the heatmap with stars
mutate(l=case_when(p.adj < 0.01~'*', TRUE~'')) %>%
ggplot(aes(y=genus, x=Study, fill=fc)) +
geom_tile() +
scale_fill_gradient2(low = '#3B6FB6', high='#D41645', mid = 'white',
limits=c(-2.7, 2.7), name='Generalized\nfold change') +
theme_minimal() +
geom_text(aes(label=l)) +
theme(panel.grid = element_blank()) +
xlab('') + ylab('') +
theme(axis.text = element_text(size=6))
Study as Confounding Factor
Additionally, we can check how differences between studies might influence the variance of specific genera. To do so, we create a singel SIAMCAT
object which holds the complete datasets and then we run the check.confounder
function.
df.meta <- meta.ind %>%
as.data.frame()
rownames(df.meta) <- df.meta$Sample_ID
sc.obj <- siamcat(feat=feat, meta=df.meta, label='Group', case='CD')
## + starting create.label
## Label used as case:
## CD
## Label used as control:
## CTR
## + finished create.label.from.metadata in 0.002 s
## + starting validate.data
## +++ checking overlap between labels and features
## + Keeping labels of 504 sample(s).
## +++ checking sample number per class
## +++ checking overlap between samples and metadata
## + finished validate.data in 0.066 s
check.confounders(sc.obj, fn.plot = './confounder_plot_cd_meta.pdf',
feature.type='original')
## Finished checking metadata for confounders, results plotted to: ./confounder_plot_cd_meta.pdf
The resulting variance plot shows that some genera are strongly impacated by differences between studies, other genera not so much. Of note, the genera that vary most with the label (CD vs controls) do not show a lot of variance across studies.
Session Info
sessionInfo()
## R version 4.5.0 RC (2025-04-04 r88126)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.2 LTS
##
## Matrix products: default
## BLAS: /home/biocbuild/bbs-3.21-bioc/R/lib/libRblas.so
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.12.0 LAPACK version 3.12.0
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_GB LC_COLLATE=C
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## time zone: America/New_York
## tzcode source: system (glibc)
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] ggpubr_0.6.0 SIAMCAT_2.12.0 phyloseq_1.52.0 mlr3_0.23.0
## [5] lubridate_1.9.4 forcats_1.0.0 stringr_1.5.1 dplyr_1.1.4
## [9] purrr_1.0.4 readr_2.1.5 tidyr_1.3.1 tibble_3.2.1
## [13] ggplot2_3.5.2 tidyverse_2.0.0 BiocStyle_2.36.0
##
## loaded via a namespace (and not attached):
## [1] RColorBrewer_1.1-3 jsonlite_2.0.0 shape_1.4.6.1
## [4] magrittr_2.0.3 magick_2.8.6 farver_2.1.2
## [7] corrplot_0.95 nloptr_2.2.1 rmarkdown_2.29
## [10] vctrs_0.6.5 multtest_2.64.0 minqa_1.2.8
## [13] PRROC_1.4 tinytex_0.57 rstatix_0.7.2
## [16] htmltools_0.5.8.1 progress_1.2.3 curl_6.2.2
## [19] broom_1.0.8 Rhdf5lib_1.30.0 Formula_1.2-5
## [22] rhdf5_2.52.0 pROC_1.18.5 sass_0.4.10
## [25] parallelly_1.43.0 bslib_0.9.0 plyr_1.8.9
## [28] palmerpenguins_0.1.1 mlr3tuning_1.3.0 cachem_1.1.0
## [31] uuid_1.2-1 igraph_2.1.4 lifecycle_1.0.4
## [34] iterators_1.0.14 pkgconfig_2.0.3 Matrix_1.7-3
## [37] R6_2.6.1 fastmap_1.2.0 GenomeInfoDbData_1.2.14
## [40] rbibutils_2.3 future_1.40.0 digest_0.6.37
## [43] numDeriv_2016.8-1.1 colorspace_2.1-1 S4Vectors_0.46.0
## [46] mlr3misc_0.16.0 vegan_2.6-10 labeling_0.4.3
## [49] timechange_0.3.0 httr_1.4.7 abind_1.4-8
## [52] mgcv_1.9-3 compiler_4.5.0 beanplot_1.3.1
## [55] bit64_4.6.0-1 withr_3.0.2 backports_1.5.0
## [58] carData_3.0-5 ggsignif_0.6.4 LiblineaR_2.10-24
## [61] MASS_7.3-65 biomformat_1.36.0 permute_0.9-7
## [64] tools_4.5.0 ape_5.8-1 glue_1.8.0
## [67] lgr_0.4.4 nlme_3.1-168 rhdf5filters_1.20.0
## [70] grid_4.5.0 checkmate_2.3.2 gridBase_0.4-7
## [73] cluster_2.1.8.1 reshape2_1.4.4 ade4_1.7-23
## [76] generics_0.1.3 gtable_0.3.6 tzdb_0.5.0
## [79] data.table_1.17.0 hms_1.1.3 car_3.1-3
## [82] XVector_0.48.0 BiocGenerics_0.54.0 foreach_1.5.2
## [85] pillar_1.10.2 vroom_1.6.5 bbotk_1.5.0
## [88] splines_4.5.0 lattice_0.22-7 bit_4.6.0
## [91] survival_3.8-3 tidyselect_1.2.1 Biostrings_2.76.0
## [94] knitr_1.50 reformulas_0.4.0 infotheo_1.2.0.1
## [97] gridExtra_2.3 bookdown_0.43 IRanges_2.42.0
## [100] stats4_4.5.0 xfun_0.52 Biobase_2.68.0
## [103] matrixStats_1.5.0 stringi_1.8.7 UCSC.utils_1.4.0
## [106] yaml_2.3.10 boot_1.3-31 evaluate_1.0.3
## [109] codetools_0.2-20 archive_1.1.12 BiocManager_1.30.25
## [112] cli_3.6.4 Rdpack_2.6.4 munsell_0.5.1
## [115] jquerylib_0.1.4 mlr3learners_0.10.0 Rcpp_1.0.14
## [118] GenomeInfoDb_1.44.0 globals_0.16.3 parallel_4.5.0
## [121] prettyunits_1.2.0 paradox_1.0.1 lme4_1.1-37
## [124] listenv_0.9.1 glmnet_4.1-8 lmerTest_3.1-3
## [127] scales_1.3.0 crayon_1.5.3 rlang_1.1.6
## [130] mlr3measures_1.0.0