Data analysis of metabolomics and other omics datasets using the structToolbox (original) (raw)

Introduction

The ‘structToolbox’ includes an extensive set of data (pre-)processing and analysis tools for metabolomics and other omics, with a strong emphasis on statistics and machine learning. The methods and tools have been implemented using class-based templates available via the struct (Statistics in R Using Class-based Templates) package. The aim of this vignette is to introduce the reader to basic and more advanced structToolbox-based operations and implementations, such as the use of struct objects, getting/setting methods/parameters, and building workflows for the analysis of mass spectrometry (MS) and nuclear magnetic resonance (NMR)-based Metabolomics and proteomics datasets. The workflows demonstrated here include a wide range of methods and tools including pre-processing such as filtering, normalisation and scaling, followed by univariate and/or multivariate statistics, and machine learning approaches.

Getting started

The latest version of structToolbox compatible with your current R version can be installed using BiocManager.

# install BiocManager if not present
if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")

# install structToolbox and dependencies
BiocManager::install("structToolbox")

A number of additional packages are needed for this vignette.

## install additional bioc packages for vignette if needed
#BiocManager::install(c('pmp', 'ropls', 'BiocFileCache'))

## install additional CRAN packages if needed
#install.packages(c('cowplot', 'openxlsx'))

suppressPackageStartupMessages({
    # Bioconductor packages
    library(structToolbox)
    library(pmp)
    library(ropls)
    library(BiocFileCache)
  
    # CRAN libraries
    library(ggplot2)
    library(gridExtra)
    library(cowplot)
    library(openxlsx)
})


# use the BiocFileCache
bfc <- BiocFileCache(ask = FALSE)

Introduction to struct objects, including models, model sequences, model charts and ontology.

Introduction

PCA (Principal Component Analysis) and PLS (Partial Least Squares) are commonly applied methods for exploring and analysing multivariate datasets. Here we use these two statistical methods to demonstrate the different types of struct (STatistics in R Using Class Templates) objects that are available as part of the structToolbox and how these objects (i.e. class templates) can be used to conduct unsupervised and supervised multivariate statistical analysis.

Dataset

For demonstration purposes we will use the “Iris” dataset. This famous (Fisher’s or Anderson’s) dataset contains measurements of sepal length and width and petal length and width, in centimeters, for 50 flowers from each of 3 class of Iris. The class are Iris setosa, versicolor, and virginica. See here (https://stat.ethz.ch/R-manual/R-devel/library/datasets/html/iris.html) for more information.

Note: this vignette is also compatible with the Direct infusion mass spectrometry metabolomics “benchmark” dataset described in Kirwan et al., Sci Data 1, 140012 (2014) (https://doi.org/10.1038/sdata.2014.12).

Both datasets are available as part of the structToolbox package and already prepared as a DatasetExperiment object.

## Iris dataset (comment if using MTBLS79 benchmark data)
D = iris_DatasetExperiment()
D$sample_meta$class = D$sample_meta$Species

## MTBLS (comment if using Iris data)
# D = MTBLS79_DatasetExperiment(filtered=TRUE)
# M = pqn_norm(qc_label='QC',factor_name='sample_type') + 
#   knn_impute(neighbours=5) +
#   glog_transform(qc_label='QC',factor_name='sample_type') +
#   filter_smeta(mode='exclude',levels='QC',factor_name='sample_type')
# M = model_apply(M,D)
# D = predicted(M)

# show info
D
## A "DatasetExperiment" object
## ----------------------------
## name:          Fisher's Iris dataset
## description:   This famous (Fisher's or Anderson's) iris data set gives the measurements in centimeters of
##                  the variables sepal length and width and petal length and width,
##                  respectively, for 50 flowers from each of 3 species of iris. The species are
##                  Iris setosa, versicolor, and virginica.
## data:          150 rows x 4 columns
## sample_meta:   150 rows x 2 columns
## variable_meta: 4 rows x 1 columns

DatasetExperiment objects

The DatasetExperiment object is an extension of the SummarizedExperiment class used by the Bioconductor community. It contains three main parts:

  1. data A data frame containing the measured data for each sample.
  2. sample_meta A data frame of additional information related to the samples e.g. group labels.
  3. variable_meta A data frame of additional information related to the variables (features) e.g. annotations

Like all struct objects it also contains name and description fields (called “slots” is R language).

A key difference between DatasetExperiment and SummarizedExperiment objects is that the data is transposed. i.e. for DatasetExperiment objects the samples are in rows and the features are in columns, while the opposite is true for SummarizedExperiment objects.

All slots are accessible using dollar notation.

# show some data
head(D$data[,1:4])
##   Sepal.Length Sepal.Width Petal.Length Petal.Width
## 1          5.1         3.5          1.4         0.2
## 2          4.9         3.0          1.4         0.2
## 3          4.7         3.2          1.3         0.2
## 4          4.6         3.1          1.5         0.2
## 5          5.0         3.6          1.4         0.2
## 6          5.4         3.9          1.7         0.4

Using struct model objects

Statistical models

Before we can apply e.g. PCA we first need to create a PCA object. This object contains all the inputs, outputs and methods needed to apply PCA. We can set parameters such as the number of components when the PCA model is created, but we can also use dollar notation to change/view it later.

P = PCA(number_components=15)
P$number_components=5
P$number_components
## [1] 5

The inputs for a model can be listed using param_ids(object):

param_ids(P)
## [1] "number_components"

Or a summary of the object can be printed to the console:

P
## A "PCA" object
## --------------
## name:          Principal Component Analysis (PCA)
## description:   PCA is a multivariate data reduction technique. It summarises the data in a smaller number of
##                  Principal Components that maximise variance.
## input params:  number_components 
## outputs:       scores, loadings, eigenvalues, ssx, correlation, that 
## predicted:     that
## seq_in:        data

Model sequences

Unless you have good reason not to, it is usually sensible to mean centre the columns of the data before PCA. Using the STRUCT framework we can create a model sequence that will mean centre and then apply PCA to the mean centred data.

M = mean_centre() + PCA(number_components = 4)

In structToolbox mean centring and PCA are both model objects, and joining them using “+” creates a model_sequence object. In a model_sequence the outputs of the first object (mean centring) are automatically passed to the inputs of the second object (PCA), which allows you chain together modelling steps in order to build a workflow.

The objects in the model_sequence can be accessed by indexing, and we can combine this with dollar notation. For example, the PCA object is the second object in our sequence and we can access the number of components as follows:

M[2]$number_components
## [1] 4

Training/testing models

Model and model_sequence objects need to be trained using data in the form of a DatasetExperiment object. For example, the PCA model sequence we created (M) can be trained using the iris DatasetExperiment object (‘D’).

M = model_train(M,D)

This model sequence has now mean centred the original data and calculated the PCA scores and loadings.

Model objects can be used to generate predictions for test datasets. For the PCA model sequence this involves mean centring the test data using the mean of training data, and the projecting the centred test data onto the PCA model using the loadings. The outputs are all stored in the model sequence and can be accessed using dollar notation. For this example we will just use the training data again (sometimes called autoprediction), which for PCA allows us to explore the training data in more detail.

M = model_predict(M,D)

Sometimes models don’t make use the training/test approach e.g. univariate statsitics, filtering etc. For these models the model_apply method can be used instead. For models that do provide training/test methods, model_apply applies autoprediction by default i.e. it is a short-cut for applying model_train and model_predict to the same data.

M = model_apply(M,D)

The available outputs for an object can be listed and accessed like input params, using dollar notation:

output_ids(M[2])
## [1] "scores"      "loadings"    "eigenvalues" "ssx"         "correlation"
## [6] "that"
M[2]$scores
## A "DatasetExperiment" object
## ----------------------------
## name:          
## description:   
## data:          150 rows x 4 columns
## sample_meta:   150 rows x 2 columns
## variable_meta: 4 rows x 1 columns

Model charts

The struct framework includes chart objects. Charts associated with a model object can be listed.

chart_names(M[2])
## [1] "pca_biplot"           "pca_correlation_plot" "pca_dstat_plot"      
## [4] "pca_loadings_plot"    "pca_scores_plot"      "pca_scree_plot"

Like model objects, chart objects need to be created before they can be used. Here we will plot the PCA scores plot for our mean centred PCA model.

C = pca_scores_plot(factor_name='class') # colour by class
chart_plot(C,M[2])

Note that indexing the PCA model is required because the pca_scores_plot object requires a PCA object as input, not a model_sequence.

If we make changes to the input parameters of the chart, chart_plot must be called again to see the effects.

# add petal width to meta data of pca scores
M[2]$scores$sample_meta$example=D$data[,1]
# update plot
C$factor_name='example'
chart_plot(C,M[2])

The chart_plot method returns a ggplot object so that you can easily combine it with other plots using the gridExtra or cowplot packages for example.

# scores plot
C1 = pca_scores_plot(factor_name='class') # colour by class
g1 = chart_plot(C1,M[2])

# scree plot
C2 = pca_scree_plot()
g2 = chart_plot(C2,M[2])

# arange in grid
grid.arrange(grobs=list(g1,g2),nrow=1)

Ontology

Within the struct framework (and structToolbox) an ontology slot is provided to allow for standardardised definitions for objects and its inputs and outputs using the Ontology Lookup Service (OLS).

For example, STATO is a general purpose STATistics Ontology (http://stato-ontology.org). From the webpage:

Its aim is to provide coverage for processes such as statistical tests, their conditions of application, and information needed or resulting from statistical methods, such as probability distributions, variables, spread and variation metrics. STATO also covers aspects of experimental design and description of plots and graphical representations commonly used to provide visual cues of data distribution or layout and to assist review of the results.

The ontology for an object can be set by assigning the ontology term identifier to the ontology slot of any struct_class object at design time. The ids can be listed using $ notation:

# create an example PCA object
P=PCA()

# ontology for the PCA object
P$ontology
## [1] "OBI:0200051"

The ontology method can be used obtain more detailed ontology information. When cache = NULL the struct package will automatically attempt to use the OLS API (via the rols package) to obtain a name and description for the provided identifiers. Here we used cached versions of the ontology definitions provided in the structToolbox package to prevent issues connecting to the OLS API when building the package.

ontology(P,cache = ontology_cache()) # set cache = NULL (default) for online use
## [[1]]
## An object of class "ontology_list"
## Slot "terms":
## [[1]]
## term id:       OBI:0200051
## ontology:      obi
## label:         principal components analysis dimensionality reduction
## description:   A principal components analysis dimensionality reduction is a dimensionality reduction
##                  achieved by applying principal components analysis and by keeping low-order principal
##                  components and excluding higher-order ones.
## iri:           http://purl.obolibrary.org/obo/OBI_0200051
## 
## 
## 
## [[2]]
## An object of class "ontology_list"
## Slot "terms":
## [[1]]
## term id:       STATO:0000555
## ontology:      stato
## label:         number of predictive components
## description:   number of predictive components is a count used as input to the principle component analysis
##                  (PCA)
## iri:           http://purl.obolibrary.org/obo/STATO_0000555

Note that the ontology method returns definitions for the object (PCA) and the inputs/outputs (number_of_components).

Validating supervised statistical models

Validation is an important aspect of chemometric modelling. The struct framework enables this kind of iterative model testing through iterator objects.

Cross-validation

Cross validation is a common technique for assessing the performance of classification models. For this example we will use a Partial least squares-discriminant analysis (PLS-DA) model. Data should be mean centred prior to PLS, so we will build a model sequence first.

M = mean_centre() + PLSDA(number_components=2,factor_name='class')
M
## A model_seq object containing:
## 
## [1]
## A "mean_centre" object
## ----------------------
## name:          Mean centre
## description:   The mean sample is subtracted from all samples in the data matrix. The features in the centred
##                  matrix all have zero mean.
## input params:  mode 
## outputs:       centred, mean_data, mean_sample_meta 
## predicted:     centred
## seq_in:        data
## 
## [2]
## A "PLSDA" object
## ----------------
## name:          Partial least squares discriminant analysis
## description:   PLS is a multivariate regression technique that extracts latent variables maximising
##                  covariance between the input data and the response. The Discriminant Analysis
##                  variant uses group labels in the response variable. For >2 groups a 1-vs-all
##                  approach is used. Group membership can be predicted for test samples based on
##                  a probability estimate of group membership, or the estimated y-value.
## input params:  number_components, factor_name, pred_method 
## outputs:       scores, loadings, yhat, design_matrix, y, reg_coeff, probability, vip, pls_model, pred, threshold, sr, sr_pvalue 
## predicted:     pred
## seq_in:        data

iterator objects like the k-fold cross-validation object (kfold_xval) can be created just like any other struct object. Parameters can be set at creation using the equals sign, and accessed or changed later using dollar notation.

# create object
XCV = kfold_xval(folds=5,factor_name='class')

# change the number of folds
XCV$folds=10
XCV$folds
## [1] 10

The model to be cross-validated can be set/accessed using the models method.

models(XCV)=M
models(XCV)
## A model_seq object containing:
## 
## [1]
## A "mean_centre" object
## ----------------------
## name:          Mean centre
## description:   The mean sample is subtracted from all samples in the data matrix. The features in the centred
##                  matrix all have zero mean.
## input params:  mode 
## outputs:       centred, mean_data, mean_sample_meta 
## predicted:     centred
## seq_in:        data
## 
## [2]
## A "PLSDA" object
## ----------------
## name:          Partial least squares discriminant analysis
## description:   PLS is a multivariate regression technique that extracts latent variables maximising
##                  covariance between the input data and the response. The Discriminant Analysis
##                  variant uses group labels in the response variable. For >2 groups a 1-vs-all
##                  approach is used. Group membership can be predicted for test samples based on
##                  a probability estimate of group membership, or the estimated y-value.
## input params:  number_components, factor_name, pred_method 
## outputs:       scores, loadings, yhat, design_matrix, y, reg_coeff, probability, vip, pls_model, pred, threshold, sr, sr_pvalue 
## predicted:     pred
## seq_in:        data

Alternatively, iterators can be combined with models using the multiplication symbol was shorthand for the models assignement method:

# cross validation of a mean centred PLSDA model
XCV = kfold_xval(
        folds=5,
        method='venetian',
        factor_name='class') * 
      (mean_centre() + PLSDA(factor_name='class'))

The run method can be used with any iterator object. The iterator will then run the set model or model sequence multiple times.

In this case we run cross-validation 5 times, splitting the data into different training and test sets each time.

The run method also needs a metric to be specified, which is another type of struct object. This metric may be calculated once after all iterations, or after each iteration, depending on the iterator type (resampling, permutation etc). For cross-validation we will calculate “balanced accuracy” after all iterations.

XCV = run(XCV,D,balanced_accuracy())
XCV$metric
##              metric mean sd
## 1 balanced_accuracy 0.11 NA

Note The balanced_accuracy metric actually reports 1-accuracy, so a value of 0 indicates perfect performance. The standard deviation “sd” is NA in this example because there is only one permutation.

Like other struct objects, iterators can have chart objects associated with them. The chart_names function will list them for an object.

chart_names(XCV)
## [1] "kfoldxcv_grid"   "kfoldxcv_metric"

Charts for iterator objects can be plotted in the same way as charts for any other object.

C = kfoldxcv_grid(
  factor_name='class',
  level=levels(D$sample_meta$class)[2]) # first level
chart_plot(C,XCV)

It is possible to combine multiple iterators by using the multiplication symbol. This is equivalent to nesting one iterator inside the other. For example, we can repeat our cross-validation multiple times by permuting the sample order.

# permute sample order 10 times and run cross-validation
P = permute_sample_order(number_of_permutations = 10) * 
    kfold_xval(folds=5,factor_name='class')*
    (mean_centre() + PLSDA(factor_name='class',number_components=2))
P = run(P,D,balanced_accuracy())
P$metric
##              metric   mean          sd
## 1 balanced_accuracy 0.1095 0.004972145

Session Info

sessionInfo()
## R version 4.5.0 RC (2025-04-04 r88126)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.2 LTS
## 
## Matrix products: default
## BLAS:   /home/biocbuild/bbs-3.21-bioc/R/lib/libRblas.so 
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.12.0  LAPACK version 3.12.0
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_GB              LC_COLLATE=C              
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## time zone: America/New_York
## tzcode source: system (glibc)
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] openxlsx_4.2.8       cowplot_1.1.3        gridExtra_2.3       
##  [4] ggplot2_3.5.2        BiocFileCache_2.16.0 dbplyr_2.5.0        
##  [7] ropls_1.40.0         pmp_1.20.0           structToolbox_1.20.0
## [10] struct_1.20.0        BiocStyle_2.36.0    
## 
## loaded via a namespace (and not attached):
##   [1] DBI_1.2.3                   MultiDataSet_1.36.0        
##   [3] httr2_1.1.2                 rlang_1.1.6                
##   [5] magrittr_2.0.3              e1071_1.7-16               
##   [7] matrixStats_1.5.0           compiler_4.5.0             
##   [9] RSQLite_2.3.9               vctrs_0.6.5                
##  [11] reshape2_1.4.4              stringr_1.5.1              
##  [13] pkgconfig_2.0.3             crayon_1.5.3               
##  [15] fastmap_1.2.0               magick_2.8.6               
##  [17] XVector_0.48.0              labeling_0.4.3             
##  [19] rmarkdown_2.29              UCSC.utils_1.4.0           
##  [21] itertools_0.1-3             tinytex_0.57               
##  [23] purrr_1.0.4                 bit_4.6.0                  
##  [25] xfun_0.52                   MultiAssayExperiment_1.34.0
##  [27] randomForest_4.7-1.2        cachem_1.1.0               
##  [29] GenomeInfoDb_1.44.0         jsonlite_2.0.0             
##  [31] blob_1.2.4                  DelayedArray_0.34.0        
##  [33] parallel_4.5.0              rols_3.4.0                 
##  [35] R6_2.6.1                    bslib_0.9.0                
##  [37] stringi_1.8.7               limma_3.64.0               
##  [39] GenomicRanges_1.60.0        jquerylib_0.1.4            
##  [41] Rcpp_1.0.14                 bookdown_0.43              
##  [43] SummarizedExperiment_1.38.0 iterators_1.0.14           
##  [45] knitr_1.50                  IRanges_2.42.0             
##  [47] BiocBaseUtils_1.10.0        Matrix_1.7-3               
##  [49] tidyselect_1.2.1            abind_1.4-8                
##  [51] yaml_2.3.10                 codetools_0.2-20           
##  [53] curl_6.2.2                  doRNG_1.8.6.2              
##  [55] lattice_0.22-7              tibble_3.2.1               
##  [57] plyr_1.8.9                  withr_3.0.2                
##  [59] Biobase_2.68.0              evaluate_1.0.3             
##  [61] ontologyIndex_2.12          isoband_0.2.7              
##  [63] proxy_0.4-27                zip_2.3.2                  
##  [65] filelock_1.0.3              pillar_1.10.2              
##  [67] BiocManager_1.30.25         MatrixGenerics_1.20.0      
##  [69] rngtools_1.5.2              foreach_1.5.2              
##  [71] stats4_4.5.0                generics_0.1.3             
##  [73] sp_2.2-0                    S4Vectors_0.46.0           
##  [75] munsell_0.5.1               scales_1.3.0               
##  [77] calibrate_1.7.7             class_7.3-23               
##  [79] glue_1.8.0                  tools_4.5.0                
##  [81] grid_4.5.0                  impute_1.82.0              
##  [83] missForest_1.5              colorspace_2.1-1           
##  [85] GenomeInfoDbData_1.2.14     cli_3.6.4                  
##  [87] rappdirs_0.3.3              viridisLite_0.4.2          
##  [89] ggthemes_5.1.0              S4Arrays_1.8.0             
##  [91] dplyr_1.1.4                 pls_2.8-5                  
##  [93] pcaMethods_2.0.0            gtable_0.3.6               
##  [95] sass_0.4.10                 digest_0.6.37              
##  [97] BiocGenerics_0.54.0         SparseArray_1.8.0          
##  [99] farver_2.1.2                memoise_2.0.1              
## [101] htmltools_0.5.8.1           lifecycle_1.0.4            
## [103] httr_1.4.7                  statmod_1.5.0              
## [105] qqman_0.1.9                 bit64_4.6.0-1              
## [107] MASS_7.3-65