Non-targeted metabolomics preprocessing and data wrangling (original) (raw)

To install notame, install BiocManager first, if it is not installed. Afterwards use the install function from BiocManager.

Drift correction

Untargeted LC-MS metabolomics data usually suffers from signal intensity drift, i.e. systematic variation in the signal intensity as a function of injection order. Features of a dataset often have very different drift patterns. This means that to correct for the drift, we need a tool that can adapt to different drift patterns.

The approach used in this package models the drift separately for each feature, using pooled QC samples and cubic spline regression. The smoothing parameter controls the “wiggliness” of the function: the larger the smoothing parameter, the closer the spline is to a linear function. By default, the smoothing parameter is chosen automatically using cross validation, and can vary between 0.5 and 1.5, where 1.5 is large enough to guarantee that the curve is practically linear. This interval seems to work well in practice.

We run the drift correction on log-transformed data. Log-transformed data follows the assumptions of the drift function model better, and avoids some practical issues. The corrected feature abundance for a sample with injection order \(i\) is then computed using the following formula:\[x_{corrected}(i) = x_{original}(i) + \overline{x}_{QC} - f_{drift}(i)\]

library(notame)
## Loading required package: ggplot2
## Loading required package: SummarizedExperiment
## Loading required package: MatrixGenerics
## Loading required package: matrixStats
## 
## Attaching package: 'MatrixGenerics'
## The following objects are masked from 'package:matrixStats':
## 
##     colAlls, colAnyNAs, colAnys, colAvgsPerRowSet, colCollapse,
##     colCounts, colCummaxs, colCummins, colCumprods, colCumsums,
##     colDiffs, colIQRDiffs, colIQRs, colLogSumExps, colMadDiffs,
##     colMads, colMaxs, colMeans2, colMedians, colMins, colOrderStats,
##     colProds, colQuantiles, colRanges, colRanks, colSdDiffs, colSds,
##     colSums2, colTabulates, colVarDiffs, colVars, colWeightedMads,
##     colWeightedMeans, colWeightedMedians, colWeightedSds,
##     colWeightedVars, rowAlls, rowAnyNAs, rowAnys, rowAvgsPerColSet,
##     rowCollapse, rowCounts, rowCummaxs, rowCummins, rowCumprods,
##     rowCumsums, rowDiffs, rowIQRDiffs, rowIQRs, rowLogSumExps,
##     rowMadDiffs, rowMads, rowMaxs, rowMeans2, rowMedians, rowMins,
##     rowOrderStats, rowProds, rowQuantiles, rowRanges, rowRanks,
##     rowSdDiffs, rowSds, rowSums2, rowTabulates, rowVarDiffs, rowVars,
##     rowWeightedMads, rowWeightedMeans, rowWeightedMedians,
##     rowWeightedSds, rowWeightedVars
## Loading required package: GenomicRanges
## Loading required package: stats4
## Loading required package: BiocGenerics
## Loading required package: generics
## 
## Attaching package: 'generics'
## The following objects are masked from 'package:base':
## 
##     as.difftime, as.factor, as.ordered, intersect, is.element, setdiff,
##     setequal, union
## 
## Attaching package: 'BiocGenerics'
## The following objects are masked from 'package:stats':
## 
##     IQR, mad, sd, var, xtabs
## The following objects are masked from 'package:base':
## 
##     Filter, Find, Map, Position, Reduce, anyDuplicated, aperm, append,
##     as.data.frame, basename, cbind, colnames, dirname, do.call,
##     duplicated, eval, evalq, get, grep, grepl, is.unsorted, lapply,
##     mapply, match, mget, order, paste, pmax, pmax.int, pmin, pmin.int,
##     rank, rbind, rownames, sapply, saveRDS, table, tapply, unique,
##     unsplit, which.max, which.min
## Loading required package: S4Vectors
## 
## Attaching package: 'S4Vectors'
## The following object is masked from 'package:utils':
## 
##     findMatches
## The following objects are masked from 'package:base':
## 
##     I, expand.grid, unname
## Loading required package: IRanges
## Loading required package: Seqinfo
## Loading required package: Biobase
## Welcome to Bioconductor
## 
##     Vignettes contain introductory material; view with
##     'browseVignettes()'. To cite Bioconductor, see
##     'citation("Biobase")', and for packages 'citation("pkgname")'.
## 
## Attaching package: 'Biobase'
## The following object is masked from 'package:MatrixGenerics':
## 
##     rowMedians
## The following objects are masked from 'package:matrixStats':
## 
##     anyMissing, rowMedians
data(toy_notame_set, package = "notame")
se <- toy_notame_set
names(assays(se)) <- "rawbundances"
# Mark missing values as NA
se <- mark_nas(se, value = 0)
# Correct drift
se <- correct_drift(se, name = "dc")
## INFO [2025-10-30 01:26:57] Starting drift correction
## INFO [2025-10-30 01:27:00] Recomputing quality metrics for drift corrected data
## INFO [2025-10-30 01:27:00] Drift correction performed
## INFO [2025-10-30 01:27:00] Inspecting drift correction results
## INFO [2025-10-30 01:27:00] Original quality metrics missing, recomputing
## INFO [2025-10-30 01:27:00] Drift correction results inspected: Drift_corrected: 100%

Flagging low quality compounds

LC-MS metabolomics datasets contain a significant number of low-quality features. We use three established quality metrics based on pooled QC samples to flag those features: detection rate, relative standard deviation and dispersion ratio(Broadhurst et al. 2018). We are using the robust versions of relative standard deviation and dispersion ratio as they are less affected by a single outlying QC and since the original RSD works best if the data is normally distributed, which is often not the case with our data. In addition, one can use flag_contaminants() to, well, flag contaminants.

By default, we use a slightly more lenient detection threshold, 0.5. The recommended limits for RSD and D-ratio are 0.2 and 0.4, respectively. Since we are using the robust alternatives, we have added an additional condition: signals with classic RSD, RSD* and basic D-ratio all below 0.1 are kept. This additional condition prevents the removal of signals with very low values in all but a few samples. These signals tend to have a very high value of D- ratio*, since the median absolute deviation of the biological samples is not affected by the large concentration in a handful of samples, causing the D- ratio* to overestimate the significance of random errors in measurements of QC samples. Thus, other quality metrics are applied, with a conservative limit of 0.1 to ensure that only good quality signals are kept this way.

se <- flag_detection(se, assay.type = "dc")
## INFO [2025-10-30 01:27:00] 
## 1% of features flagged for low detection rate
se <- flag_quality(se, assay.type = "dc")
## INFO [2025-10-30 01:27:00] 
## 71% of features flagged for low quality
head(rowData(se)$Flag)
## [1] NA            "Low_quality" "Low_quality" "Low_quality" "Low_quality"
## [6] "Low_quality"

Flagged features are not removed from the data, but the flag information is stored in the “Flag” column of rowData(se). In notame, notameViz and notameStats packages, the flagged features are silently ignored in operations that involve muliple features at a time, including random forest imputation, PCA visualization and multivariate statistical models. For feature-wise statistics, the statistical model is fit for flagged features, but FDR correction is only done for non-flagged features. One can also setall_features = TRUE in these functions to include all features in the model.

Imputation of missing values

LC-MS data often contain missing values. Note that even if some statistical methods state that they “can deal with missing values”, they sometimes just ignore samples or variables with at least one missing value, or impute the missing values before applying the actual method.

There are many different imputation strategies out there. We have tested a bunch of them on our data and we concluded that random forest imputation works best for our data (Kokla et al. 2019). To be brief, the method predicts the missing part of the data by fitting a random forest on the observed part of the data (Stekhoven and Bühlmann 2011). The downside of random forest imputation is that it is rather slow. Thus, we recommend running the imputation separately for each analysis mode if the dataset has more than 100 samples, or if there are lots of missing values.

set.seed(2025)
se <- impute_rf(se, assay.type = "dc", name = "imputed")
## INFO [2025-10-30 01:27:00] 
## Starting random forest imputation at 2025-10-30 01:27:00.961396
## INFO [2025-10-30 01:27:01] Out-of-bag error in random forest imputation: 0.645
## INFO [2025-10-30 01:27:01] Random forest imputation finished at 2025-10-30 01:27:01.55125

Simple imputation strategies included in impute_simple() are imputation by constant value, imputation by mean, median, minimum or half minimum value or by a small random value. Simple imputation strategies rely on the assumption that missingness is mostly caused by the fact that the metabolite concentration is below the detection limit.

Utilities

Data can be read with import_from_excel(), which includes checks and preparation of metadata. To accommodate typical output from peak-picking software such as Agilent’s MassHunter or MS-DIAL, the output is transformed into a spreadsheet for import_from_excel(). Alternatively, data in R can be wrangled and passed to the SummarizedExperiment() constructor.

Figure 1: Structure of spreadsheet for import_from_excel()

General utilities include combined_data() for representing the instance in adata.frame suitable for plotting and various functions for data wrangling. For keeping track of the analysis, notame offers a logging system operated using init_log(), log_text() and finish_log(). notame also keeps track of all the external packages used, offering you references for each. To see and log a list of references, use citations().

Parallellization is used in many feature-wise calculations and is provided by the BiocParallel package. BiocParallel defaults to a parallel backend. For small-scale testing on Windows, it can be quicker to use serial execution:

BiocParallel::register(BiocParallel::SerialParam())