Storing Mass Spectrometry Data in SQL Databases (original) (raw)
Package: MsBackendSql
Authors: Johannes Rainer [aut, cre] (ORCID:https://orcid.org/0000-0002-6977-7147), Chong Tang [ctb], Laurent Gatto [ctb] (ORCID: https://orcid.org/0000-0002-1520-2268)
Compiled: Tue Apr 15 19:24:34 2025
Introduction
The Spectra Bioconductor package provides a flexible and expandable infrastructure for Mass Spectrometry (MS) data. The package supports interchangeable use of different backends that provide additional file support or different ways to store and represent MS data. The_MsBackendSql_ package provides backends to store data from whole MS experiments in SQL databases. The data in such databases can be easily (and efficiently) accessed using Spectra
objects that use the MsBackendSql
class as an interface to the data in the database. Such Spectra
objects have a minimal memory footprint and hence allow analysis of very large data sets even on computers with limited hardware capabilities. For certain operations, the performance of this data representation is superior to that of other low-memory (on-disk) data representations such as Spectra
’s MsBackendMzR
backend. Finally, the MsBackendSql
supports also remote data access to e.g. a central database server hosting several large MS data sets.
Installation
The package can be installed with the BiocManager
package. To installBiocManager
use install.packages("BiocManager")
and, after that,BiocManager::install("MsBackendSql")
to install this package.
Creating and using MsBackendSql
SQL databases
MsBackendSql
SQL databases can be created either by importing (raw) MS data from MS data files using the createMsBackendSqlDatabase()
or using thebackendInitialize()
function by providing in addition to the database connection also the full MS data to import as a DataFrame
. In the first example we use the createMsBackendSqlDatabase()
function to import the full MS data from the provided MS data files into an (empty) database. Below we first create an empty SQLite database (in a temporary file) and use thecreateMsBackendSqlDatabase()
function to create all necessary tables in that database and import the MS data from two mzML files (from the r Biocpkg("msdata")
package).
library(RSQLite)
dbfile <- tempfile()
con <- dbConnect(SQLite(), dbfile)
library(Spectra)
library(MsBackendSql)
fls <- dir(system.file("sciex", package = "msdata"), full.names = TRUE)
createMsBackendSqlDatabase(con, fls)
dbDisconnect(con)
By default (with parameters blob = TRUE
and peaksStorageMode = "blob2"
) the peaks data matrix of each spectrum is stored as a BLOB data type into the database (one entry per spectrum). This has advantages on the performance to extract the peaks data from the database, but does not allow to filter individual peaks by their m/z or intensity values directly in the database. As an alternative (using blob = FALSE
) it is also possible to store the individual m/z and intensity values in separate columns of the database table. This long table format results however in considerably larger databases (with potentially poorer performance). Note also that the code and backend is optimized for MySQL/MariaDB databases by taking advantage of table partitioning and specialized table storage options. Any other SQL database server is however also supported (also portable, self-contained SQLite databases). In fact, performance for MsBackendSql databases with peaks data stored as BLOB data type is similar for SQLite and MySQL/MariaDB databases.
The MsBackendSql package provides two backends to interact with such databases: the MsBackendSql
class and the MsBackendOfflineSql
class, that inherits all properties and functions from the former, but does not store the connection to the database within the object. The MsBackendOfflineSql
object thus supports parallel processing and allows to save/load the object (e.g. usingsave
and saveRDS
). The MsBackendOfflineSql
might therefore be used as the preferred backend to SQL databases for most applications.
To access the data in the database we create below a Spectra
object providing the database connection information in the constructor call and specifying to use the MsBackendOfflineSql
as backend (parameter source
). We stored the data to a SQLite database, thus we provide the database name (SQLite database file name) and the SQLite DBI driver with parameters dbname
and drv
. Which parameters are required to connect to the database depends on the SQL database and the used driver. For a MySQL/MariaDB database we would use the MariaDB()
driver and would have to provide the database name, user name, password as well as the host name and port through which the database is accessible.
sps <- Spectra(dbname = dbfile, source = MsBackendOfflineSql(), drv = SQLite())
sps
## MSn data (Spectra) with 1862 spectra in a MsBackendOfflineSql backend:
## msLevel precursorMz polarity
## <integer> <numeric> <integer>
## 1 1 NA 1
## 2 1 NA 1
## 3 1 NA 1
## 4 1 NA 1
## 5 1 NA 1
## ... ... ... ...
## 1858 1 NA 1
## 1859 1 NA 1
## 1860 1 NA 1
## 1861 1 NA 1
## 1862 1 NA 1
## ... 35 more variables/columns.
## Use 'spectraVariables' to list all of them.
## Database: /tmp/RtmprpBYzG/file21750b75de4e07
Spectra
objects allow also to change the backend to any other backend (extending MsBackend
) using the setBackend()
function. Below we use this function to first load all data into memory by changing from theMsBackendOfflineSql
to a MsBackendMemory
.
sps_mem <- setBackend(sps, MsBackendMemory())
sps_mem
## MSn data (Spectra) with 1862 spectra in a MsBackendMemory backend:
## msLevel rtime scanIndex
## <integer> <numeric> <integer>
## 1 1 0.280 1
## 2 1 0.559 2
## 3 1 0.838 3
## 4 1 1.117 4
## 5 1 1.396 5
## ... ... ... ...
## 1858 1 258.636 927
## 1859 1 258.915 928
## 1860 1 259.194 929
## 1861 1 259.473 930
## 1862 1 259.752 931
## ... 35 more variables/columns.
## Processing:
## Switch backend from MsBackendOfflineSql to MsBackendMemory [Tue Apr 15 19:24:42 2025]
With this function it is also possible to change from any backend to aMsBackendOfflineSql
(or MsBackendSql
) in which case a new database is created and all data from the originating backend is stored in this database. To change the backend to an MsBackendOfflineSql
we need to provide the connection information to the SQL database as additional parameters. These parameters are the same that need to be passed to a dbConnect()
call to establish the connection to the database. These parameters include the database driver (parameter drv
), the database name and eventually the user name, host etc (see?dbConnect
for more information). In the simple example below we store the data into a SQLite database and thus only need to provide the database name, which corresponds SQLite database file. In our example we store the data into a temporary file. Optionally, setBackend()
supports also the parameters blob
and peaksDataStorage
described above for the createMsBackendSqlDatabase()
function.
sps2 <- setBackend(sps_mem, MsBackendOfflineSql(), drv = SQLite(),
dbname = tempfile())
sps2
## MSn data (Spectra) with 1862 spectra in a MsBackendOfflineSql backend:
## msLevel precursorMz polarity
## <integer> <numeric> <integer>
## 1 1 NA 1
## 2 1 NA 1
## 3 1 NA 1
## 4 1 NA 1
## 5 1 NA 1
## ... ... ... ...
## 1858 1 NA 1
## 1859 1 NA 1
## 1860 1 NA 1
## 1861 1 NA 1
## 1862 1 NA 1
## ... 35 more variables/columns.
## Use 'spectraVariables' to list all of them.
## Database: /tmp/RtmprpBYzG/file21750b5a315fe6
## Processing:
## Switch backend from MsBackendOfflineSql to MsBackendMemory [Tue Apr 15 19:24:42 2025]
## Switch backend from MsBackendMemory to MsBackendOfflineSql [Tue Apr 15 19:24:43 2025]
Similar to any other Spectra
object we can retrieve the available spectra variables using the spectraVariables()
function.
spectraVariables(sps)
## [1] "msLevel" "rtime"
## [3] "acquisitionNum" "scanIndex"
## [5] "dataStorage" "dataOrigin"
## [7] "centroided" "smoothed"
## [9] "polarity" "precScanNum"
## [11] "precursorMz" "precursorIntensity"
## [13] "precursorCharge" "collisionEnergy"
## [15] "isolationWindowLowerMz" "isolationWindowTargetMz"
## [17] "isolationWindowUpperMz" "peaksCount"
## [19] "totIonCurrent" "basePeakMZ"
## [21] "basePeakIntensity" "ionisationEnergy"
## [23] "lowMZ" "highMZ"
## [25] "mergedScan" "mergedResultScanNum"
## [27] "mergedResultStartScanNum" "mergedResultEndScanNum"
## [29] "injectionTime" "filterString"
## [31] "spectrumId" "ionMobilityDriftTime"
## [33] "scanWindowLowerLimit" "scanWindowUpperLimit"
## [35] "electronBeamEnergy" "spectrum_id_"
The MS peak data can be accessed using either the mz()
, intensity()
orpeaksData()
functions. Below we extract the peaks matrix of the 5th spectrum and display the first 6 rows.
peaksData(sps)[[5]] |>
head()
## mz intensity
## [1,] 105.0347 0
## [2,] 105.0362 164
## [3,] 105.0376 0
## [4,] 105.0391 0
## [5,] 105.0405 328
## [6,] 105.0420 0
All data (peaks data or spectra variables) are always retrieved on-the-fly from the database resulting thus in a minimal memory footprint for the Spectra
object.
print(object.size(sps), units = "KB")
## 114.6 Kb
The backend supports also adding additional spectra variables or changing their values. Below we add 10 seconds to the retention time of each spectrum.
sps$rtime <- sps$rtime + 10
Such operations do however not change the data in the database (which is always considered read-only) but are cached locally within the backend object (in memory). The size in memory of the object is thus higher after changing that spectra variable.
print(object.size(sps), units = "KB")
## 129.2 Kb
Such $<-
operations can also be used to cache spectra variables (temporarily) in memory which can eventually improve performance. Below we test the time it takes to extract the MS level from each spectrum from the database, then cache the MS levels in memory using $msLevel <-
and test the timing to extract these cached variable.
system.time(msLevel(sps))
## user system elapsed
## 0.015 0.000 0.015
sps$msLevel <- msLevel(sps)
system.time(msLevel(sps))
## user system elapsed
## 0.007 0.000 0.007
We can also use the reset()
function to reset the data to its original state (this will cause any local spectra variables to be deleted and the backend to be initialized with the original data in the database).
sps <- reset(sps)
Performance comparison with other backends
The need to retrieve any spectra data on-the-fly from the database has an impact on the performance of data access functions of Spectra
objects usingMsBackendSql
/MsBackendOfflineSql
backends. To evaluate this we compare below the performance of the MsBackendSql
to other Spectra
backends, specifically, the MsBackendMzR
which is the default backend to read and represent raw MS data, and the MsBackendMemory
backend that keeps all MS data in memory (and is thus not suggested for larger MS experiments). Similar to the MsBackendMzR
, also the MsBackendSql
keeps only a limited amount of data in memory. These_on-disk_ backends need thus to retrieve spectra and MS peaks data on-the-fly from either the original raw data files (in the case of the MsBackendMzR
) or from the SQL database (in the case of the MsBackendSql
). The in-memory backendMsBackendMemory
is supposed to provide the fastest data access since all data is kept in memory.
Below we thus create Spectra
objects from the same data but using the different backends.
con <- dbConnect(SQLite(), dbfile)
sps <- Spectra(con, source = MsBackendSql())
sps_mzr <- Spectra(fls, source = MsBackendMzR())
sps_im <- setBackend(sps_mzr, backend = MsBackendMemory())
At first we compare the memory footprint of the 3 backends.
print(object.size(sps), units = "KB")
## 112.9 Kb
print(object.size(sps_mzr), units = "KB")
## 401.4 Kb
print(object.size(sps_im), units = "KB")
## 54509.1 Kb
The MsBackendSql
has the lowest memory footprint of all 3 backends because it does not keep any data in memory. The MsBackendMzR
keeps all spectra variables, except the MS peaks data, in memory and has thus a larger size. TheMsBackendMemory
keeps all data (including the MS peaks data) in memory and has thus the largest size in memory.
Next we compare the performance to extract the MS level for each spectrum from the 4 different Spectra
objects.
library(microbenchmark)
microbenchmark(msLevel(sps),
msLevel(sps_mzr),
msLevel(sps_im))
## Unit: microseconds
## expr min lq mean median uq max
## msLevel(sps) 9075.515 9492.4720 10174.27107 9806.2070 10968.7150 12954.522
## msLevel(sps_mzr) 626.023 663.1530 892.68709 709.9900 784.5570 8982.692
## msLevel(sps_im) 14.818 22.6545 39.44521 41.3245 51.4205 73.248
## neval cld
## 100 a
## 100 b
## 100 c
Extracting MS levels is thus slowest for the MsBackendSql
, which is not surprising because both other backends keep this data in memory while theMsBackendSql
needs to retrieve it from the database.
We next compare the performance to access the full peaks data from eachSpectra
object.
microbenchmark(peaksData(sps, BPPARAM = SerialParam()),
peaksData(sps_mzr, BPPARAM = SerialParam()),
peaksData(sps_im, BPPARAM = SerialParam()),
times = 10)
## Unit: milliseconds
## expr min lq mean
## peaksData(sps, BPPARAM = SerialParam()) 63.534895 89.317356 199.122897
## peaksData(sps_mzr, BPPARAM = SerialParam()) 695.335248 735.872955 1063.021581
## peaksData(sps_im, BPPARAM = SerialParam()) 1.175141 1.282965 4.648899
## median uq max neval cld
## 124.245089 147.784205 667.32555 10 a
## 786.181355 1357.766999 1834.46520 10 b
## 1.655021 2.076847 31.48762 10 a
As expected, the MsBackendMemory
has the fasted access to the full peaks data. The MsBackendSql
outperforms however the MsBackendMzR
providing faster access to the m/z and intensity values.
Performance can be improved for the MsBackendMzR
using parallel processing. Note that the MsBackendSql
does not support parallel processing and thus parallel processing is (silently) disabled in functions such as peaksData()
.
m2 <- MulticoreParam(2)
microbenchmark(peaksData(sps, BPPARAM = m2),
peaksData(sps_mzr, BPPARAM = m2),
peaksData(sps_im, BPPARAM = m2),
times = 10)
## Unit: microseconds
## expr min lq mean median
## peaksData(sps, BPPARAM = m2) 57089.097 78698.172 149405.337 97000.369
## peaksData(sps_mzr, BPPARAM = m2) 707535.467 762801.732 1144929.600 1018310.785
## peaksData(sps_im, BPPARAM = m2) 599.237 956.972 1531.655 1397.655
## uq max neval cld
## 102026.638 679687.762 10 a
## 1481558.941 1973334.180 10 b
## 2038.081 3002.343 10 a
We next compare the performance of subsetting operations.
microbenchmark(filterRt(sps, rt = c(50, 100)),
filterRt(sps_mzr, rt = c(50, 100)),
filterRt(sps_im, rt = c(50, 100)))
## Unit: microseconds
## expr min lq mean median
## filterRt(sps, rt = c(50, 100)) 3084.820 3364.604 3885.9162 3583.426
## filterRt(sps_mzr, rt = c(50, 100)) 1987.844 2234.152 2612.4719 2434.856
## filterRt(sps_im, rt = c(50, 100)) 635.672 726.579 799.8071 768.355
## uq max neval cld
## 3923.035 23471.827 100 a
## 2563.178 14831.049 100 b
## 831.446 2828.657 100 c
The two on-disk backends MsBackendSql
and MsBackendMzR
show a comparable performance for this operation. This filtering does involves access to a spectra variables (the retention time in this case) which, for the MsBackendSql
needs first to be retrieved from the backend. The MsBackendSql
backend allows however also to cache spectra variables (i.e. they are stored within theMsBackendSql
object). Any access to such cached spectra variables can eventually be faster because no dedicated SQL query is needed.
To evaluate the performance of a pure subsetting operation we first define the indices of 10 random spectra and subset the Spectra
objects to these.
idx <- sample(seq_along(sps), 10)
microbenchmark(sps[idx],
sps_mzr[idx],
sps_im[idx])
## Unit: microseconds
## expr min lq mean median uq max neval
## sps[idx] 203.498 221.6815 243.7263 250.3315 260.0285 347.724 100
## sps_mzr[idx] 1028.232 1036.8395 1057.6455 1043.0985 1055.4180 1973.849 100
## sps_im[idx] 307.361 327.2190 357.7035 343.9675 361.0475 1631.650 100
## cld
## a
## b
## c
Here the MsBackendSql
outperforms the other backends because it does not keep any data in memory and hence does not need to subset these. The two other backends need to subset the data they keep in memory which is in both cases a data frame with either a reduced set of spectra variables or the full MS data.
At last we compare also the extraction of the peaks data from the such subsetSpectra
objects.
sps_10 <- sps[idx]
sps_mzr_10 <- sps_mzr[idx]
sps_im_10 <- sps_im[idx]
microbenchmark(peaksData(sps_10),
peaksData(sps_mzr_10),
peaksData(sps_im_10),
times = 10)
## Unit: microseconds
## expr min lq mean median uq
## peaksData(sps_10) 1347.151 2216.415 2884.6083 2817.9625 3425.109
## peaksData(sps_mzr_10) 131915.979 134734.611 139034.6948 139896.5860 142934.359
## peaksData(sps_im_10) 315.723 343.110 572.9997 521.1625 748.791
## max neval cld
## 4926.594 10 a
## 146907.296 10 b
## 997.407 10 a
The MsBackendSql
outperforms the MsBackendMzR
while, not unexpectedly, theMsBackendMemory
provides fasted access.
Considerations for database systems/servers
The backends from the MsBackendSql package use standard SQL calls to retrieve MS data from the database and hence any SQL database system (for which an R package is available) is supported. SQLite-based databases would represent the easiest and most user friendly solution since no database server administration and user management is required. Indeed, performance of SQLite is very high, even for very large data sets. Server-based databases on the other hand have the advantage to enable a centralized storage and control of MS data (inclusive user management etc). Also, such server systems would also allow data set or server-specific configurations to improve performance.
A comparison between a SQLite-based with a MariaDB-based MsBackendSql database for a large data set comprising over 8,000 samples and over 15,000,000 spectra is availablehere. In brief, performance to extract data was comparable and for individual spectra variables even faster for the SQLite database. Only when more complex SQL queries were involved (combining several primary keys or data fields) the more advanced MariaDB database outperformed SQLite.
Other properties of the MsBackendSql
The MsBackendSql
backend does not support parallel processing since the database connection can not be shared across the different (parallel) processes. Thus, all methods on Spectra
objects that use a MsBackendSql
will automatically (and silently) disable parallel processing even if a dedicated parallel processing setup was passed along with the BPPARAM
method.
Some functions on Spectra
objects require to load the MS peak data (i.e., m/z and intensity values) into memory. For very large data sets (or computers with limited hardware resources) such function calls can cause out-of-memory errors. One example is the lengths()
function that determines the number of peaks per spectrum by loading the peak matrix first into memory. Such functions should ideally be called using the peaksapply()
function with parameterchunkSize
(e.g., peaksapply(sps, lengths, chunkSize = 5000L)
). Instead of processing the full data set, the data will be first split into chunks of sizechunkSize
that are stepwise processed. Hence, only data from chunkSize
spectra is loaded into memory in one iteration.
Summary
The MsBackendSql
provides an MS data representations and storage mode with a minimal memory footprint (in R) that is still comparably efficient for standard processing and subsetting operations. This backend is specifically useful for very large MS data sets, that could even be hosted on remote (MySQL/MariaDB) servers. A potential use case for this backend could thus be to set up a central storage place for MS experiments with data analysts connecting remotely to this server to perform initial data exploration and filtering. After subsetting to a smaller data set of interest, users could then retrieve/download this data by changing the backend to e.g. a MsBackendMemory
, which would result in a_download_ of the full data to the user computer’s memory.
Session information
sessionInfo()
## R version 4.5.0 RC (2025-04-04 r88126)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.2 LTS
##
## Matrix products: default
## BLAS: /home/biocbuild/bbs-3.21-bioc/R/lib/libRblas.so
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.12.0 LAPACK version 3.12.0
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_GB LC_COLLATE=C
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## time zone: America/New_York
## tzcode source: system (glibc)
##
## attached base packages:
## [1] stats4 stats graphics grDevices utils datasets methods
## [8] base
##
## other attached packages:
## [1] microbenchmark_1.5.0 RSQLite_2.3.9 MsBackendSql_1.8.0
## [4] Spectra_1.18.0 BiocParallel_1.42.0 S4Vectors_0.46.0
## [7] BiocGenerics_0.54.0 generics_0.1.3 BiocStyle_2.36.0
##
## loaded via a namespace (and not attached):
## [1] sandwich_3.1-1 sass_0.4.10 MsCoreUtils_1.20.0
## [4] lattice_0.22-7 stringi_1.8.7 hms_1.1.3
## [7] digest_0.6.37 grid_4.5.0 evaluate_1.0.3
## [10] bookdown_0.43 mvtnorm_1.3-3 fastmap_1.2.0
## [13] blob_1.2.4 Matrix_1.7-3 jsonlite_2.0.0
## [16] ProtGenerics_1.40.0 progress_1.2.3 mzR_2.42.0
## [19] DBI_1.2.3 survival_3.8-3 multcomp_1.4-28
## [22] BiocManager_1.30.25 TH.data_1.1-3 codetools_0.2-20
## [25] jquerylib_0.1.4 cli_3.6.4 rlang_1.1.6
## [28] crayon_1.5.3 Biobase_2.68.0 splines_4.5.0
## [31] bit64_4.6.0-1 cachem_1.1.0 yaml_2.3.10
## [34] tools_4.5.0 parallel_4.5.0 memoise_2.0.1
## [37] ncdf4_1.24 fastmatch_1.1-6 vctrs_0.6.5
## [40] R6_2.6.1 zoo_1.8-14 lifecycle_1.0.4
## [43] fs_1.6.6 IRanges_2.42.0 bit_4.6.0
## [46] clue_0.3-66 MASS_7.3-65 cluster_2.1.8.1
## [49] pkgconfig_2.0.3 bslib_0.9.0 data.table_1.17.0
## [52] Rcpp_1.0.14 xfun_0.52 knitr_1.50
## [55] htmltools_0.5.8.1 rmarkdown_2.29 compiler_4.5.0
## [58] prettyunits_1.2.0 MetaboCoreUtils_1.16.0