Performance comparison SQLite vs MariaDB · Issue #15 · rformassspectrometry/MsBackendSql (original) (raw)

MS data can be stored with the MsBackendSql in any SQL database system supported by R/DBI (i.e. for which a dedicated R package is available). Here I compare performance of accessing MS data stored in either a SQLite and MariaDB database. Some properties:

Both databases (SQLite and MariaDB) stored on the same hard disk/partition (internal nVME disk, thus high data I/O is expected).
LC-MS data from 8,804 samples (mzML) files stored to the databases: in total 15,151,673 spectra.
Size of the SQLite database: 825GB
Size of the MariaDB database: 836GB
MariaDB database uses the Aria storage engine.

mse_maria and mse_sqlite below are two MsExperiment objects with the MS data represented by a MsBackendOfflineSQL backend.

#' Accessing spectra variables: microbenchmark(msLevel(spectra(mse_maria)), msLevel(spectra(mse_sqlite)), times = 7) Unit: seconds expr min lq mean median uq max neval cld msLevel(spectra(mse_maria)) 33.54925 33.76008 35.02384 34.78424 35.98299 37.34726 7 a msLevel(spectra(mse_sqlite)) 21.64151 21.82196 22.85622 21.97240 22.67517 27.38539 7 b

SQLite is thus about 10 seconds faster extracting MS levels for all spectra.

#' unique MS levels microbenchmark(uniqueMsLevels(spectra(mse_maria)), uniqueMsLevels(spectra(mse_sqlite)), times = 7) Unit: milliseconds expr min lq mean median uq uniqueMsLevels(spectra(mse_maria)) 1.542222 1.637237 2.053933 1.667836 2.557647 uniqueMsLevels(spectra(mse_sqlite)) 256.162693 256.191937 258.977613 257.817538 260.077294 max neval cld 2.777705 7 a 266.324599 7 b

uniqueMsLevels uses a select distinct... call to extract unique MS levels. MariaDB is here by far faster.

#' Filtering the dataset based on retention times: filterRt 200, 300 microbenchmark(filterRt(spectra(mse_maria), rt = c(200, 300)), filterRt(spectra(mse_sqlite), rt = c(200, 300)), times = 7) Unit: seconds expr min lq mean median uq filterRt(spectra(mse_maria), rt = c(200, 300)) 2.445303 2.470176 2.477086 2.478997 2.482781 filterRt(spectra(mse_sqlite), rt = c(200, 300)) 2.242923 2.255827 2.556260 2.264236 2.324919 max neval cld 2.509388 7 a 4.225171 7 a

About the same performance from both. filterRt uses a SQL-based filtering on the "rtime" spectra variable, i.e. performs the filtering within the database.

Next we subset the data to spectra from 10 random samples and evaluate also access to this data subset. Note that in general, for data analysis, the MS data will be processed per sample.

#' Access data from random 10 samples. set.seed(123) idx <- sample(seq_along(mse_maria), 10) mse_maria_sub <- mse_maria[idx] mse_sqlite_sub <- mse_sqlite[idx]

microbenchmark(msLevel(spectra(mse_maria_sub)), msLevel(spectra(mse_sqlite_sub)), times = 7) Unit: milliseconds expr min lq mean median uq max neval cld msLevel(spectra(mse_maria_sub)) 45.60593 45.79210 46.42101 46.29717 46.92385 47.61207 7 a msLevel(spectra(mse_sqlite_sub)) 24.49343 25.12833 25.23884 25.17525 25.47306 25.80040 7 b

Again, accessing a single spectra variables is faster with SQLite.

#' Filtering by retention time in the data subset microbenchmark(filterRt(spectra(mse_maria_sub), rt = c(200, 300)), filterRt(spectra(mse_sqlite_sub), rt = c(200, 300)), times = 7)

Unit: milliseconds

expr min lq mean median

filterRt(spectra(mse_maria_sub), rt = c(200, 300)) 37.27201 38.1075 39.84782 39.81546

filterRt(spectra(mse_sqlite_sub), rt = c(200, 300)) 2320.30354 2324.0109 2326.98962 2325.46668

uq max neval cld

41.74291 42.14641 7 a

2328.67538 2337.78457 7 b

Filtering by retention time within the data subset is much faster using the MariaDB database.

#' Accessing actual peak values: m/z values microbenchmark(mz(spectra(mse_maria_sub)), mz(spectra(mse_sqlite_sub)), times = 7) Unit: seconds expr min lq mean median uq max neval cld mz(spectra(mse_maria_sub)) 1.016480 1.284119 1.378519 1.500377 1.517393 1.529748 7 a mz(spectra(mse_sqlite_sub)) 1.037934 1.100142 1.314109 1.101201 1.108407 2.642530 7 a

Performance of accessing peaks data from the data subsets is about the same. At last we compare the performance of a frequently used task for LC-MS data analysis (with the xcms package): extracting the MS data in chromatographic representation. Below we use chromatogram to extract base peak chromatograms of the MS data per sample.

library(xcms) register(SerialParam()) microbenchmark(chromatogram(mse_maria_sub), chromatogram(mse_sqlite_sub), times = 7) Unit: seconds expr min lq mean median uq max neval cld chromatogram(mse_maria_sub) 2.703687 2.812147 3.058470 3.208331 3.255489 3.362001 7 a chromatogram(mse_sqlite_sub) 2.373239 2.477671 2.750986 2.609907 3.017549 3.283317 7 a

Performance is comparable. At last we combine that also with a filter for retention times.

microbenchmark(chromatogram(mse_maria_sub, rt = c(25, 40)), chromatogram(mse_sqlite_sub, rt = c(25, 40)), times = 7) Unit: milliseconds expr min lq mean median chromatogram(mse_maria_sub, rt = c(20, 100)) 651.9479 681.5439 708.7386 725.0281 chromatogram(mse_sqlite_sub, rt = c(20, 100)) 9999.5411 10004.3932 10056.3023 10027.4849 uq max neval cld 736.276 748.5543 7 a 10097.284 10163.7361 7 b

Here the MariaDB database clearly outperforms the SQLite database. The used SQL query combines both the primary keys of the spectra for the data subset and the retention times of these spectra.

Summary

For most operations both SQLite and MariaDB database engines are about equally performant.
For data access involving more complex queries (i.e. that combine retention time values and primary keys such as for filtering spectra within a subset of samples from the full data set) MariaDB has clear advantages while for plain access of individual spectra variables SQLite is faster.