GitHub - PeteHaitch/DelayedMatrixStats: A port of the matrixStats API to work with DelayedMatrix objects from the DelayedArray package (original) (raw)
DelayedMatrixStats
DelayedMatrixStats is a port of thematrixStats API to work with DelayedMatrix objects from theDelayedArraypackage.
For a DelayedMatrix, x
, the simplest way to apply a function, f()
, from matrixStats ismatrixStats::f(as.matrix(x))
. However, this “_realizes_” x
in memory as a base::matrix, which typically defeats the entire purpose of using a DelayedMatrix for storing the data.
The DelayedArray package already implements a clever strategy called “block-processing” for certain common “matrix stats” operations (e.g. colSums()
, rowSums()
). This is a good start, but not all of thematrixStats API is currently supported. Furthermore, certain operations can be optimized with additional information about x
. I’ll refer to these “seed-aware” implementations.
Installation
You can install DelayedMatrixStats from Bioconductor with:
if (!requireNamespace("BiocManager", quietly = TRUE)) install.packages("BiocManager")
BiocManager::install("DelayedMatrixStats")
Example
This example compares two ways of computing column sums of a_DelayedMatrix_ object:
DelayedMatrix::colSums()
: The ‘block-processing strategy’, implemented in the DelayedArray package. The block-processing strategy works for any DelayedMatrix object, regardless of the type of seed.DelayedMatrixStats::colSums2()
: The ‘seed-aware’ strategy, implemented in the DelayedMatrixStats package. The seed-aware implementation is optimized for both speed and memory but only for_DelayedMatrix_ objects with certain types of seed.
library(DelayedMatrixStats) library(sparseMatrixStats) library(microbenchmark) library(profmem)
set.seed(666)
Fast column sums of DelayedMatrix with matrix seed
dense_matrix <- DelayedArray(matrix(runif(20000 * 600), nrow = 20000, ncol = 600)) class(seed(dense_matrix)) #> [1] "matrix" "array" dense_matrix #> <20000 x 600> DelayedMatrix object of type "double": #> [,1] [,2] [,3] ... [,599] [,600] #> [1,] 0.7743685 0.6601787 0.4098798 . 0.89118118 0.05776471 #> [2,] 0.1972242 0.8436035 0.9198450 . 0.31799523 0.63099417 #> [3,] 0.9780138 0.2017589 0.4696158 . 0.31783791 0.02830454 #> [4,] 0.2013274 0.8797239 0.6474768 . 0.55217184 0.09678816 #> [5,] 0.3612444 0.8158778 0.5928599 . 0.08530977 0.39224147 #> ... . . . . . . #> [19996,] 0.19490291 0.07763570 0.56391725 . 0.09703424 0.62659353 #> [19997,] 0.61182993 0.01910121 0.04046034 . 0.59708388 0.88389731 #> [19998,] 0.12932744 0.21155070 0.19344085 . 0.51682032 0.13378223 #> [19999,] 0.18985573 0.41716539 0.35110782 . 0.62939661 0.94601427 #> [20000,] 0.87889047 0.25308041 0.54666920 . 0.81630322 0.73272217 microbenchmark(DelayedArray::colSums(dense_matrix), DelayedMatrixStats::colSums2(dense_matrix), times = 10) #> Warning in microbenchmark(DelayedArray::colSums(dense_matrix), DelayedMatrixStats::colSums2(dense_matrix), : #> less accurate nanosecond times to avoid potential integer overflows #> Unit: milliseconds #> expr min lq mean median uq max neval #> DelayedArray::colSums(dense_matrix) 36.26811 40.05528 87.89183 42.91652 49.70672 278.0928 10 #> DelayedMatrixStats::colSums2(dense_matrix) 12.07319 12.24256 12.50835 12.53206 12.70734 12.8938 10 profmem::total(profmem::profmem(DelayedArray::colSums(dense_matrix))) #> [1] 96106072 profmem::total(profmem::profmem(DelayedMatrixStats::colSums2(dense_matrix))) #> [1] 6064
Fast, low-memory column sums of DelayedMatrix with sparse matrix seed
sparse_matrix <- seed(dense_matrix) zero_idx <- sample(length(sparse_matrix), 0.6 * length(sparse_matrix)) sparse_matrix[zero_idx] <- 0 sparse_matrix <- DelayedArray(Matrix::Matrix(sparse_matrix, sparse = TRUE)) class(seed(sparse_matrix)) #> [1] "dgCMatrix" #> attr(,"package") #> [1] "Matrix" sparse_matrix #> <20000 x 600> sparse DelayedMatrix object of type "double": #> [,1] [,2] [,3] ... [,599] [,600] #> [1,] 0.7743685 0.0000000 0.0000000 . 0.89118118 0.00000000 #> [2,] 0.1972242 0.0000000 0.9198450 . 0.00000000 0.00000000 #> [3,] 0.9780138 0.0000000 0.4696158 . 0.31783791 0.00000000 #> [4,] 0.0000000 0.8797239 0.6474768 . 0.55217184 0.00000000 #> [5,] 0.3612444 0.0000000 0.0000000 . 0.08530977 0.39224147 #> ... . . . . . . #> [19996,] 0.1949029 0.0776357 0.0000000 . 0.09703424 0.00000000 #> [19997,] 0.0000000 0.0000000 0.0000000 . 0.00000000 0.88389731 #> [19998,] 0.0000000 0.2115507 0.1934408 . 0.00000000 0.00000000 #> [19999,] 0.1898557 0.0000000 0.3511078 . 0.62939661 0.94601427 #> [20000,] 0.8788905 0.2530804 0.0000000 . 0.00000000 0.73272217 microbenchmark(DelayedArray::colSums(sparse_matrix), DelayedMatrixStats::colSums2(sparse_matrix), times = 10) #> Unit: milliseconds #> expr min lq mean median uq max #> DelayedArray::colSums(sparse_matrix) 145.485138 147.451908 199.555520 151.754715 173.112209 384.247039 #> DelayedMatrixStats::colSums2(sparse_matrix) 5.284654 5.392238 5.455612 5.449802 5.522536 5.643978 #> neval #> 10 #> 10 profmem::total(profmem::profmem(DelayedArray::colSums(sparse_matrix))) #> [1] 249647440 profmem::total(profmem::profmem(DelayedMatrixStats::colSums2(sparse_matrix))) #> [1] 4848
Fast column sums of DelayedMatrix with Rle-based seed
rle_matrix <- RleArray(Rle(sample(2L, 200000 * 6 / 10, replace = TRUE), 100), dim = c(2000000, 6)) class(seed(rle_matrix)) #> [1] "SolidRleArraySeed" #> attr(,"package") #> [1] "DelayedArray" rle_matrix #> <2000000 x 6> RleMatrix object of type "integer": #> [,1] [,2] [,3] [,4] [,5] [,6] #> [1,] 2 2 1 1 1 2 #> [2,] 2 2 1 1 1 2 #> [3,] 2 2 1 1 1 2 #> [4,] 2 2 1 1 1 2 #> [5,] 2 2 1 1 1 2 #> ... . . . . . . #> [1999996,] 1 2 2 1 1 1 #> [1999997,] 1 2 2 1 1 1 #> [1999998,] 1 2 2 1 1 1 #> [1999999,] 1 2 2 1 1 1 #> [2000000,] 1 2 2 1 1 1 microbenchmark(DelayedArray::colSums(rle_matrix), DelayedMatrixStats::colSums2(rle_matrix), times = 10) #> Unit: milliseconds #> expr min lq mean median uq max neval #> DelayedArray::colSums(rle_matrix) 393.768059 397.620132 448.117179 412.378390 476.314179 605.87922 10 #> DelayedMatrixStats::colSums2(rle_matrix) 1.946147 2.239174 8.924917 2.819098 3.871302 61.91156 10 profmem::total(profmem::profmem(DelayedArray::colSums(rle_matrix))) #> [1] 168003192 profmem::total(profmem::profmem(DelayedMatrixStats::colSums2(rle_matrix))) #> [1] 1968
Benchmarking
An extensive set of benchmarks is under development athttp://peterhickey.org/BenchmarkingDelayedMatrixStats/.
API coverage
- ✔ = Implemented in DelayedMatrixStats
- ☑️ = Implemented inDelayedArray orsparseMatrixStats
- ❌: = Not yet implemented
Method | Block processing | base::matrix optimized | Matrix::dgCMatrix optimized | Matrix::lgCMatrix optimized | DelayedArray::RleArray (SolidRleArraySeed) optimized | DelayedArray::RleArray (ChunkedRleArraySeed) optimized | HDF5Array::HDF5Matrix optimized | base::data.frame optimized | S4Vectors::DataFrame optimized |
---|---|---|---|---|---|---|---|---|---|
colAlls() | ✔ | ✔ | ✔ | ✔ | ❌ | ❌ | ❌ | ❌ | ❌ |
colAnyMissings() | ✔ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ |
colAnyNAs() | ✔ | ✔ | ✔ | ✔ | ❌ | ❌ | ❌ | ❌ | ❌ |
colAnys() | ✔ | ✔ | ✔ | ✔ | ❌ | ❌ | ❌ | ❌ | ❌ |
colAvgsPerRowSet() | ✔ | ✔ | ✔ | ✔ | ❌ | ❌ | ❌ | ❌ | ❌ |
colCollapse() | ✔ | ✔ | ✔ | ✔ | ❌ | ❌ | ❌ | ❌ | ❌ |
colCounts() | ✔ | ✔ | ✔ | ✔ | ❌ | ❌ | ❌ | ❌ | ❌ |
colCummaxs() | ✔ | ✔ | ✔ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ |
colCummins() | ✔ | ✔ | ✔ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ |
colCumprods() | ✔ | ✔ | ✔ | ✔ | ❌ | ❌ | ❌ | ❌ | ❌ |
colCumsums() | ✔ | ✔ | ✔ | ✔ | ❌ | ❌ | ❌ | ❌ | ❌ |
colDiffs() | ✔ | ✔ | ✔ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ |
colIQRDiffs() | ✔ | ✔ | ✔ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ |
colIQRs() | ✔ | ✔ | ✔ | ✔ | ❌ | ❌ | ❌ | ❌ | ❌ |
colLogSumExps() | ✔ | ✔ | ✔ | ✔ | ❌ | ❌ | ❌ | ❌ | ❌ |
colMadDiffs() | ✔ | ✔ | ✔ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ |
colMads() | ✔ | ✔ | ✔ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ |
colMaxs() | ✔ | ✔ | ✔ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ |
colMeans2() | ✔ | ✔ | ✔ | ✔ | ✔ | ❌ | ❌ | ❌ | ❌ |
colMedians() | ✔ | ✔ | ✔ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ |
colMins() | ✔ | ✔ | ✔ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ |
colOrderStats() | ✔ | ✔ | ✔ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ |
colProds() | ✔ | ✔ | ✔ | ✔ | ✔ | ❌ | ❌ | ❌ | ❌ |
colQuantiles() | ✔ | ✔ | ✔ | ✔ | ❌ | ❌ | ❌ | ❌ | ❌ |
colRanges() | ✔ | ✔ | ✔ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ |
colRanks() | ✔ | ✔ | ✔ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ |
colSdDiffs() | ✔ | ✔ | ✔ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ |
colSds() | ✔ | ✔ | ✔ | ✔ | ❌ | ❌ | ❌ | ❌ | ❌ |
colsum() | ☑️ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ |
colSums2() | ✔ | ✔ | ✔ | ✔ | ✔ | ❌ | ❌ | ❌ | ❌ |
colTabulates() | ✔ | ✔ | ✔ | ✔ | ❌ | ❌ | ❌ | ❌ | ❌ |
colVarDiffs() | ✔ | ✔ | ✔ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ |
colVars() | ✔ | ✔ | ✔ | ✔ | ❌ | ❌ | ❌ | ❌ | ❌ |
colWeightedMads() | ✔ | ✔ | ✔ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ |
colWeightedMeans() | ✔ | ✔ | ✔ | ✔ | ❌ | ❌ | ❌ | ❌ | ❌ |
colWeightedMedians() | ✔ | ✔ | ✔ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ |
colWeightedSds() | ✔ | ✔ | ✔ | ✔ | ❌ | ❌ | ❌ | ❌ | ❌ |
colWeightedVars() | ✔ | ✔ | ✔ | ✔ | ❌ | ❌ | ❌ | ❌ | ❌ |
rowAlls() | ✔ | ✔ | ✔ | ✔ | ❌ | ❌ | ❌ | ❌ | ❌ |
rowAnyMissings() | ✔ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ |
rowAnyNAs() | ✔ | ✔ | ✔ | ✔ | ❌ | ❌ | ❌ | ❌ | ❌ |
rowAnys() | ✔ | ✔ | ✔ | ✔ | ❌ | ❌ | ❌ | ❌ | ❌ |
rowAvgsPerColSet() | ✔ | ✔ | ✔ | ✔ | ❌ | ❌ | ❌ | ❌ | ❌ |
rowCollapse() | ✔ | ✔ | ✔ | ✔ | ❌ | ❌ | ❌ | ❌ | ❌ |
rowCounts() | ✔ | ✔ | ✔ | ✔ | ❌ | ❌ | ❌ | ❌ | ❌ |
rowCummaxs() | ✔ | ✔ | ✔ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ |
rowCummins() | ✔ | ✔ | ✔ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ |
rowCumprods() | ✔ | ✔ | ✔ | ✔ | ❌ | ❌ | ❌ | ❌ | ❌ |
rowCumsums() | ✔ | ✔ | ✔ | ✔ | ❌ | ❌ | ❌ | ❌ | ❌ |
rowDiffs() | ✔ | ✔ | ✔ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ |
rowIQRDiffs() | ✔ | ✔ | ✔ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ |
rowIQRs() | ✔ | ✔ | ✔ | ✔ | ❌ | ❌ | ❌ | ❌ | ❌ |
rowLogSumExps() | ✔ | ✔ | ✔ | ✔ | ❌ | ❌ | ❌ | ❌ | ❌ |
rowMadDiffs() | ✔ | ✔ | ✔ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ |
rowMads() | ✔ | ✔ | ✔ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ |
rowMaxs() | ✔ | ✔ | ✔ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ |
rowMeans2() | ✔ | ✔ | ✔ | ✔ | ❌ | ❌ | ❌ | ❌ | ❌ |
rowMedians() | ✔ | ✔ | ✔ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ |
rowMins() | ✔ | ✔ | ✔ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ |
rowOrderStats() | ✔ | ✔ | ✔ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ |
rowProds() | ✔ | ✔ | ✔ | ✔ | ❌ | ❌ | ❌ | ❌ | ❌ |
rowQuantiles() | ✔ | ✔ | ✔ | ✔ | ❌ | ❌ | ❌ | ❌ | ❌ |
rowRanges() | ✔ | ✔ | ✔ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ |
rowRanks() | ✔ | ✔ | ✔ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ |
rowSdDiffs() | ✔ | ✔ | ✔ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ |
rowSds() | ✔ | ✔ | ✔ | ✔ | ❌ | ❌ | ❌ | ❌ | ❌ |
rowsum() | ☑️ | ❌ | ✔ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ |
rowSums2() | ✔ | ✔ | ✔ | ✔ | ❌ | ❌ | ❌ | ❌ | ❌ |
rowTabulates() | ✔ | ✔ | ✔ | ✔ | ❌ | ❌ | ❌ | ❌ | ❌ |
rowVarDiffs() | ✔ | ✔ | ✔ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ |
rowVars() | ✔ | ✔ | ✔ | ✔ | ❌ | ❌ | ❌ | ❌ | ❌ |
rowWeightedMads() | ✔ | ✔ | ✔ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ |
rowWeightedMeans() | ✔ | ✔ | ✔ | ✔ | ❌ | ❌ | ❌ | ❌ | ❌ |
rowWeightedMedians() | ✔ | ✔ | ✔ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ |
rowWeightedSds() | ✔ | ✔ | ✔ | ✔ | ❌ | ❌ | ❌ | ❌ | ❌ |
rowWeightedVars() | ✔ | ✔ | ✔ | ✔ | ❌ | ❌ | ❌ | ❌ | ❌ |