GitHub - PeteHaitch/DelayedMatrixStats: A port of the matrixStats API to work with DelayedMatrix objects from the DelayedArray package (original) (raw)

DelayedMatrixStats

DelayedMatrixStats is a port of thematrixStats API to work with DelayedMatrix objects from theDelayedArraypackage.

For a DelayedMatrix, x, the simplest way to apply a function, f(), from matrixStats ismatrixStats::f(as.matrix(x)). However, this “_realizes_” x in memory as a base::matrix, which typically defeats the entire purpose of using a DelayedMatrix for storing the data.

The DelayedArray package already implements a clever strategy called “block-processing” for certain common “matrix stats” operations (e.g. colSums(), rowSums()). This is a good start, but not all of thematrixStats API is currently supported. Furthermore, certain operations can be optimized with additional information about x. I’ll refer to these “seed-aware” implementations.

Installation

You can install DelayedMatrixStats from Bioconductor with:

if (!requireNamespace("BiocManager", quietly = TRUE)) install.packages("BiocManager")

BiocManager::install("DelayedMatrixStats")

Example

This example compares two ways of computing column sums of a_DelayedMatrix_ object:

  1. DelayedMatrix::colSums(): The ‘block-processing strategy’, implemented in the DelayedArray package. The block-processing strategy works for any DelayedMatrix object, regardless of the type of seed.
  2. DelayedMatrixStats::colSums2(): The ‘seed-aware’ strategy, implemented in the DelayedMatrixStats package. The seed-aware implementation is optimized for both speed and memory but only for_DelayedMatrix_ objects with certain types of seed.

library(DelayedMatrixStats) library(sparseMatrixStats) library(microbenchmark) library(profmem)

set.seed(666)

Fast column sums of DelayedMatrix with matrix seed

dense_matrix <- DelayedArray(matrix(runif(20000 * 600), nrow = 20000, ncol = 600)) class(seed(dense_matrix)) #> [1] "matrix" "array" dense_matrix #> <20000 x 600> DelayedMatrix object of type "double": #> [,1] [,2] [,3] ... [,599] [,600] #> [1,] 0.7743685 0.6601787 0.4098798 . 0.89118118 0.05776471 #> [2,] 0.1972242 0.8436035 0.9198450 . 0.31799523 0.63099417 #> [3,] 0.9780138 0.2017589 0.4696158 . 0.31783791 0.02830454 #> [4,] 0.2013274 0.8797239 0.6474768 . 0.55217184 0.09678816 #> [5,] 0.3612444 0.8158778 0.5928599 . 0.08530977 0.39224147 #> ... . . . . . . #> [19996,] 0.19490291 0.07763570 0.56391725 . 0.09703424 0.62659353 #> [19997,] 0.61182993 0.01910121 0.04046034 . 0.59708388 0.88389731 #> [19998,] 0.12932744 0.21155070 0.19344085 . 0.51682032 0.13378223 #> [19999,] 0.18985573 0.41716539 0.35110782 . 0.62939661 0.94601427 #> [20000,] 0.87889047 0.25308041 0.54666920 . 0.81630322 0.73272217 microbenchmark(DelayedArray::colSums(dense_matrix), DelayedMatrixStats::colSums2(dense_matrix), times = 10) #> Warning in microbenchmark(DelayedArray::colSums(dense_matrix), DelayedMatrixStats::colSums2(dense_matrix), : #> less accurate nanosecond times to avoid potential integer overflows #> Unit: milliseconds #> expr min lq mean median uq max neval #> DelayedArray::colSums(dense_matrix) 36.26811 40.05528 87.89183 42.91652 49.70672 278.0928 10 #> DelayedMatrixStats::colSums2(dense_matrix) 12.07319 12.24256 12.50835 12.53206 12.70734 12.8938 10 profmem::total(profmem::profmem(DelayedArray::colSums(dense_matrix))) #> [1] 96106072 profmem::total(profmem::profmem(DelayedMatrixStats::colSums2(dense_matrix))) #> [1] 6064

Fast, low-memory column sums of DelayedMatrix with sparse matrix seed

sparse_matrix <- seed(dense_matrix) zero_idx <- sample(length(sparse_matrix), 0.6 * length(sparse_matrix)) sparse_matrix[zero_idx] <- 0 sparse_matrix <- DelayedArray(Matrix::Matrix(sparse_matrix, sparse = TRUE)) class(seed(sparse_matrix)) #> [1] "dgCMatrix" #> attr(,"package") #> [1] "Matrix" sparse_matrix #> <20000 x 600> sparse DelayedMatrix object of type "double": #> [,1] [,2] [,3] ... [,599] [,600] #> [1,] 0.7743685 0.0000000 0.0000000 . 0.89118118 0.00000000 #> [2,] 0.1972242 0.0000000 0.9198450 . 0.00000000 0.00000000 #> [3,] 0.9780138 0.0000000 0.4696158 . 0.31783791 0.00000000 #> [4,] 0.0000000 0.8797239 0.6474768 . 0.55217184 0.00000000 #> [5,] 0.3612444 0.0000000 0.0000000 . 0.08530977 0.39224147 #> ... . . . . . . #> [19996,] 0.1949029 0.0776357 0.0000000 . 0.09703424 0.00000000 #> [19997,] 0.0000000 0.0000000 0.0000000 . 0.00000000 0.88389731 #> [19998,] 0.0000000 0.2115507 0.1934408 . 0.00000000 0.00000000 #> [19999,] 0.1898557 0.0000000 0.3511078 . 0.62939661 0.94601427 #> [20000,] 0.8788905 0.2530804 0.0000000 . 0.00000000 0.73272217 microbenchmark(DelayedArray::colSums(sparse_matrix), DelayedMatrixStats::colSums2(sparse_matrix), times = 10) #> Unit: milliseconds #> expr min lq mean median uq max #> DelayedArray::colSums(sparse_matrix) 145.485138 147.451908 199.555520 151.754715 173.112209 384.247039 #> DelayedMatrixStats::colSums2(sparse_matrix) 5.284654 5.392238 5.455612 5.449802 5.522536 5.643978 #> neval #> 10 #> 10 profmem::total(profmem::profmem(DelayedArray::colSums(sparse_matrix))) #> [1] 249647440 profmem::total(profmem::profmem(DelayedMatrixStats::colSums2(sparse_matrix))) #> [1] 4848

Fast column sums of DelayedMatrix with Rle-based seed

rle_matrix <- RleArray(Rle(sample(2L, 200000 * 6 / 10, replace = TRUE), 100), dim = c(2000000, 6)) class(seed(rle_matrix)) #> [1] "SolidRleArraySeed" #> attr(,"package") #> [1] "DelayedArray" rle_matrix #> <2000000 x 6> RleMatrix object of type "integer": #> [,1] [,2] [,3] [,4] [,5] [,6] #> [1,] 2 2 1 1 1 2 #> [2,] 2 2 1 1 1 2 #> [3,] 2 2 1 1 1 2 #> [4,] 2 2 1 1 1 2 #> [5,] 2 2 1 1 1 2 #> ... . . . . . . #> [1999996,] 1 2 2 1 1 1 #> [1999997,] 1 2 2 1 1 1 #> [1999998,] 1 2 2 1 1 1 #> [1999999,] 1 2 2 1 1 1 #> [2000000,] 1 2 2 1 1 1 microbenchmark(DelayedArray::colSums(rle_matrix), DelayedMatrixStats::colSums2(rle_matrix), times = 10) #> Unit: milliseconds #> expr min lq mean median uq max neval #> DelayedArray::colSums(rle_matrix) 393.768059 397.620132 448.117179 412.378390 476.314179 605.87922 10 #> DelayedMatrixStats::colSums2(rle_matrix) 1.946147 2.239174 8.924917 2.819098 3.871302 61.91156 10 profmem::total(profmem::profmem(DelayedArray::colSums(rle_matrix))) #> [1] 168003192 profmem::total(profmem::profmem(DelayedMatrixStats::colSums2(rle_matrix))) #> [1] 1968

Benchmarking

An extensive set of benchmarks is under development athttp://peterhickey.org/BenchmarkingDelayedMatrixStats/.

API coverage

Method Block processing base::matrix optimized Matrix::dgCMatrix optimized Matrix::lgCMatrix optimized DelayedArray::RleArray (SolidRleArraySeed) optimized DelayedArray::RleArray (ChunkedRleArraySeed) optimized HDF5Array::HDF5Matrix optimized base::data.frame optimized S4Vectors::DataFrame optimized
colAlls()
colAnyMissings()
colAnyNAs()
colAnys()
colAvgsPerRowSet()
colCollapse()
colCounts()
colCummaxs()
colCummins()
colCumprods()
colCumsums()
colDiffs()
colIQRDiffs()
colIQRs()
colLogSumExps()
colMadDiffs()
colMads()
colMaxs()
colMeans2()
colMedians()
colMins()
colOrderStats()
colProds()
colQuantiles()
colRanges()
colRanks()
colSdDiffs()
colSds()
colsum() ☑️
colSums2()
colTabulates()
colVarDiffs()
colVars()
colWeightedMads()
colWeightedMeans()
colWeightedMedians()
colWeightedSds()
colWeightedVars()
rowAlls()
rowAnyMissings()
rowAnyNAs()
rowAnys()
rowAvgsPerColSet()
rowCollapse()
rowCounts()
rowCummaxs()
rowCummins()
rowCumprods()
rowCumsums()
rowDiffs()
rowIQRDiffs()
rowIQRs()
rowLogSumExps()
rowMadDiffs()
rowMads()
rowMaxs()
rowMeans2()
rowMedians()
rowMins()
rowOrderStats()
rowProds()
rowQuantiles()
rowRanges()
rowRanks()
rowSdDiffs()
rowSds()
rowsum() ☑️
rowSums2()
rowTabulates()
rowVarDiffs()
rowVars()
rowWeightedMads()
rowWeightedMeans()
rowWeightedMedians()
rowWeightedSds()
rowWeightedVars()