Bioconductor Code: DelayedMatrixStats (original) (raw)

# DelayedMatrixStats \*\*DelayedMatrixStats\*\* is a port of the \[\*\*matrixStats\*\*\](https://CRAN.R-project.org/package=matrixStats) API to work with \*DelayedMatrix\* objects from the \[\*\*DelayedArray\*\*\](http://bioconductor.org/packages/DelayedArray/) package. For a \*DelayedMatrix\*, \`x\`, the simplest way to apply a function, \`f()\`, from \*\*matrixStats\*\* is\`matrixStats::f(as.matrix(x))\`. However, this “\*realizes\*” \`x\` in memory as a \*base::matrix\*, which typically defeats the entire purpose of using a \*DelayedMatrix\* for storing the data. The \*\*DelayedArray\*\* package already implements a clever strategy called “block-processing” for certain common “matrix stats” operations (e.g. \`colSums()\`, \`rowSums()\`). This is a good start, but not all of the \*\*matrixStats\*\* API is currently supported. Furthermore, certain operations can be optimized with additional information about \`x\`. I’ll refer to these “seed-aware” implementations. ## Installation You can install \*\*DelayedMatrixStats\*\* from Bioconductor with: \`\`\` r if (!requireNamespace("BiocManager", quietly = TRUE)) install.packages("BiocManager") BiocManager::install("DelayedMatrixStats") \`\`\` ## Example This example compares two ways of computing column sums of a \*DelayedMatrix\* object: 1\. \`DelayedMatrix::colSums()\`: The ‘block-processing strategy’, implemented in the \*\*DelayedArray\*\* package. The block-processing strategy works for any \*DelayedMatrix\* object, regardless of the type of \*seed\*. 2\. \`DelayedMatrixStats::colSums2()\`: The ‘seed-aware’ strategy, implemented in the \*\*DelayedMatrixStats\*\* package. The seed-aware implementation is optimized for both speed and memory but only for \*DelayedMatrix\* objects with certain types of \*seed\*. \`\`\` r library(DelayedMatrixStats) library(sparseMatrixStats) library(microbenchmark) library(profmem) \`\`\` \`\`\` r set.seed(666) # Fast column sums of DelayedMatrix with matrix seed dense\_matrix <- DelayedArray(matrix(runif(20000 \* 600), nrow = 20000, ncol = 600)) class(seed(dense\_matrix)) #> \[1\] "matrix" "array" dense\_matrix #> <20000 x 600> DelayedMatrix object of type "double": #> \[,1\] \[,2\] \[,3\] ... \[,599\] \[,600\] #> \[1,\] 0.7743685 0.6601787 0.4098798 . 0.89118118 0.05776471 #> \[2,\] 0.1972242 0.8436035 0.9198450 . 0.31799523 0.63099417 #> \[3,\] 0.9780138 0.2017589 0.4696158 . 0.31783791 0.02830454 #> \[4,\] 0.2013274 0.8797239 0.6474768 . 0.55217184 0.09678816 #> \[5,\] 0.3612444 0.8158778 0.5928599 . 0.08530977 0.39224147 #> ... . . . . . . #> \[19996,\] 0.19490291 0.07763570 0.56391725 . 0.09703424 0.62659353 #> \[19997,\] 0.61182993 0.01910121 0.04046034 . 0.59708388 0.88389731 #> \[19998,\] 0.12932744 0.21155070 0.19344085 . 0.51682032 0.13378223 #> \[19999,\] 0.18985573 0.41716539 0.35110782 . 0.62939661 0.94601427 #> \[20000,\] 0.87889047 0.25308041 0.54666920 . 0.81630322 0.73272217 microbenchmark(DelayedArray::colSums(dense\_matrix), DelayedMatrixStats::colSums2(dense\_matrix), times = 10) #> Warning in microbenchmark(DelayedArray::colSums(dense\_matrix), DelayedMatrixStats::colSums2(dense\_matrix), : #> less accurate nanosecond times to avoid potential integer overflows #> Unit: milliseconds #> expr min lq mean median uq max neval #> DelayedArray::colSums(dense\_matrix) 36.26811 40.05528 87.89183 42.91652 49.70672 278.0928 10 #> DelayedMatrixStats::colSums2(dense\_matrix) 12.07319 12.24256 12.50835 12.53206 12.70734 12.8938 10 profmem::total(profmem::profmem(DelayedArray::colSums(dense\_matrix))) #> \[1\] 96106072 profmem::total(profmem::profmem(DelayedMatrixStats::colSums2(dense\_matrix))) #> \[1\] 6064 # Fast, low-memory column sums of DelayedMatrix with sparse matrix seed sparse\_matrix <- seed(dense\_matrix) zero\_idx <- sample(length(sparse\_matrix), 0.6 \* length(sparse\_matrix)) sparse\_matrix\[zero\_idx\] <- 0 sparse\_matrix <- DelayedArray(Matrix::Matrix(sparse\_matrix, sparse = TRUE)) class(seed(sparse\_matrix)) #> \[1\] "dgCMatrix" #> attr(,"package") #> \[1\] "Matrix" sparse\_matrix #> <20000 x 600> sparse DelayedMatrix object of type "double": #> \[,1\] \[,2\] \[,3\] ... \[,599\] \[,600\] #> \[1,\] 0.7743685 0.0000000 0.0000000 . 0.89118118 0.00000000 #> \[2,\] 0.1972242 0.0000000 0.9198450 . 0.00000000 0.00000000 #> \[3,\] 0.9780138 0.0000000 0.4696158 . 0.31783791 0.00000000 #> \[4,\] 0.0000000 0.8797239 0.6474768 . 0.55217184 0.00000000 #> \[5,\] 0.3612444 0.0000000 0.0000000 . 0.08530977 0.39224147 #> ... . . . . . . #> \[19996,\] 0.1949029 0.0776357 0.0000000 . 0.09703424 0.00000000 #> \[19997,\] 0.0000000 0.0000000 0.0000000 . 0.00000000 0.88389731 #> \[19998,\] 0.0000000 0.2115507 0.1934408 . 0.00000000 0.00000000 #> \[19999,\] 0.1898557 0.0000000 0.3511078 . 0.62939661 0.94601427 #> \[20000,\] 0.8788905 0.2530804 0.0000000 . 0.00000000 0.73272217 microbenchmark(DelayedArray::colSums(sparse\_matrix), DelayedMatrixStats::colSums2(sparse\_matrix), times = 10) #> Unit: milliseconds #> expr min lq mean median uq max #> DelayedArray::colSums(sparse\_matrix) 145.485138 147.451908 199.555520 151.754715 173.112209 384.247039 #> DelayedMatrixStats::colSums2(sparse\_matrix) 5.284654 5.392238 5.455612 5.449802 5.522536 5.643978 #> neval #> 10 #> 10 profmem::total(profmem::profmem(DelayedArray::colSums(sparse\_matrix))) #> \[1\] 249647440 profmem::total(profmem::profmem(DelayedMatrixStats::colSums2(sparse\_matrix))) #> \[1\] 4848 # Fast column sums of DelayedMatrix with Rle-based seed rle\_matrix <- RleArray(Rle(sample(2L, 200000 \* 6 / 10, replace = TRUE), 100), dim = c(2000000, 6)) class(seed(rle\_matrix)) #> \[1\] "SolidRleArraySeed" #> attr(,"package") #> \[1\] "DelayedArray" rle\_matrix #> <2000000 x 6> RleMatrix object of type "integer": #> \[,1\] \[,2\] \[,3\] \[,4\] \[,5\] \[,6\] #> \[1,\] 2 2 1 1 1 2 #> \[2,\] 2 2 1 1 1 2 #> \[3,\] 2 2 1 1 1 2 #> \[4,\] 2 2 1 1 1 2 #> \[5,\] 2 2 1 1 1 2 #> ... . . . . . . #> \[1999996,\] 1 2 2 1 1 1 #> \[1999997,\] 1 2 2 1 1 1 #> \[1999998,\] 1 2 2 1 1 1 #> \[1999999,\] 1 2 2 1 1 1 #> \[2000000,\] 1 2 2 1 1 1 microbenchmark(DelayedArray::colSums(rle\_matrix), DelayedMatrixStats::colSums2(rle\_matrix), times = 10) #> Unit: milliseconds #> expr min lq mean median uq max neval #> DelayedArray::colSums(rle\_matrix) 393.768059 397.620132 448.117179 412.378390 476.314179 605.87922 10 #> DelayedMatrixStats::colSums2(rle\_matrix) 1.946147 2.239174 8.924917 2.819098 3.871302 61.91156 10 profmem::total(profmem::profmem(DelayedArray::colSums(rle\_matrix))) #> \[1\] 168003192 profmem::total(profmem::profmem(DelayedMatrixStats::colSums2(rle\_matrix))) #> \[1\] 1968 \`\`\` ## Benchmarking An extensive set of benchmarks is under development at . ## API coverage - ✔ = Implemented in \*\*DelayedMatrixStats\*\* - ☑️ = Implemented in \[\*\*DelayedArray\*\*\](http://bioconductor.org/packages/DelayedArray/) or \[\*\*sparseMatrixStats\*\*\](http://bioconductor.org/packages/sparseMatrixStats/) - ❌: = Not yet implemented | Method | Block processing | \*base::matrix\* optimized | \*Matrix::dgCMatrix\* optimized | \*Matrix::lgCMatrix\* optimized | \*DelayedArray::RleArray\* (\*SolidRleArraySeed\*) optimized | \*DelayedArray::RleArray\* (\*ChunkedRleArraySeed\*) optimized | \*HDF5Array::HDF5Matrix\* optimized | \*base::data.frame\* optimized | \*S4Vectors::DataFrame\* optimized | |:-----------------------|:-----------------|:-------------------------|:------------------------------|:------------------------------|:---------------------------------------------------------|:-----------------------------------------------------------|:----------------------------------|:-----------------------------|:---------------------------------| | \`colAlls()\` | ✔ | ✔ | ✔ | ✔ | ❌ | ❌ | ❌ | ❌ | ❌ | | \`colAnyMissings()\` | ✔ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | | \`colAnyNAs()\` | ✔ | ✔ | ✔ | ✔ | ❌ | ❌ | ❌ | ❌ | ❌ | | \`colAnys()\` | ✔ | ✔ | ✔ | ✔ | ❌ | ❌ | ❌ | ❌ | ❌ | | \`colAvgsPerRowSet()\` | ✔ | ✔ | ✔ | ✔ | ❌ | ❌ | ❌ | ❌ | ❌ | | \`colCollapse()\` | ✔ | ✔ | ✔ | ✔ | ❌ | ❌ | ❌ | ❌ | ❌ | | \`colCounts()\` | ✔ | ✔ | ✔ | ✔ | ❌ | ❌ | ❌ | ❌ | ❌ | | \`colCummaxs()\` | ✔ | ✔ | ✔ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | | \`colCummins()\` | ✔ | ✔ | ✔ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | | \`colCumprods()\` | ✔ | ✔ | ✔ | ✔ | ❌ | ❌ | ❌ | ❌ | ❌ | | \`colCumsums()\` | ✔ | ✔ | ✔ | ✔ | ❌ | ❌ | ❌ | ❌ | ❌ | | \`colDiffs()\` | ✔ | ✔ | ✔ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | | \`colIQRDiffs()\` | ✔ | ✔ | ✔ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | | \`colIQRs()\` | ✔ | ✔ | ✔ | ✔ | ❌ | ❌ | ❌ | ❌ | ❌ | | \`colLogSumExps()\` | ✔ | ✔ | ✔ | ✔ | ❌ | ❌ | ❌ | ❌ | ❌ | | \`colMadDiffs()\` | ✔ | ✔ | ✔ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | | \`colMads()\` | ✔ | ✔ | ✔ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | | \`colMaxs()\` | ✔ | ✔ | ✔ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | | \`colMeans2()\` | ✔ | ✔ | ✔ | ✔ | ✔ | ❌ | ❌ | ❌ | ❌ | | \`colMedians()\` | ✔ | ✔ | ✔ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | | \`colMins()\` | ✔ | ✔ | ✔ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | | \`colOrderStats()\` | ✔ | ✔ | ✔ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | | \`colProds()\` | ✔ | ✔ | ✔ | ✔ | ✔ | ❌ | ❌ | ❌ | ❌ | | \`colQuantiles()\` | ✔ | ✔ | ✔ | ✔ | ❌ | ❌ | ❌ | ❌ | ❌ | | \`colRanges()\` | ✔ | ✔ | ✔ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | | \`colRanks()\` | ✔ | ✔ | ✔ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | | \`colSdDiffs()\` | ✔ | ✔ | ✔ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | | \`colSds()\` | ✔ | ✔ | ✔ | ✔ | ❌ | ❌ | ❌ | ❌ | ❌ | | \`colsum()\` | ☑️ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | | \`colSums2()\` | ✔ | ✔ | ✔ | ✔ | ✔ | ❌ | ❌ | ❌ | ❌ | | \`colTabulates()\` | ✔ | ✔ | ✔ | ✔ | ❌ | ❌ | ❌ | ❌ | ❌ | | \`colVarDiffs()\` | ✔ | ✔ | ✔ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | | \`colVars()\` | ✔ | ✔ | ✔ | ✔ | ❌ | ❌ | ❌ | ❌ | ❌ | | \`colWeightedMads()\` | ✔ | ✔ | ✔ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | | \`colWeightedMeans()\` | ✔ | ✔ | ✔ | ✔ | ❌ | ❌ | ❌ | ❌ | ❌ | | \`colWeightedMedians()\` | ✔ | ✔ | ✔ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | | \`colWeightedSds()\` | ✔ | ✔ | ✔ | ✔ | ❌ | ❌ | ❌ | ❌ | ❌ | | \`colWeightedVars()\` | ✔ | ✔ | ✔ | ✔ | ❌ | ❌ | ❌ | ❌ | ❌ | | \`rowAlls()\` | ✔ | ✔ | ✔ | ✔ | ❌ | ❌ | ❌ | ❌ | ❌ | | \`rowAnyMissings()\` | ✔ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | | \`rowAnyNAs()\` | ✔ | ✔ | ✔ | ✔ | ❌ | ❌ | ❌ | ❌ | ❌ | | \`rowAnys()\` | ✔ | ✔ | ✔ | ✔ | ❌ | ❌ | ❌ | ❌ | ❌ | | \`rowAvgsPerColSet()\` | ✔ | ✔ | ✔ | ✔ | ❌ | ❌ | ❌ | ❌ | ❌ | | \`rowCollapse()\` | ✔ | ✔ | ✔ | ✔ | ❌ | ❌ | ❌ | ❌ | ❌ | | \`rowCounts()\` | ✔ | ✔ | ✔ | ✔ | ❌ | ❌ | ❌ | ❌ | ❌ | | \`rowCummaxs()\` | ✔ | ✔ | ✔ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | | \`rowCummins()\` | ✔ | ✔ | ✔ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | | \`rowCumprods()\` | ✔ | ✔ | ✔ | ✔ | ❌ | ❌ | ❌ | ❌ | ❌ | | \`rowCumsums()\` | ✔ | ✔ | ✔ | ✔ | ❌ | ❌ | ❌ | ❌ | ❌ | | \`rowDiffs()\` | ✔ | ✔ | ✔ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | | \`rowIQRDiffs()\` | ✔ | ✔ | ✔ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | | \`rowIQRs()\` | ✔ | ✔ | ✔ | ✔ | ❌ | ❌ | ❌ | ❌ | ❌ | | \`rowLogSumExps()\` | ✔ | ✔ | ✔ | ✔ | ❌ | ❌ | ❌ | ❌ | ❌ | | \`rowMadDiffs()\` | ✔ | ✔ | ✔ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | | \`rowMads()\` | ✔ | ✔ | ✔ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | | \`rowMaxs()\` | ✔ | ✔ | ✔ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | | \`rowMeans2()\` | ✔ | ✔ | ✔ | ✔ | ❌ | ❌ | ❌ | ❌ | ❌ | | \`rowMedians()\` | ✔ | ✔ | ✔ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | | \`rowMins()\` | ✔ | ✔ | ✔ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | | \`rowOrderStats()\` | ✔ | ✔ | ✔ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | | \`rowProds()\` | ✔ | ✔ | ✔ | ✔ | ❌ | ❌ | ❌ | ❌ | ❌ | | \`rowQuantiles()\` | ✔ | ✔ | ✔ | ✔ | ❌ | ❌ | ❌ | ❌ | ❌ | | \`rowRanges()\` | ✔ | ✔ | ✔ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | | \`rowRanks()\` | ✔ | ✔ | ✔ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | | \`rowSdDiffs()\` | ✔ | ✔ | ✔ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | | \`rowSds()\` | ✔ | ✔ | ✔ | ✔ | ❌ | ❌ | ❌ | ❌ | ❌ | | \`rowsum()\` | ☑️ | ❌ | ✔ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | | \`rowSums2()\` | ✔ | ✔ | ✔ | ✔ | ❌ | ❌ | ❌ | ❌ | ❌ | | \`rowTabulates()\` | ✔ | ✔ | ✔ | ✔ | ❌ | ❌ | ❌ | ❌ | ❌ | | \`rowVarDiffs()\` | ✔ | ✔ | ✔ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | | \`rowVars()\` | ✔ | ✔ | ✔ | ✔ | ❌ | ❌ | ❌ | ❌ | ❌ | | \`rowWeightedMads()\` | ✔ | ✔ | ✔ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | | \`rowWeightedMeans()\` | ✔ | ✔ | ✔ | ✔ | ❌ | ❌ | ❌ | ❌ | ❌ | | \`rowWeightedMedians()\` | ✔ | ✔ | ✔ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | | \`rowWeightedSds()\` | ✔ | ✔ | ✔ | ✔ | ❌ | ❌ | ❌ | ❌ | ❌ | | \`rowWeightedVars()\` | ✔ | ✔ | ✔ | ✔ | ❌ | ❌ | ❌ | ❌ | ❌ |