Introduction to BatchtoolsParam (original) (raw)
Contents
- 1 Introduction
- 2 Quick start
- 3 BatchtoolsParam interface
- 4 Defining templates
- 5 Use cases
- 6 Session info
The BatchtoolsParam
class is an interface to the _batchtools_package from within BiocParallel, for computing on a high performance cluster such as SGE, TORQUE, LSF, SLURM, OpenLava.
Quick start
library(BiocParallel)
This example demonstrates the easiest way to launch a 100000 jobs using batchtools. The first step involves creating a BatchtoolsParam
class. You can compute using ‘bplapply’ and then the result is stored.
## Pi approximation
piApprox <- function(n) {
nums <- matrix(runif(2 * n), ncol = 2)
d <- sqrt(nums[, 1]^2 + nums[, 2]^2)
4 * mean(d <= 1)
}
piApprox(1000)
## [1] 3.108
## Apply piApprox over
param <- BatchtoolsParam()
result <- bplapply(rep(10e5, 10), piApprox, BPPARAM=param)
mean(unlist(result))
## [1] 3.143465
Defining templates
The job submission template controls how the job is processed by the job scheduler on the cluster. Obviously, the format of the template will differ depending on the type of job scheduler. Let’s look at the default SLURM template as an example:
fname <- batchtoolsTemplate("slurm")
## using default 'slurm' template in batchtools.
cat(readLines(fname), sep="\n")
## #!/bin/bash
##
## ## Job Resource Interface Definition
## ##
## ## ntasks [integer(1)]: Number of required tasks,
## ## Set larger than 1 if you want to further parallelize
## ## with MPI within your job.
## ## ncpus [integer(1)]: Number of required cpus per task,
## ## Set larger than 1 if you want to further parallelize
## ## with multicore/parallel within each task.
## ## walltime [integer(1)]: Walltime for this job, in seconds.
## ## Must be at least 60 seconds for Slurm to work properly.
## ## memory [integer(1)]: Memory in megabytes for each cpu.
## ## Must be at least 100 (when I tried lower values my
## ## jobs did not start at all).
## ##
## ## Default resources can be set in your .batchtools.conf.R by defining the variable
## ## 'default.resources' as a named list.
##
## <%
## # relative paths are not handled well by Slurm
## log.file = fs::path_expand(log.file)
## -%>
##
##
## #SBATCH --job-name=<%= job.name %>
## #SBATCH --output=<%= log.file %>
## #SBATCH --error=<%= log.file %>
## #SBATCH --time=<%= ceiling(resources$walltime / 60) %>
## #SBATCH --ntasks=1
## #SBATCH --cpus-per-task=<%= resources$ncpus %>
## #SBATCH --mem-per-cpu=<%= resources$memory %>
## <%= if (!is.null(resources$partition)) sprintf(paste0("#SBATCH --partition='", resources$partition, "'")) %>
## <%= if (array.jobs) sprintf("#SBATCH --array=1-%i", nrow(jobs)) else "" %>
##
## ## Initialize work environment like
## ## source /etc/profile
## ## module add ...
##
## ## Export value of DEBUGME environemnt var to slave
## export DEBUGME=<%= Sys.getenv("DEBUGME") %>
##
## <%= sprintf("export OMP_NUM_THREADS=%i", resources$omp.threads) -%>
## <%= sprintf("export OPENBLAS_NUM_THREADS=%i", resources$blas.threads) -%>
## <%= sprintf("export MKL_NUM_THREADS=%i", resources$blas.threads) -%>
##
## ## Run R:
## ## we merge R output with stdout from SLURM, which gets then logged via --output option
## Rscript -e 'batchtools::doJobCollection("<%= uri %>")'
The <%= =>
blocks are automatically replaced by the values of the elements in the resources
argument in the BatchtoolsParam
constructor. Failing to specify critical parameters properly (e.g., wall time or memory limits too low) will cause jobs to crash, usually rather cryptically. We suggest setting parameters explicitly to provide robustness to changes to system defaults. Note that the <%= =>
blocks themselves do not usually need to be modified in the template.
The part of the template that is most likely to require explicit customization is the last line containing the call to Rscript
. A more customized call may be necessary if the R installation is not standard, e.g., if multiple versions of R have been installed on a cluster. For example, one might use instead:
echo 'batchtools::doJobCollection("<%= uri %>")' |
ArbitraryRcommand --no-save --no-echo
If such customization is necessary, we suggest making a local copy of the template, modifying it as required, and then constructing a BiocParallelParam
object with the modified template using the template
argument. However, we find that the default templates accessible with batchtoolsTemplate
are satisfactory in most cases.
Use cases
As an example for a BatchtoolParam job being run on an SGE cluster, we use the same piApprox
function as defined earlier. The example runs the function on 5 workers and submits 100 jobs to the SGE cluster.
Example of SGE with minimal code:
library(BiocParallel)
## Pi approximation
piApprox <- function(n) {
nums <- matrix(runif(2 * n), ncol = 2)
d <- sqrt(nums[, 1]^2 + nums[, 2]^2)
4 * mean(d <= 1)
}
template <- system.file(
package = "BiocParallel",
"unitTests", "test_script", "test-sge-template.tmpl"
)
param <- BatchtoolsParam(workers=5, cluster="sge", template=template)
## Run parallel job
result <- bplapply(rep(10e5, 100), piApprox, BPPARAM=param)
Example of SGE demonstrating some of BatchtoolsParam
methods.
library(BiocParallel)
## Pi approximation
piApprox <- function(n) {
nums <- matrix(runif(2 * n), ncol = 2)
d <- sqrt(nums[, 1]^2 + nums[, 2]^2)
4 * mean(d <= 1)
}
template <- system.file(
package = "BiocParallel",
"unitTests", "test_script", "test-sge-template.tmpl"
)
param <- BatchtoolsParam(workers=5, cluster="sge", template=template)
## start param
bpstart(param)
## Display param
param
## To show the registered backend
bpbackend(param)
## Register the param
register(param)
## Check the registered param
registered()
## Run parallel job
result <- bplapply(rep(10e5, 100), piApprox)
bpstop(param)
Session info
sessionInfo()
## R version 4.5.0 (2025-04-11)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.2 LTS
##
## Matrix products: default
## BLAS: /home/biocbuild/bbs-3.22-bioc/R/lib/libRblas.so
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.12.0 LAPACK version 3.12.0
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_GB LC_COLLATE=C
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## time zone: America/New_York
## tzcode source: system (glibc)
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] BiocParallel_1.43.3 BiocStyle_2.37.0
##
## loaded via a namespace (and not attached):
## [1] base64url_1.4 jsonlite_2.0.0 compiler_4.5.0
## [4] BiocManager_1.30.25 crayon_1.5.3 parallel_4.5.0
## [7] jquerylib_0.1.4 progress_1.2.3 yaml_2.3.10
## [10] fastmap_1.2.0 R6_2.6.1 batchtools_0.9.17
## [13] knitr_1.50 backports_1.5.0 checkmate_2.3.2
## [16] tibble_3.2.1 bookdown_0.43 pillar_1.10.2
## [19] bslib_0.9.0 rlang_1.1.6 cachem_1.1.0
## [22] stringi_1.8.7 xfun_0.52 fs_1.6.6
## [25] sass_0.4.10 debugme_1.2.0 cli_3.6.5
## [28] magrittr_2.0.3 withr_3.0.2 digest_0.6.37
## [31] rappdirs_0.3.3 hms_1.1.3 lifecycle_1.0.4
## [34] prettyunits_1.2.0 vctrs_0.6.5 glue_1.8.0
## [37] evaluate_1.0.3 data.table_1.17.4 codetools_0.2-20
## [40] rmarkdown_2.29 tools_4.5.0 pkgconfig_2.0.3
## [43] htmltools_0.5.8.1 brew_1.0-10