Common Workflow Language (CWL) — bcbio-nextgen 1.2.9 documentation (original) (raw)

CWL functionality is not supported as of bcbio1.2.8

bcbio runs with Common Workflow Language (CWL) compatible parallelization software. bcbio generates a CWL workflow from a standard bcbio sample YAML description file and any tool that supports CWL input can run the workflow. CWL-based tools do the work of managing files and workflows, and bcbio performs the biological analysis using either a Docker container or a local installation.

Current status¶

bcbio creates CWL for alignment, small variant calls (SNPs and indels), coverage assessment, HLA typing, quality control and structural variant calling. It generates a CWL v1.0.2 compatible workflow. The actual biological code execution during runs works with either a bcbio docker container or a local installation of bcbio.

The implementation includes bcbio’s approaches to splitting and batching analyses. At the top level workflow, we parallelize by samples. Using sub-workflows, we split fastq inputs into sections for parallel alignment over multiple machines following by merging. We also use sub-workflows, along with CWL records, to batch multiple samples and run in parallel. This enables pooled and tumor/normal cancer calling with parallelization by chromosome regions based on coverage calculations.

cwl example

bcbio supports these CWL-compatible tools:

Cromwell – multicore local runs and distributed runs on HPC systems with shared filesystems and schedulers like SLURM, SGE and PBSPro.
Arvados – a hosted platform that runs on top of parallel cloud environments. We include an example below of running on the public Curoverse instance running on Microsoft Azure.
DNANexus – a hosted platform running distributed jobs on cloud environments, working with both AWS and Azure.
Seven Bridges – parallel distributed analyses on the Seven Bridges platform and Cancer Genomics Cloud.
Toil – parallel local and distributed cluster runs on schedulers like SLURM, SGE and PBSPro.
rabix bunny – multicore local runs.
cwltool – a single core analysis engine, primarily used for testing.

We plan to continue to expand CWL support to include more components of bcbio, and also need to evaluate the workflow on larger, real life analyses. This includes supporting additional CWL runners. We’re working on evaluating Galaxy/Planemo for integration with the Galaxy community.

Installation¶

bcbio-vm installs all dependencies required to generate CWL and run bcbio, along with supported CWL runners. There are two install choices, depending on your usage of bcbio: running CWL with a existing local bcbio install, or running with containers.

Install bcbio-vm with a local bcbio¶

To run bcbio without using containers, first and make it available in your path. You’ll need both the bcbio code and tools. To only run the tests and bcbio validations, you don’t need a full data installation so can install with --nodata.

To then install bcbio-vm, add the --cwl flag to the install:

bcbio_nextgen.py upgrade --cwl

Adding this to any future upgrades will also update the bcbio-vm wrapper code and tools.

When you begin running your own analysis and need the data available, pre-prepare your bcbio data directory with bcbio_nextgen.py upgrade --data --cwl.

Install bcbio-vm with containers¶

If you don’t have an existing local bcbio installation and want to run with CWL using the tools and data embedded in containers, you can do a stand along install of just bcbio-vm. To install using Miniconda and bioconda packages on Linux:

export TARGETDIR=~/install/bcbio-vm/anaconda export BINDIR=/usr/local/bin wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh bash Miniconda3-latest-Linux-x86_64.sh -b -p $TARGETDIR $TARGETDIR/bin/conda install --yes -c conda-forge -c bioconda python=3 bcbio-nextgen $TARGETDIR/bin/conda install --yes -c conda-forge -c bioconda python=3 bcbio-nextgen-vm mkdir -p $BINDIR ln -s TARGETDIR/bin/bcbiovm.pyTARGETDIR/bin/bcbio_vm.py TARGETDIR/bin/bcbiovm.pyBINDIR/bcbio_vm.py ln -s TARGETDIR/bin/condaTARGETDIR/bin/conda TARGETDIR/bin/condaBINDIR/bcbiovm_conda ln -s TARGETDIR/bin/pythonTARGETDIR/bin/python TARGETDIR/bin/pythonBINDIR/bcbiovm_python

In the above commands, the bcbio-vm install goes in $TARGETDIR. The example is in your home directory but set it anywhere you have space. Also, as an alternative to symbolic linking to a $BINDIR, you can add the install bin directory to your PATH:

export PATH=$TARGETDIR/bin:$PATH

This install includes bcbio-nextgen libraries, used in generating CWL and orchestrating runs, but is not a full bcbio installation. It requires Docker present on your system this is all you need to get started running examples, since the CWL runners will pull in Docker containers with the bcbio tools.

Getting started¶

To make it easy to get started, we have pre-built CWL descriptions that use test data. These run in under 5 minutes on a local machine and don’t require a bcbio installation if you have Docker available on your machine:

Download and unpack the test repository:
wget -O test_bcbio_cwl.tar.gz https://github.com/bcbio/test_bcbio_cwl/archive/master.tar.gz
tar -xzvpf test_bcbio_cwl.tar.gz
cd test_bcbio_cwl-master/somatic
Run the analysis using either Cromwell, Rabix bunny or Toil. If you have Docker available on your machine, the runner will download the correct bcbio container and you don’t need to install anything else to get started. If you have an old version of the container you want to update to the latest with docker pull quay.io/bcbio/bcbio-vc. There are shell scripts that provide the command lines for running:
bash run_cromwell.sh
bash run_bunny.sh
bash run_toil.sh
Or you can run directly using the bcbio_vm.py wrappers:
bcbio_vm.py cwlrun cromwell somatic-workflow
bcbio_vm.py cwlrun toil somatic-workflow
bcbio_vm.py cwlrun bunny somatic-workflow
These wrappers automatically handle temporary directories, permissions, logging and re-starts. If running without Docker, use a local installation of bcbio add --no-container to the commands in the shell scripts.

Running with Toil¶

The Toil pipeline management system runs CWL workflows in parallel on a local machine, on a cluster or at AWS.

To run a bcbio CWL workflow locally with Toil using Docker:

bcbio_vm.py cwlrun toil sample-workflow

If you want to run from a locally installed bcbio add --no-container to the commandline.

To run distributed on a Slurm cluster:

bcbio_vm.py cwlrun toil sample-workflow -- --batchSystem slurm

Running on Arvados¶

bcbio generated CWL workflows run on Arvados and these instructions detail how to run on the Arvdos public instance. Arvados cwl-runner comes pre-installed with bcbio-vm. We have a publicly accessible project, called bcbio_resources that contains the latest Docker images, test data and genome references you can use for runs.

Retrieve API keys from the Arvados public instance. Login, then go to ‘User Icon-> Personal Token’. Copy and paste the commands given there into your shell. You’ll specifically need to set ARVADOS_API_HOST and ARVADOS_API_TOKEN.

To run an analysis:

Create a new project from the web interface (Projects -> Add a new project). Note the project ID from the URL of the project (an identifier like qr1hi-j7d0g-7t73h4hrau3l063).
Upload reference data to Arvados Keep. Note the genome collection UUID. You can also use the existing genomes pre-installed in the bcbio_resources project if using the public Arvados playground:
arv-put --name testdata_genomes --project-uuid $PROJECT_ID testdata/genomes/hg19
Upload input data to Arvados Keep. Note the collection UUID:
arv-put --name testdata_inputs --project-uuid $PROJECT_ID testdata/100326_FC6107FAAXX testdata/automated testdata/reference_material
Create an Arvados section in a bcbio_system.yaml file specifying locations to look for reference and input data. input can be one or more collections containing files or associated files in the original sample YAML:
arvados:
reference: qr1hi-4zz18-kuz1izsj3wkfisq
input: [qr1hi-j7d0g-h691y6104tlg8b4]
resources:
default: {cores: 4, memory: 2G, jvm_opts: [-Xms750m, -Xmx2500m]}
Generate the CWL to run your samples. If you’re using multiple input files with a CSV metadata file and template start with creation of a configuration file:
bcbio_vm.py template --systemconfig bcbio_system_arvados.yaml testcwl_template.yaml testcwl.csv
To generate the CWL from the system and sample configuration files:
bcbio_vm.py cwl --systemconfig bcbio_system_arvados.yaml testcwl/config/testcwl.yaml
In most cases, Arvados should directly pick up the Docker images you need from the public bcbio_resources project in your instance. If you need to manually add to your project, you can copy latest bcbio Docker image into your project from bcbio_resources using arv-copy. You’ll need to find the UUID of quay.io/bcbio/bcbio-vc and arvados/jobs:
arv-copy JOBSID−−project−uuidJOBS_ID --project-uuid JOBSID−−project−uuidPROJECT_ID --src qr1hi --dst qr1hi
arv-copy BCBIOVCID−−project−uuidBCBIO_VC_ID --project-uuid BCBIOVCID−−project−uuidPROJECT_ID --src qr1hi --dst qr1hi
or import local Docker images to your Arvados project:
docker pull arvados/jobs:1.0.20180216164101
arv-keepdocker --project $PROJECT_ID -- arvados/jobs 1.0.20180216164101
docker pull quay.io/bcbio/bcbio-vc
arv-keepdocker --project $PROJECT_ID -- quay.io/bcbio/bcbio-vc latest
Run the CWL on the Arvados public cloud using the Arvados cwl-runner:
bcbio_vm.py cwlrun arvados arvados_testcwl-workflow -- --project-uuid $PROJECT_ID

Running on DNAnexus¶

bcbio runs on the DNAnexus platform by converting bcbio generated CWL into DNAnexus workflows and apps using dx-cwl. This describes the process using the bcbio workflow app (bcbio-run-workflow) and bcbio workflow applet (bcbio_resources:/applets/bcbio-run-workflow) in the public bcbio_resources project, both are regularly updated and maintained on the DNAnexus platform. Secondarily, we also show how to install and create workflows locally for additional control and debugging.

Set some useful environmental variables:

$PNAME – The name of the project you’re analyzing. For convenience here we keep this the same for your local files and remote DNAnexus project, although that does not have to be true.
$DX_AUTH_TOKEN – The DNAnexus authorization token for access, used for the dx command line tool and bcbio scripts.
$DX_PROJECT_ID – The DNAnexus GUID identifier for your project (similar to project-F8Q7fJj0XFJJ3XbBPQYXP4B9). You can get this from dx env after creating/selecting a project in steps 1 and 2.

Create an analysis project:
Upload sample data to the project:
dx select $PNAME
dx upload -p --path /data/input *.bam
Create a bcbio system YAML file with projects, locations of files and desired core and memory usage for jobs. bcbio uses the core and memory specifications to determine machine instance types to use:
dnanexus:
project: PNAME
ref:
project: bcbio_resources
folder: /reference_genomes
inputs:
- /data/input
- /data/input/regions
  resources:
  default: {cores: 8, memory: 3000M, jvm_opts: [-Xms1g, -Xmx3000m]}
Create a bcbio sample CSV file referencing samples to run. The files can be relative to the inputs directory specified above; bcbio will search recursively for files, so you don’t need to specify full paths if your file names are unique. Start with a sample specification:
samplename,description,batch,phenotype
file1.bam,sample1,b1,tumor
file2.bam,sample2,b1,normal
file3.bam,sample3,b2,tumor
file4.bam,sample4,b2,normal
Pick a template file that describes the bcbio configuration variables. You can define parameters either globally (in the template) file or by sample (in the csv) using the standard bcbio templating. An example template for GATK4 germline variant calling is:
details:

algorithm:
aligner: bwa
variantcaller: gatk-haplotype
analysis: variant2
genome_build: hg38

Supply the three inputs (bcbio_system.yaml, project.csv and template.yaml) to the either the bcbio-run-workflow app or applet. This example uses a specific version of the bcbio app for full reproducibility; any future re-runs will always use the exact same versioned tools and workflows. You can do this using the web interface or via the command line with a small script like:
TEMPLATE=germline
APP_VERSION=0.0.2
FOLDER=/bcbio/$PNAME
dx select "$PROJECT"
dx mkdir -p $FOLDER
for F in TEMPLATE−template.yamlTEMPLATE-template.yaml TEMPLATE−template.yamlPNAME.csv bcbio_system-dnanexus.yaml
do
dx rm -a /$FOLDER/$F || true
dx upload --path /$FOLDER/ $F
done
dx ls $FOLDER
dx rm -a -r /$FOLDER/dx-cwl-run || true
dx run bcbio-run-workflow/$APP_VERSION -iyaml_template=/$FOLDER/$TEMPLATE-template.yaml -isample_spec=/$FOLDER/$PNAME.csv -isystem_configuration=/$FOLDER/bcbio_system-dnanexus.yaml -ioutput_folder=/$FOLDER/dx-cwl-run
Alternatively if you want the latest bcbio code, change the final command to use the applet. Everything else in the script is identical:
dx run bcbio_resources:/applets/bcbio-run-workflow -iyaml_template=/$FOLDER/$TEMPLATE-template.yaml -isample_spec=/$FOLDER/$PNAME.csv -isystem_configuration=/$FOLDER/bcbio_system-dnanexus.yaml -ioutput_folder=/$FOLDER/dx-cwl-run

The app will lookup all files, prepare a bcbio CWL workflow, convert into a DNAnexus workflow, and submit to the platform. The workflow runs as a standard DNAnexus workflow and you can monitor through the command line (with dx find executions --root job-YOURJOBID and dx watch) or the web interface (Monitor tab).

If you prefer not to use the DNAnexus app, you can also submit jobs locally by installing bcbio-vm on your local machine. This can also be useful to test generation of CWL and manually ensure identification of all your samples and associated files on the DNAnexus platform.

Follow the Automated sample configuration workflow to generate a full configuration, and generate a CWL description of the workflow:
TEMPLATE=germline
rm -rf PNAMEPNAME PNAMEPNAME-workflow
bcbio_vm.py template --systemconfig bcbio_system-dnanexus.yaml TEMPLATE−template.yamlTEMPLATE-template.yaml TEMPLATE−template.yamlPNAME.csv
bcbio_vm.py cwl --systemconfig bcbio_system-dnanexus.yaml PNAME/config/PNAME/config/PNAME/config/PNAME.yaml
Determine project information and login credentials. You’ll want to note the Auth token used and Current workspace project ID:
Compile the CWL workflow into a DNAnexus workflow:
dx-cwl compile-workflow PNAME−workflow/main−PNAME-workflow/main-PNAME−workflow/main−PNAME.cwl \
--project PROJECT_ID --token $DX_AUTH_TOKEN \
--rootdir $FOLDER/dx-cwl-run
Upload sample information from generated CWL and run workflow:
FOLDER=/bcbio/$PNAME
dx mkdir -p DXPROJECTID:DX_PROJECT_ID:DXPROJECTID:FOLDER/$PNAME-workflow
dx upload -p --path DXPROJECTID:DX_PROJECT_ID:DXPROJECTID:FOLDER/$PNAME-workflow PNAME−workflow/main−PNAME-workflow/main-PNAME−workflow/main−PNAME-samples.json
dx-cwl run-workflow FOLDER/dx−cwl−run/main−FOLDER/dx-cwl-run/main-FOLDER/dx−cwl−run/main−PNAME/main-$PNAME \
FOLDER/FOLDER/FOLDER/PNAME-workflow/main-$PNAME-samples.json \
--project PROJECT_ID --token $DX_AUTH_TOKEN \
--rootdir $FOLDER/dx-cwl-run

Running on Seven Bridges¶

bcbio runs on the Seven Bridges including the main platform and specialized data sources like the Cancer Genomics Cloud and Cavatica. Seven Bridges uses generated CWL directly and bcbio has utilities to query your remote data on the platform and prepare CWL for direct submission.

Since Seven Bridges is available on multiple platforms and data access points, we authenticate with a configuration file in $HOME/.sevenbridges/credentials with potentially multiple profiles defining API access URLs and authentication keys. We reference the specified credentials when setting up a bcbio_system-sbg.yaml file to ensure correct authentication.
Upload your inputs and bcbio reference data using the Seven Bridges command line uploader. We plan to host standard bcbio reference data in a public project so you should only need to upload your project specific data:
sbg-uploader.sh -p chapmanb/bcbio-test --folder inputs --preserve-folder fastq_files regions
Create bcbio_system-sbg.yaml file defining locations of inputs:
sbgenomics:
profile: default
project: chapmanb/bcbio-test
inputs:
- /testdata/100326_FC6107FAAXX
- /testdata/automated
- /testdata/genomes
- /testdata/reference_material
  resources:
  default:
  cores: 2
  memory: 3G
  jvm_opts: [-Xms750m, -Xmx3000m]
Follow the Automated sample configuration workflow to generate a full configuration, and generate a CWL description of the workflow:
PNAME=somatic
bcbio_vm.py template --systemconfig=bcbio_system-sbg.yaml PNAMEtemplate.yaml{PNAME}_template.yaml PNAMEtemplate.yamlPNAME.csv
bcbio_vm.py cwl --systemconfig=bcbio_system-sbg.yaml PNAME/config/PNAME/config/PNAME/config/PNAME.yaml
Run the job on the Seven Bridges platform:
PNAME=somatic
SBG_PROJECT=bcbio-test
bcbio_vm.py cwlrun sbg PNAME−workflow−−−−project{PNAME}-workflow -- --project PNAME−workflow−−−−project{SBG_PROJECT}

Development notes¶

bcbio generates a common workflow language description. Internally, bcbio represents the files and information related to processing as a comprehensive dictionary. This world object describes the state of a run and associated files, and new processing steps update or add information to it. The world object is roughly equivalent to CWL’s JSON-based input object, but CWL enforces additional annotations to identify files and models new inputs/outputs at each step. The work in bcbio is to move from our laissez-faire approach to the more structured CWL model.

The generated CWL workflow is in run_info-cwl-workflow:

main-*.cwl – the top level CWL file describing the workflow steps
main*-samples.json – the flattened bcbio world structure represented as CWL inputs
wf-*.cwl – CWL sub-workflows, describing sample level parallel processing of a section of the workflow, with potential internal parallelization.
steps/*.cwl – CWL descriptions of sections of code run inside bcbio. Each of these are potential parallelization points and make up the nodes in the workflow.

To help with defining the outputs at each step, there is a WorldWatcher object that can output changed files and world dictionary objects between steps in the pipeline when running a bcbio in the standard way. The variant pipeline has examples using it. This is useful when preparing the CWL definitions of inputs and outputs for new steps in the bcbio CWL step definitions.

ToDo¶

Support the full variant calling workflow with additional steps like ensemble calling, heterogeneity detection and disambiguation.
Port RNA-seq and small RNA workflows to CWL.