Common Workflow Language (CWL) — bcbio-nextgen 1.2.9 documentation (original) (raw)

bcbio-nextgen

CWL functionality is not supported as of bcbio1.2.8

bcbio runs with Common Workflow Language (CWL) compatible parallelization software. bcbio generates a CWL workflow from a standard bcbio sample YAML description file and any tool that supports CWL input can run the workflow. CWL-based tools do the work of managing files and workflows, and bcbio performs the biological analysis using either a Docker container or a local installation.

Current status

bcbio creates CWL for alignment, small variant calls (SNPs and indels), coverage assessment, HLA typing, quality control and structural variant calling. It generates a CWL v1.0.2 compatible workflow. The actual biological code execution during runs works with either a bcbio docker container or a local installation of bcbio.

The implementation includes bcbio’s approaches to splitting and batching analyses. At the top level workflow, we parallelize by samples. Using sub-workflows, we split fastq inputs into sections for parallel alignment over multiple machines following by merging. We also use sub-workflows, along with CWL records, to batch multiple samples and run in parallel. This enables pooled and tumor/normal cancer calling with parallelization by chromosome regions based on coverage calculations.

cwl example

bcbio supports these CWL-compatible tools:

We plan to continue to expand CWL support to include more components of bcbio, and also need to evaluate the workflow on larger, real life analyses. This includes supporting additional CWL runners. We’re working on evaluating Galaxy/Planemo for integration with the Galaxy community.

Installation

bcbio-vm installs all dependencies required to generate CWL and run bcbio, along with supported CWL runners. There are two install choices, depending on your usage of bcbio: running CWL with a existing local bcbio install, or running with containers.

Install bcbio-vm with a local bcbio

To run bcbio without using containers, first and make it available in your path. You’ll need both the bcbio code and tools. To only run the tests and bcbio validations, you don’t need a full data installation so can install with --nodata.

To then install bcbio-vm, add the --cwl flag to the install:

bcbio_nextgen.py upgrade --cwl

Adding this to any future upgrades will also update the bcbio-vm wrapper code and tools.

When you begin running your own analysis and need the data available, pre-prepare your bcbio data directory with bcbio_nextgen.py upgrade --data --cwl.

Install bcbio-vm with containers

If you don’t have an existing local bcbio installation and want to run with CWL using the tools and data embedded in containers, you can do a stand along install of just bcbio-vm. To install using Miniconda and bioconda packages on Linux:

export TARGETDIR=~/install/bcbio-vm/anaconda export BINDIR=/usr/local/bin wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh bash Miniconda3-latest-Linux-x86_64.sh -b -p $TARGETDIR $TARGETDIR/bin/conda install --yes -c conda-forge -c bioconda python=3 bcbio-nextgen $TARGETDIR/bin/conda install --yes -c conda-forge -c bioconda python=3 bcbio-nextgen-vm mkdir -p $BINDIR ln -s TARGETDIR/bin/bcbiovm.pyTARGETDIR/bin/bcbio_vm.py TARGETDIR/bin/bcbiovm.pyBINDIR/bcbio_vm.py ln -s TARGETDIR/bin/condaTARGETDIR/bin/conda TARGETDIR/bin/condaBINDIR/bcbiovm_conda ln -s TARGETDIR/bin/pythonTARGETDIR/bin/python TARGETDIR/bin/pythonBINDIR/bcbiovm_python

In the above commands, the bcbio-vm install goes in $TARGETDIR. The example is in your home directory but set it anywhere you have space. Also, as an alternative to symbolic linking to a $BINDIR, you can add the install bin directory to your PATH:

export PATH=$TARGETDIR/bin:$PATH

This install includes bcbio-nextgen libraries, used in generating CWL and orchestrating runs, but is not a full bcbio installation. It requires Docker present on your system this is all you need to get started running examples, since the CWL runners will pull in Docker containers with the bcbio tools.

Getting started

To make it easy to get started, we have pre-built CWL descriptions that use test data. These run in under 5 minutes on a local machine and don’t require a bcbio installation if you have Docker available on your machine:

  1. Download and unpack the test repository:
    wget -O test_bcbio_cwl.tar.gz https://github.com/bcbio/test_bcbio_cwl/archive/master.tar.gz
    tar -xzvpf test_bcbio_cwl.tar.gz
    cd test_bcbio_cwl-master/somatic
  2. Run the analysis using either Cromwell, Rabix bunny or Toil. If you have Docker available on your machine, the runner will download the correct bcbio container and you don’t need to install anything else to get started. If you have an old version of the container you want to update to the latest with docker pull quay.io/bcbio/bcbio-vc. There are shell scripts that provide the command lines for running:
    bash run_cromwell.sh
    bash run_bunny.sh
    bash run_toil.sh
    Or you can run directly using the bcbio_vm.py wrappers:
    bcbio_vm.py cwlrun cromwell somatic-workflow
    bcbio_vm.py cwlrun toil somatic-workflow
    bcbio_vm.py cwlrun bunny somatic-workflow
    These wrappers automatically handle temporary directories, permissions, logging and re-starts. If running without Docker, use a local installation of bcbio add --no-container to the commands in the shell scripts.

Running with Toil

The Toil pipeline management system runs CWL workflows in parallel on a local machine, on a cluster or at AWS.

To run a bcbio CWL workflow locally with Toil using Docker:

bcbio_vm.py cwlrun toil sample-workflow

If you want to run from a locally installed bcbio add --no-container to the commandline.

To run distributed on a Slurm cluster:

bcbio_vm.py cwlrun toil sample-workflow -- --batchSystem slurm

Running on Arvados

bcbio generated CWL workflows run on Arvados and these instructions detail how to run on the Arvdos public instance. Arvados cwl-runner comes pre-installed with bcbio-vm. We have a publicly accessible project, called bcbio_resources that contains the latest Docker images, test data and genome references you can use for runs.

Retrieve API keys from the Arvados public instance. Login, then go to ‘User Icon-> Personal Token’. Copy and paste the commands given there into your shell. You’ll specifically need to set ARVADOS_API_HOST and ARVADOS_API_TOKEN.

To run an analysis:

  1. Create a new project from the web interface (Projects -> Add a new project). Note the project ID from the URL of the project (an identifier like qr1hi-j7d0g-7t73h4hrau3l063).
  2. Upload reference data to Arvados Keep. Note the genome collection UUID. You can also use the existing genomes pre-installed in the bcbio_resources project if using the public Arvados playground:
    arv-put --name testdata_genomes --project-uuid $PROJECT_ID testdata/genomes/hg19
  3. Upload input data to Arvados Keep. Note the collection UUID:
    arv-put --name testdata_inputs --project-uuid $PROJECT_ID testdata/100326_FC6107FAAXX testdata/automated testdata/reference_material
  4. Create an Arvados section in a bcbio_system.yaml file specifying locations to look for reference and input data. input can be one or more collections containing files or associated files in the original sample YAML:
    arvados:
    reference: qr1hi-4zz18-kuz1izsj3wkfisq
    input: [qr1hi-j7d0g-h691y6104tlg8b4]
    resources:
    default: {cores: 4, memory: 2G, jvm_opts: [-Xms750m, -Xmx2500m]}
  5. Generate the CWL to run your samples. If you’re using multiple input files with a CSV metadata file and template start with creation of a configuration file:
    bcbio_vm.py template --systemconfig bcbio_system_arvados.yaml testcwl_template.yaml testcwl.csv
    To generate the CWL from the system and sample configuration files:
    bcbio_vm.py cwl --systemconfig bcbio_system_arvados.yaml testcwl/config/testcwl.yaml
  6. In most cases, Arvados should directly pick up the Docker images you need from the public bcbio_resources project in your instance. If you need to manually add to your project, you can copy latest bcbio Docker image into your project from bcbio_resources using arv-copy. You’ll need to find the UUID of quay.io/bcbio/bcbio-vc and arvados/jobs:
    arv-copy JOBSID−−project−uuidJOBS_ID --project-uuid JOBSIDprojectuuidPROJECT_ID --src qr1hi --dst qr1hi
    arv-copy BCBIOVCID−−project−uuidBCBIO_VC_ID --project-uuid BCBIOVCIDprojectuuidPROJECT_ID --src qr1hi --dst qr1hi
    or import local Docker images to your Arvados project:
    docker pull arvados/jobs:1.0.20180216164101
    arv-keepdocker --project $PROJECT_ID -- arvados/jobs 1.0.20180216164101
    docker pull quay.io/bcbio/bcbio-vc
    arv-keepdocker --project $PROJECT_ID -- quay.io/bcbio/bcbio-vc latest
  7. Run the CWL on the Arvados public cloud using the Arvados cwl-runner:
    bcbio_vm.py cwlrun arvados arvados_testcwl-workflow -- --project-uuid $PROJECT_ID

Running on DNAnexus

bcbio runs on the DNAnexus platform by converting bcbio generated CWL into DNAnexus workflows and apps using dx-cwl. This describes the process using the bcbio workflow app (bcbio-run-workflow) and bcbio workflow applet (bcbio_resources:/applets/bcbio-run-workflow) in the public bcbio_resources project, both are regularly updated and maintained on the DNAnexus platform. Secondarily, we also show how to install and create workflows locally for additional control and debugging.

Set some useful environmental variables:

  1. Create an analysis project:
  2. Upload sample data to the project:
    dx select $PNAME
    dx upload -p --path /data/input *.bam
  3. Create a bcbio system YAML file with projects, locations of files and desired core and memory usage for jobs. bcbio uses the core and memory specifications to determine machine instance types to use:
    dnanexus:
    project: PNAME
    ref:
    project: bcbio_resources
    folder: /reference_genomes
    inputs:
    • /data/input
    • /data/input/regions
      resources:
      default: {cores: 8, memory: 3000M, jvm_opts: [-Xms1g, -Xmx3000m]}
  4. Create a bcbio sample CSV file referencing samples to run. The files can be relative to the inputs directory specified above; bcbio will search recursively for files, so you don’t need to specify full paths if your file names are unique. Start with a sample specification:
    samplename,description,batch,phenotype
    file1.bam,sample1,b1,tumor
    file2.bam,sample2,b1,normal
    file3.bam,sample3,b2,tumor
    file4.bam,sample4,b2,normal
  5. Pick a template file that describes the bcbio configuration variables. You can define parameters either globally (in the template) file or by sample (in the csv) using the standard bcbio templating. An example template for GATK4 germline variant calling is:
    details:
  1. Supply the three inputs (bcbio_system.yaml, project.csv and template.yaml) to the either the bcbio-run-workflow app or applet. This example uses a specific version of the bcbio app for full reproducibility; any future re-runs will always use the exact same versioned tools and workflows. You can do this using the web interface or via the command line with a small script like:
    TEMPLATE=germline
    APP_VERSION=0.0.2
    FOLDER=/bcbio/$PNAME
    dx select "$PROJECT"
    dx mkdir -p $FOLDER
    for F in TEMPLATE−template.yamlTEMPLATE-template.yaml TEMPLATEtemplate.yamlPNAME.csv bcbio_system-dnanexus.yaml
    do
    dx rm -a /$FOLDER/$F || true
    dx upload --path /$FOLDER/ $F
    done
    dx ls $FOLDER
    dx rm -a -r /$FOLDER/dx-cwl-run || true
    dx run bcbio-run-workflow/$APP_VERSION -iyaml_template=/$FOLDER/$TEMPLATE-template.yaml -isample_spec=/$FOLDER/$PNAME.csv -isystem_configuration=/$FOLDER/bcbio_system-dnanexus.yaml -ioutput_folder=/$FOLDER/dx-cwl-run
    Alternatively if you want the latest bcbio code, change the final command to use the applet. Everything else in the script is identical:
    dx run bcbio_resources:/applets/bcbio-run-workflow -iyaml_template=/$FOLDER/$TEMPLATE-template.yaml -isample_spec=/$FOLDER/$PNAME.csv -isystem_configuration=/$FOLDER/bcbio_system-dnanexus.yaml -ioutput_folder=/$FOLDER/dx-cwl-run

The app will lookup all files, prepare a bcbio CWL workflow, convert into a DNAnexus workflow, and submit to the platform. The workflow runs as a standard DNAnexus workflow and you can monitor through the command line (with dx find executions --root job-YOURJOBID and dx watch) or the web interface (Monitor tab).

If you prefer not to use the DNAnexus app, you can also submit jobs locally by installing bcbio-vm on your local machine. This can also be useful to test generation of CWL and manually ensure identification of all your samples and associated files on the DNAnexus platform.

  1. Follow the Automated sample configuration workflow to generate a full configuration, and generate a CWL description of the workflow:
    TEMPLATE=germline
    rm -rf PNAMEPNAME PNAMEPNAME-workflow
    bcbio_vm.py template --systemconfig bcbio_system-dnanexus.yaml TEMPLATE−template.yamlTEMPLATE-template.yaml TEMPLATEtemplate.yamlPNAME.csv
    bcbio_vm.py cwl --systemconfig bcbio_system-dnanexus.yaml PNAME/config/PNAME/config/PNAME/config/PNAME.yaml
  2. Determine project information and login credentials. You’ll want to note the Auth token used and Current workspace project ID:
  3. Compile the CWL workflow into a DNAnexus workflow:
    dx-cwl compile-workflow PNAME−workflow/main−PNAME-workflow/main-PNAMEworkflow/mainPNAME.cwl \
    --project PROJECT_ID --token $DX_AUTH_TOKEN \
    --rootdir $FOLDER/dx-cwl-run
  4. Upload sample information from generated CWL and run workflow:
    FOLDER=/bcbio/$PNAME
    dx mkdir -p DXPROJECTID:DX_PROJECT_ID:DXPROJECTID:FOLDER/$PNAME-workflow
    dx upload -p --path DXPROJECTID:DX_PROJECT_ID:DXPROJECTID:FOLDER/$PNAME-workflow PNAME−workflow/main−PNAME-workflow/main-PNAMEworkflow/mainPNAME-samples.json
    dx-cwl run-workflow FOLDER/dx−cwl−run/main−FOLDER/dx-cwl-run/main-FOLDER/dxcwlrun/mainPNAME/main-$PNAME \
    FOLDER/FOLDER/FOLDER/PNAME-workflow/main-$PNAME-samples.json \
    --project PROJECT_ID --token $DX_AUTH_TOKEN \
    --rootdir $FOLDER/dx-cwl-run

Running on Seven Bridges

bcbio runs on the Seven Bridges including the main platform and specialized data sources like the Cancer Genomics Cloud and Cavatica. Seven Bridges uses generated CWL directly and bcbio has utilities to query your remote data on the platform and prepare CWL for direct submission.

  1. Since Seven Bridges is available on multiple platforms and data access points, we authenticate with a configuration file in $HOME/.sevenbridges/credentials with potentially multiple profiles defining API access URLs and authentication keys. We reference the specified credentials when setting up a bcbio_system-sbg.yaml file to ensure correct authentication.
  2. Upload your inputs and bcbio reference data using the Seven Bridges command line uploader. We plan to host standard bcbio reference data in a public project so you should only need to upload your project specific data:
    sbg-uploader.sh -p chapmanb/bcbio-test --folder inputs --preserve-folder fastq_files regions
  3. Create bcbio_system-sbg.yaml file defining locations of inputs:
    sbgenomics:
    profile: default
    project: chapmanb/bcbio-test
    inputs:
    • /testdata/100326_FC6107FAAXX
    • /testdata/automated
    • /testdata/genomes
    • /testdata/reference_material
      resources:
      default:
      cores: 2
      memory: 3G
      jvm_opts: [-Xms750m, -Xmx3000m]
  4. Follow the Automated sample configuration workflow to generate a full configuration, and generate a CWL description of the workflow:
    PNAME=somatic
    bcbio_vm.py template --systemconfig=bcbio_system-sbg.yaml PNAMEtemplate.yaml{PNAME}_template.yaml PNAMEtemplate.yamlPNAME.csv
    bcbio_vm.py cwl --systemconfig=bcbio_system-sbg.yaml PNAME/config/PNAME/config/PNAME/config/PNAME.yaml
  5. Run the job on the Seven Bridges platform:
    PNAME=somatic
    SBG_PROJECT=bcbio-test
    bcbio_vm.py cwlrun sbg PNAME−workflow−−−−project{PNAME}-workflow -- --project PNAMEworkflowproject{SBG_PROJECT}

Development notes

bcbio generates a common workflow language description. Internally, bcbio represents the files and information related to processing as a comprehensive dictionary. This world object describes the state of a run and associated files, and new processing steps update or add information to it. The world object is roughly equivalent to CWL’s JSON-based input object, but CWL enforces additional annotations to identify files and models new inputs/outputs at each step. The work in bcbio is to move from our laissez-faire approach to the more structured CWL model.

The generated CWL workflow is in run_info-cwl-workflow:

To help with defining the outputs at each step, there is a WorldWatcher object that can output changed files and world dictionary objects between steps in the pipeline when running a bcbio in the standard way. The variant pipeline has examples using it. This is useful when preparing the CWL definitions of inputs and outputs for new steps in the bcbio CWL step definitions.

ToDo