Development — bcbio-nextgen 1.2.9 documentation (original) (raw)

bcbio-nextgen

This section provides useful concepts for getting started digging into the code and contributing new functionality. We welcome contributors and hope these notes help make it easier to get started.

bcbio dev installation

When developing, you’d like to avoid breaking your production bcbio instance. Use the installer script to create a separate bcbio instance without downloading any data. Before installing the second bcbio instance, investigate your PATH and PYTHONPATH variables and clean them from referenceing the production bcbio python. It is better to avoid mixing bcbio instances in the PATH. Also watch ~/.conda/environments.txt, ~/.condarc config files, CONDA_EXE, CONDA_PYTHON_EXE environment variables : having a reference to the production package cache: /n/app/bcbio/dev/anaconda/pkgs leads to bcbio/tools/bin not being populated in some installs.

To install in ${HOME}/local/share/bcbio (your location might be different, make sure you have ~30GB of disk quota there):

wget https://raw.githubusercontent.com/chapmanb/bcbio-nextgen/master/scripts/bcbio_nextgen_install.py python bcbio_nextgen_install.py HOME/local/share/bcbio−−tooldir={HOME}/local/share/bcbio --tooldir=HOME/local/share/bcbiotooldir={HOME}/local --nodata --isolate

Make soft links to the data from your production bcbio instance (your installation path could be different from /n/app/bcbio):

ln -s /n/app/bcbio/biodata/genomes/ ${HOME}/local/share/genomes ln -s /n/app/bcbio/biodata/galaxy/tool-data ${HOME}/local/share/bcbio/galaxy/tool-data

Create .bcbio_devel_profile to clear PATH from the production bcbio executables and reference the development version:

use everything you need except of production bcbio

export PATH=/usr/local/bin:/usr/bin:/usr/local/sbin:/usr/sbin: export PATH=${HOME}/local/share/bcbio/anaconda/bin:${HOME}/local/bin:$PATH export CONDA_EXE=${HOME}/local/share/bcbio/anaconda/bin/conda export CONDA_PYTHON_EXE=${HOME}/local/share/bcbio/anaconda/bin/python

Or directly call the testing bcbio:${HOME}/local/share/bcbio/anaconda/bin/bcbio_nextgen.py.

Injecting bcbio code into bcbio installation

To install from your bcbio-nextgen source tree for testing do:

make sure you are using the development bcbio instance

which bcbio_python

local git folder

cd ~/code/bcbio-nextgen bcbio_python setup.py install

One tricky part that we don’t yet know how to work around is that pip and standard setup.py install have different ideas about how to write Python eggs. setup.py install will create an isolated python egg directory like bcbio_nextgen-1.1.5-py3.6.egg, while pip creates an egg pointing to a top level bcbio directory. Where this gets tricky is that the top level bcbio directory takes precedence. The best way to work around this problem is to manually remove the current pip installed bcbio-nextgen code (rm -rf /path/to/anaconda/lib/python3.6/site-packages/bcbio*) before managing it manually with bcbio_python setup.py install. We’d welcome tips about ways to force consistent installation across methods.

Testing

The test suite exercises the scripts driving the analysis, so are a good starting point to ensure correct installation. Tests use the pytest framework. The tests are available in the bcbio source code:

git clone https://github.com/bcbio/bcbio-nextgen.git

There is a small wrapper script that finds the pytest and other dependencies pre-installed with bcbio you can use to run tests:

You can use this to run specific test targets:

./run_tests.sh cancer ./run_tests.sh rnaseq ./run_tests.sh devel ./run_tests.sh docker

Optionally, you can run pytest directly from the bcbio install to tweak more options. It will be in /path/to/bcbio/anaconda/bin/pytest. Pass -s to pytest to see the stdout log, and -v to make pytest output more verbose. The -x flag will stop the test at the first failure and --lf will run only the tests that failed the last go-around. Sometimes it is useful to drop into the debugger on failure, wihch you can do by setting -s --pdb. The tests are marked with labels which you can use to run a specific subset of the tests using the -m argument:

To run unit tests:

To run integration pipeline tests:

To run tests which use bcbio_vm:

To see the test coverage, add the --cov=bcbio argument to pytest.

By default the test suite will use your installed system configuration for running tests, substituting the test genome information instead of using full genomes. If you need a specific testing environment, copy tests/data/automated/post_process-sample.yaml to tests/data/automated/post_process.yaml to provide a test-only configuration.

The environment variable BCBIO_TEST_DIR can put the tests in a different directory if the full path to the new directory is specified. For example:

export BCBIO_TEST_DIR=$(pwd)/output

will put the output in your current working directory/output.

The test directory can be kept around after running by passing the --keep-test-dir flag.

Repeat a failed test:

export BCBIO_TEST_DIR=/path/to/test;
pytest -s -x --keep-test-dir tests/integration/test_automated_analysis.py::failed_test

New release checklist

not working for now

Goals

bcbio-nextgen provides best-practice pipelines for automated analysis of high throughput sequencing data with the goal of being:

During development we seek to maximize functionality and usefulness, while avoiding complexity. Since these goals are sometimes in conflict, it’s useful to understand the design approaches:

Style guide

General:

Python:

Modules

The most useful modules inside bcbio, ordered by likely interest:

GitHub

bcbio-nextgen uses GitHub for code development, and we welcome pull requests. GitHub makes it easy to establish custom forks of the code and contribute those back. The Biopython documentation has great information on using git and GitHub for a community developed project. In short, make a fork of the bcbio code by clicking the Fork button in the upper right corner of the GitHub page, commit your changes to this custom fork and keep it up to date with the main bcbio repository as you develop. The GitHub help pages have detailed information on keeping your fork updated with the main GitHub repository (e.g. https://help.github.com/en/github/collaborating-with-issues-and-pull-requests/syncing-a-fork). After commiting changes, click New Pull Request from your fork when you’d like to submit your changes for integration in bcbio.

Documentation

To build this documentation locally and see how it looks like you can do so by installing the dependencies:

cd docs conda install --file requirements-local.txt --file requirements.txt

and running:

The documentation will be built under docs/_build/html, open index.htmlwith your browser to load your local build.

Adding new organisms

While bcbio-nextgen and supporting tools receive the most testing and development on human or human-like diploid organisms, the algorithms are generic and we strive to support the wide diversity of organisms used in your research. We welcome contributors interested in setting up and maintaining support for their particular research organism, and this section defines the steps in integrating a new genome. We also welcome suggestions and implementations that improve this process.

Setup CloudBioLinux to automatically download and prepare the genome:

Add the organism to the supported installs within bcbio (in two places):

Test installation of genomes by pointing to your local cloudbiolinux edits during a data installation:

mkdir -p tmpbcbio-install ln -s ~/bio/cloudbiolinux tmpbcbio-install bcbio_nextgen.py upgrade --data --genomes DBKEY

Add configuration information to bcbio-nextgen by creating a config/genomes/DBKEY-resources.yaml file. Copy an existing minimal template like canFam3 and edit with pointers to snpEff and other genome resources. The VEP database directory has Ensembl names. SnpEff has a command to list available databases:

Finally, send pull requests for CloudBioLinux and bcbio-nextgen and we’ll happily integrate the new genome.

This will provide basic integration with bcbio and allow running a minimal pipeline with alignment and quality control. We also have utility scripts in CloudBioLinux to help with preparing dbSNP (utils/prepare_dbsnp.py) and RNA-seq (utils/prepare_tx_gff.py) resources for some genomes. For instance, to prepare RNA-seq transcripts for mm9:

bcbio_python prepare_tx_gff.py --genome-dir /path/to/bcbio/genomes Mmusculus mm9

We are still working on ways to best include these as part of the standard build and install since they either require additional tools to run locally, or require preparing copies in S3 buckets.

Enabling new MultiQC modules

MultiQC modules can be turned on in bcbio/qc/multiqc.py. bcbio collects the files to be used rather than searching through the work directory to support CWL workflows. Quality control files can be added by using the datadict.update_summary_qc function which adds the files in the appropriate place in the data dict. For example, here is how to add the quality control reports from bismark methylation calling:

data = dd.update_summary_qc(data, "bismark", base=biasm_file)
data = dd.update_summary_qc(data, "bismark", base=data["bam_report"])
data = dd.update_summary_qc(data, "bismark", base=splitting_report)

Files that can be added for each tool in MultiQC can be found in the MultiQC module documentation

Standard function arguments

names

This dictionary provides lane and other BAM run group naming information used to correctly build BAM files. We use the rg attribute as the ID within a BAM file:

{'lane': '7_100326_FC6107FAAXX', 'pl': 'illumina', 'pu': '7_100326_FC6107FAAXX', 'rg': '7', 'sample': 'Test1'}

data

The data dictionary is a large dictionary representing processing, configuration and files associated with a sample. The standard work flow is to pass this dictionary between functions, updating with associated files from the additional processing. Populating this dictionary only with standard types allows serialization to JSON for distributed processing.

The dictionary is dynamic throughout the workflow depending on the step, but some of the most useful key/values available throughout are:

It also contains information the genome build, sample name and reference genome file throughout. Here’s an example of these inputs:

{'config': {'algorithm': {'aligner': 'bwa', 'callable_regions': 'analysis_blocks.bed', 'coverage_depth': 'low', 'coverage_interval': 'regional', 'mark_duplicates': 'samtools', 'nomap_split_size': 50, 'nomap_split_targets': 20, 'num_cores': 1, 'platform': 'illumina', 'quality_format': 'Standard', 'realign': 'gkno', 'recalibrate': 'gatk', 'save_diskspace': True, 'upload_fastq': False, 'validate': '../reference_material/7_100326_FC6107FAAXX-grade.vcf', 'variant_regions': '../data/automated/variant_regions-bam.bed', 'variantcaller': 'freebayes'}, 'resources': {'bcbio_variation': {'dir': '/usr/share/java/bcbio_variation'}, 'bowtie': {'cores': None}, 'bwa': {'cores': 4}, 'cortex': {'dir': '/install/CORTEX_release_v1.0.5.14'}, 'cram': {'dir': '/usr/share/java/cram'}, 'gatk': {'cores': 2, 'dir': '/usr/share/java/gatk', 'jvm_opts': ['-Xms750m', '-Xmx2000m'], 'version': '2.4-9-g532efad'}, 'gemini': {'cores': 4}, 'novoalign': {'cores': 4, 'memory': '4G', 'options': ['-o', 'FullNW']}, 'picard': {'cores': 1, 'dir': '/usr/share/java/picard'}, 'snpEff': {'dir': '/usr/share/java/snpeff', 'jvm_opts': ['-Xms750m', '-Xmx3g']}, 'stampy': {'dir': '/install/stampy-1.0.18'}, 'tophat': {'cores': None}, 'varscan': {'dir': '/usr/share/java/varscan'}, 'vcftools': {'dir': '~/install/vcftools_0.1.9'}}}, 'genome_resources': {'aliases': {'ensembl': 'human', 'human': True, 'snpeff': 'hg19'}, 'rnaseq': {'transcripts': '/path/to/rnaseq/ref-transcripts.gtf', 'transcripts_mask': '/path/to/rnaseq/ref-transcripts-mask.gtf'}, 'variation': {'dbsnp': '/path/to/variation/dbsnp_132.vcf', 'train_1000g_omni': '/path/to/variation/1000G_omni2.5.vcf', 'train_hapmap': '/path/to/hg19/variation/hapmap_3.3.vcf', 'train_indels': '/path/to/variation/Mills_Devine_2hit.indels.vcf'}, 'version': 1}, 'dirs': {'fastq': 'input fastq directory', 'galaxy': 'directory with galaxy loc and other files', 'work': 'base work directory'}, 'metadata': {'batch': 'TestBatch1'}, 'genome_build': 'hg19', 'name': ('', 'Test1'), 'sam_ref': '/path/to/hg19.fa'}

Processing also injects other useful key/value pairs. Here’s an example of additional information supplied during a variant calling workflow:

{'prep_recal': 'Test1/7_100326_FC6107FAAXX-sort.grp', 'summary': {'metrics': [('Reference organism', 'hg19', ''), ('Total', '39,172', '76bp paired'), ('Aligned', '39,161', '(100.0\%)'), ('Pairs aligned', '39,150', '(99.9\%)'), ('Pair duplicates', '0', '(0.0\%)'), ('Insert size', '152.2', '+/- 31.4')], 'pdf': '7_100326_FC6107FAAXX-sort-prep-summary.pdf', 'project': 'project-summary.yaml'}, 'validate': {'concordant': 'Test1-ref-eval-concordance.vcf', 'discordant': 'Test1-eval-ref-discordance-annotate.vcf', 'grading': 'validate-grading.yaml', 'summary': 'validate-summary.csv'}, 'variants': [{'population': {'db': 'gemini/TestBatch1-freebayes.db', 'vcf': None}, 'validate': None, 'variantcaller': 'freebayes', 'vrn_file': '7_100326_FC6107FAAXX-sort-variants-gatkann-filter-effects.vcf'}], 'vrn_file': '7_100326_FC6107FAAXX-sort-variants-gatkann-filter-effects.vcf', 'work_bam': '7_100326_FC6107FAAXX-sort-prep.bam'}

Parallelization framework

bcbio-nextgen supports parallel runs on local machines using multiple cores and distributed on a cluster using IPython using a general framework.

The first parallelization step starts up a set of resources for processing. On a cluster this spawns a IPython parallel controller and set of engines for processing. The prun (parallel run) start function is the entry point to spawning the cluster and the main argument is a parallel dictionary which contains arguments to the engine processing command. Here is an example input from an IPython parallel run:

{'cores': 12, 'type': 'ipython' 'progs': ['aligner', 'gatk'], 'ensure_mem': {'star': 30, 'tophat': 8, 'tophat2': 8}, 'module': 'bcbio.distributed', 'queue': 'batch', 'scheduler': 'torque', 'resources': [], 'retries': 0, 'tag': '', 'timeout': 15}

The cores and type arguments must be present, identifying the total cores to use and type of processing, respectively. Following that are arguments to help identify the resources to use. progs specifies the programs used, here the aligner, which bcbio looks up from the input sample file, and gatk. ensure_mem is an optional argument that specifies minimum memory requirements to programs if used in the workflow. The remaining arguments are all specific to IPython to help it spin up engines on the appropriate computing cluster.

A shared component of all processing runs is the identification of used programs from the progs argument. The run creation process looks up required memory and CPU resources for each program from the Resources section of your bcbio_system.yaml file. It combines these resources into required memory and cores using the logic described in the Memory management section of the parallel documentation. Passing these requirements to the cluster creation process ensures the available machines match program requirements.

bcbio-nextgen’s pipeline.main code contains examples of starting and using set of available processing engines. This example starts up machines that use samtools, gatk and cufflinks then runs an RNA-seq expression analysis:

with prun.start(_wprogs(parallel, ["samtools", "gatk", "cufflinks"]), samples, config, dirs, "rnaseqcount") as run_parallel: samples = rnaseq.estimate_expression(samples, run_parallel)

The pipelines often reuse a single set of machines for multiple distributed functions to avoid the overhead of starting up and tearing down machines and clusters.

The run_parallel function returned from the prun.start function enables running on jobs in the parallel on the created machines. The ipython wrapper code contains examples of implementing this. It is a simple function that takes two arguments, the name of the function to run and a set of multiple arguments to pass to that function:

The items arguments need to be strings, lists and dictionaries to allow serialization to JSON format. The internals of the run function take care of running all of the code in parallel and returning the results back to the caller function.

In this setup, the main processing code is fully independent from the parallel method used so running on a single multicore machine or in parallel on a cluster return identical results and require no changes to the logical code defining the pipeline.

During re-runs, we avoid the expense of spinning up processing clusters for completed tasks using simple checkpoint files in the checkpoints_parallel directory. The prun.start wrapper writes these on completion of processing for a group of tasks with the same parallel architecture, and on subsequent runs will go through these on the local machine instead of parallelizing. The processing code supports these quick re-runs by checking for and avoiding re-running of tasks when it finds output files.

Plugging new parallelization approaches into this framework involves writing interface code that handles the two steps. First, create a cluster of ready to run machines given the parallel function with expected core and memory utilization:

Second, implement a run_parallel function that handles using these resources to distribute jobs and return results. The multicore wrapper and ipython wrapper are useful starting points for understanding the current implementations.