Installation — bcbio-nextgen 1.2.9 documentation (original) (raw)

bcbio-nextgen

Fresh installation (HPC cluster, server, AMI instance, Linux only)

2. Install data

Bcbio needs reference files, indices, and databases. It is possible to install bcbio package and data at once, but we recommend to split these steps, because:

bcbio_nextgen.py upgrade -u skip --genomes hg38 --aligners bwa

This command installs hg38 human reference genome and bwa aligner index - the bare minimum required to run germline or somatic variant calling pipelines.

Installation notes

Installation parameters

Run

bcbio_nextgen.py upgrade --help

to see all supported installation options:

bcbio_nextgen.py upgrade --help usage: bcbio_nextgen.py upgrade [-h] [--cores CORES] [--tooldir TOOLDIR] [--tools] [-u {stable,development,system,deps,skip}] [--toolconf TOOLCONF] [--revision REVISION] [--toolplus TOOLPLUS] [--datatarget {variation,rnaseq,smallrna,gemini,vep,dbnsfp,dbscsnv,battenberg,kraken,ericscript,gnomad}] [--genomes {GRCh37,hg19,hg38,hg38-noalt,mm10,mm9,rn6,rn5,canFam3,dm3,galGal4,phix,pseudomonas_aeruginosa_ucbpp_pa14,sacCer3,TAIR10,WBcel235,xenTro3,GRCz10,GRCz11,Sscrofa11.1,BDGP6}] [--aligners {bwa,rtg,hisat2,bbmap,bowtie,bowtie2,minimap2,novoalign,twobit,bismark,snap,star,seq}] [--data] [--cwl] [--isolate] [--distribution {ubuntu,debian,centos,scientificlinux,macosx}]

optional arguments: -h, --help show this help message and exit --cores CORES Number of cores to use if local indexing is necessary. --tooldir TOOLDIR Directory to install 3rd party software tools. Leave unspecified for no tools --tools Boolean argument specifying upgrade of tools. Uses previously saved install directory -u {stable,development,system,deps,skip}, --upgrade {stable,development,system,deps,skip} Code version to upgrade --toolconf TOOLCONF YAML configuration file of tools to install --revision REVISION Specify a git commit hash or tag to install --toolplus TOOLPLUS Specify additional tool categories to install --datatarget {variation,rnaseq,smallrna,gemini,vep,dbnsfp,dbscsnv,battenberg,kraken,ericscript,gnomad} Data to install. Allows customization or install of extra data. --genomes {GRCh37,hg19,hg38,hg38-noalt,mm10,mm9,rn6,rn5,canFam3,dm3,galGal4,phix,pseudomonas_aeruginosa_ucbpp_pa14,sacCer3,TAIR10,WBcel235,xenTro3,GRCz10,GRCz11,Sscrofa11.1,BDGP6} Genomes to download --aligners {bwa,rtg,hisat2,bbmap,bowtie,bowtie2,minimap2,novoalign,twobit,bismark,snap,star,seq} Aligner indexes to download --data Upgrade data dependencies --cwl Install code and data for running CWL workflows --isolate Created an isolated installation without PATH updates --distribution {ubuntu,debian,centos,scientificlinux,macosx} Operating system distribution

Some useful arguments are:

Upgrade

bcbio 1.2.9 has major changes in the conda environments. Please consider installing bcbio1.2.9 code/tools from scratch rather than upgrading from 1.2.8. You can re-use the data installation from bcbio<=1.2.8. snpeff databases has to be re-installed with the below command

We use the same automated installation process for performing upgrades of tools, software and data in place. Since there are multiple targets and we want to avoid upgrading anything unexpectedly, we have specific arguments for each. Generally, you’d want to upgrade the code, tools and data together with:

bcbio_nextgen.py upgrade -u stable --tools --data

Tune the upgrade with these options:

Customizing data installation

bcbio supports the following genome references, 12 of them have additional data downloads. If you need a reference which is absent in the list, you may install it as a custom genome.

bcbio installs associated data files for sequence processing, and you’re able to customize this to install larger files or change the defaults. Use the --datatarget flag (potentially multiple times) to customize or add new targets.

By default, bcbio will install data files for variation, rnaseq and smallrna but you can sub-select a single one of these if you don’t require other analyses. The available targets are:

For somatic analyses, bcbio includes COSMIC v68 for hg19 and GRCh37 only. Due to license restrictions, we cannot include updated versions of this dataset and hg38 support with the installer. To prepare these datasets yourself you can use a utility script shipped with cloudbiolinux that downloads, sorts and merges the VCFs, then copies into your bcbio installation:

export COSMIC_USER="you@example.org" export COSMIC_PASS="your_cosmic_password" bcbio_python prepare_cosmic.py 89 /path/to/bcbio

/path/to/bcbio/ here is the directory one up from the genomes directory. The script removes variants marked as SNP in COSMIC, i.e. leaving only somatic variants. From version a minor portion of variants gets re-classified, for example: 3,779 variants were SNPs in cosmic-90 and became mutations in cosmic-92, 656 variants were mutations in cosmic-90 and became SNPs in cosmic-92.

System requirements

bcbio-nextgen provides a wrapper around external tools and data, so the actual tools used drive the system requirements. For small projects, it should install on workstations or laptops with a couple GB of memory, and then scale as needed on clusters or multicore machines.

Disk space requirement for the tools, including all system packages is about 22GB (or more, depending on the type of the file system). Biological data requirements will depend on the genomes and aligner indices used, but a suggested install with GRCh37 and bowtie/bwa2 indexes uses approximately 35GB of storage during preparation and ~25GB after:

$ du -shc genomes/Hsapiens/GRCh37/* 3.8G bowtie2 5.1G bwa 3.0G rnaseq-2014-05-02 3.0G seq 340M snpeff 4.2G variation 4.4G vep 23.5G total

Troubleshooting

Proxy or firewall problems

Some steps retrieve third party tools from GitHub, which can run into issues if you’re behind a proxy or block git ports. To instruct git to use https:// globally instead of git://:

git config --global url.https://github.com/.insteadOf git://github.com/

GATK or Java Errors

Most software tools used by bcbio require Java 1.8. bcbio distributes an OpenJDK Java build and uses it so you don’t need to install anything. Older versions of GATK (< 3.6) and MuTect require a locally installed Java 1.7. If you have version incompatibilities, you’ll see errors like:

Unsupported major.minor version 51.0

Fixing this requires either installing Java 1.7 for old GATK and MuTect or avoiding pointing to an incorrect java (unset JAVA_HOME). You can also tweak the java used by bcbio, described in the Automated installation section.

ImportErrors

Import errors with tracebacks containing Python libraries outside of the bcbio distribution (/path/to/bcbio/anaconda) are often due to other conflicting Python installations. bcbio tries to isolate itself as much as possible but external libraries can get included during installation due to the PYTHONHOME or PYTHONPATH environmental variables or local site libraries. These commands will temporary unset those to get bcbio installed, after which it should ignore them automatically:

unset PYTHONHOME unset PYTHONPATH export PYTHONNOUSERSITE=1

Finally, having a .pydistutils.cfg file in your home directory can mess with where the libraries get installed. If you have this file in your home directory, temporarily renaming it to something else may fix your installation issue.

Manual process

The manual process does not allow the in-place updates and management of third party tools that the automated installer makes possible. It’s a more error-prone and labor intensive process. If you find you can’t use the installer we’d love to hear why to make it more amenable to your system. If you’d like to develop against a bcbio installation, see the documentation on setting up a development environment.

Data requirements

In addition to existing bioinformatics software the pipeline requires associated data files for reference genomes, including pre-built indexes for aligners. The CloudBioLinux toolkit again provides an automated way to download and prepare these reference genomes:

fab -f data_fabfile.py -H localhost -c your_fabricrc.txt install_data_s3:your_biodata.yaml

The biodata.yaml file contains information about what genomes to download. The fabricrc.txt describes where to install the genomes by adjusting the data_files variable. This creates a tree structure that includes a set of Galaxy-style location files to describe locations of indexes:

├── galaxy │   ├── tool-data │   │   ├── alignseq.loc │   │   ├── bowtie_indices.loc │   │   ├── bwa_index.loc │   │   ├── sam_fa_indices.loc │   │   └── twobit.loc │   └── tool_data_table_conf.xml ├── genomes │   ├── Hsapiens │   │   ├── GRCh37 │   │   └── hg19 │   └── phiX174 │   └── phix └── liftOver

Individual genome directories contain indexes for aligners in individual sub-directories prefixed by the aligner name. This structured scheme helps manage aligners that don’t have native Galaxy .loc files. The automated installer will download and set this up automatically:

-- phix |-- bowtie | |-- phix.1.ebwt | |-- phix.2.ebwt | |-- phix.3.ebwt | |-- phix.4.ebwt | |-- phix.rev.1.ebwt | -- phix.rev.2.ebwt |-- bowtie2 | |-- phix.1.bt2 | |-- phix.2.bt2 | |-- phix.3.bt2 | |-- phix.4.bt2 | |-- phix.rev.1.bt2 | -- phix.rev.2.bt2 |-- bwa | |-- phix.fa.amb | |-- phix.fa.ann | |-- phix.fa.bwt | |-- phix.fa.pac | |-- phix.fa.rbwt | |-- phix.fa.rpac | |-- phix.fa.rsa | -- phix.fa.sa |-- novoalign | -- phix |-- seq | |-- phix.dict | |-- phix.fa | -- phix.fa.fai -- ucsc -- phix.2bit

Maintain many bcbio installations

It is often asked how to reproduce older bcbio analyses when every update changes a lot in tools and in bcbio code. One of the solutions is the use of modules in HPC environemnt: https://www.admin-magazine.com/HPC/Articles/Environment-Modules. You can have a bcbio/version module for every bcbio snapshot you need. They would consume <50G each, and a single large genomesfolder could be symlinked to all of them. Data in genomes changes in a much slower pace compared to bcbio code and tools.