Configuration — bcbio-nextgen 1.2.9 documentation (original) (raw)

bcbio-nextgen

Project structure

bcbio encourages a project structure:

project/ ├── config ├── final ├── input └── work

with the project.yaml configuration in the config directory, the input files (fastq, bam, bed) in the input directory, the outputs of the pipeline in the final directory, and the actual processing done in the work directory.

Typical bcbio run:

System and sample configuration files

Two configuration files, in easy to write YAML format, specify details about your system and samples to run:

Commented systemand sampleexample files are available in the config directory.

Sample configuration

Sample information

The sample configuration file defines details of each sample to process:

details:

Automated sample configuration

bcbio-nextgen provides a utility to create configuration files for multiple sample inputs using a base template. Start with one of the best-practice templates, or define your own, then apply to multiple samples using the template workflow command:

bcbio_nextgen.py -w template freebayes-variant project1.csv sample1.bam sample2_1.fq sample2_2.fq

To make it easier to define your own project specific template, an optional first step is to download and edit a local template. First retrieve a standard template:

bcbio_nextgen.py -w template freebayes-variant project1

This pulls the current GATK best practice variant calling template into your project directory in project1/config/project1-template.yaml. Manually edit this file to define your options, then run the full template creation for your samples, pointing to this custom configuration file:

bcbio_nextgen.py -w template project1/config/project1-template.yaml project1.csv folder/*

If your sample folder contains additional BAM or FASTQ files you do not wish to include in the sample YAML configuration, you can restrict the output to only include samples in the metadata CSV with --only-metadata. The output will print warnings about samples not present in the metadata file, then leave these out of the final output YAML:

bcbio_nextgen.py -w template --only-metadata project1/config/project1-template.yaml project1.csv folder/*

Multiple files per sample

In case you have multiple FASTQ or BAM files for each sample you can use bcbio_prepare_samples.py. The main parameters are:

samplename,description,batch,phenotype,sex,variant_regions file1.fastq,sample1,batch1,normal,female,/path/to/regions.bed file2.fastq,sample1,batch1,normal,female,/path/to/regions.bed file1.fastq,sample2,batch1,tumor,,/path/to/regions.bed

An example of usage is:

bcbio_prepare_samples.py --out merged --csv project1.csv

The script will create the sample1.fastq,sample2.fastq in the merged folder, and a new CSV file in the same folder than the input CSV:project1-merged.csv. Later, it can be used for bcbio:

bcbio_nextgen.py -w template project1/config/project1-template.yaml project1-merged.csv merged/*fastq

The new CSV file will look like:

samplename,description,batch,phenotype,sex,variant_regions sample1.fastq,sample1,batch1,normal,female,/path/to/regions.bed sample2.fastq,sample2,batch1,tumor,,/path/to/regions.bed

It supports parallelization the same way bcbio_nextgen.py does:

python $BCBIO_PATH/scripts/utils/bcbio_prepare_samples.py --out merged --csv project1.csv -t ipython -q queue_name -s lsf -n 1

See more examples at parallelize pipeline.

In case of paired reads, the CSV file should contain all files:

samplename,description,batch,phenotype,sex,variant_regions file1_R1.fastq,sample1,batch1,normal,female,/path/to/regions.bed file2_R1.fastq,sample1,batch1,normal,female,/path/to/regions.bed file1_R2.fastq,sample1,batch1,normal,femela,/path/to/regions.bed file2_R2.fastq,sample1,batch1,normal,female,/path/to/regions.bed

The script will try to guess the paired files the same way that bcbio_nextgen.py -w template does. It would detect paired files if the difference among two files is only _R1/_R2 or -1/-2 or _1/_2 or .1/.2

The output CSV will look like and is compatible with bcbio:

samplename,description,batch,phenotype,sex,variant_regions sample1,sample1,batch1,normal,female,/path/to/regions.bed

Algorithm parameters

Alignment

Read trimming

Alignment postprocessing

Coverage information

Analysis regions

These BED files define the regions of the genome to analyze and report on. variant_regions adjusts regions for small variant (SNP and indel) calling. sv_regions defines regions for structural variant calling if different than variant_regions. For coverage-based quality control metrics, we first use coverage if specified, then sv_regions if specified, then variant_regions. See the section on input file preparation for tips on ensuring chromosome naming in these files match your reference genome. bcbio pre-installs some standard BED files for human analyses. Reference these using the naming schemes described in the reference data repository.

Standard

This pipeline implements alignment and qc tools. Furthermore, it will run qsignature to detect possible duplicated samples, or mislabeling. It uses SNPs signature to create a distance matrix that helps easily to create groups. The project yaml file will show the number of total samples analyzed, the number of very similar samples, and samples that could be duplicated.

We will assume that you installed bcbio-nextgen with the automated installer, and so your default bcbio_system.yaml file is configured correctly with all of the tools pointing to the right places. If that is the case, to run bcbio-nextgen on a set of samples you just need to set up a YAML file that describes your samples and what you would like to do to them.

Let’s say that you have a single paired-end control lane, prepared with the Illumina TruSeq Kit from a human. Here is what a well-formed sample YAML file for that RNA-seq experiment would look like:

fc_date: '070113' fc_name: control_experiment upload: dir: final details:

fc_date and fc_name will be combined to form a prefix to name intermediate files, and can be set to whatever you like. upload is explained pretty well in the configuration documentation and the above will direct bcbio-nextgen to put the output files from the pipeine into the final directory. Under details is a list of sections each describing a sample to process. You can set manyparameters under each section but most of the time just setting a few like the above is all that is necessary. analysis tells bcbio-nextgen to run the best-practice RNA-seq pipeline on this sample.

In the above, since there are two files, control_1.fastq and control_2.fastq will be automatically run as paired-end data. If you have single end data you can just supply one file and it will run as single-end. The description field will be used to eventually rename the files, so make it very evocative since you will be looking at it a lot later. genome_build is self-explanatory.

Sometimes you need a little bit more flexibility than the standard pipeline, and the algorithm section has many options to fine-tune the behavior of the algorithm. quality_format tells bcbio-nextgen what quality format your FASTQ inputs are using, if your samples were sequenced any time past 2009 or so, you probably want to set it toStandard. Adapter read-through is a problem in RNA-seq libraries, so we want to trim off possible adapter sequences on the ends of reads, so trim_reads is set to read_through, which will also trim off poor quality ends. Since your library is a RNA-seq library prepared with the TruSeq kit, the set of adapters to trim off are the TruSeq adapters and possible polyA tails, so adapters is set to both of those.strandedness can be set if your library was prepared in a strand-specific manner and can be set to firststrand, secondstrand or unstranded (the default).

Parallelization

Multiple samples

Lets say you have a set of mouse samples to analyze and each sample is a single lane of single-end RNA-seq reads prepared using the NextEra kit. There are two case and two control samples. Here is a sample configuration file for that analysis:

fc_date: '070113' fc_name: mouse_analysis upload: dir: final details:

More samples are added just by adding more entries under the details section. This is tedious and error prone to do by hand, so there is an automatedtemplate system for common experiments. You could set up the previous experiment by making a mouse version of the illumina-rnaseq template file and saving it to a local file such as illumina-mouse-rnaseq.yaml. Then you can set up the sample file using the templating system:

bcbio_nextgen.py -w template illumina-mouse-rnaseq.yaml mouse_analysis /full/path/to/control_rep1.fastq /full/path/to/control_rep2.fastq /full/path/to/case_rep1.fastq /full/path/to/case_rep2.fastq

If you had paired-end samples instead of single-end samples, you can still use the template system as long as the forward and reverse read filenames are the same, barring a _1 and _2. For example: control\_1.fastq and control\_2.fastq will be detected as paired and combined in the YAML file output by the templating system.

Quality control

Post-processing

archive Specify targets for long term archival. cram removes fastq names and does 8-bin compression of BAM files into CRAM format. cram-lossless generates CRAM files without changes to quality scores or fastq name. Default: [] – no archiving. Lossy cram has some issues, lossless cram provides pretty good compression relative to BAM, and many machines output binned values now, so cram-lossless is what we recommend you use.

Changing bcbio defaults

bcbio provides some hints to change default behavior be either turning specific defaults on or off, with tools_on and tools_off. Both can be lists with multiple options:

Resources

The resources section allows customization of locations of programs and memory and compute resources to devote to them:

resources: bwa: cores: 12 cmd: /an/alternative/path/to/bwa samtools: cores: 16 memory: 2G gatk: jvm_opts: ["-Xms2g", "-Xmx4g"] mutect2_filter: options: ["--max-events-in-region", "2"]

Temporary directory

You also use the resource section to specify system specific parameters like global temporary directories:

resources: tmp: dir: /scratch

This is useful on cluster systems with large attached local storage, where you can avoid some shared filesystem IO by writing temporary files to the local disk. When setting this keep in mind that the global temporary disk must have enough space to handle intermediates. The space differs between steps but generally you’d need to have 2x the largest input file per sample and account for samples running simultaneously on multiple core machines.

To handle clusters that specify local scratch space with an environmental variable, bcbio will resolve environmental variables like:

resources: tmp: dir: $YOUR_SCRATCH_LOCATION

Sample or run specific resources

To override any of the global resource settings in a sample specific manner, you write a resource section within your sample YAML configuration. For example, to create a sample specific temporary directory and pass a command line option to novoalign, write a sample resource specification like:

To adjust resources for an entire run, you can add this resources specification at the top level of your sample YAML:

details:

Input file preparation

Input files for supplementing analysis, like variant_regions need to match the specified reference genome. A common cause of confusion is the two chromosome naming schemes for human genome build 37: UCSC-style in hg19 (chr1, chr2) and Ensembl/NCBI style in GRCh37 (1, 2). To help avoid some of this confusion, in build 38 we only support the commonly agreed on chr1, chr2 style.

It’s important to ensure that the chromosome naming in your input files match those in the reference genome selected. bcbio will try to detect this and provide helpful errors if you miss it.

To convert chromosome names, you can use Devon Ryan’s collection of chromosome mappings as an input to sed. For instance, to convert hg19 chr-style coordinates to GRCh37:

wget --no-check-certificate -qO- https://raw.githubusercontent.com/dpryan79/ChromosomeMappings/master/GRCh37_UCSC2ensembl.txt
| awk '{if($1!=$2) print "s/^"$1"/"$2"/g"}' > remap.sed sed -f remap.sed original.bed > final.bed

Genome configuration files

Each genome build has an associated buildname-resources.yaml configuration file which contains organism specific naming and resource files. bcbio-nextgen expects a resource file present next to the genome FASTA file. Example genome configuration files are available, and automatically installed for natively supported genomes. Create these by hand to support additional organisms or builds.

The major sections of the file are:

By default, we place the buildname-resources.yaml files next to the genome FASTA files in the reference directory. For custom setups, you specify an alternative directory in the resources section of your bcbio_system.yaml file:

resources: genome: dir: /path/to/resources/files

Reference genome files

For human genomes, we recommend using build 38 (hg38). This is fully supported and validated in bcbio, and corrects a lot of issues in the previous build 37. We use the 1000 genomes distribution which includes HLAs and decoy sequences. For human build 37, GRCh37 and hg19, we use the 1000 genome references provided in the GATK resource bundle. These differ in chromosome naming: hg19 uses chr1, chr2, chr3 style contigs while GRCh37 uses 1, 2, 3. They also differ slightly in content: GRCh37 has masked Pseudoautosomal regions on chromosome Y allowing alignment to these regions on chromosome X.

You can use pre-existing data and reference indexes by pointing bcbio-nextgen at these resources. We use the Galaxy .loc files approach to describing the location of the sequence and index data, as described in Data requirements. This does not require a Galaxy installation since the installer sets up a minimal set of .loc files. It finds these by identifying the root galaxy directory, in which it expects a tool-data sub-directory with the .loc files. It can do this in two ways:

To manually make genomes available to bcbio-nextgen, edit the individual .loc files with locations to your reference and index genomes. You need to edit sam_fa_indices.loc to point at the FASTA files and then any genome indexes corresponding to aligners you’d like to use (for example: bwa_index.loc for bwa and bowtie2_indices.loc for bowtie2). The database key names used (like GRCh37 and mm10) should match those used in the genome_build of your sample input configuration file.

To remove a reference genome, delete its directory bcbio/genomes/species/reference and remove all the records corresponding to that genome from bcbio/galaxy/tool-data/*.loc files.

genomes/Hsapiens/hg38/seq/hg38-resources.yaml specifies relative locations of the resources. To determine the absolute path, bcbio fetches a value from bcbio/galaxy/tool-data/sam_fa_indices.loc and uses it is a basedir for all resources. If there are several installations of bcbio data, it is important to have separate tool-data as well.

Adding custom genomes

bcbio_setup_genome.py installs a custom genome for variant and bulk-RNA-seq analyses and updates the configuration files.

bcbio_setup_genome.py
-f genome.fa
-g annotation.gtf
-i bwa star seq
-n Celegans -b WBcel135 --buildversion WormBase_34

Arguments:

References for many species are available from Ensembl:

If you want to add smallRNA-seq data files, you will need to add the 3 letters code of mirbase for your genome (i.e hsa for human) and the GTF file for the annotation of smallRNA data. Here you can use the same file than the transcriptome if no other available.

bcbio_setup_genome.py -f genome.fa -g annotation.gtf -i bowtie2 star seq -n Celegans -b WBcel135 --species cel --srna_gtf another_annotation.gtf --buildversion WormBase_34

To use that genome just need to configure your YAML files as:

The GTF file you provide for the bcbio_setup_genome.py script must have the following features:

  1. each entry must have a transcript_id and a gene_id
  2. for each transcript there must be entries where the feature field (field 3) is exon with the coordinates describing the stop and end of the exon

for example, this is a snippet from a valid GTF file:

1 pseudogene gene 11869 14412 . + . gene_source "ensembl_havana"; gene_biotype "pseudogene"; gene_id "ENSG00000223972"; gene_name "DDX11L1"; 1 processed_transcript transcript 11869 14409 . + . transcript_source "havana"; gene_id "ENSG00000223972"; gene_source "ensembl_havana"; trans cript_name "DDX11L1-002"; gene_biotype "pseudogene"; transcript_id "ENST00000456328"; gene_name "DDX11L1"; 1 processed_transcript exon 11869 12227 . + . exon_number "1"; transcript_source "havana"; gene_id "ENSG00000223972"; exon_id "ENSE00002234944"; gene_source "ensembl_havana"; transcript_id "ENST00000456328"; gene_biotype "pseudogene"; transcript_name "DDX11L1-002"; gene_name "DDX11L1"; 1 processed_transcript exon 12613 12721 . + . exon_number "2"; transcript_source "havana"; gene_id "ENSG00000223972"; exon_id "ENSE00003582793"; gene_source "ensembl_havana"; transcript_id "ENST00000456328"; gene_biotype "pseudogene"; transcript_name "DDX11L1-002"; gene_name "DDX11L1"; 1 processed_transcript exon 13221 14409 . + . exon_number "3"; transcript_source "havana"; gene_id "ENSG00000223972"; exon_id "ENSE00002312635"; gene_source "ensembl_havana"; transcript_id "ENST00000456328"; gene_biotype "pseudogene"; transcript_name "DDX11L1-002"; gene_name "DDX11L1";

Effects prediction

To perform variant calling and predict effects in a custom genome you’d have to manually download and link this into your installation. First find the snpEff genome build:

$ snpEff databases | grep Lactobacillus | grep pentosus Lactobacillus_pentosus_dsm_20314 Lactobacillus_pentosus_dsm_20314 ENSEMBL_BFMPP_32_179 http://downloads.sourceforge.net/project/snpeff/databases/v4_3/snpEff_v4_3_ENSEMBL_BFMPP_32_179.zip Lactobacillus_pentosus_kca1 Lactobacillus_pentosus_kca1 ENSEMBL_BFMPP_32_179 http://downloads.sourceforge.net/project/snpeff/databases/v4_3/snpEff_v4_3_ENSEMBL_BFMPP_32_179.zip

then download to the appropriate location:

$ cd /path/to/bcbio/genomes/Lacto/Lactobacillus_pentosus $ mkdir snpEff $ cd snpEff $ wget http://downloads.sourceforge.net/project/snpeff/databases/v4_3/snpEff_v4_3_ENSEMBL_BFMPP_32_179.zip $ unzip snpEff_v4_3_ENSEMBL_BFMPP_32_179.zip $ find . -name "Lactobacillus_pentosus_dsm_20314" ./home/pcingola/snpEff/data/Lactobacillus_pentosus_dsm_20314 $ mv ./home/pcingola/snpEff/data/Lactobacillus_pentosus_dsm_20314 .

finally add to your genome configuration file (seq/Lactobacillus_pentosus-resources.yaml):

aliases: snpeff: Lactobacillus_pentosus_dsm_20314

For adding an organism not present in snpEff, please see this mailing list discussion.

Upload

The upload section of the sample configuration file describes where to put the final output files of the pipeline. At its simplest, you can configure bcbio-nextgen to upload results to a local directory, for example a folder shared amongst collaborators or a Dropbox account. You can also configure it to upload results automatically to a Galaxy instance, to Amazon S3 or to iRODS. Here is the simplest configuration, uploading to a local directory:

upload: dir: /local/filesystem/directory

General parameters, always required:

Galaxy parameters:

Here is an example configuration for uploading to a Galaxy instance. This assumes you have a shared mounted filesystem that your Galaxy instance can also access:

upload: method: galaxy dir: /path/to/shared/galaxy/filesystem/folder galaxy_url: http://url-to-galaxy-instance galaxy_api_key: YOURAPIKEY galaxy_library: data_library_to_upload_to

Your Galaxy universe_wsgi.ini configuration needs to have allow_library_path_paste = True set to enable uploads.

S3 parameters:

For S3 access credentials, set the standard environmental variables, AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY, and AWS_DEFAULT_REGION or use IAM access roles with an instance profile on EC2 to give your instances permission to create temporary S3 access.

iRODS parameters:

Example configuration:

upload: method: irods dir: ../final folder: /irodsZone/your/path/ resource: yourResourceName

Uploads to iRODS depend on a valid installation of the iCommands CLI, and a preconfigured connection through the iinit command.

Globals

You can define files used multiple times in the algorithm section of your configuration in a top level globals dictionary. This saves copying and pasting across the configuration and makes it easier to manually adjust the configuration if inputs change:

globals: my_custom_locations: /path/to/file.bed details:

Logging

Bcbio creates 3 log files:

Persistence

Every pipeline has multiple steps. Bcbio saves intermediate results in the work directory. If a step has been successfully finished (alignment bam file is generated, variants vcf is calculated, purecn normal db is generated), and the pipeline failed one of the subsequent steps, then upon re-running the pipeline, the finished steps would not be re-calculated. If you’d like to re-generate data for a particular step, simply remove the corresponding work/step folder, for example, remove work/gemini if you’d like to re-generate a gemini database or purecn normaldb.