GitHub - ucsd-ccbb/VAPr: VAPr: A Python package for NoSQL variant data storage, annotation and prioritization (original) (raw)

VAPr

Variant Annotation and Prioritization package

This package is aimed at providing a way of retrieving variant information using ANNOVAR and myvariant.info. In particular, it is suited for bioinformaticians interested in aggregating variant information into a single NoSQL database (MongoDB solely at the moment).

Documentation now live at: http://vapr.readthedocs.io/en/latest/

DOI: Efficient population-scale variant analysis and prioritization with VAPr

Authors

License

This project is licensed under the MIT License - see the LICENSE file for details

Table of contents

  1. Background
    1.1. Data Models
  2. Getting Started
  3. Tutorial
    3.1. Workflow Overview
    3.2. VaprAnnotator - Tips on usage
    3.2.1 Arguments
    3.3. Core Methods
    3.3.1 Annovar
    3.3.2 Annotation
    3.3.3 Filtering
    3.3.1 Output Files

Background

VAPr was developed to simplify the steps required to get mutation data from a VCF file to a downstream analysis process. A query system was implemented allowing users to quickly slice the genomic variant (GV) data and select variants according to their characteristics, allowing researchers to focus their analysis only on the subset of data that contains meaningful information. Further, this query system allows the user to select the format in which the data can be retrieved. Most notably, CSV or VCF files can be retrieved from the database, allowing any researcher to quickly filter variants and retrieve them in commonly used formats. The package can also be installed and used without having to download ANNOVAR. In that case, variant data can be retrieved solely by MyVariant.info and rapidly parsed to the MongoDB instance.

Data Models

The annotation process identifies every unique variant in the union of variants found for the input samples; it then submits batches (of a user-specifiable size) of variant ids to MyVariant.info and stores the resulting annotation information to the local MongoDB. Subsequent filtering and output of the resulting annotations is done against the MongoDB rather than via additional calls to MyVariant.info, allowing the user to investigate multiple different filtering strategies on a given annotation run without additional overhead. Note that, by design, each run of annotate() performs new annotation calls to MyVariant.info rather than attempting to find potentially relevant past annotations in the MongoDB; this is because MyVariant.info is continually updated live, and we anticipate that users will want to receive the latest annotations each time they choose to annotate, rather than potentially “stale” annotations from past runs.

Intuitively, variant data could be stored in SQL-like databases, since annotation files are usually produced in VCF or CSV formats. However, a different approach may be more fruitful. As explained on our paper (currently under review), the abundance and diversity of genomic variant data causes SQL schemas to perform poorly for variant storage and querying. As it can be the case for many variants, the number of different fields and sub-fields it can have can be over 500, with even more diverse nested sub-fields. Creating a pre-defined schema (as required by SQL-like engined) becomes rather impossible: representing such variant in a table format would thus result in a highly sparse and inefficient storage. Representing instead a variant atomically, that is, as a standalone JSON object having no pre-defied schema, it is possible to compress the rich data into a more manageable format. A sample entry in the Mongo Database will look likethis. The variety of data that can be retrieved from the sources results from the richness of databases that can be accessed through MyVariant.info. However, not every variant will have such data readily available. In some cases, the data will be restricted to what can be inferred from the vcf file and the annotation carried out with Annovar. In that case, the entries that will be found in the document will be the following:

    {'1000g2015aug_all': 0.00579073,
 '_id': ObjectId('5a0d4c5b59f987f13d76aa17'),
 'alt': 'A',
 'cadd': {'1000g': {'af': 0.01, 'afr': 0.002, 'amr': 0.01, 'eur': 0.02},
          '_license': 'http://goo.gl/bkpNhq',
          'esp': {'af': 0.017, 'afr': 0.005, 'eur': 0.023},
          'gerp': {'n': 3.47, 'rs': 350.8, 'rs_pval': 8.50723e-58, 's': 1.47},
          'phred': 19.55,
          'polyphen': {'cat': 'benign', 'val': 0.017},
          'sift': {'cat': 'tolerated', 'val': 0.43}},
 'chr': '1',
 'clinvar': {'_license': 'https://goo.gl/OaHML9',
             'rcv': [{'accession': 'RCV000017600',
                      'clinical_significance': 'risk factor',
                      'conditions': {'identifiers': {'medgen': 'C2751604'},
                                     'name': 'Epilepsy, juvenile myoclonic 7 '
                                             '(EJM7)',
                                     'synonyms': ['EPILEPSY, JUVENILE '
                                                  'MYOCLONIC, SUSCEPTIBILITY '
                                                  'TO, 7',
                                                  'EPILEPSY, IDIOPATHIC '
                                                  'GENERALIZED, SUSCEPTIBILITY '
                                                  'TO, 10; EPILEPSY, JUVENILE '
                                                  'MYOCLONIC, SUSCEPTIBILITY '
                                                  'TO, 7']}},
                     {'accession': 'RCV000017599',
                      'clinical_significance': 'risk factor',
                      'conditions': {'identifiers': {'medgen': 'C3150401'},
                                     'name': 'Generalized epilepsy with '
                                             'febrile seizures plus type 5 '
                                             '(GEFSP5)'}},
                     {'accession': 'RCV000022558',
                      'clinical_significance': 'risk factor',
                      'conditions': {'identifiers': {'medgen': 'C2751603',
                                                     'omim': '613060'},
                                     'name': 'Epilepsy, idiopathic generalized '
                                             '10 (EIG10)',
                                     'synonyms': ['EPILEPSY, IDIOPATHIC '
                                                  'GENERALIZED, SUSCEPTIBILITY '
                                                  'TO, 10']}}]},
 'dbsnp': {'_license': 'https://goo.gl/Ztr5rl', 'rsid': 'rs41307846'},
 'end': 1959699,
 'exonicfunc_knowngene': 'nonsynonymous SNV',
 'func_knowngene': 'exonic',
 'gene_knowngene': 'GABRD',
 'hgvs_id': 'chr1:g.1959699G>A',
 'ref': 'G',
 'samples': [{'AD': [17, 20],
              'genotype': '0/1',
              'genotype_likelihoods': [400.0, 0.0, 314.0],
              'genotype_subclass_by_class': {'heterozygous': 'reference'},
              'sample_id': 'S1'}],
 'start': 1959699,
 'wellderly': {'_license': 'https://goo.gl/e8OO17',
               'alleles': [{'allele': 'A', 'freq': 0.015},
                           {'allele': 'G', 'freq': 0.985}]}}

Getting started

These instructions will get you a copy of the package up and running on your local machine, and will enable you to run annotation jobs on any number of vcf files while storing the data in MongoDB. See the workflow

Prerequisites

Python 3 and MongoDB

VAPr is written in Python and stores variant annotations in NoSQL database, using a locally-installed instance of MongoDB. Installation instructions

BCFtools

BCFtools will be used for VCF file merging between samples. To download and install:

wget https://github.com/samtools/bcftools/releases/download/1.6/bcftools-1.6.tar.bz2
tar -vxjf bcftools-1.6.tar.bz2
cd bcftools-1.6
make
make install
export PATH=/where/to/install/bin:$PATH

Refer here for installation debugging.

Tabix

Tabix and bgzip binaries are available through the HTSlib project:

wget https://github.com/samtools/htslib/releases/download/1.6/htslib-1.6.tar.bz2
tar -vxjf htslib-1.6.tar.bz2
cd htslib-1.6
make
make install
export PATH=/where/to/install/bin:$PATH

Refer here for installation debugging.

ANNOVAR

(It is possible to proceed without installing ANNOVAR. Variants will only be annotated with MyVariant.info. In that case, users can skip the next steps and go straight to the sectionKnown Variant Annotation and Storage)

Users who wish to annotate novel variants will also need to have a local installation of the popular command-line software ANNOVAR(1), which VAPr wraps with a Python interface. If you use ANNOVAR's functionality through VAPr, please remember to cite the ANNOVAR publication (see #1 in Citations)!

The base ANNOVAR program must be installed by each user individually, since its license agreement does not permit redistribution. Please visit the ANNOVAR download form here, ensure that you meet the requirements for a free license, and fill out the required form. You will then receive an email providing a link to the latest ANNOVAR release file. Download this file (which will usually have a name likeannovar.latest.tar.gz) and place it in the location on your machine in which you would like the ANNOVAR program and its data to be installed--the entire disk size of the databases will be around 25 GB, so make sure you have such space available!

VAPr

VAPr is available from PyPi. Once the above requirements have been installed, VAPr itself can be installed by just running:

Tutorial

See this jupyter notebook to create your first annotation job

Table of contents

Supplemental Information

Workflow Overview

Workflow

VaprAnnotator - Tips on Usage

An annotation project can be started by providing the API with a small set of information and then running thecore methods provided to spawn annotation jobs. This is done in the following manner:

Import core module

from VAPr import vapr_core import os

Start by specifying the project information

IN_PATH = "/path/to/vcf" OUT_PATH = "/path/to/out" ANNOVAR_PATH = "/path/to/annovar" MONGODB = 'VariantDatabase' COLLECTION = 'Cancer'

annotator = vapr_core.VaprAnnotator(input_dir=IN_PATH, output_dir=OUT_PATH, mongo_db_name=MONGODB, mongo_collection_name=COLLECTION, build_ver='hg19', vcfs_gzipped=False, annovar_install_path=ANNOVAR_PATH)

This will allow you to use any of core methods in the package.

Downloading ANNOVAR Databases

If you plan to use Annovar, please make sure to download the necessary Annovar databases. When Annovar is first installed, it does not install Annovar databases by default. The vapr_core has a method download_annovar_databases() that will download the necessary annovar databases. Note: this command only needs to be run the first time you use VAPr.

annotator.download_annovar_databases()

dataset = annotator.annotate(num_processes=8)

Required Arguments

The first four arguments are required to run the full annotation pipeline. These are:

Optional Arguments

      Sample_Names  
      /path/to/file1.vcf  
      /path/to/file2.vcf  
      /path/to/file3.vcf  

Table of contents

Core Methods

The VaprAnnotator object has a variety of methods from it. These include:

The differences and nuances of each will be discussed next.

Annovar

download_annovar_databases

download_annovar_databases(): this function downloads the databases required to run Annovar to the .../annovar/humandb/ directory. It will download the databases according to the genome version specified. If your databases are out-of-date, re-running this command will download the latest version of them. If you currently have the required databases, you may get an error.

Args:

Required:

Optional:

Annotation

annotate

annotate(): this requires running Annovar beforehand, and will kick-start the main functionality of this package. Namely, it will collect all the variant data from Annovar annotations, combine it with data coming from MyVariant.info, and parse it to MongoDB, in the database and collection specified in project_data.

Args:

Required:

Optional:

annotate_lite(): Execution will skip annotating with Annovar. It will grab the HGVS ids from the vcf files and query the variant data from MyVariant.info. It is subject to the issue of potentially having completely empty data for some of the variants, and inability to run native VAPr queries on the data.

Args:

Required:

Optional:

Filtering Methods

Four different pre-made filters that allow for the retrieval of specific variants have been implemented. Refer to the [README.md](link to readme/filters) file for more more information about the fields and thresholds used.

Usage

In order to use the filters, proceed as follows:

rare_deleterious_variants = dataset.get_rare_deleterious_variants()

This will return a list of dictionaries, where each dictionary is contains variant containing annotations.

Available filters

Filter #1: Rare Deleterious Variants

get_rare_deleterious_variants(): this will retrieve all the variants in your collection matching the thresholds specified in the README.md file.

Args:

Required:

Optional:

Filter #2: Known Disease Variants

get_known_disease_variants(): this will retrieve all the variants in your collection matching the thresholds specified in the README.md file.

Args:

Required:

Optional:

Filter #3: Deleterious Compound Heterozygous Variants

get_deleterious_compound_heterozygous_variants(): this will retrieve all the variants in your collection matching the thresholds specified in the README.md file.

Args:

Required:

Optional:

Filter #4: De novo Variants

get_de_novo_variants(): this will retrieve all the variants in your collection matching the thresholds specified in the README.md file.

Args:

Required:

Create your own filter

As long as you have a MongoDB instance running, filtering can be performed through pymongo as shown by the code below. If a list is intended to be created from it, simply add: filter2 = list(filter2)

If you'd like to customize your filters, a good idea would be to look at the available fields to be filtered. Looking at the myvariant.info documentation, you can see what are all the fields available and can be used for filtering.

from pymongo import MongoClient

client = MongoClient() db = getattr(client, mongodb_name) collection = getattr(db, mongo_collection_name)

filtered = collection.find({"$and": [ {"$or": [{"func_knowngene": "exonic"}, {"func_knowngene": "splicing"}]}, {"cosmic70": {"$exists": True}}, {"1000g2015aug_all": {"$lt": 0.05}} ]}) filtered = list(filtered)

Output Files

Write Options #1: Unfiltered Variants CSV

write_unfiltered_annotated_csv(): All variants will be written to a CSV file.

Args:

Required:

Write Options #2: Filtered Variants CSV

write_filtered_annotated_csv(): List of filtered variants will be written to a CSV file.

Args:

Required:

Write Options #3: Unfiltered Variants VCF

write_unfiltered_annotated_vcf(): All variants will be written to a VCF file.

Args:

Required:

Optional:

Write Options #4: Filtered Variants VCF

write_filtered_annotated_vcf(): List of filtered variants will be written to a VCF file.

Args:

Required:

List of rare deleterious variants

filtered_variants = dataset.get_rare_deleterious_variants()

Write variants to vcf file

dataset.write_filtered_annotated_vcf(filtered_variants, output_dir + “/myfile.csv”)

Citations