wANNOVAR: annotating genetic variants for personal genomes via the web (original) (raw)

. Author manuscript; available in PMC: 2013 Jul 1.

Abstract

Background

High-throughput DNA sequencing platforms have become widely available. As a result, personal genomes are increasingly being sequenced in research and clinical settings. However, the resulting massive amounts of variants data pose significant challenges to the average biologists and clinicians without bioinformatics skills.

Methods and results

We developed a web server called wANNOVAR to address the critical needs for functional annotation of genetic variants from personal genomes. The server provides simple and intuitive interface to help users determine the functional significance of variants. These include annotating single nucleotide variants and insertions/deletions for their effects on genes, reporting their conservation levels (such as PhyloP and GERP++ scores), calculating their predicted functional importance scores (such as SIFT and PolyPhen scores), retrieving allele frequencies in public databases (such as the 1000 Genomes Project and NHLBI-ESP 5400 exomes), and implementing a ‘variants reduction’ protocol to identify a subset of potentially deleterious variants/genes. We illustrated how wANNOVAR can help draw biological insights from sequencing data, by analysing genetic variants generated on two Mendelian diseases.

Conclusions

We conclude that wANNOVAR will help biologists and clinicians take advantage of the personal genome information to expedite scientific discoveries. The wANNOVAR server is available at http://wannovar.usc.edu, and will be continuously updated to reflect the latest annotation information.

INTRODUCTION

Over the past 5 years, massively parallel DNA sequencing platforms have become widely available.1 As a result, variants data on genomes from healthy subjects and patients are being generated at an unprecedented rate. However, the development of bioinformatics tools for handling these data lags behind, creating a gap between the generation of massive data and the ability to fully exploit the biological contents of these data. To fill the urgent demand, we previously developed the ANNOVAR (ANNOtate VARiation) software for functional annotation of genetic variants from sequence data.2 ANNOVAR efficiently uses up-to-date information to annotate genetic variants detected from diverse genomes with user-specified versions of genome builds. Although ANNOVAR has become one of the most widely used annotation tools for sequencing data, the requirement to type command line arguments makes ANNOVAR inaccessible to the average biologists and clinicians who would otherwise benefit from its extensive functionality.

Therefore, we developed a web server called wANNOVAR to facilitate web-based personal genome annotation, using ANNOVAR as the backend annotation engine. Users need to simply submit a list of variants (even whole-exome or whole-genome variants), and wANNOVAR can process the submission and generate HTML-based result pages. It allows flexibility by permitting the users to select customised filtering criteria and identify a subset of prioritised variants from thousands or even millions of input variants. Below, we describe the implementation of the wANNOVAR sever and illustrate its utility using two high-throughput sequencing data sets on Mendelian diseases.

METHODS

The web server is composed of a web interface and a background program for executing annotation tasks. Our tests indicated that the server performed well under a light load for user queries. For example, annotating an exome with ~20 000 SNPs and indels takes merely a few minutes in the server. The subroutines for handling user query were written in Perl and were facilitated by the Common Gateway Interface module (CGI.pm). The static and dynamic HTML pages have been tested in different versions of Internet Explorer, Firefox and Google Chrome browsers.

Input fields for the wANNOVAR server include a sample identifier, an email address, a variant file, the reference genome build, the gene definition system and optionally a disease model for running the ‘variants reduction’ pipeline. The default input format for the variant file is variant call format (VCF),3 which is a text file that contains meta-information lines, a header line, and data lines containing information about a position in the genome. The server can also handle other input formats, including the ANNOVAR input format, the Complete Genomics ASM.tsv format and the GFF3-SOLiD format. Currently, the input file size is restricted to less than 200 MB, and the input file can be compressed in .gz or .zip format. If all input fields are correctly set, the server will return a webpage with a URL for the results page.

The results page contains a collection of functional annotations for variant calls. Users can download the ‘exome summary results’ or the ‘genome summary results’ as Excel-compatible files or tab-delimited files, or choose to view the annotation results in a table on the webpage. The annotations on all variants were grouped into several broad categories including gene annotation, variation databases, functional prediction and region annotations (table 1). Several functional prediction scores for exonic variants from the dbNSFP Database4 including SIFT,5 PolyPhen,6 LRT,7 MutationTaster8 and PhyloP,9 are also provided in the wANNOVAR server to help users judge the functionality of variants using multiple sources of information. As previously described, wANNOVAR can perform a ‘variants reduction’ procedure to identify a subset of the most likely causal variants/genes for Mendelian diseases, from a large list of variants on personal genomes.2 For example, users can remove variants observed in public databases such as the 1000 Genomes Project,10 NHLBI-ESP 5400 exomes11 and dbSNP12 with specific minor allele frequency cut-off. The server uses modified versions of dbSNP that excluded all SNPs flagged as ‘clinically associated’ by dbSNP. We provide several default pipelines for different disease models such as ‘rare recessive Mendelian disease’ and ‘rare dominant Mendelian disease’, but users can also use ‘advanced options’ to specify a custom filtering strategy (table 2).

Table 1.

Selected annotation tasks from the wANNOVAR server

Type Column Description
Gene annotation Variant function Exonic, intronic, intergenic, UTR, etc
Gene Impacted gene or neighbouring gene (with distance)
Exonic variant function Non-synonymous, synonymous, stopgain, etc
AAChange mRNA and amino acid change for coding variants
Variation databases ESP5400_ALL Allele frequency in 5400 NHLBI-ESP exomes
1000G_ALL Allele frequency in 1000 Genomes Project (currently, version 2012 Feb)
dbSNP dbSNP identifier (currently, version 135)
Functional prediction AVSIFT Base-level SIFT scores
LJB_SIFT 1-SIFT scores and predictions (D: damaging, T: tolerated)
LJB_PolyPhen2 PolyPhen 2 scores and predictions (D: probably damaging; P: possibly damaging; B: bening)
LJB_LRT LRT scores and predictions (D: deleterious; N: neutral; U: unknown)
LJB_MutationTaster MutationTaster scores and predictions (A: disease_causing_automatic; D: disease_causing; N: polymorphism; P: polymorphism_automatic)
LJB_PhyloP PhyloP conservation scores and predictions (C: conserved, N: non-conserved)
GERP++ GERP++ scores for exonic variants
Region annotation Conserved Region-level phastCons LOD scores
SegDup Located in segmental duplication region and the sequence identity score

Table 2.

Illustration of the “variants reduction” pipeline on the Ogden syndrome data set and the synthetic Miller syndrome data set

Data setVariants reduction strategy Ogden (exome variants inhg19 coordinate) Miller (genome variants in hg18 coordinate)
Default Custom Default Custom Custom
Input variants 1479 1479 4702187 4702187 4702187
Identify missense, nonsense and splicing variants 136 136 12410 12410 12410
Identify variants from conserved regions 5395
Remove variants in segmental duplications regions 5135
Remove variants observed in user-supplied controls 16*
Remove variants observed in the 1000 Genomes Project with MAF>1% 19 3 2275 1116 2275
Remove variants observed in the NHLBI-ESP 5400 exomes with MAF>1% 14 3 1256 740 1256
Remove variants in dbSNP (excluding clinically associated SNPs) 1 1 516 313 516
Remove variants with SIFT score >0.05 1 395
Remove variants with PolyPhen2 score <0.85 1 351
Final list of candidate genes based on disease model 1 1 24 10 14
Correct causal gene identified? Yes Yes Yes Yes No

RESULTS

Analysis of a real exome sequencing data set on Ogden syndrome

To demonstrate the utility of the wANNOVAR server, we analysed variants calls from a family segregating Ogden syndrome ([MIM: 300855]). Thirty years ago, Ogden syndrome was discovered as an X linked lethal infantile disorder, and its genetic basis was recently solved by next-generation sequencing.13 The disease is characterised by postnatal growth failure with severe delays and dysmorphic features, and is caused by a mutation in the NAA10 gene, leading to a N-terminal acetyltransferase deficiency. For the family with Ogden syndrome, exon-capture sequencing data was aligned by BWA14 and genotypes were called by GATK15 as VCF3 files in hg19 coordinate. We submitted all chromosome X variants (1318 single nucleotide variants and 161 indels) in the proband to the wANNOVAR server, and tested the ‘variants reduction’ procedure using the default ‘rare recessive Mendelian disease’ pipeline and a custom pipeline (table 2). Compared with the default pipeline, the custom pipeline filter variants set against the two unaffected family members and the deleterious variants were identified using SIFT/PolyPhen scores. Both pipelines identified a hemizygous mutation (p.S37P) within a single candidate gene NAA10, and this was precisely the known causal variant in this family.13 Detailed examination of the ‘exome summary’ table demonstrated that this variant has a SIFT5 score of 0 (prediction: damaging), PolyPhen6 score of 0.96 (prediction: probably damaging), LRT7 score of 1 (prediction: deleterious), Mutation Taster8 score of 1 (prediction: disease causing), PhyloP9 score of 0.96 (prediction: conserved) and GERP++16 score of 3.55 (prediction: highly constrained). The variant is not observed in the 1000 Genomes Project,10 the dbSNP12 version 135 (after removing SNPs flagged as ‘clinically associated’) or the NHLBI-ESP 5400 exomes.11 Therefore, converging bioinformatics evidence supports that this variant may affect protein function.

Analysis of a synthetic whole-genome sequencing data set on Miller syndrome

We next evaluated wANNOVAR on millions of genetic variants from whole-genome sequencing. We used a synthetic data set of a male subject with ~4.2 million single nucleotide variants and ~0.5 million indels,17 supplemented with two variants (p.G152R and p.G202A) in DHODH known to cause Miller syndrome ([MIM: 263750]).18 This synthetic data set was previously used to illustrate the ‘variant reduction’ procedure.2 With the default ‘rare recessive Mendelian disease’ pipeline (table 2), the large number of input variants was drastically reduced to 516, and 24 candidate genes were identified including the causal gene DHODH. We also tested a custom pipeline that additionally identifies variants in conserved genomic regions19 and outside of segmental duplication regions20 (table 2). This custom pipeline identified ten candidate genes including DHODH, similar to what has been previously reported.2 Finally, we tested a different custom pipeline that additionally remove variants with SIFT score >0.05 and PolyPhen2 score <0.85. This custom pipeline identified 14 candidate genes (table 2), but DHODH was not among them because one of the mutations (p.G202A) was predicted as tolerated by SIFT (score =0.18) and benign by PolyPhen (score =0.69). However, we note that the variant was correctly predicted as deleterious by LRT, Mutation Taster, PhyloP and GERP++. We caution that these algorithms present predictions that help users prioritise variants/genes, but the true sensitivity/specificity will depend on many factors, and that none of the algorithms constitute proof of being disease causal. In summary, this example has confirmed the utility of the wANNOVAR server in identifying a prioritised list of candidate disease causal genes, yet cautioned the judicious use of function prediction scores.

DISCUSSION

In this manuscript, we presented a web server called wANNOVAR for performing web-based functional annotation of genetic variants from personal genomes. Below we compare the server with other competing approaches and discuss potential future extensions and development.

Several similar web servers exist, including SIFT,5 PolyPhen6 and the SeattleSeq server.21 The wANNOVAR server already incorporates SIFT and PolyPhen2 scores with additional scoring systems (table 1).4 The wANNOVAR server differs from SeattleSeq in that: (1) it allows flexibility by permitting the users to select gene definition systems, including RefSeq genes,22 ENSEMBL genes,23 UCSC genes24 or GENCODE genes.25 Compared with the manually compiled RefSeq gene definitions, ENSEMBL genes and UCSC genes are supplemented with computational predictions of transcripts and genes. The GENCODE genes are compiled by a combination of initial manual annotation and experimental validation by the GENCODE consortium, and a refinement of the annotation based on these experimental results. All of the four gene definition systems are widely used in human genomic studies; (2) wANNOVAR produces more annotation results including predicted functional importance scores for non-synonymous variants; (3) wANNOVAR builds in a ‘variants reduction’ pipeline to facilitate identifying potential disease causal variants and genes from personal genomes.

The wANNOVAR server will be under constant development to improve its functionality. Some of the future plans include: First, we will explore the possibility of allowing FTP access to users with limited internet connection speed for uploading files. Second, we will add more annotation tasks for non-coding variants, splicing variants and UTR variants. Currently, the available annotations are strongly biased towards non-synonymous variants. With the accumulation of cell-type specific data on functional elements from large-scale genomics project, such as the ENCODE project26 and the development of bioinformatics methods and databases,2732 we will be able to provide more annotations for variants outside of coding regions. Third, we will test the use of a backend computing cluster rather than a frontend web server to perform the actual annotation tasks to handle multiple simultaneous user queries. Fourth, we will explore the use of GALAXY33 and design a plug-in based on ANNOVAR, for better annotating, processing and visualising variants.

In summary, wANNOVAR is an easy-to-use online tool for batch annotation of genetic variants. Given the rapid generation and accumulation of whole-exome or whole-genome sequencing data in research and clinical settings, we expect that wANNOVAR will help biologists and clinicians take advantage of personal genome information in various medical genetics applications.

Acknowledgements

The authors thank James Knowles, Jalas Chaim, Mingyao Li, Gholson Lyon and the three anonymous reviewers for testing the server and providing valuable feedbacks.

Funding The study is supported by start-up funds from the Zilkha Neurogenetic Institute and grant number HG006465 from NIH/NHGRI (K.W.).

Footnotes

Contributors XC developed and implemented the tool and drafted the manuscript. KW supervised the implementation of the project and revised the manuscript. All the authors read and approved the manuscript.

Competing interests None.

Provenance and peer review Not commissioned; externally peer reviewed.

Data sharing statement The two data sets used in the manuscript are available to users in the “Example” section of the wANNOVAR server.

REFERENCES