VAT: a computational framework to functionally annotate variants in personal genomes within a cloud-computing environment (original) (raw)
Journal Article
,
1Program in Computational Biology and Bioinformatics, 2Department of Molecular Biophysics and Biochemistry, 3Department of Computer Science, Yale University, New Haven, CT 06520, USA, 4Department of Pathology and Laboratory Medicine, Weill Cornell Medical College, New York, NY 10065, 5Institute for Computational Biomedicine, Weill Cornell Medical College, New York, NY 10021, 6Department of Chemistry, Yale University, New Haven, CT 06520 and 7Department of Genetics, Stanford University School of Medicine, Stanford, CA 94305, USA
*To whom correspondence should be addressed.
Search for other works by this author on:
,
1Program in Computational Biology and Bioinformatics, 2Department of Molecular Biophysics and Biochemistry, 3Department of Computer Science, Yale University, New Haven, CT 06520, USA, 4Department of Pathology and Laboratory Medicine, Weill Cornell Medical College, New York, NY 10065, 5Institute for Computational Biomedicine, Weill Cornell Medical College, New York, NY 10021, 6Department of Chemistry, Yale University, New Haven, CT 06520 and 7Department of Genetics, Stanford University School of Medicine, Stanford, CA 94305, USA
1Program in Computational Biology and Bioinformatics, 2Department of Molecular Biophysics and Biochemistry, 3Department of Computer Science, Yale University, New Haven, CT 06520, USA, 4Department of Pathology and Laboratory Medicine, Weill Cornell Medical College, New York, NY 10065, 5Institute for Computational Biomedicine, Weill Cornell Medical College, New York, NY 10021, 6Department of Chemistry, Yale University, New Haven, CT 06520 and 7Department of Genetics, Stanford University School of Medicine, Stanford, CA 94305, USA
Search for other works by this author on:
,
1Program in Computational Biology and Bioinformatics, 2Department of Molecular Biophysics and Biochemistry, 3Department of Computer Science, Yale University, New Haven, CT 06520, USA, 4Department of Pathology and Laboratory Medicine, Weill Cornell Medical College, New York, NY 10065, 5Institute for Computational Biomedicine, Weill Cornell Medical College, New York, NY 10021, 6Department of Chemistry, Yale University, New Haven, CT 06520 and 7Department of Genetics, Stanford University School of Medicine, Stanford, CA 94305, USA
Search for other works by this author on:
,
1Program in Computational Biology and Bioinformatics, 2Department of Molecular Biophysics and Biochemistry, 3Department of Computer Science, Yale University, New Haven, CT 06520, USA, 4Department of Pathology and Laboratory Medicine, Weill Cornell Medical College, New York, NY 10065, 5Institute for Computational Biomedicine, Weill Cornell Medical College, New York, NY 10021, 6Department of Chemistry, Yale University, New Haven, CT 06520 and 7Department of Genetics, Stanford University School of Medicine, Stanford, CA 94305, USA
1Program in Computational Biology and Bioinformatics, 2Department of Molecular Biophysics and Biochemistry, 3Department of Computer Science, Yale University, New Haven, CT 06520, USA, 4Department of Pathology and Laboratory Medicine, Weill Cornell Medical College, New York, NY 10065, 5Institute for Computational Biomedicine, Weill Cornell Medical College, New York, NY 10021, 6Department of Chemistry, Yale University, New Haven, CT 06520 and 7Department of Genetics, Stanford University School of Medicine, Stanford, CA 94305, USA
Search for other works by this author on:
,
1Program in Computational Biology and Bioinformatics, 2Department of Molecular Biophysics and Biochemistry, 3Department of Computer Science, Yale University, New Haven, CT 06520, USA, 4Department of Pathology and Laboratory Medicine, Weill Cornell Medical College, New York, NY 10065, 5Institute for Computational Biomedicine, Weill Cornell Medical College, New York, NY 10021, 6Department of Chemistry, Yale University, New Haven, CT 06520 and 7Department of Genetics, Stanford University School of Medicine, Stanford, CA 94305, USA
1Program in Computational Biology and Bioinformatics, 2Department of Molecular Biophysics and Biochemistry, 3Department of Computer Science, Yale University, New Haven, CT 06520, USA, 4Department of Pathology and Laboratory Medicine, Weill Cornell Medical College, New York, NY 10065, 5Institute for Computational Biomedicine, Weill Cornell Medical College, New York, NY 10021, 6Department of Chemistry, Yale University, New Haven, CT 06520 and 7Department of Genetics, Stanford University School of Medicine, Stanford, CA 94305, USA
Search for other works by this author on:
,
1Program in Computational Biology and Bioinformatics, 2Department of Molecular Biophysics and Biochemistry, 3Department of Computer Science, Yale University, New Haven, CT 06520, USA, 4Department of Pathology and Laboratory Medicine, Weill Cornell Medical College, New York, NY 10065, 5Institute for Computational Biomedicine, Weill Cornell Medical College, New York, NY 10021, 6Department of Chemistry, Yale University, New Haven, CT 06520 and 7Department of Genetics, Stanford University School of Medicine, Stanford, CA 94305, USA
1Program in Computational Biology and Bioinformatics, 2Department of Molecular Biophysics and Biochemistry, 3Department of Computer Science, Yale University, New Haven, CT 06520, USA, 4Department of Pathology and Laboratory Medicine, Weill Cornell Medical College, New York, NY 10065, 5Institute for Computational Biomedicine, Weill Cornell Medical College, New York, NY 10021, 6Department of Chemistry, Yale University, New Haven, CT 06520 and 7Department of Genetics, Stanford University School of Medicine, Stanford, CA 94305, USA
Search for other works by this author on:
,
1Program in Computational Biology and Bioinformatics, 2Department of Molecular Biophysics and Biochemistry, 3Department of Computer Science, Yale University, New Haven, CT 06520, USA, 4Department of Pathology and Laboratory Medicine, Weill Cornell Medical College, New York, NY 10065, 5Institute for Computational Biomedicine, Weill Cornell Medical College, New York, NY 10021, 6Department of Chemistry, Yale University, New Haven, CT 06520 and 7Department of Genetics, Stanford University School of Medicine, Stanford, CA 94305, USA
1Program in Computational Biology and Bioinformatics, 2Department of Molecular Biophysics and Biochemistry, 3Department of Computer Science, Yale University, New Haven, CT 06520, USA, 4Department of Pathology and Laboratory Medicine, Weill Cornell Medical College, New York, NY 10065, 5Institute for Computational Biomedicine, Weill Cornell Medical College, New York, NY 10021, 6Department of Chemistry, Yale University, New Haven, CT 06520 and 7Department of Genetics, Stanford University School of Medicine, Stanford, CA 94305, USA
Search for other works by this author on:
,
1Program in Computational Biology and Bioinformatics, 2Department of Molecular Biophysics and Biochemistry, 3Department of Computer Science, Yale University, New Haven, CT 06520, USA, 4Department of Pathology and Laboratory Medicine, Weill Cornell Medical College, New York, NY 10065, 5Institute for Computational Biomedicine, Weill Cornell Medical College, New York, NY 10021, 6Department of Chemistry, Yale University, New Haven, CT 06520 and 7Department of Genetics, Stanford University School of Medicine, Stanford, CA 94305, USA
Search for other works by this author on:
,
1Program in Computational Biology and Bioinformatics, 2Department of Molecular Biophysics and Biochemistry, 3Department of Computer Science, Yale University, New Haven, CT 06520, USA, 4Department of Pathology and Laboratory Medicine, Weill Cornell Medical College, New York, NY 10065, 5Institute for Computational Biomedicine, Weill Cornell Medical College, New York, NY 10021, 6Department of Chemistry, Yale University, New Haven, CT 06520 and 7Department of Genetics, Stanford University School of Medicine, Stanford, CA 94305, USA
Search for other works by this author on:
1Program in Computational Biology and Bioinformatics, 2Department of Molecular Biophysics and Biochemistry, 3Department of Computer Science, Yale University, New Haven, CT 06520, USA, 4Department of Pathology and Laboratory Medicine, Weill Cornell Medical College, New York, NY 10065, 5Institute for Computational Biomedicine, Weill Cornell Medical College, New York, NY 10021, 6Department of Chemistry, Yale University, New Haven, CT 06520 and 7Department of Genetics, Stanford University School of Medicine, Stanford, CA 94305, USA
1Program in Computational Biology and Bioinformatics, 2Department of Molecular Biophysics and Biochemistry, 3Department of Computer Science, Yale University, New Haven, CT 06520, USA, 4Department of Pathology and Laboratory Medicine, Weill Cornell Medical College, New York, NY 10065, 5Institute for Computational Biomedicine, Weill Cornell Medical College, New York, NY 10021, 6Department of Chemistry, Yale University, New Haven, CT 06520 and 7Department of Genetics, Stanford University School of Medicine, Stanford, CA 94305, USA
1Program in Computational Biology and Bioinformatics, 2Department of Molecular Biophysics and Biochemistry, 3Department of Computer Science, Yale University, New Haven, CT 06520, USA, 4Department of Pathology and Laboratory Medicine, Weill Cornell Medical College, New York, NY 10065, 5Institute for Computational Biomedicine, Weill Cornell Medical College, New York, NY 10021, 6Department of Chemistry, Yale University, New Haven, CT 06520 and 7Department of Genetics, Stanford University School of Medicine, Stanford, CA 94305, USA
*To whom correspondence should be addressed.
Search for other works by this author on:
Received:
06 January 2012
Revision received:
25 May 2012
Cite
Lukas Habegger, Suganthi Balasubramanian, David Z. Chen, Ekta Khurana, Andrea Sboner, Arif Harmanci, Joel Rozowsky, Declan Clarke, Michael Snyder, Mark Gerstein, VAT: a computational framework to functionally annotate variants in personal genomes within a cloud-computing environment, Bioinformatics, Volume 28, Issue 17, September 2012, Pages 2267–2269, https://doi.org/10.1093/bioinformatics/bts368
Close
Navbar Search Filter Mobile Enter search term Search
Abstract
Summary: The functional annotation of variants obtained through sequencing projects is generally assumed to be a simple intersection of genomic coordinates with genomic features. However, complexities arise for several reasons, including the differential effects of a variant on alternatively spliced transcripts, as well as the difficulty in assessing the impact of small insertions/deletions and large structural variants. Taking these factors into consideration, we developed the Variant Annotation Tool (VAT) to functionally annotate variants from multiple personal genomes at the transcript level as well as obtain summary statistics across genes and individuals. VAT also allows visualization of the effects of different variants, integrates allele frequencies and genotype data from the underlying individuals and facilitates comparative analysis between different groups of individuals. VAT can either be run through a command-line interface or as a web application. Finally, in order to enable on-demand access and to minimize unnecessary transfers of large data files, VAT can be run as a virtual machine in a cloud-computing environment.
Availability and Implementation: VAT is implemented in C and PHP. The VAT web service, Amazon Machine Image, source code and detailed documentation are available at vat.gersteinlab.org.
Contact: lukas.habegger@yale.edu or mark.gerstein@yale.edu
Supplementary Information: Supplementary data are available at Bioinformatics online.
1 INTRODUCTION
Recent technological advances have significantly reduced the cost of DNA sequencing and have made it possible to sequence complete genomes on a large scale. Currently, a number of efforts aim to sequence and genotype large numbers of individual genomes (The 1000 Genomes Project Consortium. 2010). These studies have already revealed many novel single nucleotide polymorphisms (SNPs), multi-nucleotide polymorphisms (MNPs), small insertions and deletions (indels) and structural variants (SVs). In order to assess the functional impact of identified variants, a key objective is to determine whether those variants intersect with annotated elements. However, the intersection of variants with a gene annotation set is non-trivial (Balasubramanian et al., 2011). First, a variant may affect only a subset of the possible transcript isoforms of a given gene or it may have different effects on alternatively spliced transcripts. For example, a variant can affect the coding region of one transcript and overlap the canonical splice site of another. In addition, for cases in which neighboring SNPs (i.e. MNPs) lie within the same codon, one must assess both SNPs simultaneously to evaluate the resultant codon change, as considering each independently could give rise to erroneous codon changes. Second, indels in coding regions can either preserve the frame or introduce frameshifts. They can also partially overlap coding exons, thereby impairing splice sites as well as coding regions. Assessing the functional impact in such cases is especially challenging. Lastly, large SVs can have drastic effects on the structure of a gene if exons are removed in whole or in part. As a result, it can be difficult to assess the overall functional impact of different types of variants on gene structures without having visual representations (Supplementary Fig. 1).
To address these challenges, we have developed the Variant Annotation Tool (VAT). Like VAT, other tools have been implemented to assess the functional impact of variants (Ng and Henikoff, 2003; Ramensky et al., 2002; Wang et al., 2010). One issue with these tools is that they are not cloud enabled.
Cloud-computing provides immense storage capacity and scalable compute resources as well as the ability to share data and perform collaborative analyses. Given the increasing rate of data production, many foresee that sequencing reads will be stored on the cloud. In addition, the importance of software residing in the same space as the data on which it operates requires that the analysis pipelines processing these reads migrate to the cloud as well. Thus, as VAT will constitute an integral part of such pipelines, having it reside on the cloud will be necessary.
Thus, we provide VAT as a virtual machine (VM) that can be run within a cloud-computing environment (including that operated by Amazon) to take advantage of the scalability and unlimited storage capacity offered by this framework. The utility of VAT has been demonstrated by its extensive use in annotating the loss-of-function variants obtained as part of the 1000 Genomes Project (MacArthur et al., 2012).
2 FEATURES AND METHODS
VAT is implemented in C for efficiency, and consists of a number of modules to pre-process gene annotation sets, intersect variants from multiple individuals with both coding and non-coding genes, generate summary statistics across these individuals and at the single gene level and provide clear visualization summarizing the functional impact of the annotated variants. The overall workflow is depicted in Figure 1A.
Fig. 1
(A) VAT comprises a number of modules that relate variants to both protein-coding genes and non-coding elements. These modules use a set of variants and an annotation set as inputs to generate annotated VCFs. (B) Architecture of the VAT web application. The web application may be accessed through the browser or a JSON-based interface. The I/O layer of VAT takes advantage of the Amazon S3 service and stores all data in S3 buckets or, if S3 support is disabled, simply writes to a local disk. This architecture may also be easily scaled to use more sophisticated storage schemes, such as hashing across multiple input and output buckets. (C) The VAT EC2 cloud service is implemented in a service-oriented architecture consisting of a master node and a number of worker nodes. The master node hosts the user-facing interface and delegates tasks on behalf of the user to the worker nodes
A number of modules in VAT relate variants to protein-coding genes (snpMapper, indelMapper and svMapper) and non-coding elements (genericMapper). These four core modules use an annotation set and a set of variants from multiple individuals as inputs. The variants are typically represented using the Variant Call Format (VCF; Danecek et al., 2011). A key feature of VAT is that the annotation is performed at the transcript level to determine whether all or only a subset of the transcript isoforms of a gene is affected. Therefore, the output explicitly shows which isoforms are affected by each variant and provides detailed information about the location of a given variant within a transcript as well as the variant's effect on its coding potential. In addition, a principal advantage of VAT lies in its ability to annotate MNPs. Moreover, VAT can be executed using gene annotation sets and genome builds beyond human, such as Arabidopsis thaliana.
VAT contains a number of utilities for the downstream analysis of annotated variants. For instance, an auxiliary module generates detailed summaries of annotated variants across multiple individuals as well as at the level of single genes. For variants intersecting protein-coding genes, VAT includes a module for generating an image for each gene to give a clear overview. This schematic representation displays the various transcript isoforms of a gene, which are superimposed with the annotated variants (Fig. 1A).
As shown in Figure 1B, VAT uses the Amazon web services cloud-computing platform. Each instance comprises a command-line executable of the VAT pipeline and a PHP web application, which serves as the user interface and driver for the pipeline. The VAT I/O abstraction layer may be customized using the configuration file to take advantage of Amazon's simple storage service (S3). With S3 support enabled, VAT reads input from a bucket storing raw VCF files and stores output in another bucket. Otherwise, VAT reads and writes locally.
The VAT cloud service uses the Amazon Elastic Compute Cloud (EC2) platform, and it is implemented in a service-oriented architecture consisting of a master node and a number of worker nodes. Each node consists of a VAT installation running on an EC2 VM (Fig. 1C). The master node hosts user-facing web components and serves as a load balancer for the worker nodes. A user action is forwarded by the master node as a request to one of the worker nodes. Each worker node communicates with the S3 buckets and reports updates to the master node asynchronously. VAT also uses the Amazon EC2 API to allow the master node to dynamically create worker instances. Intensive batch requests may thus be parallelized and handled efficiently. The S3 buckets' growing data are then available for further analyses.
3 CONCLUSIONS
In summary, VAT offers a combination of unique advantages in variant annotation. First, VAT operates as a VM in a cloud-computing environment, which is likely to serve as the future framework for the collaborative analysis of rapidly growing datasets. Second, VAT provides a novel means of clearly visualizing the functional impact of variants across different transcript isoforms of a given gene. Third, VAT can be used to functionally annotate MNPs, which has often been challenging. Fourth, VAT provides output files in VCF format. This readily facilitates the post-processing of output files with tools that are already widely used by the community, such as those used for variant filtering. Given that VAT has been an integral part of many analyses conducted as part of the 1000 Genomes Project, we believe that it will be of broad utility in other research contexts.
† The authors wish it to be known that, in their opinion, the first three authors should be regarded as joint First Authors.
ACKNOWLEDGEMENT
We thank Raymond Auerbach for critical reading of this article.
Funding: National Institutes of Health.
Conflict of Interest: none declared.
REFERENCES
, et al.
Gene inactivation and its implications for annotation in the era of personal genomics
,
Genes Dev.
,
2011
, vol.
25
(pg.
1
-
10
)
, et al.
The variant call format and VCFtools
,
Bioinformatics
,
2011
, vol.
27
(pg.
2156
-
2158
)
, et al.
A systematic survey of loss-of-function variants in human protein-coding genes
,
Science
,
2012
, vol.
335
(pg.
823
-
828
)
, .
SIFT: predicting amino acid changes that affect protein function
,
Nucleic Acids Res.
,
2003
, vol.
31
(pg.
3812
-
3814
)
, et al.
Human non-synonymous SNPs: server and survey
,
Nucleic Acids Res.
,
2002
, vol.
30
(pg.
3894
-
3900
)
The 1000 Genomes Project Consortium.
.
A map of human genome variation from population-scale sequencing
,
Nature
,
2010
, vol.
467
(pg.
1061
-
1073
)
, et al.
ANNOVAR: functional annotation of genetic variants from high-throughput sequencing data
,
Nucleic Acids Res.
,
2010
, vol.
38
pg.
e164
Author notes
Associate Editor: Michael Brudno
© The Author 2012. Published by Oxford University Press on behalf of The Society for Financial Studies. All rights reserved. For Permissions, please e-mail: journals.permissions.com.
Supplementary data
Citations
Views
Altmetric
Metrics
Total Views 1,689
1,261 Pageviews
428 PDF Downloads
Since 11/1/2016
Month: | Total Views: |
---|---|
November 2016 | 4 |
December 2016 | 1 |
January 2017 | 31 |
February 2017 | 29 |
March 2017 | 20 |
April 2017 | 41 |
May 2017 | 29 |
June 2017 | 15 |
July 2017 | 5 |
August 2017 | 10 |
September 2017 | 2 |
October 2017 | 10 |
November 2017 | 4 |
December 2017 | 37 |
January 2018 | 23 |
February 2018 | 29 |
March 2018 | 26 |
April 2018 | 21 |
May 2018 | 20 |
June 2018 | 20 |
July 2018 | 31 |
August 2018 | 16 |
September 2018 | 18 |
October 2018 | 17 |
November 2018 | 32 |
December 2018 | 25 |
January 2019 | 12 |
February 2019 | 16 |
March 2019 | 14 |
April 2019 | 36 |
May 2019 | 22 |
June 2019 | 19 |
July 2019 | 29 |
August 2019 | 22 |
September 2019 | 24 |
October 2019 | 28 |
November 2019 | 17 |
December 2019 | 11 |
January 2020 | 15 |
February 2020 | 19 |
March 2020 | 8 |
April 2020 | 11 |
May 2020 | 10 |
June 2020 | 20 |
July 2020 | 15 |
August 2020 | 16 |
September 2020 | 18 |
October 2020 | 17 |
November 2020 | 7 |
December 2020 | 9 |
January 2021 | 9 |
February 2021 | 7 |
March 2021 | 16 |
April 2021 | 10 |
May 2021 | 12 |
June 2021 | 8 |
July 2021 | 11 |
September 2021 | 5 |
October 2021 | 11 |
November 2021 | 7 |
December 2021 | 12 |
January 2022 | 9 |
February 2022 | 15 |
March 2022 | 14 |
April 2022 | 25 |
May 2022 | 17 |
June 2022 | 11 |
July 2022 | 10 |
August 2022 | 14 |
September 2022 | 42 |
October 2022 | 39 |
November 2022 | 16 |
December 2022 | 17 |
January 2023 | 19 |
February 2023 | 19 |
March 2023 | 6 |
April 2023 | 35 |
May 2023 | 15 |
June 2023 | 9 |
July 2023 | 23 |
August 2023 | 26 |
September 2023 | 11 |
October 2023 | 22 |
November 2023 | 10 |
December 2023 | 23 |
January 2024 | 22 |
February 2024 | 31 |
March 2024 | 25 |
April 2024 | 23 |
May 2024 | 22 |
June 2024 | 13 |
July 2024 | 34 |
August 2024 | 20 |
September 2024 | 13 |
Citations
45 Web of Science
×
Email alerts
Citing articles via
More from Oxford Academic