HaploReg: a resource for exploring chromatin states, conservation, and regulatory motif alterations within sets of genetically linked variants (original) (raw)

Journal Article

,

1Computer Science and Artificial Intelligence Laboratory, Massachusetts Institute of Technology and 2The Broad Institute of MIT and Harvard, Cambridge, MA 02139, USA

*To whom correspondence should be addressed. Tel: +1 617 253 2419; Fax:

+1 617 452 5034

; Email: manoli@mit.edu

Search for other works by this author on:

1Computer Science and Artificial Intelligence Laboratory, Massachusetts Institute of Technology and 2The Broad Institute of MIT and Harvard, Cambridge, MA 02139, USA

*To whom correspondence should be addressed. Tel: +1 617 253 2419; Fax:

+1 617 452 5034

; Email: manoli@mit.edu

Search for other works by this author on:

Revision received:

06 October 2011

Accepted:

08 October 2011

Published:

07 November 2011

Cite

Lucas D. Ward, Manolis Kellis, HaploReg: a resource for exploring chromatin states, conservation, and regulatory motif alterations within sets of genetically linked variants, Nucleic Acids Research, Volume 40, Issue D1, 1 January 2012, Pages D930–D934, https://doi.org/10.1093/nar/gkr917
Close

Navbar Search Filter Mobile Enter search term Search

Abstract

The resolution of genome-wide association studies (GWAS) is limited by the linkage disequilibrium (LD) structure of the population being studied. Selecting the most likely causal variants within an LD block is relatively straightforward within coding sequence, but is more difficult when all variants are intergenic. Predicting functional non-coding sequence has been recently facilitated by the availability of conservation and epigenomic information. We present HaploReg, a tool for exploring annotations of the non-coding genome among the results of published GWAS or novel sets of variants. Using LD information from the 1000 Genomes Project, linked SNPs and small indels can be visualized along with their predicted chromatin state in nine cell types, conservation across mammals and their effect on regulatory motifs. Sets of SNPs, such as those resulting from GWAS, are analyzed for an enrichment of cell type-specific enhancers. HaploReg will be useful to researchers developing mechanistic hypotheses of the impact of non-coding variants on clinical phenotypes and normal variation. The HaploReg database is available at http://compbio.mit.edu/HaploReg.

INTRODUCTION

Genome-wide association studies (GWAS) are providing a flood of data associating genetic variants with common phenotypes (1). A confounding factor in such studies is linkage disequilibrium (LD), which allows many variants at the same locus to be associated with a phenotype even if only one of them is causal. Within genes, prioritizing the likely causal variant is relatively straightforward; variants are easily annotated as synonymous, missense or nonsense, changing the consensus sequence at splice sites, or residing in introns or UTRs. Often, however, GWAS associations lie far from known genes or transcribed regions, presumably in distal tissue-specific enhancers. One of the most striking examples of such a finding is the gene desert at 8q24, within which are regions specifically and independently linked to prostate, breast, ovarian, colorectal and bladder cancer. These variants have been shown to correspond to cell-type-specific distal enhancers for the MYC oncogene (2,3). Recent systematic comparisons of expression quantitative trait loci (eQTL) and GWAS suggest that the association of intergenic variants with complex phenotyes is a result of alteration of gene expression regulatory elements (4,5).

Ernst and colleagues (6) recently developed a map of chromatin states, including enhancers, promoters, insulators and heterochromatin, in nine human cell lines based on a variety of histone modifications. Using this map, it was demonstrated that these states can be used to prioritize SNPs within LD blocks associated with disease, and in some cases reveal biologically plausible enrichments for cell type-specific enhancers. Here we present a tool, HaploReg, to systematically mine these chromatin state data, along with conservation data and regulatory motif alterations.

A wide range of resources exists to make predictions about the functional consequences of variants, as well as navigating groups of linked variants using LD information. Polyphen (7), SIFT (8) and SNPS3D (9) all make predictions of the impact of missense SNPs. Algorithms such as is-rSNP (10) and RAVEN (11) use regulatory motif changes to predict SNPs that may influence transcriptional regulation. SNPinfo (12) combines missense predictions with TRANSFAC PWM disruption predictions and conservation information across 17 vertebrates for HapMap Phase III SNPs. SNAP (13) provides LD calculations using 1000 Genomes Project pilot data with information about neighboring genes and array membership for proxy/tag SNP selection, but does not currently include indels. HaploReg improves on SNAP by providing LD calculation of 1000 Genomes Project pilot indels associated with query SNPs. In addition, the features of SNPinfo are improved upon by incorporating evolutionary constraint based on two alogrithms (involving the sequences of at least 29 mammals) and considering a much larger library of PWMs.

The UCSC Genome Browser (14) and ENSEMBL Genome Browser (15,16) both allow genomic regions to be annotated with the results of cutting-edge genomic data, including chromatin state segmentations, ENCODE data, 1000 Genomes variants, evolutionary constraint, LD calculations and NHGRI catalog variants. However, the output of these browsers can be overwhelming, especially when one is interested only in a limited subset of loci (such as the variants linked to a GWAS hit.) To this end, HaploReg combines the focus on haplotype blocks provided by tools such as SNAP and SNPinfo with the breadth of genomic annotation provided by the full-featured genome browsers.

METHODS

HaploReg consists of a PHP interface to a MySQL database. The initial database table was populated using genomic coordinates and sequences for 16 151 841 biallelic SNPs and small indels from the pilot release of the 1000 Genomes Project (17). In some cases, such as novel indels, the variant call format (VCF) file from the pilot release did not have a RefSNP identifier (rsid); for the purpose of creating a unique identifier for this database, these variants were assigned the label of ‘chromosome:position’ in hg18 coordinates. To provide backward compatibility with obsolete rsids, dbSNP release 132 was checked for variants at the same position as 1000 Genomes pilot variants with multiple rsids (18). In addition, annotations of functional consequences were extracted from dbSNP.

A variety of functional annotations were then intersected with the set of variants using the BEDTools package (19), including the chromatin state segmentation of Ernst et al. (6), and conserved regions by GERP (20) and SiPhy (21,22). To obtain gene annotations, RefSeq genes (23) were downloaded from the UCSC Genome Browser and GENCODE version 7 (24) was downloaded from the project website. BEDTools was then used to calculate the proximity of each variant to a gene by either annotation, as well as the orientation (3′ or 5′) relative to the nearest end of the gene, based on the strand of the gene.

In order to annotate variants by their effect on regulatory motifs, a library of position weight matrices (PWMs) was constructed from literature sources and was scored on genomic sequences as described previously (6). Briefly, a set of PWMs was collected from TRANSFAC (25), JASPAR (26), and protein-binding microarray (PBM) experiments (27–29). The reference and alternate alleles for each of the 1000 Genomes pilot SNPs and indels were concatenated with 29 bp of genomic context on each side, using the hg18 sequence obtained from the UCSC Genome Browser (30). PWMs were then scored for instances that passed either of two thresholds, a stringent threshold of P < 4−8 and a less-stringent threshold of P < 4−7 (31). Only instances where a motif in the sequence (i) passed the stringent threshold of a PWM in either the reference or the alternate genomic sequence, and (ii) overlapped the variable nucleotide(s) (thus changing the PWM score) were considered. Then, the change in log-odds (LOD) score was calculated. In cases where the weaker match was did not pass the less-stringent threshold, an approximate minimum change of LOD score was reported, corresponding to the difference between the score of the stronger match to the score required to pass the less-stringent threshold. In cases where both allelic variants surpassed the less-stringent threshold, the exact difference in score was reported.

GWAS results were obtained from the table curated by NHGRI (32) (accessed June 29, 2011.) In cases where multiple studies were annotated as pertaining to the same phenotype, unique independent SNPs were consolidated into a single list.

LD was calculated using the phased genotype information accompanying the 1000 Genomes Project pilot release (17). VCFTools (33) was used to perform the calculation, using an LD threshold of _r_2 = 0.80, and a maximum distance between variants of 200 kb. Results from VCFTools were then consolidated such that for every variant in our database, a list of linked variants is accessible for each of the three populations, along with an _r_2 value.

To perform enhancer enrichment analysis on sets of variants, tables of common array designs were obtained from the UCSC Table Browser (34) and lists were constructed of 1000 Genomes SNPs segregating in each of the three pilot populations, as well as all SNPs in the database. Then, a background frequency of coverage was calculated for variants annotated as overlapping a strong enhancer state in each cell type. When a user submits a query list of variants, the coverage of strong enhancers in each cell type is calculated. If the coverage exceeds that of the background set selected by the user, a binomial test is performed, and enrichment is reported if it passes an uncorrected significance threshold of 0.05.

USAGE

A user may submit queries in two formats: a comma-delimited list of rsids, or a one of the GWAS or traits from the NHGRI catalog. To illustrate (Figure 1), we select the lupus study by Han et al. (35). Since the study was conducted in Han Chinese, we select ASN (CHB + JPT) as the population for LD calculation, and we select all SNPs in the ASN population as the background for enhancer enrichment analysis. As was reported by Ernst et al. (6), there is a strong enrichment for GM12878 (lymphoblastoid) enhancers. To demonstrate LD blocks, we select an LD threshold of _r_2 = 0.95. In the LD block with lead SNP rs9271100, there is a SNP rs9271055 which affects an Ets-family binding site. Clicking on rs9271055 leads to a detail view (Figure 2) in which the complete chromatin state data are available. The positions in two literature motifs for Ets-family proteins can be seen, where the alternate T allele strengthens the predicted affinity relative to the reference G allele. In addition, links to NCBI RefSeq and ENSEMBL pages detailing the neighboring HLA-DRB1 gene are provided.

HaploReg view of the SNPs from the lupus GWAS by Han et al.

Figure 1.

HaploReg view of the SNPs from the lupus GWAS by Han et al.

HaploReg detail view for the SNP rs9271055.

Figure 2.

HaploReg detail view for the SNP rs9271055.

FUNDING

National Institutes of Health (R01-HG004037, RC1-HG005334); National Science Foundation (0644282). Funding for open access charge: National Institutes of Health.

Conflict of interest statement. None declared.

ACKNOWLEDGEMENTS

We thank Pouya Kheradpour for valuable assistance with PWM curation and scoring, and other members of the Kellis lab for helpful discussions.

REFERENCES

1

Genome-wide association studies for complex traits: consensus, uncertainty and challenges

,

Nat. Rev. Genet.

,

2008

, vol.

9

(pg.

356

-

369

)

2

et al.

Multiple loci with different cancer specificities within the 8q24 gene desert

,

J. Natl Cancer Inst.

,

2008

, vol.

100

(pg.

962

-

966

)

3

An 8q24 gene desert variant associated with prostate cancer risk confers differential in vivo activity to a MYC enhancer

,

Genome Res.

,

2010

, vol.

20

(pg.

1191

-

1197

)

4

Chemotherapeutic drug susceptibility associated SNPs are enriched in expression quantitative trait loci

,

Proc. Natl Acad. Sci. USA

,

2010

, vol.

107

(pg.

9287

-

9292

)

5

Trait-associated SNPs are more likely to be eQTLs: annotation to enhance discovery from GWAS

,

PLoS Genet.

,

2010

, vol.

6

pg.

e1000888

6

et al.

Mapping and analysis of chromatin state dynamics in nine human cell types

,

Nature

,

2011

, vol.

473

(pg.

43

-

49

)

7

A method and server for predicting damaging missense mutations

,

Nat. Methods

,

2010

, vol.

7

(pg.

248

-

249

)

8

Predicting deleterious amino acid substitutions

,

Genome Res.

,

2001

, vol.

11

(pg.

863

-

874

)

9

SNPs3D: candidate gene and SNP selection for association studies

,

BMC Bioinformatics

,

2006

, vol.

7

pg.

166

10

is-rSNP: a novel technique for in silico regulatory SNP detection

,

Bioinformatics

,

2010

, vol.

26

(pg.

i524

-

i530

)

11

In silico detection of sequence variations modifying transcriptional regulation

,

PLoS Comput. Biol.

,

2008

, vol.

4

pg.

e5

12

SNPinfo: integrating GWAS and candidate gene information into functional SNP selection for genetic association studies

,

Nucleic Acids Res.

,

2009

, vol.

37

(pg.

W600

-

W605

)

13

SNAP: a web-based tool for identification and annotation of proxy SNPs using HapMap

,

Bioinformatics

,

2008

, vol.

24

(pg.

2938

-

2939

)

14

The human genome browser at UCSC

,

Genome Res.

,

2002

, vol.

12

(pg.

996

-

1006

)

15

et al.

Ensembl 2011

,

Nucleic Acids Res.

,

2011

, vol.

39

(pg.

D800

-

D806

)

16

et al.

Ensembl variation resources

,

BMC Genomics

,

2010

, vol.

11

pg.

293

17

The 1000 Genomes Project Consortium

A map of human genome variation from population-scale sequencing

,

Nature

,

2010

, vol.

467

(pg.

1061

-

1073

)

18

dbSNP: the NCBI database of genetic variation

,

Nucleic Acids Res.

,

2001

, vol.

29

(pg.

308

-

311

)

19

BEDTools: a flexible suite of utilities for comparing genomic features

,

Bioinformatics

,

2010

, vol.

26

(pg.

841

-

842

)

20

Identifying a high fraction of the human genome to be under selective constraint using GERP++

,

PLoS Comput. Biol.

,

2010

, vol.

6

pg.

e1001025

21

Identifying novel constrained elements by exploiting biased substitution patterns

,

Bioinformatics

,

2009

, vol.

25

(pg.

i54

-

i62

)

22

et al.

A high-resolution map of human evolutionary constraint using 29 mammals

,

Nature

,

2011

(epub ahead of print)

23

NCBI reference sequences (RefSeq): a curated non-redundant sequence database of genomes, transcripts and proteins

,

Nucleic Acids Res

,

2007

, vol.

35

(pg.

D61

-

D65

)

24

et al.

GENCODE: producing a reference annotation for ENCODE

,

Genome Biol.

,

2006

, vol.

7

Suppl. 1

(pg.

S4 1

-

9

)

25

et al.

TRANSFAC: transcriptional regulation, from patterns to profiles

,

Nucleic Acids Res.

,

2003

, vol.

31

(pg.

374

-

378

)

26

JASPAR 2010: the greatly expanded open-access database of transcription factor binding profiles

,

Nucleic Acids Res.

,

2010

, vol.

38

(pg.

D105

-

D110

)

27

et al.

Variation in homeodomain DNA binding revealed by high-resolution analysis of sequence preferences

,

Cell

,

2008

, vol.

133

(pg.

1266

-

1276

)

28

et al.

Diversity and complexity in DNA recognition by transcription factors

,

Science

,

2009

, vol.

324

(pg.

1720

-

1723

)

29

Compact, universal DNA microarrays to comprehensively determine transcription-factor binding site specificities

,

Nat. Biotechnol.

,

2006

, vol.

24

(pg.

1429

-

1435

)

30

et al.

Initial sequencing and analysis of the human genome

,

Nature

,

2001

, vol.

409

(pg.

860

-

921

)

31

Efficient and accurate P-value computation for position weight matrices

,

Algorithms Mol. Biol.

,

2007

, vol.

2

pg.

15

32

Potential etiologic and functional implications of genome-wide association loci for human diseases and traits

,

Proc. Natl Acad. Sci. USA

,

2009

, vol.

106

(pg.

9362

-

9367

)

33

et al.

The variant call format and VCFtools

,

Bioinformatics

,

2011

, vol.

27

(pg.

2156

-

2158

)

34

The UCSC Table Browser data retrieval tool

,

Nucleic Acids Res.

,

2004

, vol.

32

(pg.

D493

-

D496

)

35

et al.

Genome-wide association study in a Chinese Han population identifies nine new susceptibility loci for systemic lupus erythematosus

,

Nat. Genet.

,

2009

, vol.

41

(pg.

1234

-

1237

)

© The Author(s) 2011. Published by Oxford University Press.

This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/3.0), which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.

I agree to the terms and conditions. You must accept the terms and conditions.

Submit a comment

Name

Affiliations

Comment title

Comment

You have entered an invalid code

Thank you for submitting a comment on this article. Your comment will be reviewed and published at the journal's discretion. Please check for further notifications by email.

Citations

Views

Altmetric

Metrics

Total Views 12,272

9,644 Pageviews

2,628 PDF Downloads

Since 1/1/2017

Month: Total Views:
January 2017 29
February 2017 71
March 2017 86
April 2017 51
May 2017 88
June 2017 79
July 2017 64
August 2017 63
September 2017 76
October 2017 76
November 2017 67
December 2017 170
January 2018 171
February 2018 206
March 2018 197
April 2018 161
May 2018 202
June 2018 205
July 2018 238
August 2018 189
September 2018 236
October 2018 192
November 2018 247
December 2018 200
January 2019 180
February 2019 174
March 2019 242
April 2019 283
May 2019 259
June 2019 242
July 2019 207
August 2019 265
September 2019 132
October 2019 174
November 2019 106
December 2019 95
January 2020 170
February 2020 118
March 2020 88
April 2020 75
May 2020 89
June 2020 102
July 2020 115
August 2020 140
September 2020 101
October 2020 109
November 2020 127
December 2020 81
January 2021 139
February 2021 107
March 2021 124
April 2021 150
May 2021 149
June 2021 73
July 2021 87
August 2021 92
September 2021 86
October 2021 159
November 2021 106
December 2021 91
January 2022 135
February 2022 120
March 2022 190
April 2022 152
May 2022 93
June 2022 122
July 2022 107
August 2022 116
September 2022 101
October 2022 98
November 2022 110
December 2022 117
January 2023 137
February 2023 128
March 2023 69
April 2023 68
May 2023 123
June 2023 106
July 2023 92
August 2023 106
September 2023 108
October 2023 163
November 2023 114
December 2023 116
January 2024 145
February 2024 121
March 2024 124
April 2024 109
May 2024 124
June 2024 83
July 2024 96
August 2024 106
September 2024 96
October 2024 76

×

Email alerts

Citing articles via

More from Oxford Academic