Dissecting the regulatory architecture of gene expression QTLs - PubMed (original) (raw)

Dissecting the regulatory architecture of gene expression QTLs

Daniel J Gaffney et al. Genome Biol. 2012.

Abstract

Background: Expression quantitative trait loci (eQTLs) are likely to play an important role in the genetics of complex traits; however, their functional basis remains poorly understood. Using the HapMap lymphoblastoid cell lines, we combine 1000 Genomes genotypes and an extensive catalogue of human functional elements to investigate the biological mechanisms that eQTLs perturb.

Results: We use a Bayesian hierarchical model to estimate the enrichment of eQTLs in a wide variety of regulatory annotations. We find that approximately 40% of eQTLs occur in open chromatin, and that they are particularly enriched in transcription factor binding sites, suggesting that many directly impact protein-DNA interactions. Analysis of core promoter regions shows that eQTLs also frequently disrupt some known core promoter motifs but, surprisingly, are not enriched in other well-known motifs such as the TATA box. We also show that information from regulatory annotations alone, when weighted by the hierarchical model, can provide a meaningful ranking of the SNPs that are most likely to drive gene expression variation.

Conclusions: Our study demonstrates how regulatory annotation and the association signal derived from eQTL-mapping can be combined into a single framework. We used this approach to further our understanding of the biology that drives human gene expression variation, and of the putatively causal SNPs that underlie it.

PubMed Disclaimer

Figures

Figure 1

Figure 1

A schematic outline of the hierarchical model. (a) Two SNPs that are significantly associated with expression level at the adjacent gene (in our method, association is measured using Bayes factors). (b) SNP 1 is located in regulatory annotations I, II and III, while SNP 2 is located in regulatory annotation I only. The numbers at the ends of the annotation bars illustrate the fold enrichment of eQTNs in each annotation: these are the exponential of the λl parameters of the hierarchical model. In practice, enrichment levels are estimated using all the genes simultaneously via a hierarchical model. These are combined in a logistic model to estimate the prior probability that any given SNP is an eQTN. (c) The hierarchical model assigns a posterior probability that each SNP is an eQTN, combining information from (a, b). Thus, even though the level of association with gene expression is similar for SNPs 1 and 2, more of the posterior probability is assigned to SNP1.

Figure 2

Figure 2

Estimated fold enrichment of eQTNs in putative regions of active chromatin. The plots show the enrichment of eQTNs within DNaseI hypersensitive peaks, or within regions marked by a number of histone modifications. (a) All locations within the cis region around each gene. Error bars show 95% confidence intervals. Arrows indicate that the confidence interval extends beyond the left end of the x-axis. (b) Open chromatin 5 to 100 kb upstream of the gene TSS. (c) Estimated probability that a random SNP is an eQTN as a function of distance from the TSS (grey bars) or the conditional probability of a random SNP being an eQTN, given that it lies outside or inside a DNaseI hypersensitive site, or within a DNaseI site overlapped by two or more histone marks.

Figure 3

Figure 3

eQTN enrichments in regulatory elements directly related to transcription factor binding. (a, b) eQTN enrichments in regulatory elements directly related to transcription factor binding as annotated by ChIP-seq (a) or DNase-seq footprinting (b). Of the 26 clusters of DNase-seq footprints tested, 15 had confidence intervals spanning the range (-∞, > 0) and are not shown (Figure S6 in Additional file 1). Error bars show 95% confidence intervals.

Figure 4

Figure 4

eQTN enrichment in regulatory elements of the core promoter. (a) The fold enrichments of eQTNs in a variety of predicted regulatory elements based on published methods, sequence motifs and evolutionary conservation. See main text for further details. Only SNPs occurring within 50 bp of the TSS were considered. The confidence intervals for the estimates of enrichment in other core motifs (TATA, SP1, Initiator (Inr) and the TFIIB recognition element (BRE)) were (-∞, > 0) and are not shown. (b) The QQ-plots of expected versus observed quantiles of the -log10(_P_-value) for SNPs located in several known core promoter motifs, including the TATA box, the SP1 binding site (or GC-box), the Inr element, the BRE and the downstream promoter element (DPE), as well as in 1,000 6-mer sequences that are highly overrepresented in core promoters.

Figure 5

Figure 5

eQTN enrichments in all functional annotations included in the combined model, ordered by annotation type. Error bars show 95% confidence intervals. Arrows indicate that the confidence interval extends beyond the left end of the x-axis.

Figure 6

Figure 6

Examples of two high posterior eQTNs, rs473407 and rs28362527, in two genes. The first row shows the Bayes factors for each SNP located within a 35-kb window either side of the gene. The second row shows the position, and marginal enrichment level, of some of the regulatory annotations we analyzed here. The positions of the highest posterior SNPs in rows one and two are marked by yellow boxes. Row three shows independent data, not used by the hierarchical model, on the level of NF-κB binding in a 600-bp window centered on each of the two SNPs marked above in yellow, where the ChIP-seq profiles are grouped according to the genotypes at those SNPs. Data on between-individual variation in NF-κB binding were from [63]. TES, transcription end site.

Figure 7

Figure 7

Prior rankings of SNPs for 100 genes where a single SNP is a clear best candidate for being the 'true' eQTN using the prior probability from the hierarchical model. The histogram shows the percentage of genes for which the putative causal site is ranked by the prior among the top 1 to 15 SNPs, 15 to 30 SNPs, and so on. Typically, the candidate region for each gene contains approximately 1,200 SNPs. The two prior models correspond to the distance model only (blue) and the distance model plus experimental annotations (red). The 100 genes analyzed here were excluded from all other analyses. BG, background.

Similar articles

Cited by

References

    1. Carroll SB. Evolution at two levels: On genes and form. PLoS Biol. 2005;3:1159–1166. - PMC - PubMed
    1. Kleinjan DA, van Heyningen V. Long-range control of gene expression: Emerging mechanisms and disruption in disease. Am J Hum Genet. 2005;76:8–32. doi: 10.1086/426833. - DOI - PMC - PubMed
    1. Wray GA. The evolutionary significance of cis-regulatory mutations. Nat Rev Genet. 2007;8:206–216. - PubMed
    1. Emilsson V, Thorleifsson G, Zhang B, Leonardson AS, Zink F, Zhu J, Carlson S, Helgason A, Walters GB, Gunnarsdottir S, Mouy M, Steinthorsdottir V, Eiriksdottir GH, Bjornsdottir G, Reynisdottir I, Gudbjartsson D, Helgadottir A, Jonasdottir A, Jonasdottir A, Styrkarsdottir U, Gretarsdottir S, Magnusson KP, Stefansson H, Fossdal R, Kristjansson K, Gislason HG, Stefansson T, Leifsson BG, Thorsteinsdottir U, Lamb JR. et al.Genetics of gene expression and its effect on disease. Nature. 2008;452:423–U2. doi: 10.1038/nature06758. - DOI - PubMed
    1. Kudaravalli S, Veyrieras JB, Stranger BE, Dermitzakis ET, Pritchard JK. Gene expression levels are a target of recent natural selection in the human genome. Mol Biol Evol. 2009;26:649–658. - PMC - PubMed

Publication types

MeSH terms

Substances

Grants and funding

LinkOut - more resources