GPA: A Statistical Approach to Prioritizing GWAS Results by Integrating Pleiotropy and Annotation (original) (raw)

Abstract

Results from Genome-Wide Association Studies (GWAS) have shown that complex diseases are often affected by many genetic variants with small or moderate effects. Identifications of these risk variants remain a very challenging problem. There is a need to develop more powerful statistical methods to leverage available information to improve upon traditional approaches that focus on a single GWAS dataset without incorporating additional data. In this paper, we propose a novel statistical approach, GPA (Genetic analysis incorporating Pleiotropy and Annotation), to increase statistical power to identify risk variants through joint analysis of multiple GWAS data sets and annotation information because: (1) accumulating evidence suggests that different complex diseases share common risk bases, i.e., pleiotropy; and (2) functionally annotated variants have been consistently demonstrated to be enriched among GWAS hits. GPA can integrate multiple GWAS datasets and functional annotations to seek association signals, and it can also perform hypothesis testing to test the presence of pleiotropy and enrichment of functional annotation. Statistical inference of the model parameters and SNP ranking is achieved through an EM algorithm that can handle genome-wide markers efficiently. When we applied GPA to jointly analyze five psychiatric disorders with annotation information, not only did GPA identify many weak signals missed by the traditional single phenotype analysis, but it also revealed relationships in the genetic architecture of these disorders. Using our hypothesis testing framework, statistically significant pleiotropic effects were detected among these psychiatric disorders, and the markers annotated in the central nervous system genes and eQTLs from the Genotype-Tissue Expression (GTEx) database were significantly enriched. We also applied GPA to a bladder cancer GWAS data set with the ENCODE DNase-seq data from 125 cell lines. GPA was able to detect cell lines that are biologically more relevant to bladder cancer. The R implementation of GPA is currently available at http://dongjunchung.github.io/GPA/.

Author Summary

In the past 10 years, many genome wide association studies (GWAS) have been conducted to identify the genetic bases of complex human traits. As of January, 2014, more than 12,000 single-nucleotide polymorphisms (SNPs) have been reported to be significantly associated with at least one complex trait/disease. On one hand, about 85% of identified risk variants are located in non-coding regions, which motivates a systematic understanding of the function of non-coding variants in regulatory elements in the human genome. On the other hand, complex diseases are often affected by many genetic variants with small or moderate effects. To address these issues, we propose a statistical approach, GPA, to integrating information from multiple GWAS datasets and functional annotation. Notably, our approach only requires marker-wise _p_-values as input, making it especially useful when only summary statistics, instead of the full genotype and phenotype data, are available. We applied GPA to analyze GWAS datasets of five psychiatric disorders and bladder cancer, where the central nervous system genes, eQTLs from the Genotype-Tissue Expression (GTEx), and the ENCODE DNase-seq data from 125 cell lines were used as functional annotation. The analysis results suggest that GPA is an effective method for integrative data analysis in the post-GWAS era.

Introduction

Hundreds of genome-wide association studies (GWAS) have been conducted to study the genetic bases of complex human traits. As of January, 2014, more than 12,000 single-nucleotide polymorphisms (SNPs) have been reported to be significantly associated with at least one complex trait (see the web resource of GWAS catalog [1] http://www.genome.gov/gwastudies/). Despite of these successes, these significantly associated SNPs can only explain a small portion of genetic contributions to complex traits/diseases [2]. For example, human height is a highly heritable trait whose heritability is estimated to be around 80%, i.e., 80% of variation in height within the same population can be attributed to genetic effects [3]. Based on large-scale GWAS, about 180 SNPs have been reported to be significantly associated with human height [4]. However, these loci together only explain about 5-10% of variation in height [2], [4], [5]. This phenomenon is referred to as the “missing heritability” [2], [6], [7].

Identifying the source of this missing heritability has drawn much attention from researchers, and progress has been made towards explaining the apparent discrepancy. The role of a much greater-than-expected set of common variants (minor allele frequency (MAF)Inline graphic0.01) has been shown to be critical in explaining the phenotypic variance [8]. Instead of only using genome-wide significant SNPs, Yang et al. [9] reported that, by using all genotyped common SNPs, 45% of the variance for human height can be explained. This result suggests that a large proportion of the heritability is not actually missing: given the limited sample size, many individual effects of genetic markers are too weak to pass the genome-wide significance, and thus those variants remain undiscovered. So far, people have found similar genetic architectures for many other complex traits [10], such as metabolic syndrome traits [11] and psychiatric disorders [12][14]. That is, the phenotype is affected by many genetic variants with small or modest effects effects that cannot be confirmed individually via statistical significance, which is usually referred to as “polygenicity”. The polygenicity of complex traits is further supported by recent GWAS with larger sample sizes, in which more associated common SNPs with moderate effects have been identified (e.g., [15]). Clearly, the emerging polygenic genetic architecture imposes a great challenge of identifying risk genetic variants: a larger sample size is required to identify genetic variants with smaller effect sizes. However, sample recruitment may be expensive and time-consuming. It would be desirable to find a way to increase power to detect variants that miss significance on standard GWAS without extensive additional subject recruitment requirements. Integrative analysis of genomic data could be a promising direction, including combining GWAS data of multiple genetically related phenotypes and incorporating relevant biological information.

The last few years have seen concrete demonstrations of “pleiotropy”, i.e. the sharing of genetic factors, between human complex traits. For example, a systematic analysis of the NHGRI catalog of published GWAS (http://www.genome.gov/gwastudies/) showed that 16.9% of the reported genes and 4.6% of the reported SNPs are associated with multiple traits [16]. Through a “pleiotropic enrichment” method, Andreassen et al showed that it is possible to improve the power to detect schizophrenia-associated genetic variants by using the pleiotropy between schizophrenia (SCZ) and cardiovascular-disease [17]. A more recent study identified four significant loci (_p_-value Inline graphic) to have pleiotropic effects by analyzing GWAS data of 33,332 cases and 27,888 controls for five psychiatric disorders [18]. Further analysis suggested a very significant genetic correlation between schizophrenia and bipolar disorder (Inline graphic s.e.) [12]. Pleiotropy has also been demonstrated among several other types of traits, for example, metabolic syndrome traits [11] and cancers [19].

An increasing number of studies also suggest that functionally annotated SNPs are generally more biologically important than those that are not annotated, and henceforth more likely to be associated with complex traits. To name a few, using GWAS data of different traits (e.g., Crohn's disease and SCZ), Schork et al. [20] demonstrated a consistent pattern of enrichment of GWAS signals among functionally annotated SNPs. Yang et al. [21] showed that SNPs in genic regions could explain more variance of height and body mass index (BMI) than SNPs in intergenic regions. Nicolae et al. [22] found that SNPs associated with complex traits were more likely to be expression quantitative trait loci (eQTL). In addition, public availability of a vast amount of functional annotation data also provides unprecedented opportunities to investigate the enrichment of GWAS signals among these various types of functional annotations. For example, the Encyclopedia of DNA Elements (ENCODE) Consortium has recently generated extensive experimental data on gene expression (RNA-seq), DNA methylation status (RRBS-seq), chromatin modifications (ChIP-seq), chromatin accessibility (DNase-seq and FAIRE-seq), transcription factor (TF) binding sites (ChIP-seq), and long-range chromatin interactions (ChIA-PET, Hi-C, and 5C). As of September 2012, more than 1,600 data sets from 147 cell lines had been produced to annotate the human genome, including 2.89 million unique, non-overlapping DNase I hypersensitivity sites (DHSs) in 125 cell lines using DNase-seq and 630K binding regions of 119 DNA-binding proteins in 72 cell lines using ChIP-seq, among many [23]. The ENCODE Consortium [23] examined 4,492 risk-associated SNPs from the NHGRI GWAS catalog and found that 12% of them overlap with TF binding regions and 34% overlap with DHSs.

The increasing evidence of pleiotropy between complex traits and the increasing functional annotation data call for novel statistical methods to effectively analyze multiple GWAS data sets and functional annotation data simultaneously. Statistical methods to investigate pleiotropy have been actively researched (reviewed in [24] and [25]), for example, using linear mixed models [26], [27] or the conditional False Discovery Rate (FDR) approach [17], [28]. However, these methods do not allow use of functional annotation data for prioritization of GWAS results. On the other hand, various statistical methods have been proposed to make use of functionally annotated SNPs in recent years (reviewed in [29], [30], and [31]). For example, GSEA [32] identifies potentially important pathways in which target genes of risk-associated SNPs are involved while RegulomeDB [33] allows nucleotide-level annotations of risk-associated SNPs, especially for those located in non-coding regions. Stratified FDR methods have been applied to incorporate annotation into GWAS data analysis [20]. However, each of these methods was designed for the analysis of a single phenotype and hence, they do not use functional annotation data fully efficiently for the genetic variants shared by multiple phenotypes. There is a need to develop a coherent statistical framework for the integration of functional annotation data for joint analysis of genetically correlated GWAS information.

In this article, we propose a unified statistical framework, named GPA (Genetic analysis incorporating Pleiotropy and Annotation), to prioritize GWAS results based on pleiotropy and annotation information. GPA also provides statistically rigorous and biologically interpretable inference tools for this purpose. The method can easily be used for other purposes as well, as we will discuss below. This article is organized as follows. First, we investigate the properties of GPA using extensive simulation studies and illustrate the versatility and utility of GPA with the analysis of real data. Specifically, we apply GPA to the five psychiatric disorder GWAS data from the Psychiatric Genomics Consortium with central nervous system gene expression data and show that GPA can accurately identify pleiotropy structure among these diseases. We further apply GPA to the bladder cancer GWAS data with the ENCODE DNase-seq data from 125 cell lines and show that GPA can detect cell lines that are biologically more relevant to bladder cancer. Lastly, we discuss many issues related to GPA. The details of our GPA model and its statistical inference procedures are provided in the Materials and Methods section.

Results

Simulation study

We conducted comprehensive simulation studies to evaluate GPA performance. The _p_-values for non-risk SNPs can be simulated easily from a uniform distribution Inline graphic. For risk-SNPs, we can simulate their _p_-values via different approaches. The most favorable simulation for our GPA model is to simulate them from the Beta distribution. To examine the robustness of our GPA model, we adopted an alternative simulation scheme under the framework of the linear mixed model and liability threshold model that has gained increasing interest recently (e.g., [9], [13]). The detailed procedures will be described later. But we emphasize that there is substantial discrepancy between the generative model used in simulation and the GPA model. The primary purpose of our simulation study is to investigate whether the GPA model can robustly improve the power to detect risk SNPs by integrating multiple GWAS data sets and annotation data despite this discrepancy.

To simulate case-control GWAS data for two genetically correlated diseases, we followed the classical liability threshold model [13]. For each disease, we first simulated a large cohort of individuals with genotypes of Inline graphic independent SNPs. The MAFs of these SNPs were drawn uniformly from [0.05, 0.5]. Then we randomly designated Inline graphic SNPs as risk SNPs. The per-minor-allele effect of each risk SNP was drawn from a normal distribution with zero-mean and variance of Inline graphic, where Inline graphic is the desired level of variance explained by all SNPs on the liability scale and Inline graphic is the MAF of the corresponding risk SNP. We also simulated the environmental effect on the liability scale for each individual from a standard normal distribution (zero mean and unit variance). The total liability for each individual was then obtained by adding up all the genetic effects and the environmental effect. Given a desired disease prevalence Inline graphic, individuals with liabilities greater than the Inline graphic quantile were classified as cases and others were classified as controls. Then equal numbers of cases and controls were drawn from the cohort as a GWAS data set. When simulating two diseases simultaneously, we simulated two disjoint cohorts with the same set of SNPs. To reflect the pleiotropy effects between the two diseases, Inline graphic risk SNPs (Inline graphic) were chosen to be shared by the two diseases. The annotation status of each risk and non-risk SNP was simulated from a Bernoulli distribution with probability of Inline graphic and Inline graphic, respectively.

In our simulation study, the total number of SNPs, Inline graphic, was set to be 20,000, and the sample size of each data set, Inline graphic, was set at 2000, 5000 or 10000, respectively. The number of risk SNPs Inline graphic was the same for the two diseases and was set at 500, 1000 or 2000, respectively. We varied the proportion of shared risk SNPs between the two diseases, Inline graphic, from Inline graphic to 1. Note that Inline graphic corresponds to the absence of pleiotropy. The disease prevalence, Inline graphic, was fixed at 0.1 and the variance explained by all the SNPs, Inline graphic, was fixed at 0.6 for each of the two diseases. Here Inline graphic and Inline graphic were fixed at 0.4 and 0.1, respectively.

We first evaluated SNP prioritization performance of GPA. Specifically, after the two data sets were simulated, we obtained the _p_-value for each SNP in each disease using a Inline graphic test with one degree of freedom. Then we analyzed the simulated data using our GPA method in the following four modes: 1. analyzing the two diseases separately without the annotation data; 2. analyzing the two diseases separately with the annotation data; 3. analyzing the two diseases jointly without the annotation data; and 4. analyzing the two diseases jointly with the annotation data. In each mode, we compared the order of the local FDR obtained using GPA against the actual risk status of the SNPs to calculate the area under the receiver operating characteristic curve (AUC) as a measure of risk SNP prioritization accuracy. The left panel of Figure 1 shows the AUCs from the four modes with Inline graphic and Inline graphic (results for other scenarios are shown in Figures S10-S20 in Text S1). Because all the simulation parameters were the same for the two diseases, only the results for the first disease are shown. We can see that incorporating either annotation information or pleiotropy between the two diseases improved the prioritization performance. In particular, as the proportion of shared risk SNPs increased, the prioritization performance also improved. Given the local FDR obtained using GPA, we controlled the global FDR at 0.2 and calculated the average power to identify the true risk SNPs. GPA performance measured by partial AUC and power for Inline graphic and Inline graphic are shown in the middle and right panels of Figure 1, respectively (results for other scenarios are shown in Figures S10–S20 in Text S1). We also evaluated the actual FDR and found that the FDR was indeed controlled at 0.2 with occasional slight conservativeness (Figure 2 and Figures S21–S31 in Text S1). In addition, we evaluated the actual FDR in the presence of linkage disequilibrium (LD) among SNPs. The details of the simulations and results are given in Figure S1 in Text S1. These results suggest that GPA can improve the power of identifying risk variants while controling FDR at the nominal level despite the mismatch between GPA and the generative model in our simulation study.

Figure 1. AUC (left), partial AUC (Middle) and power (right) of GPA for SNP prioritization with sample size Inline graphic = 5000 and number of risk SNPs Inline graphic = 1000.

Figure 1

The results are based on 200 simulations.

Figure 2. Global false discovery rates of GPA at sample size Inline graphic = 5000 and number of risk SNPs Inline graphic = 1000.

Figure 2

Upper panel: Global false discovery rates of GPA with annotation. Lower panel: Global false discovery rates of GPA without annotation. From left to right: FDR of first GWAS (joint analysis), FDR of second GWAS (joint analysis), FDR of first GWAS (separate analysis), FDR of second GWAS (separate analysis) and FDR of risk variants shared by both GWAS. For all scenarios, the global false discovery rates of GPA are controlled at the nominal level.

Regarding parameter estimation, we found that GPA provided a satisfactory estimate of Inline graphic, the probability of being annotated for a certain group of SNPs, as long as there are enough SNPs in that group (Figures S32–S44 in Text S1). But we note that the estimates of the proportion of risk and non-risk SNPs, Inline graphic, may be biased (Figures S45–S66 in Text S1). This is no surprise because the distribution of the _p_-values of the risk SNPs obtained from the generative model in the simulation study may differ from the Beta distribution assumed in GPA. If the _p_-values of the risk SNPs are indeed generated from the Beta distribution, our GPA model can give fairly accurate estimates of Inline graphic (Figures S67-S72 in Text S1).

As a comparison, we also used the “conditional FDR” approach proposed by Andreassen et al. [17] to prioritize SNPs in our simulations. The comparison results between GPA and the conditional FDR at Inline graphic and Inline graphic are shown in Figure 3 (other results are provided in Figures S77–S87 in Text S1). GPA significantly outperformed the conditional FDR approach in SNP prioritization. More importantly, in the absence of pleiotropy, the conditional FDR approach had worse accuracy than single-GWAS analysis using the standard FDR approach, whereas GPA achieved comparable accuracy with single-GWAS analysis in this scenario. This suggests that GPA was able to take advantage of pleiotropy when it exists while, in clear contrast to the conditional FDR approach, it does not sacrifice much statistical power than the conditional FDR approach when it is absent.

Figure 3. Comparisons of receiver operating characteristic curves measured by AUCs (Left) and partial AUCs (Right) between GPA and the conditional FDR approach at sample size Inline graphic = 5000 and number of risk SNPs Inline graphic = 1000.

Figure 3

The results are based on 200 simulations.

Next, we evaluated the type I error and power of GPA for hypothesis testing on the significance of annotation enrichment for risk SNPs. Gene Set Enrichment Analysis (GSEA) [34] is a popular method to accomplish a similar task. Although GSEA typically is used for gene expression data analysis, its input can be a list of _p_-values obtained from any source. Therefore we implemented the GSEA method to test the enrichment of the Inline graphic-values of a set of SNPs being annotated and compared it with GPA. We followed the previous simulation scheme and simulated one GWAS data set with Inline graphic, Inline graphic varying from 2000 to 10000, and Inline graphic varying from 500 to 2000. Here Inline graphic was fixed at 0.1 and Inline graphic was varied from 0.1 to 0.5. We set the statistical significance level at 0.05. Type I error rate was evaluated at Inline graphic and power was evaluated at Inline graphic. The results for Inline graphic are shown in Figure 4. In general, GPA provided much higher power than GSEA while both methods appropriately controlled the type I error rate.

Figure 4. The comparison between GPA and GSEA at number of risk SNPs Inline graphic = 1000.

Figure 4

Here we fixed Inline graphic and varied Inline graphic to evaluate the power for sample size Inline graphic = 2000 (Upper Left panel), 5000 (Upper Right panel), 10000 (Lower Left panel), respectively. We used Inline graphic to evaluate the type I errors (Lower Right panel). The results are based on 500 simulations.

We further evaluated the type I error rate and power of GPA for the test of pleiotropy in our simulations. The simulation parameters were the same as those in the previous simulations. Power was evaluated at Inline graphic, Inline graphic, Inline graphic, and Inline graphic. The type I error rate was evaluated at Inline graphic, corresponding to the expected shared proportion of risk SNPs in the absence of pleiotropy. As shown in Figure 5, power increased as Inline graphic decreased and as Inline graphic and Inline graphic increased, whereas the type I error rate was appropriately controlled in all cases. Please note that type I errors and power remained almost the same for hypothesis testing of pleiotropy in the presence of annotation (see Figures S2–S4 in Text S1).

Figure 5. The type I error rate and power of the pleiotropy test. Here we varied Inline graphic to evaluate the power for sample size Inline graphic = 500 (Upper Left panel), 1000 (Upper Right panel), and 2000 (Lower Left panel), respectively.

Figure 5

We used Inline graphic to evaluate the type I errors of the pleiotropy test (Lower Right panel). In each setting, we also varied sample size Inline graphic = 1000, 2000, and 10000. Note that type I error rate and power of the pleiotropy test remain almost the same in presence of annotation (see Figure S9 in Text S1).

Lastly, we performed additional simulations with moderate heritability (Inline graphic) and pleiotropy (Inline graphic). The results shown in Figures S73–S76 in Text S1 demonstrate that, with moderate heritability and pleiotropy, GPA can still effectively improve the power by leveraging pleiotropy between related traits. Therefore, GPA can serve as a more powerful tool for integrative analysis in the post-GWAS era.

Real data analysis

GWAS of five psychiatric disorders

We applied our GPA model to the analysis of five psychiatric disorders, as has been done in previous publications using different methodologies [12], [18]. Traits included were attention deficit-hyperactivity disorder (ADHD), autism spectrum disorder (ASD), bipolar disorder (BPD), major depressive disorder (MDD), and schizophrenia (SCZ). Detailed information about these data sets is provided in [12], [18]. Summary statistics of the five psychiatric disorders were downloaded from the section for cross-disorder analysis at the Psychiatric Genomics Consortium (PGC) website. The _p_-values were available for 1,230,535 SNPs in ADHD, 1,245,864 SNPs in ASD, 1,233,533 SNPs in BPD, 1,232,794 SNPs in MDD, and 1,237,959 SNPs in SCZ, respectively. We took the intersection of those SNPs, resulting in a _p_-value matrix Inline graphic for the five psychiatric disorders.

First, we performed single-GWAS data analysis using genes preferentially expressed in the central nervous system (CNS) [13], [34] as the annotation data. Specifically, we generated the annotation vector Inline graphic as follows: The entries in Inline graphic corresponding to SNPs within 50-kb of the genes from the CNS set were set to be 1. Among all the SNPs, 21.9% were thus annotated to be CNS related. The analysis results of these five psychiatric disorders are given in Table 1. The estimated fold enrichment Inline graphic of the CNS set was 1.749 (s.e. 0.447), 1.261 (s.e. 0.055), 1.467 (s.e. 0.033), 1.177 (s.e. 0.058) and 1.391 (s.e. 0.022) for ADHD, ASD, BPD, MDD and SCZ, respectively. PGC investigators also evaluated enrichment of the CNS gene set by variance component estimation using linear mixed models (LMM) [12], suggesting about 1.6 and 1.5 fold enrichments in BPD and SCZ, respectively. Two explanations may contribute to these minor differences in fold enrichment estimation between GPA and LMM: First, GPA only used summary statistics while LMM used both phenotype and genotype data. Second, GPA and LMM employed different mathematical definitions of fold enrichment: GPA used the ratio between Inline graphic and Inline graphic, while LMM used the ratio between the proportion of the variance explained by SNPs in the CNS set and the proportion of the CNS set in entire genome. Furthermore, we evaluated the significance of enrichment of the CNS set by hypothesis testing. As shown in Table 1, enrichment of the CNS gene set was strong in BPD and SCZ, moderate in ASD and MDD, and nonsignificant in ADHD.

Table 1. Single-GWAS analysis of five psychiatric disorders using the CNS gene set as the annotation data.

Next, we applied GPA to study pairwise pleiotropy of these five psychiatric disorders without using annotation data. As shown in Table 2, our results suggest that the pleiotropy effect was strong between BPD and SCZ (_p_-value is essentially zero), MDD and SCZ (_p_-value Inline graphic), BDP and MDD (_p_-value Inline graphic), ASD and SCZ (_p_-value Inline graphic), and ASD-BPD (_p_-value Inline graphic); moderate between ADHD and BPD (_p_-value Inline graphic), ADHD and SCZ (Inline graphic); and non-significant for all other pairs. Our results largely agree with those reported in [12] and the disagreements mainly came from those between ADHD and other disorders. The pleiotropy between ADHD and MDD was reported to be moderate in [12], while GPA did not detect this moderate effect. From single GWAS analysis of ADHD, given in Table 1, the estimated parameters (Inline graphic(s.e. 0.006), Inline graphic) indicate that its GWAS signals were very weak. For MDD, the estimated parameter Inline graphic also indicates the weak marginal signals for MDD. Consequently, the marginal GWAS signals of ADHD and MDD were too weak to allow GPA to detect the pleiotropic effect between them. Since the data analysis performed in [12] used genotype data, the bivariate LMM could still have enough power to detect a moderate genetical correlation between ADHD and MDD.

Table 2. Pleiotropy estimated among five psychiatric disorders.

We further applied GPA to study all pairs of disorders using the CNS gene set as the annotation data. The estimated Inline graphic (Inline graphic) are given in Table 3 and Inline graphic remained almost the same as those estimated without the annotation data. The _p_-values of hypothesis test (14) are provided in the last column of Table 3. The _p_-value should be interpreted with caution: as shown in Table 1, the CNS gene set is enriched in all these disorders except ADHD. Hence, the significant _p_-values listed in Table 3 may be simply due to the significant enrichments for individual traits. On the other hand, the ratio between Inline graphic and Inline graphic could be more interesting. Take BPD-SCZ as an example. The ratio between Inline graphic and Inline graphic is 1.503 (s.e. 0.025), which suggests that enrichment of the CNS set for the BDP-SCZ shared risk variants was even stronger than that for BPD-only (1.467 (s.e. 0.033)) or SCZ-only (1.391 (s.e. 0.022)), although the differences are not statistically significant.

Table 3. GPA results for all pairs of the five psychiatric disorders with the CNS set as annotation.

We also compared the results given by four different analysis approaches: single-GWAS analysis with or without annotation, and two-GWAS joint analysis with or without annotation data. The Manhattan plots are shown in Figures 6 and 7. For single-GWAS analysis without annotation, GPA identified 13 SNPs and 391 SNPs with local false discovery rate Inline graphic for BPD and SCZ, respectively. By using the CNS set as annotation, GPA was able to identify 14 and 409 SNPs for BPD and SCZ, respectively, with the same fdr control. For joint analysis without annotation, the number of identified SNPs increased to 383 and 821 for BPD and SCZ, respectively. By using the CNS set as annotation, the number of identified SNPs further increased to 385 and 837 for BPD and SCZ, respectively. We investigated the BPD results in detail to evaluate the power of GPA in identification of functionally important SNPs. For single-GWAS analysis of BPD, GPA was able to identify SNPs located in the ANK3 gene. By using annotation data, the CACNA1C gene, which encodes an alpha-1 subunit of a voltage-dependent calcium channel, was identified by GPA. After incorporating pleiotropy information between SCZ and BPD, additional functionally relevant genes, such as PBRM1, C6orf136, DPCR1, SYNE1, were identified by GPA. For instance, SYNE1 encodes the synaptic nuclear envelope protein 1, and codes the protein Syne-1 that is found in many tissues and is especially critical in the brain. The Syne-1 protein is active (expressed) in Purkinje cells, which are located in the cerebellum and are involved in signaling between neurons. Mutations in the SYNE1 gene have been found to cause autosomal recessive cerebellar ataxia type 1 (ARCA1) and SYNE1 has recently been implicated as a susceptibility gene for BPD in a large collaborative GWAS study [35]. Clearly, the present results indicate that the statistical power to identify associated SNPs increased by making use of pleiotropy and functional annotation (in this real data example, pleiotropy played a more important role than functional annotation).

Figure 6. Manhattan plots of BPD and SCZ.

Figure 6

Top left panel: separate analysis without annotation. Top right panel: separate analysis with CNS annotation. Bottom left panel: joint analysis without annotation. Bottom right panel: joint analysis with CNS annotation. The red and blue lines indicate local Inline graphic = 0.05 and 0.1, respectively.

Figure 7. Manhattan plots of local false discovery rates Inline graphic and Inline graphic (Equations (11) and (12)) for detecting BPD-SCZ-sharing SNPs.

Figure 7

Left panel: joint analysis without annotation. Right panel: joint analysis with annotation. The red and blue lines indicate local Inline graphic = 0.05 and 0.1, respectively.

We also applied GPA with multiple annotation datasets to improve its performance. Beside the CNS gene set, we considered eQTL annotation from GTEx [36] and transcription factor binding site (TFBS) annotation by ANNOVAR [37]. Specifically, we downloaded the available eQTL data from the GTEx website (http://www.ncbi.nlm.nih.gov/gtex/GTEX2/gtex.cgi) and took the intersection of these eQTL with the markers of the psychiatric disorders, resulting in 23,505 SNPs annotated as eQTL. To obtain our TFBS annotation, we used the key word “TFBS” to query the ANNOVAR database, resulting in 19,029 SNPs annotated as TFBS (More details can be found at http://www.openbioinformatics.org/annovar/annovar_faq.html#tfbs). We performed joint analysis of BPD and SCZ with these three annotations (CNS, eQTL and TFBS). The estimated parameters in the GPA model and their standard errors are shown in Table 4. Notice that Inline graphic hit the boundary of the parameter space in presence of the eQTL annotation. This made the EM algorithm converge in fewer iterations, resulting in the minor differences between the estimated parameters Inline graphic in Table 4 and those in Table 2. With the local FDR Inline graphic, GPA identified 724 and 977 risk SNPs for BPD and SCZ, respectively. Clearly, the enrichment of eQTL is high with fold Inline graphic (s.e. 0.152) and the enrichment of TFBS is moderate with Inline graphic (s.e. 0.063). These results demonstrate that GPA is able to incorporate multiple annotation datasets for prioritization of GWAS results with good effects.

Table 4. The estimated parameters and their standard errors for the joint analysis of BPD and SCZ, together with multiple annotation data: The CNS gene set, eQTL and TFBS.

Besides the evaluation of the statistical power of GPA on real data, we examined whether the Beta distribution of the GPA model fit the real data. We compared the _p_-values of real data and the _p_-values simulated from the fitted GPA model to examine the goodness of fit. The results are given in Figures S88–S97 in Text S1, suggesting that our GPA model fitted the real data well.

Regarding computational time, the GPA algorithm takes less than 20 minutes to analyze two traits with one annotation data file. The speed of convergence depends on the strength of the GWAS signals. For example, it took about 7 mins and 3 mins to analyze ADHD and SCZ, repectively, as SCZ has stronger GWAS signals than ADHD. For joint analysis of BPD and SCZ, it took about 20 mins. All timings were carried out on a desktop with 3.0 GHz CPU and 16G memory.

Bladder cancer GWAS and ENCODE annotation data

DNase I hypersensitive sites (DHSs) are regions where DNA degradation by enzymes like DNase I occur more frequently than elsewhere. As a result, DHSs can mark active transcription regions across genome and these patterns are known to be tissue or cell specific. The ENCODE project analyzed the DHSs in 125 human cell lines with the intention of cataloging human regulatory DNA [38]. In this section, we applied GPA to assess how bladder cancer [39] risk associated SNPs are enriched in DHSs region across these 125 human cell lines.

We downloaded genotype data for bladder cancer from dbGaP (NCI Cancer Genetic Markers of Susceptibility (CGEMS) project; accession number phs000346.v1.p1). We used samples genotyped from both Illumina 1 M chip and 610K chip for our analysis. For quality control, we removed SNPs with missing rates Inline graphic0.01. We checked Hardy-Weinberg Equilibrium and excluded SNPs with _p_-value Inline graphic0.001. SNPs with minor allele frequencies (MAF) Inline graphic were also removed. After quality control, 490,614 SNPs from 3,631 cases and 3,356 controls of European descent were used in the analysis. SNP-level association p-values were calculated for this bladder cancer data using logistic regression by assuming an additive genetic model. We also downloaded the uniform peak files for DHSs in 125 cell lines from the ENCODE project (http://genome.ucsc.edu/cgi-bin/hgFileUi?db=hg19&g=wgEncodeAwgDnaseUniform). Note that the DHSs for these 125 cell lines were identified with a uniform analysis workflow by the ENCODE Consortium; this facilitates fair and unbiased comparison among cell lines as annotation for our GPA model.

We applied GPA to analyze the bladder cancer GWAS Inline graphic-values with one annotation dataset at a time, and performed hypothesis testing to assess the significance of enrichment. The results are shown in the left panel of Figure 8. Under significance level Inline graphic after Bonferroni correction, annotations from 19 cell lines were statistically significantly enriched for bladder cancer risk associated SNPs. Most of these cell lines were derived from lymphocytes from normal blood (e.g., T cells CD4+ Th0 adult, Monocytes CD14+ RO01746), while some cell lines came from cancer patients (e.g., Gliobla and HeLa-S3). The above results suggest that involvement of the immune system and carcinoma pathways in bladder cancer. These results also demonstrate that GPA may be an effective way to explore functional roles of GWAS hits by testing enrichment on phenotype-related annotations or user-specified annotations.

Figure 8. Enrichment of the DNase I hypersenstivity site annotation data from 125 cell lines for bladder cancer.

Figure 8

Left panel: Inline graphic of hypothesis testing (13) vs. fold enrichment Inline graphic. The vertical red line corresponds to the significance level (Inline graphic = 0.05) after Bonferroni correction. The horizontal red line corresponds to ratio = 1. Right panel: The normalized variance component Inline graphic (2) given by LMM v.s. Inline graphic given by GPA.

We also compared GPA with the LMM-based approach [21], [40] for this dataset. Specifically, we considered the following genome-partitioning LMM:

graphic file with name pgen.1004787.e143.jpg (1)

where Inline graphic are covariates (the first five principal components from genotype data), and Inline graphic and Inline graphic are sets of SNPs overlapping DHSs in each cell line and the remaining SNPs, respectively. We denote the numbers of SNPs in Inline graphic and Inline graphic as Inline graphic and Inline graphic, respectively. The median number of SNPs that overlap DHS in each cell line is about 60K and 90% of cell lines have the number of DHSs ranging between 40K and 80K. In order to take into account such variation in DHS number among cell lines, we define a scaled version of the proportion of phenotype variance explained by SNPs overlapping DHSs in each cell line as

graphic file with name pgen.1004787.e151.jpg (2)

where Inline graphic is the proportion of the explained variance and Inline graphic is the scaling factor. The right panel of Figure 8 shows that the (Inline graphic)-transformed _p_-value of the GPA annotation enrichment test is linearly related to Inline graphic. This indicates that our GPA model captures enrichment of annotation almost as accurately as LMM even without the original genotype data, implying its broader applicability than methods requiring individual genotype and phenotype data.

Discussion

Many GWAS have been conducted in the past 10 years that have led to the discoveries of thousands of genomic regions associated with many traits, and many more discoveries are expected from ongoing GWAS. As GWAS data accumulate, there is an urgent need to perform a systematic analysis of available GWAS datasets for a comprehensive understanding of the genetic architecture of complex traits, and provide new insights for functional roles of the implicated variants. To achieve this goal, there is a great interest in developing computational and statistical approaches to exploring genomic data in the post-GWAS era.

In the following, we briefly discuss the relationship between GPA and other related methods, such as LMM, conditional FDR and GSEA. LMM is an effective method for exploring genetic architecture of complex traits and it has been implemented in a popular software package, GCTA [41]. Compared with LMM, GPA has the following distinctive features:

To our best knowledge, the conditional FDR approach is the first approach that statistically addresses the issue of pleiotropy between two GWAS. and GSEA is presently the most popular approach to evaluating the enrichment of gene sets. In fact, GPA provides a unified framework for systematically integrating both sources of information, pleiotropy and annotation. Rigorous statistical inference of pleiotropic effects and annotation enrichment has been established in this framework. As demonstrated in our extensive simulation study, GPA has better performance of identifying disease-associated markers than the conditional FDR approach, and it shows greater power of evaluating annotation enrichment than GSEA as well. With real data analysis, we have demonstrated how to use GPA to incorporate pleiotropy information and multiple annotation data for prioritizing GWAS results.

Here we briefly discuss some key assumptions made in GPA.

The parameters in the GPA model should be interpreted with caution because parameter estimation is based on the model assumption as discussed above. For example, as we showed in simulation study, the estimated Inline graphic can be biased due to the mismatch of GPA model and the random-effects model.

There are also some limitations of the current GPA model. Although extensions to three or more GWAS are straightforward in principle (from four-groups model to Inline graphic-groups model, Inline graphic), the number of groups will increase exponentially as the number of GWAS increases. This makes many Inline graphic close to zero, resulting in poor parameter estimation (large variance) and thus reduced power. Currently, we suggest doing pairwise-GWAS analysis to explore the genetic relationship of different phenotypes. It is of great interest to develop statistical models to handle the multiple-GWAS case more effectively. Another limitation is that current GPA version can only handle binary annotation. Allowing more annotation structures (e.g., continuous annotation) in GPA is an important extension of current GPA model. We will investigate this issue in the future.

In summary, we have presented a statistical approach, named GPA, that can integrate information from multiple GWAS data sets and functional annotation data. Not only does GPA have better statistical power than related methods, it also provides interpretable model parameters offering insights to our understanding of the genetic architecture of complex traits. We have successfully applied GPA to analyze GWAS data of five psychiatric disorders from PGC, and showed that GPA is able to identify pleiotropic effects among psychiatric disorders and detect enrichment of the CNS gene set. We have also applied GPA to analyze a bladder cancer GWAS dataset with ENCODE data as annotation, where significant enrichments of immune system and carcinoma pathways were observed. Compared to LMM that requires individual genotype and phenotype data as input, GPA has similar results of enrichment analysis without requirements of the genotype data. This makes GPA an attractive and effective tool for the integrative analysis of multiple GWAS data with functional annotation data, when genotype data are not available.

Materials and Methods

GPA probabilistic model

Throughout this paper, we shall use Inline graphic to index SNPs, Inline graphic to index GWAS data sets, and Inline graphic to index the annotation data sets. We first consider the simplest case where we only have summary statistics (_p_-values) from just one GWAS data set, and then extend our model to handle multiple GWAS data sets and annotation data. Suppose we have performed hypothesis testing of genome-wide SNPs and obtained their _p_-values:

graphic file with name pgen.1004787.e168.jpg (3)

where Inline graphic is the number of SNPs. Consider the “two-groups model” [46], i.e., the obtained _p_-values are assumed to come from the mixture of null and non-null, with probability Inline graphic and Inline graphic, respectively. Let Inline graphic be the latent variables indicating whether the _j_-th SNP is null or non-null, where Inline graphic, Inline graphic, and Inline graphic, because a SNP can only be either null or non-null. Here Inline graphic means un-associated (null) and Inline graphic means associated (non-null). Then we have the following two-groups model:

graphic file with name pgen.1004787.e178.jpg (4)

where the _p_-values from the null group are from the Uniform distribution on [0,1], denoted as Inline graphic, and the _p_-values from the non-null group are from the Beta distribution with parameters (Inline graphic), where Inline graphic. We put the constraint Inline graphic to model that a smaller _p_-value is more likely than a larger _p_-value when it is from the non-null group [47].

To incorporate information from functional annotation data, we extend the basic model as follows. Suppose we have collected information from Inline graphic functional annotation sources in the annotation matrix: Inline graphic, where Inline graphic indicates whether the _j_-th SNP is annotated in the _d_-th functional annotation source. For example, when there are two annotation sources – eQTL data and DNase I hypersensitivity sites (DHS) data – then Inline graphic is an Inline graphic matrix. If the _j_-th SNP is an eQTL, then Inline graphic, otherwise Inline graphic; if it is located in a DHS, then Inline graphic, otherwise Inline graphic. Now we model the relationship between Inline graphic and Inline graphic as

graphic file with name pgen.1004787.e194.jpg (5)

Clearly, Inline graphic can be interpreted as the proportion of null SNPs being annotated in the _d_-th annotation, and Inline graphic corresponds to the proportion of non-null SNPs being annotated in the _d_-th annotation. Therefore, Inline graphic implies that there exists enrichment for the _d_-th annotation. The statistical inference about enrichment of annotation data will be discussed in details in Section “Hypothesis testing of annotation enrichment and pleiotropy”.

Now we extend the above model to handle multiple GWAS data sets. To keep the notation uncluttered, we present the model for the case of two GWAS data sets. Suppose we have _p_-values from two GWAS:

graphic file with name pgen.1004787.e198.jpg (6)

Let Inline graphic be the matrix collecting all the _p_-values, where Inline graphic denotes the _p_-value of the _j_-th SNP in the _k_-th GWAS. Similarly, we introduce latent variables Inline graphic indicating the association between the _j_-th SNP and the two phenotypes: Inline graphic means the _j_-th SNP is associated with neither of them, Inline graphic means it is only associated with the first one, Inline graphic means it is only associated with the second one, and Inline graphic means it is associated with both. The two-groups model (4) is extended to the following “four-groups model”:

graphic file with name pgen.1004787.e206.jpg (7)

where Inline graphic. When the genetic bases of the two phenotypes are independent of each other (i.e., no pleiotropy), then we have Inline graphic by expectation. Therefore, the difference between Inline graphic and Inline graphic can be used to characterize pleiotropy. Statistical inference on pleiotropy is given in Section “Hypothesis testing of annotation enrichment and pleiotropy”.

To incorporate annotation information into the multiple GWAS model (7), similarly, we model the relationship between Inline graphic and Inline graphic as

graphic file with name pgen.1004787.e213.jpg (8)

where Inline graphic is the probability of a null SNP being annotated, Inline graphic is the probability of the first phenotype associated-SNP being annotated, Inline graphic is the probability of the second phenotype associated-SNP being annotated, and Inline graphic is the probability of jointly associated-SNP being annotated. Assuming the independence of SNP markers, the joint distribution Inline graphic can be written as

graphic file with name pgen.1004787.e219.jpg (9)

where Inline graphic and Inline graphic are the _j_-th row of Inline graphic and Inline graphic; the second equation holds by assuming the independence between Inline graphic and Inline graphic, conditional on Inline graphic; and the third equation holds by further assuming the independence between Inline graphic and Inline graphic for Inline graphic, conditional on Inline graphic.

Parameters in the GPA model can be estimated using the Expectation-Maximization (EM) algorithm [48], which is computationally efficient because we have explicit solutions for estimation of all the parameters in the M-step. Standard errors for parameter estimates can be approximated using the empirical observed information matrix [49]. Note that in the GPA model, the sample size for estimating the empirical observed information matrix corresponds to the number of SNPs and as a result, we have a very large sample size (Inline graphic) to estimate standard errors accurately. More details of the EM algorithm and the estimation of standard errors are provided in Sections 1 and 3 in Text S1.

Statistical inference

False discovery rate

After we estimate the parameters in the GPA model, SNPs can be prioritized based on their local false discovery rates [50]. Note that separate analysis of single GWAS data based on their summary statistics is equivalent to the analysis of single GWAS data using GPA without any annotation data. Hence, separate analysis of single GWAS data can be considered as a special case of our GPA approach.

To present the local false discovery rate based on our GPA model, we begin with the simplest case: single GWAS without annotation data. In this case, there are only two groups: null and non-null. The false discovery rate is defined as the probability that the _j_-th SNP belongs to the null group given its _p_-value, i.e.,

graphic file with name pgen.1004787.e232.jpg (10)

For joint analysis of two GWAS data sets, we are interested in the local false discovery rate of the _j_-th SNP, if it is claimed to be associated with the first phenotype, the second one, and both, i.e.,

graphic file with name pgen.1004787.e233.jpg (11)

Similarly, when annotation data are available, the false discovery rates can be calculated as

graphic file with name pgen.1004787.e234.jpg (12)

Then, we use the direct posterior probability approach [51] to control global false discovery rates. More details for the estimation of false discovery rates are provided in Section 2 in Text S1.

Hypothesis testing of annotation enrichment and pleiotropy

Given an annotation file, we may be interested in whether there is statistical evidence for enrichment of GWAS signals in this annotation file. We propose to use the likelihood ratio test (LRT) to assess the statistical significance. To keep the notation simple, we drop the index of annotation files. Specifically, the significance of enrichment of an annotation for single-GWAS data analysis can be assessed by the following test:

graphic file with name pgen.1004787.e235.jpg (13)

The LRT statistic is:

graphic file with name pgen.1004787.e236.jpg

where Inline graphic collects the parameter estimates obtained under Inline graphic, and the superscript A indicates the Annotation enrichment test. Note that Inline graphic can be easily obtained by running the GPA algorithm without incorporating the annotation data. Under the null, the test statistic Inline graphic asymptotically follows the Inline graphic distribution with 1 degree of freedom [52]. We reject Inline graphic if Inline graphic, where Inline graphic is the Inline graphic-th quantile of Inline graphic distribution with Inline graphic.

For joint analysis of two GWAS with annotation data, test (13) becomes

graphic file with name pgen.1004787.e248.jpg (14)

Under the null, the test statistic asymptotically follows a Inline graphic distribution with Inline graphic. Similarly, the test of annotation enrichment can be extended to handle Inline graphic GWAS. In this case, the test statistic asymptotically follows Inline graphic distribution with Inline graphic under the null.

Now we consider testing pleiotropy between two GWAS. When there is no pleiotropy, i.e., the signals from the two GWAS are independent of each other, testing pleiotropy can be formulated by testing the following hypothesis:

graphic file with name pgen.1004787.e254.jpg (15)

where Inline graphic and Inline graphic. The LRT statistic is constructed as follows:

graphic file with name pgen.1004787.e257.jpg

where Inline graphic represents the parameter estimates obtained under Inline graphic, and the superscript P indicates the Pleiotropy test. The test statistic (Inline graphic) asymptotically follows Inline graphic distribution with Inline graphic under the null.

Supporting Information

Text S1

Supporting information for GPA.

(PDF)

Funding Statement

This study was supported by National Institutes of Health grants RC2 DA028909, R01 DA12690, R01 DA12849, R01 DA18432, R01 AA11330, R01 AA017535, P50 AA12870, R01 GM59507, R01 DA030976, UL1 RR024139, and the VA Cooperative Studies Program of the Department of Veterans Affairs, Office of Research and Development. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

References

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Text S1

Supporting information for GPA.

(PDF)