Identifying relationships among genomic disease regions: predicting genes at pathogenic SNP associations and rare deletions - PubMed (original) (raw)

Identifying relationships among genomic disease regions: predicting genes at pathogenic SNP associations and rare deletions

Soumya Raychaudhuri et al. PLoS Genet. 2009 Jun.

Abstract

Translating a set of disease regions into insight about pathogenic mechanisms requires not only the ability to identify the key disease genes within them, but also the biological relationships among those key genes. Here we describe a statistical method, Gene Relationships Among Implicated Loci (GRAIL), that takes a list of disease regions and automatically assesses the degree of relatedness of implicated genes using 250,000 PubMed abstracts. We first evaluated GRAIL by assessing its ability to identify subsets of highly related genes in common pathways from validated lipid and height SNP associations from recent genome-wide studies. We then tested GRAIL, by assessing its ability to separate true disease regions from many false positive disease regions in two separate practical applications in human genetics. First, we took 74 nominally associated Crohn's disease SNPs and applied GRAIL to identify a subset of 13 SNPs with highly related genes. Of these, ten convincingly validated in follow-up genotyping; genotyping results for the remaining three were inconclusive. Next, we applied GRAIL to 165 rare deletion events seen in schizophrenia cases (less than one-third of which are contributing to disease risk). We demonstrate that GRAIL is able to identify a subset of 16 deletions containing highly related genes; many of these genes are expressed in the central nervous system and play a role in neuronal synapses. GRAIL offers a statistically robust approach to identifying functionally related genes from across multiple disease regions--that likely represent key disease pathways. An online version of this method is available for public use (http://www.broad.mit.edu/mpg/grail/).

PubMed Disclaimer

Conflict of interest statement

The authors have declared that no competing interests exist.

Figures

Figure 1

Figure 1. Gene Relationships Among Implicated Loci (GRAIL) method consists of four steps.

(A) Identifying genes in disease regions. For each independent associated SNP or CNV from a GWA study, GRAIL defines a disease region; then GRAIL identifies genes overlapping the region. In this region there are three genes. We use gene 1 (pink arrow) as an example. (B) Assess relatedness to other human genes. GRAIL scores each gene contained in a disease region for relatedness to all other human genes. GRAIL determines gene relatedness by looking at words in gene references; related genes are defined as those whose abstract references use similar words. Here gene 1 has word counts that are highly similar to gene A but not to gene B. All human genes are ranked according to text-based similarity (green bar), and the most similar genes are considered related. (C) Counting regions with similar genes. For each gene in a disease region, GRAIL assesses whether other independent disease regions contain highly significant genes. GRAIL assigns a significance score to the count. In this illustration gene 1 is similar to genes in three of the regions (green arrows), including gene A. (D) Assigning a significance score to a disease region. After all of the genes within a region are scored, GRAIL identifies the most significant gene as the likely candidate. GRAIL corrects its significance score for multiple hypothesis testing (by adjusting for the number of genes in the region), to assign a significance score to the region.

Figure 2

Figure 2. SNPs associated with lipid metabolism and height contain genes related to each other.

(A) 19 SNPs associated with lipid metabolism. The _y_-axis plots the ptext values on a log scale, with increasing significance at the top. The histogram on the left side of the graph illustrates values for matched SNP sets. 88.6% of those SNPs have ptext values that are >0.1. The scatter plot on the right illustrates ptext values for actual serum cholesterol associated SNPs (blue dots). Black horizontal line marks the median ptext value. We assessed the same SNP with similarity metrics based on gene annotation (green dots) and gene expression correlation (purple dots). (B) 42 SNPs associated with height. Similar plot for 42 height associated SNPs. The histogram on the left of the graph illustrates ptext values for random SNP sets carefully matched to height-associated SNP set. 86.5% of those SNPs have ptext values that are >0.1. The scatter plot on the right illustrates ptext values for actual SNPs associated with height (blue dots). Black horizontal line marks the median ptext value. We assessed the same SNP with similarity metrics based on gene annotation (green dots) and gene expression correlation (purple dots). On the right we list for each ptext threshold the number of expected SNPs less than the threshold based on matched sets, and the number of observed SNPs less than the threshold among height associated SNPs.

Figure 3

Figure 3. GRAIL predicts Crohn's disease SNPs.

(A) Validated versus Failed SNPs. Prior to replication, GRAIL scored Crohn's SNPs that emerged from a meta-analysis study. Results from follow-up testing either validated Crohn's SNPs, or identified those SNPs that failed. We produce a scatter plot of the significance of text-based similiarty (ptext) for validated regions (green) versus regions that failed to replicate (red). Black horizontal lines mark the median ptext values. The distribution of scores for failed SNPs resembles a random distribution of _p_-values. The distribution of scores for validated SNPs is significantly different; almost ½ of these SNPs obtain ptext scores<0.1. (B) Histogram of text-based scores for Crohn's disease candidate regions. Here we plot a histogram of _p_ text scores for 74 Crohn's disease SNPs. Validated SNPs (green) have _p_ text values that are enriched for significant values. Indeterminate SNPs (yellow) have a subset of _p_ text values that are significant. Failed SNPs (Red) have all of their _p_ text scores>0.1.

Figure 4

Figure 4. GRAIL identifies a subset of highly connected genes within rare deletions found in Schizophrenia cases.

(A) Case deletions versus control deletions. Here we plot the results of the separate GRAIL analyses conducted on the deletions observed in schizophrenia cases and controls. Case deletion ptext scores are displayed in red; control deletion ptext scores are displayed in green. The line in each category in the middle of the box represents the median GRAIL ptext score. The box represents the 25–75% range. The bars represent the 5–95% range. Additional scores outside the range are individual plotted. (B) Text-based GRAIL significance score tracks with CNS specific expression. We partition case-only deletions by their GRAIL scores. For each range of GRAIL ptext scores, we assess the candidate genes selected by GRAIL for CNS expression. The upper portion of this plot illustrates the fraction of those candidate genes that demonstrate preferential CNS expression along with 95% confidence intervals. The blue line represents the total fraction of genes that are preferentially CNS expressed. For the most compelling GRAIL scores, the candidate genes are significantly enriched for CNS expression compared to what would be expected from a random group of genes. The lower portion of the plot is a histogram.

Similar articles

Cited by

References

    1. The Wellcome Trust. Genome-wide association study of 14,000 cases of seven common diseases and 3,000 shared controls. Nature. 2007;447:661–678. - PMC - PubMed
    1. Gudbjartsson DF, Walters GB, Thorleifsson G, Stefansson H, Halldorsson BV, et al. Many sequence variants affecting diversity of adult human height. Nat Genet. 2008;40:609–615. - PubMed
    1. Lettre G, Jackson AU, Gieger C, Schumacher FR, Berndt SI, et al. Identification of ten loci associated with height highlights new biological pathways in human growth. Nat Genet. 2008;40:584–591. - PMC - PubMed
    1. Weedon MN, Lango H, Lindgren CM, Wallace C, Evans DM, et al. Genome-wide association analysis identifies 20 loci that influence adult height. Nat Genet. 2008;40:575–583. - PMC - PubMed
    1. Kathiresan S, Melander O, Guiducci C, Surti A, Burtt NP, et al. Six new loci associated with blood low-density lipoprotein cholesterol, high-density lipoprotein cholesterol or triglycerides in humans. Nat Genet. 2008;40:189–197. - PMC - PubMed

Publication types

MeSH terms

Grants and funding

LinkOut - more resources