PheWAS: demonstrating the feasibility of a phenome-wide scan to discover gene-disease associations (original) (raw)
Related papers
Efficacy assessment of SNP sets for genome-wide disease association studies
Nucleic Acids Research, 2007
The power of a genome-wide disease association study depends critically upon the properties of the marker set used, particularly the number and physical spacing of markers, and the level of intermarker association due to linkage disequilibrium. Extending our previously devised theoretical framework for the entropy-based selection of genetic markers, we have developed a local measure of the efficacy of a marker set, relative to including a maximally polymorphic single nucleotide polymorphism (SNP) at the map position of interest. Using this quantitative criterion, we evaluated five currently available SNP sets, namely Affymetrix 100K and 500K, and Illumina 100K, 300K and 550K in the CEU, YRI and JPT + CHB HapMap populations. At 50% relative efficacy, the commercial marker sets cover between 19 and 68% of the human genome, depending upon the population under study. An optimal technology-independent 500K marker set constructed from HapMap for Caucasians, in contrast, would achieve 73% coverage at the same relative efficacy.
On Combining Data From Genome-Wide Association Studies to Discover Disease-Associated SNPs
Statistical Science, 2009
Combining data from several case-control genome-wide association (GWA) studies can yield greater efficiency for detecting associations of disease with single nucleotide polymorphisms (SNPs) than separate analyses of the component studies. We compared several procedures to combine GWA study data both in terms of the power to detect a disease-associated SNP while controlling the genome-wide significance level, and in terms of the detection probability (DP). The DP is the probability that a particular disease-associated SNP will be among the T most promising SNPs selected on the basis of low p-values. We studied both fixed effects and random effects models in which associations varied across studies. In settings of practical relevance, meta-analytic approaches that focus on a single degree of freedom had higher power and DP than global tests such as summing chi-square test-statistics across studies, Fisher's combination of p-values, and forming a combined list of the best SNPs from within each study.
Human-Disease Phenotype Map Derived from PheWAS across 38,682 Individuals
The American Journal of Human Genetics
Phenome-wide association studies (PheWASs) have been a useful tool for testing associations between genetic variations and multiple complex traits or diagnoses. Linking PheWAS-based associations between phenotypes and a variant or a genomic region into a network provides a new way to investigate cross-phenotype associations, and it might broaden the understanding of genetic architecture that exists between diagnoses, genes, and pleiotropy. We created a network of associations from one of the largest PheWASs on electronic health record (EHR)-derived phenotypes across 38,682 unrelated samples from the Geisinger's biobank; the samples were genotyped through the DiscovEHR project. We computed associations between 632,574 common variants and 541 diagnosis codes. Using these associations, we constructed a ''disease-disease'' network (DDN) wherein pairs of diseases were connected on the basis of shared associations with a given genetic variant. The DDN provides a landscape of intra-connections within the same disease classes, as well as interconnections across disease classes. We identified clusters of diseases with known biological connections, such as autoimmune disorders (type 1 diabetes, rheumatoid arthritis, and multiple sclerosis) and cardiovascular disorders. Previously unreported relationships between multiple diseases were identified on the basis of genetic associations as well. The network approach applied in this study can be used to uncover interactions between diseases as a result of their shared, potentially pleiotropic SNPs. Additionally, this approach might advance clinical research and even clinical practice by accelerating our understanding of disease mechanisms on the basis of similar underlying genetic associations.
SNPStats: a web tool for the analysis of association studies
Bioinformatics, 2006
A web-based application has been designed from a genetic-epidemiology point of view to analyze association studies. Main capabilities include: descriptive analysis, test for Hardy-Weinberg equilibrium and linkage disequilibrium. Analysis of association is based on linear or logistic regression according to the response variable (quantitative or binary disease status respectively). Analysis of single SNPs: multiple inheritance models (co-dominant, dominant, recessive, additive and super-dominant), and analysis of interactions (gene-gene or gene-environment). Analysis of multiple SNPs: haplotype frequency estimation, analysis of association of haplotypes with the response, including analysis of interactions. Availability: Main page: http://bioinfo.iconcologia.net/SNPstats. Source code for local installation is available under GNU license. Contact: v.moreno@ico.scs.es Supplementary information: Figures with a sample run are available on Bioinformatics online. A detailed online tutorial is available within the application.
Application of Clinical Text Data for Phenome-Wide Association Studies (PheWASs)
Bioinformatics, 2015
Motivation: Genome-wide association studies (GWASs) are effective for describing genetic complexities of common diseases. Phenome-wide association studies (PheWASs) offer an alternative and complementary approach to GWAS using data embedded in the electronic health record (EHR) to define the phenome. International Classification of Disease version 9 (ICD9) codes are used frequently to define the phenome, but using ICD9 codes alone misses other clinically relevant information from the EHR that can be used for PheWAS analyses and discovery. Results: As an alternative to ICD9 coding, a text-based phenome was defined by 23 384 clinically relevant terms extracted from Marshfield Clinic's EHR. Five single nucleotide polymorphisms (SNPs) with known phenotypic associations were genotyped in 4235 individuals and associated across the text-based phenome. All five SNPs genotyped were associated with expected terms (P < 0.02), most at or near the top of their respective PheWAS ranking. Raw association results indicate that text data performed equivalently to ICD9 coding and demonstrate the utility of information beyond ICD9 coding for application in PheWAS.
PLoS Genetics, 2013
Using a phenome-wide association study (PheWAS) approach, we comprehensively tested genetic variants for association with phenotypes available for 70,061 study participants in the Population Architecture using Genomics and Epidemiology (PAGE) network. Our aim was to better characterize the genetic architecture of complex traits and identify novel pleiotropic relationships. This PheWAS drew on five population-based studies representing four major racial/ethnic groups (European Americans (EA), African Americans (AA), Hispanics/Mexican-Americans, and Asian/Pacific Islanders) in PAGE, each site with measurements for multiple traits, associated laboratory measures, and intermediate biomarkers. A total of 83 single nucleotide polymorphisms (SNPs) identified by genome-wide association studies (GWAS) were genotyped across two or more PAGE study sites. Comprehensive tests of association, stratified by race/ethnicity, were performed, encompassing 4,706 phenotypes mapped to 105 phenotype-classes, and association results were compared across study sites. A total of 111 PheWAS results had significant associations for two or more PAGE study sites with consistent direction of effect with a significance threshold of p,0.01 for the same racial/ethnic group, SNP, and phenotype-class. Among results identified for SNPs previously associated with phenotypes such as lipid traits, type 2 diabetes, and body mass index, 52 replicated previously published genotype-phenotype associations, 26 represented phenotypes closely related to previously known genotype-phenotype associations, and 33 represented potentially novel genotype-phenotype associations with pleiotropic effects. The majority of the potentially novel results were for single PheWAS phenotype-classes, for example, for CDKN2A/B rs1333049 (previously associated with type 2 diabetes in EA) a PheWAS association was identified for hemoglobin levels in AA. Of note, however, GALNT2 rs2144300 (previously associated with high-density lipoprotein cholesterol levels in EA) had multiple potentially novel PheWAS associations, with hypertension related phenotypes in AA and with serum calcium levels and coronary artery disease phenotypes in EA. PheWAS identifies associations for hypothesis generation and exploration of the genetic architecture of complex traits.
Genetic Epidemiology, 2007
Nonparametric approaches have been developed that are able to analyze large numbers of single nucleotide polymorphisms (SNPs) in modest sample sizes. These approaches have different selection features and may not provide similar results when applied to the same dataset. Therefore, we compared the results of three approaches (set association, random forests and multifactor dimensionality reduction [MDR]) to select from a total of 93 candidate SNPs a subset of SNPs that are important in determining high-density lipoprotein (HDL)-cholesterol levels. The study population consisted of a random sample from a Dutch monitoring project for cardiovascular disease risk factors and was dichotomized into cases (low HDL-cholesterol, n 5 533) and non-cases (high HDL-cholesterol, n 5 545) based on gender-specific median values for HDL cholesterol. Clearly, all three approaches prioritized three SNPs as important (CETP Taq1B, CETPĂ€629 C/A and LPL Ser447X). Two SNPs with weaker main effects were additionally prioritized by random forests (APOC3 3175 G/C and CCR2 Val62Ile), whereas MTHFR 677 C/T was selected in combination with CETP Taq1B as best model by MDR. Obtained p-values for the selected models were significant for the set association approach (p 5 .0019), random forests (po.01) and MDR (po.02). In conclusion, the application of a combination of multi-locus methods is a useful approach in genetic association studies to select a well-defined set of important SNPs for further statistical and epidemiological interpretation, providing increased confidence and more information compared with the application of only one method. Genet. Epidemiol. 31:910-921, 2007. r 2007 The supplemental materials described in this article can be found at http://www.interscience.wiley.com/jpages/0741-0395/suppmat Abbreviations: APO, apolipoprotein; CCR2, chemokine receptor; CETP, cholesteryl ester transfer protein; HDL, high-density lipoprotein; IL, interleukin; IL5RA, interleukin 5 receptor alpha; LPA, apolipoprotein(a); LPL, lipoprotein lipase; MTHFR, methylenetetrahydrofolate reductase; NOS2A, nitric oxide synthase; PON1, paraoxonase.
Interpretation of genetic association studies in complex disease
Pharmacogenomics Journal, 2002
Recent successful discoveries of potentially causal single nucleotide polymorphisms (SNPs) for complex diseases hold great promise, and commercialization of genomics in personalized medicine has already begun. The hope is that genetic testing will benefit patients and their families, and encourage positive lifestyle changes and guide clinical decisions. However, for many complex diseases, it is arguable whether the era of genomics in personalized medicine is here yet. We focus on the clinical validity of genetic testing with an emphasis on two popular statistical methods for evaluating markers. The two methods, logistic regression and receiver operating characteristic (ROC) curve analysis, are applied to our agerelated macular degeneration dataset. By using an additive model of the CFH, LOC387715, and C2 variants, the odds ratios are 2.9, 3.4, and 0.4, with p-values of 10 213 , 10 213 , and 10 23 , respectively. The area under the ROC curve (AUC) is 0.79, but assuming prevalences of 15%, 5.5%, and 1.5% (which are realistic for age groups 80 y, 65 y, and 40 y and older, respectively), only 30%, 12%, and 3% of the group classified as high risk are cases. Additionally, we present examples for four other diseases for which strongly associated variants have been discovered. In type 2 diabetes, our classification model of 12 SNPs has an AUC of only 0.64, and two SNPs achieve an AUC of only 0.56 for prostate cancer. Nine SNPs were not sufficient to improve the discrimination power over that of nongenetic predictors for risk of cardiovascular events. Finally, in Crohn's disease, a model of five SNPs, one with a quite low odds ratio of 0.26, has an AUC of only 0.66. Our analyses and examples show that strong association, although very valuable for establishing etiological hypotheses, does not guarantee effective discrimination between cases and controls. The scientific community should be cautious to avoid overstating the value of association findings in terms of personalized medicine before their time.