Winner's Curse Correction and Variable Thresholding Improve Performance of Polygenic Risk Modeling Based on Genome-Wide Association Study Summary-Level Data - PubMed (original) (raw)
. 2016 Dec 30;12(12):e1006493.
doi: 10.1371/journal.pgen.1006493. eCollection 2016 Dec.
Ju-Hyun Park 2, Jubao Duan 3, Sonja T Berndt 1, Winton Moy 4, Kai Yu 1, Lei Song 1, William Wheeler 5, Xing Hua 1, Debra Silverman 1, Montserrat Garcia-Closas 1, Chao Agnes Hsiung 6, Jonine D Figueroa 1 7, Victoria K Cortessis 8 9, Núria Malats 10, Margaret R Karagas 11, Paolo Vineis 12 13, I-Shou Chang 14, Dongxin Lin 15 16, Baosen Zhou 17, Adeline Seow 18, Keitaro Matsuo 19, Yun-Chul Hong 20, Neil E Caporaso 1, Brian Wolpin 21 22, Eric Jacobs 23, Gloria M Petersen 24, Alison P Klein 25 26, Donghui Li 27, Harvey Risch 28, Alan R Sanders 3, Li Hsu 29, Robert E Schoen 30, Hermann Brenner 31 32 33; MGS (Molecular Genetics of Schizophrenia) GWAS Consortium; GECCO (The Genetics and Epidemiology of Colorectal Cancer Consortium); GAME-ON/TRICL (Transdisciplinary Research in Cancer of the Lung) GWAS Consortium; PRACTICAL (PRostate cancer AssoCiation group To Investigate Cancer Associated aLterations) Consortium; PanScan Consortium; GAME-ON/ELLIPSE Consortium; Rachael Stolzenberg-Solomon 1, Pablo Gejman 3, Qing Lan 1, Nathaniel Rothman 1, Laufey T Amundadottir 1, Maria Teresa Landi 1, Douglas F Levinson 34, Stephen J Chanock 1, Nilanjan Chatterjee 1 35 36
Affiliations
- PMID: 28036406
- PMCID: PMC5201242
- DOI: 10.1371/journal.pgen.1006493
Winner's Curse Correction and Variable Thresholding Improve Performance of Polygenic Risk Modeling Based on Genome-Wide Association Study Summary-Level Data
Jianxin Shi et al. PLoS Genet. 2016.
Abstract
Recent heritability analyses have indicated that genome-wide association studies (GWAS) have the potential to improve genetic risk prediction for complex diseases based on polygenic risk score (PRS), a simple modelling technique that can be implemented using summary-level data from the discovery samples. We herein propose modifications to improve the performance of PRS. We introduce threshold-dependent winner's-curse adjustments for marginal association coefficients that are used to weight the single-nucleotide polymorphisms (SNPs) in PRS. Further, as a way to incorporate external functional/annotation knowledge that could identify subsets of SNPs highly enriched for associations, we propose variable thresholds for SNPs selection. We applied our methods to GWAS summary-level data of 14 complex diseases. Across all diseases, a simple winner's curse correction uniformly led to enhancement of performance of the models, whereas incorporation of functional SNPs was beneficial only for selected diseases. Compared to the standard PRS algorithm, the proposed methods in combination led to notable gain in efficiency (25-50% increase in the prediction R2) for 5 of 14 diseases. As an example, for GWAS of type 2 diabetes, winner's curse correction improved prediction R2 from 2.29% based on the standard PRS to 3.10% (P = 0.0017) and incorporating functional annotation data further improved R2 to 3.53% (P = 2×10-5). Our simulation studies illustrate why differential treatment of certain categories of functional SNPs, even when shown to be highly enriched for GWAS-heritability, does not lead to proportionate improvement in genetic risk-prediction because of non-uniform linkage disequilibrium structure.
Conflict of interest statement
The authors have declared that no competing interests exist.
Figures
Fig 1. Theoretic investigation of prediction performance and optimal thresholds for SNP selection in 2D PRS.
The theoretic calculation assumes M = 53,163 independent SNP, of which 5,000 are causal for a binary trait, similar to simulation studies. The high-prior (HP) SNP set has 5,000 SNPs and the low-prior (LP) SNP set has 48,163 SNPs. Δ is the enrichment fold of HP SNPs in the causal SNP set. (A) The prediction AUC for 1D PRS and 2D PRS. (B) The optimal P-value thresholds for including HP and LP SNPs in 2D PRS. For both plots, x-coordinate is the discovery sample size, assuming equal number of cases and controls.
Fig 2. Simulation results for comparing polygenic risk prediction methods and different high priority SNP sets.
Quantitative traits were simulated conditioning on the genotypes of LD-pruned SNPs in lung cancer GWAS with 10,000 discovery samples and 1,924 validation samples. For each simulation, we used 5,000 causal SNPs and 9,940 high priority (HP) SNPs (either randomly selected or the SNPs related with conserved regions). Δ denotes the enrichment fold change of the HP SNP. In the x-axis, “1D” denotes 1D PRS without winner’s curse correction; “1D-LASSO(MLE)” denotes 1D PRS with lasso-type (MLE) correction; “2D-random” indicates 2D PRS with HP SNP sets randomly selected from the LD-pruned SNPs in the genome; “2D-CR” indicates 2D PRS using SNPs in conserved regions as HP SNPs.
Fig 3. Genetic risk prediction for type-2 diabetes.
PRS models were built based on the summary statistics from a meta-analysis of DIAGRAM consortium and GERA data (17,802 cases and 105,109 controls in total) and validated in independent 1500 cases and 1500 controls in GERA. (A) Prediction R2 (observational scale) for 1D PRS with or without winner’s curse correction. “NO”: no winner’s correction for association coefficients; “Lasso”: regression coefficients were modified by a lasso-type correction; “MLE”: association coefficients were modified by maximizing a likelihood function conditioning on selection. (B) Quantile-quantile plot for −_log_10(P) for high priority (HP) SNPs vs. low priority (LP) SNPs. SNPs were pruned to have pairwise _r_2 ≤ 0.1. Here, the HP SNPs were eSNPs/meSNPs in adipose tissue or SNPs related with the H3K4me3 mark in pancreatic islet cell line with data downloaded from the ROADMAP project. The HP SNPs were strongly enriched in the discovery data. (C) Prediction R2 for 2D PRS with lasso-type winner’s curse correction. The SNP set was the same to (B). The best prediction (R2 = 3.53%) was achieved when we included HP SNPs using criterion P ≤ 0.03 and LP SNPs with P ≤ 0.005. (D) The prediction R2, the area under the curve (AUC) and the significances for testing whether an alternative PRS was better than the standard 1D.
Fig 4. Comparison of polygenic risk prediction methods for 13 complex diseases.
For all figures, the y-coordinate is the prediction R2 in the observational scale. “1D” denotes 1D PRS; “2D, blood eSNPs” denotes 2D PRS using blood eSNPs as high-prior SNP set. In the x-axis, “NO” denotes PRS without winner’s curse correction; “LASSO” and “MLE” denote lasso-type and MLE winner’s curse correction, respectively. (A) Prediction R2 values for six diseases in WTCCC data, estimated based on five-fold cross-validation. (B) Prediction R2 values for three GWAS of cancers, estimated based on ten-fold cross-validation. (C) Prediction R2 values for four complex diseases estimated based on independent validation samples.
References
MeSH terms
Grants and funding
- P01 CA087969/CA/NCI NIH HHS/United States
- HHSN268201100046C/HL/NHLBI NIH HHS/United States
- HHSN268201100003C/WH/WHI NIH HHS/United States
- U01 CA164930/CA/NCI NIH HHS/United States
- R01 CA137178/CA/NCI NIH HHS/United States
- U01 CA167551/CA/NCI NIH HHS/United States
- R01 MH106575/MH/NIMH NIH HHS/United States
- HHSN271201100004C/AG/NIA NIH HHS/United States
- HHSN268201100002C/WH/WHI NIH HHS/United States
- R01 CA042182/CA/NCI NIH HHS/United States
- R01 CA060987/CA/NCI NIH HHS/United States
- UM1 CA167552/CA/NCI NIH HHS/United States
- R01 CA154823/CA/NCI NIH HHS/United States
- P50 CA127003/CA/NCI NIH HHS/United States
- HHSN268201100004C/WH/WHI NIH HHS/United States
- R01 CA057494/CA/NCI NIH HHS/United States
- R01 CA059045/CA/NCI NIH HHS/United States
- HHSN268201100001I/HL/NHLBI NIH HHS/United States
- U19 CA148127/CA/NCI NIH HHS/United States
- R01 CA076366/CA/NCI NIH HHS/United States
- R01 CA195789/CA/NCI NIH HHS/United States
- P50 CA062924/CA/NCI NIH HHS/United States
- HHSN268201100004I/HL/NHLBI NIH HHS/United States
- UL1 TR001863/TR/NCATS NIH HHS/United States
- P01 CA053996/CA/NCI NIH HHS/United States
- K05 CA154337/CA/NCI NIH HHS/United States
- P01 CA055075/CA/NCI NIH HHS/United States
- R01 CA151993/CA/NCI NIH HHS/United States
- R01 CA048998/CA/NCI NIH HHS/United States
- U01 CA137088/CA/NCI NIH HHS/United States
- R01 CA063464/CA/NCI NIH HHS/United States
- P01 CA033619/CA/NCI NIH HHS/United States
- UM1 CA186107/CA/NCI NIH HHS/United States
- WT_/Wellcome Trust/United Kingdom
- HHSN268201100003I/HL/NHLBI NIH HHS/United States
- HHSN268201100002I/HL/NHLBI NIH HHS/United States
- U01 CA074783/CA/NCI NIH HHS/United States
- R37 CA054281/CA/NCI NIH HHS/United States
- HHSN268201100001C/WH/WHI NIH HHS/United States
LinkOut - more resources
Full Text Sources
Other Literature Sources