Winner's Curse Correction and Variable Thresholding Improve Performance of Polygenic Risk Modeling Based on Genome-Wide Association Study Summary-Level Data - PubMed (original) (raw)

. 2016 Dec 30;12(12):e1006493.

doi: 10.1371/journal.pgen.1006493. eCollection 2016 Dec.

Ju-Hyun Park 2, Jubao Duan 3, Sonja T Berndt 1, Winton Moy 4, Kai Yu 1, Lei Song 1, William Wheeler 5, Xing Hua 1, Debra Silverman 1, Montserrat Garcia-Closas 1, Chao Agnes Hsiung 6, Jonine D Figueroa 1 7, Victoria K Cortessis 8 9, Núria Malats 10, Margaret R Karagas 11, Paolo Vineis 12 13, I-Shou Chang 14, Dongxin Lin 15 16, Baosen Zhou 17, Adeline Seow 18, Keitaro Matsuo 19, Yun-Chul Hong 20, Neil E Caporaso 1, Brian Wolpin 21 22, Eric Jacobs 23, Gloria M Petersen 24, Alison P Klein 25 26, Donghui Li 27, Harvey Risch 28, Alan R Sanders 3, Li Hsu 29, Robert E Schoen 30, Hermann Brenner 31 32 33; MGS (Molecular Genetics of Schizophrenia) GWAS Consortium; GECCO (The Genetics and Epidemiology of Colorectal Cancer Consortium); GAME-ON/TRICL (Transdisciplinary Research in Cancer of the Lung) GWAS Consortium; PRACTICAL (PRostate cancer AssoCiation group To Investigate Cancer Associated aLterations) Consortium; PanScan Consortium; GAME-ON/ELLIPSE Consortium; Rachael Stolzenberg-Solomon 1, Pablo Gejman 3, Qing Lan 1, Nathaniel Rothman 1, Laufey T Amundadottir 1, Maria Teresa Landi 1, Douglas F Levinson 34, Stephen J Chanock 1, Nilanjan Chatterjee 1 35 36

Affiliations

PMID: 28036406
PMCID: PMC5201242
DOI: 10.1371/journal.pgen.1006493

Winner's Curse Correction and Variable Thresholding Improve Performance of Polygenic Risk Modeling Based on Genome-Wide Association Study Summary-Level Data

Jianxin Shi et al. PLoS Genet. 2016.

Abstract

Recent heritability analyses have indicated that genome-wide association studies (GWAS) have the potential to improve genetic risk prediction for complex diseases based on polygenic risk score (PRS), a simple modelling technique that can be implemented using summary-level data from the discovery samples. We herein propose modifications to improve the performance of PRS. We introduce threshold-dependent winner's-curse adjustments for marginal association coefficients that are used to weight the single-nucleotide polymorphisms (SNPs) in PRS. Further, as a way to incorporate external functional/annotation knowledge that could identify subsets of SNPs highly enriched for associations, we propose variable thresholds for SNPs selection. We applied our methods to GWAS summary-level data of 14 complex diseases. Across all diseases, a simple winner's curse correction uniformly led to enhancement of performance of the models, whereas incorporation of functional SNPs was beneficial only for selected diseases. Compared to the standard PRS algorithm, the proposed methods in combination led to notable gain in efficiency (25-50% increase in the prediction R2) for 5 of 14 diseases. As an example, for GWAS of type 2 diabetes, winner's curse correction improved prediction R2 from 2.29% based on the standard PRS to 3.10% (P = 0.0017) and incorporating functional annotation data further improved R2 to 3.53% (P = 2×10-5). Our simulation studies illustrate why differential treatment of certain categories of functional SNPs, even when shown to be highly enriched for GWAS-heritability, does not lead to proportionate improvement in genetic risk-prediction because of non-uniform linkage disequilibrium structure.

PubMed Disclaimer

Conflict of interest statement

The authors have declared that no competing interests exist.

Figures

Fig 1. Theoretic investigation of prediction performance and optimal thresholds for SNP selection in 2D PRS.

The theoretic calculation assumes M = 53,163 independent SNP, of which 5,000 are causal for a binary trait, similar to simulation studies. The high-prior (HP) SNP set has 5,000 SNPs and the low-prior (LP) SNP set has 48,163 SNPs. Δ is the enrichment fold of HP SNPs in the causal SNP set. (A) The prediction AUC for 1D PRS and 2D PRS. (B) The optimal P-value thresholds for including HP and LP SNPs in 2D PRS. For both plots, x-coordinate is the discovery sample size, assuming equal number of cases and controls.

Fig 2. Simulation results for comparing polygenic risk prediction methods and different high priority SNP sets.

Quantitative traits were simulated conditioning on the genotypes of LD-pruned SNPs in lung cancer GWAS with 10,000 discovery samples and 1,924 validation samples. For each simulation, we used 5,000 causal SNPs and 9,940 high priority (HP) SNPs (either randomly selected or the SNPs related with conserved regions). Δ denotes the enrichment fold change of the HP SNP. In the x-axis, “1D” denotes 1D PRS without winner’s curse correction; “1D-LASSO(MLE)” denotes 1D PRS with lasso-type (MLE) correction; “2D-random” indicates 2D PRS with HP SNP sets randomly selected from the LD-pruned SNPs in the genome; “2D-CR” indicates 2D PRS using SNPs in conserved regions as HP SNPs.

Fig 3. Genetic risk prediction for type-2 diabetes.

PRS models were built based on the summary statistics from a meta-analysis of DIAGRAM consortium and GERA data (17,802 cases and 105,109 controls in total) and validated in independent 1500 cases and 1500 controls in GERA. (A) Prediction R2 (observational scale) for 1D PRS with or without winner’s curse correction. “NO”: no winner’s correction for association coefficients; “Lasso”: regression coefficients were modified by a lasso-type correction; “MLE”: association coefficients were modified by maximizing a likelihood function conditioning on selection. (B) Quantile-quantile plot for −_log_10(P) for high priority (HP) SNPs vs. low priority (LP) SNPs. SNPs were pruned to have pairwise _r_2 ≤ 0.1. Here, the HP SNPs were eSNPs/meSNPs in adipose tissue or SNPs related with the H3K4me3 mark in pancreatic islet cell line with data downloaded from the ROADMAP project. The HP SNPs were strongly enriched in the discovery data. (C) Prediction R2 for 2D PRS with lasso-type winner’s curse correction. The SNP set was the same to (B). The best prediction (R2 = 3.53%) was achieved when we included HP SNPs using criterion P ≤ 0.03 and LP SNPs with P ≤ 0.005. (D) The prediction R2, the area under the curve (AUC) and the significances for testing whether an alternative PRS was better than the standard 1D.

Fig 4. Comparison of polygenic risk prediction methods for 13 complex diseases.

For all figures, the y-coordinate is the prediction R2 in the observational scale. “1D” denotes 1D PRS; “2D, blood eSNPs” denotes 2D PRS using blood eSNPs as high-prior SNP set. In the x-axis, “NO” denotes PRS without winner’s curse correction; “LASSO” and “MLE” denote lasso-type and MLE winner’s curse correction, respectively. (A) Prediction R2 values for six diseases in WTCCC data, estimated based on five-fold cross-validation. (B) Prediction R2 values for three GWAS of cancers, estimated based on ten-fold cross-validation. (C) Prediction R2 values for four complex diseases estimated based on independent validation samples.

References

1. Allen HL, Estrada K, Lettre G, Berndt SI, Weedon MN, Rivadeneira F, et al. Hundreds of variants clustered in genomic loci and biological pathways affect human height. Nature. 2010;467(7317):832–8. 10.1038/nature09410 -DOI -PMC -PubMed
1. Wood AR, Esko T, Yang J, Vedantam S, Pers TH, Gustafsson S, et al. Defining the role of common variation in the genomic and biological architecture of adult human height. Nat Genet. 2014;46(11):1173–86. 10.1038/ng.3097 -DOI -PMC -PubMed
1. Locke AE, Kahali B, Berndt SI, Justice AE, Pers TH, Felix R, et al. Genetic studies of body mass index yield new insights for obesity biology. Nature. 2015;518(7538):197–206. 10.1038/nature14177 -DOI -PMC -PubMed
1. Michailidou K, Beesley J, Lindstrom S, Canisius S, Dennis J, Lush MJ, et al. Genome-wide association analysis of more than 120,000 individuals identifies 15 new susceptibility loci for breast cancer. Nat Genet. 2015;47(4):373–380. 10.1038/ng.3242 -DOI -PMC -PubMed
1. Al Olama AA, Kote-Jarai Z, Berndt SI, Conti DV, Schumacher F, Han Y, et al. A meta-analysis of 87,040 individuals identifies 23 new susceptibility loci for prostate cancer. Nat Genet. 2014;46(10):1103–9. 10.1038/ng.3094 -DOI -PMC -PubMed

MeSH terms

Grants and funding

LinkOut - more resources

Full Text Sources
Other Literature Sources
- H1 Connect - Access expert opinions and insights on biomedical research.
- scite Smart Citations