Evaluating empirical bounds on complex disease genetic architecture - PubMed (original) (raw)

. 2013 Dec;45(12):1418-27.

doi: 10.1038/ng.2804. Epub 2013 Oct 20.

Affiliations

Evaluating empirical bounds on complex disease genetic architecture

Vineeta Agarwala et al. Nat Genet. 2013 Dec.

Abstract

The genetic architecture of human diseases governs the success of genetic mapping and the future of personalized medicine. Although numerous studies have queried the genetic basis of common disease, contradictory hypotheses have been advocated about features of genetic architecture (for example, the contribution of rare versus common variants). We developed an integrated simulation framework, calibrated to empirical data, to enable the systematic evaluation of such hypotheses. For type 2 diabetes (T2D), two simple parameters--(i) the target size for causal mutation and (ii) the coupling between selection and phenotypic effect--define a broad space of architectures. Whereas extreme models are excluded by the combination of epidemiology, linkage and genome-wide association studies, many models remain consistent, including those where rare variants explain either little (<25%) or most (>80%) of T2D heritability. Ongoing sequencing and genotyping studies will further constrain the space of possible architectures, but very large samples (for example, >250,000 unselected individuals) will be required to localize most of the heritability underlying T2D and other traits characterized by these models.

PubMed Disclaimer

Figures

Figure.1

Figure.1. Framework for specification and evaluation of disease models

Figure.2

Figure.2. Patterns of genetic variation: forward simulated vs. empirically observed

a) Number of singleton, rare (MAF<1%), intermediate frequency (1%<MAF<5%), and common (MAF>5%) synonymous sites per Mb of mutational target in empirical data from GoT2D Consortium, n=1322 European samples. b) Number of simulated neutrally evolving sites per Mb under different human demographic histories: A = history chosen in this study (µ=2e-8, _Na_=8.1K, _Nb_=2K, _te_=370 generations, _re_=1.3%, _Ne_=228K), B = Gravel et al (µ=2.4e-8, _Na_=7.3K->14.4K, _Nb_=1.8K->1.0K, _te_=920 generations, _re_=0.4%, _Ne_=35.9K), C = Kryukov et al (µ=1.8e-8, _Na_=8.1K, _Nb_=7.9K, _te_=370 generations, _re_=1.3%, _Ne_=900K), D = Schaffner et al (µ=1.5e-8, _Na_=12.5K, _Nb_=7.7K->540, _te_=350 generations, _re_=0.7%, _Ne_=100K), E = Fixed 10K population (_Na_=_Nb_=_Ne_=10K). c) Number of non-synonymous (under purifying selection) sites per Mb in empirical data (dark blue) and in forward simulated data (light blue) using chosen demographic history and distribution of selection coefficients (inset). d) Full site frequency spectrum (n = 1322 samples) of simulated synonymous (green) and non-synonymous (light blue) sites compared to those in empirical data (black, dark blue). e) Average pairwise LD (measured by r2) as a function of physical distance between frequency-matched common (MAF > 5%) in simulated (green) and empirical (black) data. Linkage structure at a representative 200kb forward simulated locus, as generated in Haploview (inset).

Figure.3

Figure.3. Sensitivity of genetic architecture to parameters of disease models

a) Density of odds ratios (as measured in a sampled cohort of 10K individuals) for common (MAF > 2%) causal variants under disease models with varying target sizes; for all three models shown here there is no coupling to selection (τ = 0). b) Cumulative portion of population genetic variance explained by causal variants as a function of their minor allele frequency under disease models with different degrees of coupling to selection; for all three models shown here target size (T) is fixed at 500 functional loci. c) Heat maps showing distribution of population genetic variance in the two-dimensional minor allele frequency (x-axis) and effect size (y-axis) space of causal variants; models shown are for τ = 1.0, 0.5, and 0 and T = 75kb, 250kb, and 1.25Mb (N = 30, 100, and 500 causal loci).

Figure.4

Figure.4. Genetic study results for type 2 diabetes under different disease models

a) Space of disease models tested, each varying in target size (vertical axis) and selection coupling (horizontal axis). All models have fixed prevalence (8%) and heritability (45%), matching values observed for T2D. Each model produces results that are either inconsistent (red) or consistent (green) with empirical data for T2D. Inside red models, arrows indicate whether simulated results were too high or too low relative to empirical data (see Supp. Figure 17 for further detail). Dots in GWAS boxes indicate that the model is excluded by an excess of findings in large-scale (N~85K) GWAS (though results in 10K samples are consistent). b–c) Sensitivity of study results under models with N fixed at 300 loci and τ varying (b) or τ fixed at 0.3 and N varying (c). In each box, simulated data are shown (clockwise) for sibling relative risk, best genome-wide LOD score in an affected sibling pair (ASP) study of 4200 ASPs, number of genome-wide significant (p-value < 5*10^-8) loci detected in a GWAS of ~10K samples, and the Nagelkerke’s R2 value in a polygene score logistic regression in 5K samples, using common variants with a discovery p-value < 0.01 (PT = 0.01). Green zones are centered (vertically) on empirically observed values for T2D, and represent the simulated values deemed consistent with empirical data (see Methods).

Figure.5

Figure.5. Simulated study results under representative disease models and comparison to T2D empirical data

At left (a) are empirical genetic study results for type 2 diabetes (black outline, see Methods for detail). At right are simulated genetic study results for four different disease models. b) T = 250kb (N = 100 loci), τ =1 (tight coupling to selection); an ‘extreme’ rare variant model. c) T = 1.25Mb (N = 500 loci), τ =0.5 (moderate coupling to selection); an intermediate model. d) T = 1.25Mb (N = 500 loci), τ =0 (no coupling to selection); a ‘common polygenic’ model. e) T = 3.75Mb (N = 1500 loci), τ =0.1 (weak coupling to selection); a highly polygenic hybrid model. Red crosses indicate inconsistency with empirical data for T2D; green checks indicate consistency with empirical data. ‘GWS loci’ refers to the number of unique loci at which a variant is associated to disease at genome-wide significance levels (p<5e-8).

Figure.6

Figure.6. Prediction of ongoing sequencing and large-scale genotyping studies for type 2 diabetes under different disease models consistent with empirical data

Predictions under the two consistent disease models from Figure 5 are shown here: (a) a model with ‘moderate’ coupling to selection and a target size of _T_=1.25Mb (_N_=500 causal loci), and (b) a ‘weakly coupled’ model with a target size of _T_=3.75Mb (_N_=1500 causal loci). Top charts show cumulative fraction of disease loci discovered by each study design: A = Discovery GWAS in 10K samples, followed by B = Replication genotyping of top signals in 55K independent samples (as in Zeggini et al 2008); C = large-scale GWAS with discovery in an effective sample size of ~30K, followed by genotyping of all independent signals with p<0.005 to yield a total effective sample size of ~85K (as done via the Metabochip in Morris et al 2012); D = high coverage genome sequencing in 3K samples; E = high coverage genome sequencing in 10K samples; F = genotyping in 20K cases and 35 controls of all rare variants seen >= 2× in 5K controls (similar to ExomeChip); G = high coverage genome sequencing in 20K cases and 230K controls (a 250K unselected population cohort with T2D prevalence 8%). Labels above bars indicate predicted number of novel loci (e.g. not found in the previous studies) discovered at each step (Methods). Bottom charts show cumulative fraction of population genetic variance (heritability) explained by loci uncovered in each study. Solid line indicates true variance explained by those loci; dotted line represents fraction estimated using frequencies and odds ratios (estimated in the study) of the most associated single variants at each locus.

Similar articles

Cited by

References

    1. Collins FS, McKusick V. Implications of the Human Genome Project for Medical Science. JAMA. 2001;285:540–544. - PubMed
    1. Jostins L, Barrett JC. Genetic risk prediction in complex disease. Human Molecular Genetics. 2011;20:R182–R188. - PMC - PubMed
    1. Thanassoulis G, Vasan R. Genetic Cardiovascular Risk Prediction - Will We Get There? Circulation. 2011;122:2323–2334. - PMC - PubMed
    1. Grant RW, Moore AF, Florez JC. Genetic architecture of type 2 diabetes: recent progress and clinical implications. Diabetes Care. 2009;32:1107–1114. - PMC - PubMed
    1. Altshuler D, Daly MJ, Lander ES. Genetic mapping in human disease. Science. 2008;322:881–888. - PMC - PubMed

Publication types

MeSH terms

Grants and funding

LinkOut - more resources