A comparison of approaches to account for uncertainty in analysis of imputed genotypes - PubMed (original) (raw)
Comparative Study
A comparison of approaches to account for uncertainty in analysis of imputed genotypes
Jin Zheng et al. Genet Epidemiol. 2011 Feb.
Abstract
The availability of extensively genotyped reference samples, such as "The HapMap" and 1,000 Genomes Project reference panels, together with advances in statistical methodology, have allowed for the imputation of genotypes at single nucleotide polymorphism (SNP) markers that are untyped in a cohort or case-control study. These imputation procedures facilitate the interpretation and meta-analyses of genome-wide association studies. A natural question when implementing these procedures concerns how best to take into account uncertainty in imputed genotypes. Here we compare the performance of the following three strategies: least-squares regression on the "best-guess" imputed genotype; regression on the expected genotype score or "dosage"; and mixture regression models that more fully incorporate posterior probabilities of genotypes at untyped SNPs. Using simulation, we considered a range of sample sizes, minor allele frequencies, and imputation accuracies to compare the performance of the different methods under various genetic models. The mixture models performed the best in the setting of a large genetic effect and low imputation accuracies. However, for most realistic settings, we find that regressing the phenotype on the estimated allelic or genotypic dosage provides an attractive compromise between accuracy and computational tractability.
© 2011 Wiley-Liss, Inc.
Figures
Fig. A1
Summary of effect sizes for phenotype simulations. Values for the effect size (a) are plotted against allele frequencies of the recessive allele (allele “A” in Table I). Values of d are given as in Tables II and III, i.e. 0 (Additive), (1/2)a (Partially dominant), a (Dominant), and (6/5)a (Overdominant).
Fig. 1
Example of posterior probability summaries. Here we present a didactic illustration of the three summaries of the full posterior probabilities for imputed genotypes. From the set of Reference Haplotypes, the missing genotype (denoted with two ? symbols) in the sample genotypes can be inferred. Based on the reference, the first sample haplotype would consist of a C at the missing position, since all three similar haplotypes in the reference set have a C here. For the second sample haplotype, three-fourths of the similar haplotypes in the reference set consist of a C; and one consists of a T at that position. Therefore, the “expected” dosage would be 1.75. And the only “possible” genotypes, based completely on the reference, would be C/C and C/T, expected probabilities given.
Fig. 2
Power vs. accuracy and allele frequency for large sample size and small effects. For each summary and the true genotypes, both an additive (solid line) and dominant (dotted line) model were analyzed. (A) and (C) are based on data simulated with an additive effect; (B) and (D) are based on data simulated under a model of complete dominance. Power was computed at a fixed type-I error rate (α) of 5 × 10−5. The sample size was 1,000. TOP: Power is plotted against _R_2, a measure of imputation accuracy. BOTTOM: Power is plotted against allele frequency. (A) Power vs. _R_2 with an additive effect; (B) power vs. _R_2 under complete dominance; (C) power vs. frequency of minor allele with an additive effect; and (D) power vs. frequency of dominant allele under complete dominance.
Fig. 3
Power vs. accuracy and allele frequency for small sample size and large effects. Power was computed at a fixed type-I error rate (α) of 5 × 10−5. The sample size was 50. For each summary and the true genotypes, both an additive (solid line) and dominant (dotted line) model were analyzed. (A) and (C) are based on data simulated with an additive effect; (B) and (D) are based on data simulated under a model of complete dominance. TOP: Power is plotted against _R_2, a measure of imputation accuracy. BOTTOM: Power is plotted against allele frequency: (A) power vs. _R_2 with an additive effect; (B) power vs. _R_2 under complete dominance; (C) power vs. frequency of minor allele with an additive effect; and (D) power vs. frequency of dominant allele under complete dominance.
References
- Barrett JC, Hansoul S, Nicolae DL, Cho JH, Duerr RH, Rioux JD, Brant SR, Silverberg MS, Taylor KD, Barmada MM, Bitton A, Dassopoulos T, Datta LW, Green T, Griffiths AM, Kistner EO, Murtha MT, Regueiro MD, Rotter JI, Schumm LP, Steinhart AH, Targan SR, Xavier RJ, Libioulle C, Sandor C, Lathrop M, Belaiche J, Dewit O, Gut I, Heath S, Laukens D, Mni M, Rutgeerts P, Van Gossum A, Zelenika D, Franchimont D, Hugot JP, de Vos M, Vermeire S, Louis E, Cardon LR, Anderson CA, Drummond H, Nimmo E, Ahmad T, Prescott NJ, Onnie CM, Fisher SA, Marchini J, Ghori J, Bumpstead S, Gwilliam R, Tremelling M, Deloukas P, Mansfield J, Jewell D, Satsangi J, Mathew CG, Parkes M, Georges M, Daly MJ Belgian-French IBD Consortium, Wellcome Trust Case Control Consortium. Genome-wide association defines more than 30 distinct susceptibility loci for crohn’s disease. Nat Genet. 2008;40:955–962. - PMC - PubMed
Publication types
MeSH terms
LinkOut - more resources
Full Text Sources