Genotype imputation with thousands of genomes - PubMed (original) (raw)

Genotype imputation with thousands of genomes

Bryan Howie et al. G3 (Bethesda). 2011 Nov.

Abstract

Genotype imputation is a statistical technique that is often used to increase the power and resolution of genetic association studies. Imputation methods work by using haplotype patterns in a reference panel to predict unobserved genotypes in a study dataset, and a number of approaches have been proposed for choosing subsets of reference haplotypes that will maximize accuracy in a given study population. These panel selection strategies become harder to apply and interpret as sequencing efforts like the 1000 Genomes Project produce larger and more diverse reference sets, which led us to develop an alternative framework. Our approach is built around a new approximation that uses local sequence similarity to choose a custom reference panel for each study haplotype in each region of the genome. This approximation makes it computationally efficient to use all available reference haplotypes, which allows us to bypass the panel selection step and to improve accuracy at low-frequency variants by capturing unexpected allele sharing among populations. Using data from HapMap 3, we show that our framework produces accurate results in a wide range of human populations. We also use data from the Malaria Genetic Epidemiology Network (MalariaGEN) to provide recommendations for imputation-based studies in Africa. We demonstrate that our approximation improves efficiency in large, sequence-based reference panels, and we discuss general computational strategies for modern reference datasets. Genome-wide association studies will soon be able to harness the power of thousands of reference genomes, and our work provides a practical way for investigators to use this rich information. New methodology from this study is implemented in the IMPUTE2 software package.

Keywords: GWAS; haplotype; human; linkage disequilibrium; reference panel.

PubMed Disclaimer

Figures

Figure 1

Figure 1

Imputation accuracy at low-frequency SNPs in HapMap 3 cross-validations in ASW and TSI, as a function of reference panel composition and khap value. These plots show the imputation accuracy of IMPUTE2 in (A) the ASW panel and (B) the TSI panel. The accuracy of each experiment is plotted on the _y_-axis as the mean R2 across all SNPs with MAF < 5% in the cross-validation panel (identified by the gray box in each plot). The _x_-axis shows the khap parameter, which scales linearly with the computational burden of imputation updates in IMPUTE2. Each curve represents a different reference panel, with panels added cumulatively in the order shown in the legends, reading from bottom to top. Similar plots for other HapMap 3 target panels can be found in

File S1

.

Figure 2

Figure 2

Imputation accuracy at low-frequency SNPs in HapMap 3 cross-validations, as a function of target panel, reference panel composition, khap value, and imputation method. These plots show the imputation accuracy of IMPUTE2 and Beagle in various cross-validation experiments. The accuracy of each experiment is plotted on the _y_-axis as the mean R2 across all SNPs with MAF < 5% in the cross-validation panel (identified by the gray box in each plot). The _x_-axis shows the khap parameter, which scales linearly with the computational burden of imputation updates in IMPUTE2. The solid black curves show how R2 varies with khap when using IMPUTE2 with a reference panel containing the full set of 2020 HapMap 3 haplotypes; the dashed black lines show the accuracy of Beagle with this reference panel. IMPUTE2 was also applied to subpanels of the full HapMap 3 panel, with results shown as orange curves. Similar plots for other observed SNP sets and imputed SNP MAFs can be found in

File S3

.

Figure 3

Figure 3

Imputation accuracy in Gambian validation set as a function of reference panel composition and minor allele frequency. These plots show the accuracy obtained when imputing masked SNPs in 1216 Gambian individuals from the MalariaGEN dataset using IMPUTE2 with khap = 500. Each reference panel is represented by a different color, and the results are shown for (A) all SNPs and (B) SNPs with MAF < 10% in the Gambian validation set. The results are binned by MAF, with 5% bins in (A) and 1% bins in (B). Each point on a curve is located in the middle of the corresponding MAF bin. The following reference panel codes are used in the legend: GMB (Gambia, 200 haplotypes); GHN (Ghana, 200 haplotypes); and HM3 (HapMap 3, 2022 haplotypes).

Figure 4

Figure 4

Comparison of imputation accuracy between IMPUTE2 and Beagle in Gambian validation set. This plot shows the accuracy obtained when imputing masked SNPs in 1216 Gambian individuals from the MalariaGEN dataset using either IMPUTE2 with khap = 500 (solid lines) or Beagle on default settings (dashed lines). Imputation was performed with a reference panel of Gambian haplotypes (blue) and a reference panel of Gambian, Ghanaian, and HapMap 3 African ancestry haplotypes (gray). The results are grouped into 5% MAF bins, and each point on a curve is located in the middle of the corresponding MAF bin. The following reference panel codes are used in the legend: GMB (Gambia, 200 haplotypes); GHN (Ghana, 200 haplotypes); and HM3.afr (HapMap 3 African ancestry, 822 haplotypes).

References

    1. Browning B. L., Browning S. R., 2009. A unified approach to genotype imputation and haplotype-phase inference for large data sets of trios and unrelated individuals. Am. J. Hum. Genet. 84: 210–223 - PMC - PubMed
    1. Browning S. R., 2008. Missing data imputation and haplotype phase inference for genome-wide association studies. Hum. Genet. 124: 439–450 - PMC - PubMed
    1. Bryc K., Auton A., Nelson M. R., Oksenberg J. R., Hauser S. L., et al. , 2010. Genome-wide patterns of population structure and admixture in West Africans and African Americans. Proc. Natl. Acad. Sci. USA 107: 786–791 - PMC - PubMed
    1. Campbell M. C., Tishkoff S. A., 2008. African genetic diversity: implications for human demographic history, modern human origins, and complex disease mapping. Annu. Rev. Genomics Hum. Genet. 9: 403–433 - PMC - PubMed
    1. Campbell M. C., Tishkoff S. A., 2010. The evolution of human genetic and phenotypic variation in Africa. Curr. Biol. 20: R166–R173 - PMC - PubMed

Grants and funding

LinkOut - more resources