A unified approach to genotype imputation and haplotype-phase inference for large data sets of trios and unrelated individuals - PubMed (original) (raw)

A unified approach to genotype imputation and haplotype-phase inference for large data sets of trios and unrelated individuals

Brian L Browning et al. Am J Hum Genet. 2009 Feb.

Abstract

We present methods for imputing data for ungenotyped markers and for inferring haplotype phase in large data sets of unrelated individuals and parent-offspring trios. Our methods make use of known haplotype phase when it is available, and our methods are computationally efficient so that the full information in large reference panels with thousands of individuals is utilized. We demonstrate that substantial gains in imputation accuracy accrue with increasingly large reference panel sizes, particularly when imputing low-frequency variants, and that unphased reference panels can provide highly accurate genotype imputation. We place our methodology in a unified framework that enables the simultaneous use of unphased and phased data from trios and unrelated individuals in a single analysis. For unrelated individuals, our imputation methods produce well-calibrated posterior genotype probabilities and highly accurate allele-frequency estimates. For trios, our haplotype-inference method is four orders of magnitude faster than the gold-standard PHASE program and has excellent accuracy. Our methods enable genotype imputation to be performed with unphased trio or unrelated reference panels, thus accounting for haplotype-phase uncertainty in the reference panel. We present a useful measure of imputation accuracy, allelic R(2), and show that this measure can be estimated accurately from posterior genotype probabilities. Our methods are implemented in version 3.0 of the BEAGLE software package.

PubMed Disclaimer

Figures

Figure 1

Calibration of Posterior Genotype Probabilities Genotypes for chromosome 1 markers on the Illumina 550K chip, but not the Affymetrix 500K chip, were imputed with a phased reference panel of 60 individuals (HapMap CEU panel) in a sample of 1088 individuals genotyped on the Affymetrix 500K chip. Imputed genotypes are divided into bins according to their posterior genotype probability. The proportion of imputed genotypes that are consistent with the Illumina genotype are given for each bin. The line is the set of points with equal posterior genotype probability and accuracy rate.

Figure 2

Imputation Accuracy and Reference Panel Sample Size Genotypes for markers on the Illumina 550K chip, but not the Affymetrix 500K chip, were imputed in a sample of 188 individuals with five different reference panels: 60 phased individuals and 100, 300, 600, and 1200 unphased individuals. For each reference panel, the proportion of imputed markers whose allelic R2 (see Material and Methods) exceeds each threshold is given. The allelic R2 for each imputed marker is calculated with the assumption that the Illumina genotypes are the true genotypes.

Figure 3

Median Allelic R2 and Minor-Allele Frequency Genotypes for markers on the Illumina 550K chip, but not the Affymetrix 500K chip, were imputed with two different reference panels in a sample of 1088 individuals genotyped on the Affymetrix 500K chip. For each minor-allele frequency, x = 0.01, 0.02, …, 0.5, the median allelic R2 for imputed markers with minor-allele frequency between x − 0.01 and x + 0.01 is plotted.

Figure 4

Allelic Test p values for SNPs Associated with Disease in the WTCCC Study Allelic test p values were computed from data for approximately 2000 cases and approximately 2500 controls genotyped with the Affymetrix 500K chip. Two reference panels were used for imputing data: 300 unphased individuals genotyped on both the Affymetric 500K and Illumina 550K chips and 60 phased individuals from the HapMap CEU panel. For each marker of interest, p values were calculated for the original genotype data and for the imputed data obtained from each reference panel. The imputed data for each marker of interest was obtained after masking the genotype data for the marker in the sample. The allelic test was a two-sample t test of the estimated allele dosage in each individual. Left panel: p values for 15 markers (outside the MHC) that have minor-allele frequency >0.10 in controls, that show the strongest association (p < 5 × 10−7) on an allelic or genotypic test in the WTCCC study, and that have evidence of association in replication studies. Right panel: p values for nine markers with minor-allele frequency between 0.06 and 0.10 in controls that were reported to show moderate or strong association (p < 10−5) on an allelic or genotypic test in the WTCCC study. One marker (rs6679677) that is associated with two diseases (rheumatoid arthritis and type 1 diabetes) is repeated.

Figure 5

Building the BEAGLE HMM (A) Building level l+1 from level l. The first step is merging. In this example, states S1, S2, and S4 are merged, and states S3 and S5 are merged. After merging, haplotype clusters are split on the basis of the allele at marker l+1. All haplotypes in states S7 and S10 have allele 1 at this marker, whereas all haplotypes in states S8, S9, and S11 have allele 2. (B) Transition probabilities between the states at the two levels. All transitions with nonzero probabilities are shown. Transitions with the same probability have the same pattern on the arrow shaft.

Cited by

GW3, encoding a member of the P450 subfamily, controls grain width by regulating the GA4 content in spikelets of rice (Oryza sativa L.).
Dang X, Xu Q, Li Y, Song S, Hu C, Jing C, Zhang Y, Wang D, Hong D, Jiang J. Dang X, et al. Theor Appl Genet. 2024 Oct 19;137(11):251. doi: 10.1007/s00122-024-04751-5. Theor Appl Genet. 2024. PMID: 39425772
Plant sperm cell sequencing for genome phasing and determination of meiotic crossover points.
Zhang W, Tariq A, Jia X, Yan J, Fernie AR, Usadel B, Wen W. Zhang W, et al. Nat Protoc. 2024 Oct 2. doi: 10.1038/s41596-024-01063-2. Online ahead of print. Nat Protoc. 2024. PMID: 39358597 Review.
Genetic and genomic analysis of reproduction traits in holstein cattle using SNP chip data and imputed sequence level genotypes.
Schwarz L, Križanac AM, Schneider H, Falker-Gieske C, Heise J, Liu Z, Bennewitz J, Thaller G, Tetens J. Schwarz L, et al. BMC Genomics. 2024 Sep 19;25(1):880. doi: 10.1186/s12864-024-10782-5. BMC Genomics. 2024. PMID: 39300329 Free PMC article.
GWAS Enhances Genomic Prediction Accuracy of Caviar Yield, Caviar Color and Body Weight Traits in Sturgeons Using Whole-Genome Sequencing Data.
Song H, Dong T, Wang W, Yan X, Geng C, Bai S, Hu H. Song H, et al. Int J Mol Sci. 2024 Sep 9;25(17):9756. doi: 10.3390/ijms25179756. Int J Mol Sci. 2024. PMID: 39273703 Free PMC article.
Mitochondrial sequence variants: testing imputation accuracy and their association with dairy cattle milk traits.
Dorji J, Chamberlain AJ, Reich CM, VanderJagt CJ, Nguyen TV, Daetwyler HD, MacLeod IM. Dorji J, et al. Genet Sel Evol. 2024 Sep 12;56(1):62. doi: 10.1186/s12711-024-00931-5. Genet Sel Evol. 2024. PMID: 39266998 Free PMC article.

References

1. Browning S.R. Missing data imputation and haplotype phase inference for genome-wide association studies. Hum. Genet. 2009;124:439–450. - PMC - PubMed
1. Lettre G., Jackson A.U., Gieger C., Schumacher F.R., Berndt S.I., Sanna S., Eyheramendy S., Voight B.F., Butler J.L., Guiducci C. Identification of ten loci associated with height highlights new biological pathways in human growth. Nat. Genet. 2008;40:584–591. - PMC - PubMed
1. Barrett J.C., Hansoul S., Nicolae D.L., Cho J.H., Duerr R.H., Rioux J.D., Brant S.R., Silverberg M.S., Taylor K.D., Barmada M.M. Genome-wide association defines more than 30 distinct susceptibility loci for Crohn's disease. Nat. Genet. 2008;40:955–962. - PMC - PubMed
1. Willer C.J., Sanna S., Jackson A.U., Scuteri A., Bonnycastle L.L., Clarke R., Heath S.C., Timpson N.J., Najjar S.S., Stringham H.M. Newly identified loci that influence lipid concentrations and risk of coronary artery disease. Nat. Genet. 2008;40:161–169. - PMC - PubMed
1. Zeggini E., Scott L.J., Saxena R., Voight B.F., Marchini J.L., Hu T., de Bakker P.I., Abecasis G.R., Almgren P., Andersen G. Meta-analysis of genome-wide association data and large-scale replication identifies additional susceptibility loci for type 2 diabetes. Nat. Genet. 2008;40:638–645. - PMC - PubMed

Publication types

MeSH terms

LinkOut - more resources

Full Text Sources
Other Literature Sources
- The Lens - Patent Citations

A unified approach to genotype imputation and haplotype-phase inference for large data sets of trios and unrelated individuals - PubMed (original) (raw)