MaCH: using sequence and genotype data to estimate haplotypes and unobserved genotypes - PubMed (original) (raw)
MaCH: using sequence and genotype data to estimate haplotypes and unobserved genotypes
Yun Li et al. Genet Epidemiol. 2010 Dec.
Abstract
Genome-wide association studies (GWAS) can identify common alleles that contribute to complex disease susceptibility. Despite the large number of SNPs assessed in each study, the effects of most common SNPs must be evaluated indirectly using either genotyped markers or haplotypes thereof as proxies. We have previously implemented a computationally efficient Markov Chain framework for genotype imputation and haplotyping in the freely available MaCH software package. The approach describes sampled chromosomes as mosaics of each other and uses available genotype and shotgun sequence data to estimate unobserved genotypes and haplotypes, together with useful measures of the quality of these estimates. Our approach is already widely used to facilitate comparison of results across studies as well as meta-analyses of GWAS. Here, we use simulations and experimental genotypes to evaluate its accuracy and utility, considering choices of genotyping panels, reference panel configurations, and designs where genotyping is replaced with shotgun sequencing. Importantly, we show that genotype imputation not only facilitates cross study analyses but also increases power of genetic association studies. We show that genotype imputation of common variants using HapMap haplotypes as a reference is very accurate using either genome-wide SNP data or smaller amounts of data typical in fine-mapping studies. Furthermore, we show the approach is applicable in a variety of populations. Finally, we illustrate how association analyses of unobserved variants will benefit from ongoing advances such as larger HapMap reference panels and whole genome shotgun sequencing technologies.
© 2010 Wiley-Liss, Inc.
Figures
Fig. 1
ROC curve comparing two measures of data quality. For imputed SNPs on chromosome 14, where both imputed and actual genotypes were available, we evaluated the ability of two different measures of data quality (the estimated concordance between imputed and true genotypes and the estimated _r_2 between imputed and true genotypes) to discriminate between poor and well imputed SNPs. Both estimates of imputation quality are calculated without using the actual observed genotypes.
Fig. 2
Imputation improves quality of LD estimates. For imputed SNPs on chromosome 14, the figure compares estimates of LD obtained by genotyping both SNPs (“Results from Actual Genotyping,” X axis) with estimates of LD obtained by imputing genotypes for both SNPs using markers on the 317K marker chip (“Results from Imputed Data,” Y axis, Top left), obtained by imputing genotypes for one of the SNPs (“Results from Imputed Data,” Y axis, Bottom Left) or obtained from the HapMap CEU panel (“Results from HapMap CEU,” Y axis, Top and Bottom Right).
Fig. 3
Evaluation of imputation accuracy across HGDP panels. For each of 52 populations in the Human Genome Diversity Panel (HGDP) a set of 872 SNPs distributed evenly across 32 regions, each ~330 kb in length, was used to impute 992 other SNPs. The 992 imputed SNPs were located near the middle of each imputed region. Imputation was done using either the HapMap YRI, CEU, CHB+JPT, or a combination of three HapMap panels (first four panels, best panel is shaded in gray) or using the remaining HGDP samples as a reference. In each case, the proportion of correctly imputed alleles is tabulated. The figure is based on a re-analysis of data of Conrad et al. [2006].
Fig. 4
Evaluation of imputation accuracy across HGDP panels. Genotypes for a set of 992 SNPs were imputed in the HGDP and then compared with actual genotypes. For each pair of true and imputed genotypes an _r_2 coefficient was calculated and averaged for each population. The best set of HapMap reference individuals for each population is shaded. The coverage obtained by using the best available tag SNP (rather than imputed genotypes) is overlaid in pink. See Figure 3 legend for further details.
Similar articles
- Accuracy of genome-wide imputation of untyped markers and impacts on statistical power for association studies.
Hao K, Chudin E, McElwee J, Schadt EE. Hao K, et al. BMC Genet. 2009 Jun 16;10:27. doi: 10.1186/1471-2156-10-27. BMC Genet. 2009. PMID: 19531258 Free PMC article. - Genotype imputation of Metabochip SNPs using a study-specific reference panel of ~4,000 haplotypes in African Americans from the Women's Health Initiative.
Liu EY, Buyske S, Aragaki AK, Peters U, Boerwinkle E, Carlson C, Carty C, Crawford DC, Haessler J, Hindorff LA, Marchand LL, Manolio TA, Matise T, Wang W, Kooperberg C, North KE, Li Y. Liu EY, et al. Genet Epidemiol. 2012 Feb;36(2):107-17. doi: 10.1002/gepi.21603. Genet Epidemiol. 2012. PMID: 22851474 Free PMC article. - Rare variant genotype imputation with thousands of study-specific whole-genome sequences: implications for cost-effective study designs.
Pistis G, Porcu E, Vrieze SI, Sidore C, Steri M, Danjou F, Busonero F, Mulas A, Zoledziewska M, Maschio A, Brennan C, Lai S, Miller MB, Marcelli M, Urru MF, Pitzalis M, Lyons RH, Kang HM, Jones CM, Angius A, Iacono WG, Schlessinger D, McGue M, Cucca F, Abecasis GR, Sanna S. Pistis G, et al. Eur J Hum Genet. 2015 Jul;23(7):975-83. doi: 10.1038/ejhg.2014.216. Epub 2014 Oct 8. Eur J Hum Genet. 2015. PMID: 25293720 Free PMC article. - Genotype Imputation in Genome-Wide Association Studies.
Naj AC. Naj AC. Curr Protoc Hum Genet. 2019 Jun;102(1):e84. doi: 10.1002/cphg.84. Curr Protoc Hum Genet. 2019. PMID: 31216114 Review. - Genotype imputation in genome-wide association studies.
Porcu E, Sanna S, Fuchsberger C, Fritsche LG. Porcu E, et al. Curr Protoc Hum Genet. 2013 Jul;Chapter 1:Unit 1.25. doi: 10.1002/0471142905.hg0125s78. Curr Protoc Hum Genet. 2013. PMID: 23853078 Review.
Cited by
- Identification of candidate causal variants and target genes at 41 breast cancer risk loci through differential allelic expression analysis.
Xavier JM, Magno R, Russell R, de Almeida BP, Jacinta-Fernandes A, Besouro-Duarte A, Dunning M, Samarajiwa S, O'Reilly M, Maia AM, Rocha CL, Rosli N, Ponder BAJ, Maia AT. Xavier JM, et al. Sci Rep. 2024 Sep 28;14(1):22526. doi: 10.1038/s41598-024-72163-y. Sci Rep. 2024. PMID: 39341862 Free PMC article. - Identification of genetic basis of brain imaging by group sparse multi-task learning leveraging summary statistics.
Xi D, Cui D, Zhang M, Zhang J, Shang M, Guo L, Han J, Du L. Xi D, et al. Comput Struct Biotechnol J. 2024 Sep 3;23:3288-3299. doi: 10.1016/j.csbj.2024.08.027. eCollection 2024 Dec. Comput Struct Biotechnol J. 2024. PMID: 39296810 Free PMC article. - Improving on polygenic scores across complex traits using select and shrink with summary statistics (S4) and LDpred2.
Tyrer JP, Peng PC, DeVries AA, Gayther SA, Jones MR, Pharoah PD. Tyrer JP, et al. BMC Genomics. 2024 Sep 18;25(1):878. doi: 10.1186/s12864-024-10706-3. BMC Genomics. 2024. PMID: 39294559 Free PMC article. - An overview of recent technological developments in bovine genomics.
Ghavi Hossein-Zadeh N. Ghavi Hossein-Zadeh N. Vet Anim Sci. 2024 Jul 23;25:100382. doi: 10.1016/j.vas.2024.100382. eCollection 2024 Sep. Vet Anim Sci. 2024. PMID: 39166173 Free PMC article. Review. - CD59 gene: 143 haplotypes of 22,718 nucleotides length by computational phasing in 113 individuals from different ethnicities.
Srivastava K, Yin Q, Makuria AT, Rios M, Gebremedhin A, Flegel WA. Srivastava K, et al. Transfusion. 2024 Jul;64(7):1296-1305. doi: 10.1111/trf.17869. Epub 2024 May 30. Transfusion. 2024. PMID: 38817044 Free PMC article.
References
- Abecasis GR, Cherny SS, Cookson WO, Cardon LR. Merlin—rapid analysis of dense genetic maps using sparse gene flow trees. Nat Genet. 2002;30:97–101. - PubMed
- Barrett JC, Cardon LR. Evaluating coverage of genome-wide association studies. Nat Genet. 2006;38:659–662. - PubMed
- Baum LE. An inequality and associated maximization technique in statistical estimation for probabilistic functions of Markov processes. Inequalities. 1972;3:1–8.
- Bentley DR. Whole-genome re-sequencing. Curr Opin Genet Dev. 2006;16:545–552. - PubMed
Publication types
MeSH terms
Substances
Grants and funding
- R01 CA082659/CA/NCI NIH HHS/United States
- R01 MH084698/MH/NIMH NIH HHS/United States
- U01 HL084729/HL/NHLBI NIH HHS/United States
- R01 HG002651/HG/NHGRI NIH HHS/United States
- RC2 HG005552/HG/NHGRI NIH HHS/United States
- K99 HL094535/HL/NHLBI NIH HHS/United States
- RC2 HG005552-02/HG/NHGRI NIH HHS/United States
- R01 HG002651-05/HG/NHGRI NIH HHS/United States
- U01 HG005214-02/HG/NHGRI NIH HHS/United States
- K99 HL094535-02/HL/NHLBI NIH HHS/United States
- U01 HG005214/HG/NHGRI NIH HHS/United States
- U01 HL084729-03/HL/NHLBI NIH HHS/United States
- R01 MH084698-03/MH/NIMH NIH HHS/United States
LinkOut - more resources
Full Text Sources
Other Literature Sources