Rapid and accurate haplotype phasing and missing-data inference for whole-genome association studies by use of localized haplotype clustering - PubMed (original) (raw)

. 2007 Nov;81(5):1084-97.

doi: 10.1086/521987. Epub 2007 Sep 21.

Affiliations

Rapid and accurate haplotype phasing and missing-data inference for whole-genome association studies by use of localized haplotype clustering

Sharon R Browning et al. Am J Hum Genet. 2007 Nov.

Abstract

Whole-genome association studies present many new statistical and computational challenges due to the large quantity of data obtained. One of these challenges is haplotype inference; methods for haplotype inference designed for small data sets from candidate-gene studies do not scale well to the large number of individuals genotyped in whole-genome association studies. We present a new method and software for inference of haplotype phase and missing data that can accurately phase data from whole-genome association studies, and we present the first comparison of haplotype-inference methods for real and simulated data sets with thousands of genotyped individuals. We find that our method outperforms existing methods in terms of both speed and accuracy for large data sets with thousands of individuals and densely spaced genetic markers, and we use our method to phase a real data set of 3,002 individuals genotyped for 490,032 markers in 3.1 days of computing time, with 99% of masked alleles imputed correctly. Our method is implemented in the Beagle software package, which is freely available.

PubMed Disclaimer

Figures

Figure  1.

Figure 1.

Example of a directed acyclic graph representing the localized haplotype-cluster model for four markers, with the haplotype counts given in table 1. For each marker, allele 1 is represented by a solid line, and allele 2 by a dashed line. The bold-line edges from the root node to the terminal node represent the haplotype 2112. The node marked by an asterisk (*) is the parent node for edge

e F

.

Figure  2.

Figure 2.

Error rates for selected haplotype-phasing methods. Three classes of data were considered: low-density data with ∼1 SNP per 10 kb (left column), high-density data with ∼1 SNP per 3 kb (middle column), and Affymetrix 500K data for the WTCCC controls (right column). Within each plot, three sample sizes (n) are shown. Each row of graphs gives a different measure of accuracy (_Y_-axis). The relative error graphs show differences in error rate between each method and a reference method, which is Beagle with

_R_=25

samples per individual. All estimates are averaged across the data sets, with error bars showing ±2 SEs.

Similar articles

Cited by

References

Web Resources

    1. Beagle genetic analysis software package, http://www.stat.auckland.ac.nz/~browning/beagle/beagle.html
    1. WTCCC, http://www.wtccc.org.uk/

References

    1. Browning BL, Browning SR (2007) Efficient multilocus association mapping for whole genome association studies using localized haplotype clustering. Genet Epidemiol 31:365–37510.1002/gepi.20216 - DOI - PubMed
    1. The International HapMap Consortium (2005) A haplotype map of the human genome. Nature 437:1299–132010.1038/nature04226 - DOI - PMC - PubMed
    1. Marchini J, Cutler D, Patterson N, Stephens M, Eskin E, Halperin E, Lin S, Qin ZS, Munro HM, Abecasis GR, et al (2006) A comparison of phasing algorithms for trios and unrelated individuals. Am J Hum Genet 78:437–450 - PMC - PubMed
    1. Excoffier L, Slatkin M (1995) Maximum-likelihood estimation of molecular haplotype frequencies in a diploid population. Mol Biol Evol 12:921–927 - PubMed
    1. Long JC, Williams RC, Urbanek M (1995) An E-M algorithm and testing strategy for multiple-locus haplotypes. Am J Hum Genet 56:799–810 - PMC - PubMed

Publication types

MeSH terms

Substances

LinkOut - more resources