Haplotype phasing: existing methods and new developments (original) (raw)

Tewhey, R., Bansal, V., Torkamani, A., Topol, E. J. & Schork, N. J. The importance of phase information for human genomics. Nature Rev. Genet. 12, 215–223 (2011).
Article CAS PubMed Google Scholar
Marchini, J., Howie, B., Myers, S., McVean, G. & Donnelly, P. A new multipoint method for genome-wide association studies by imputation of genotypes. Nature Genet. 39, 906–913 (2007).
Article CAS PubMed Google Scholar
Browning, B. L. & Browning, S. R. A unified approach to genotype imputation and haplotype-phase inference for large data sets of trios and unrelated individuals. Am. J. Hum. Genet. 84, 210–223 (2009).
Article CAS PubMed PubMed Central Google Scholar
Li, Y., Willer, C. J., Ding, J., Scheet, P. & Abecasis, G. R. MaCH: using sequence and genotype data to estimate haplotypes and unobserved genotypes. Genet. Epidemiol. 34, 816–834 (2010).
Article PubMed PubMed Central Google Scholar
Kang, H., Qin, Z. S., Niu, T. & Liu, J. S. Incorporating genotyping uncertainty in haplotype inference for single-nucleotide polymorphisms. Am. J. Hum. Genet. 74, 495–510 (2004).
Article CAS PubMed PubMed Central Google Scholar
Browning, B. L. & Yu, Z. Simultaneous genotype calling and haplotype phasing improves genotype accuracy and reduces false-positive associations for genome-wide association studies. Am. J. Hum. Genet. 85, 847–861 (2009).
Article CAS PubMed PubMed Central Google Scholar
Yu, Z., Garner, C., Ziogas, A., Anton-Culver, H. & Schaid, D. J. Genotype determination for polymorphisms in linkage disequilibrium. BMC Bioinformatics 10, 63 (2009).
Article PubMed PubMed Central Google Scholar
The 1000 Genomes Project Consortium. A map of human genome variation from population-scale sequencing. Nature 467, 1061–1073 (2010).
Le, S. Q. & Durbin, R. SNP detection and genotyping from low-coverage sequencing data on multiple diploid samples. Genome Res. 21, 952–960 (2011).
Article CAS PubMed PubMed Central Google Scholar
Li, Y., Sidore, C., Kang, H. M., Boehnke, M. & Abecasis, G. R. Low-coverage sequencing: Implications for design of complex trait association studies. Genome Res. 21, 940–951 (2011).
Article CAS PubMed PubMed Central Google Scholar
Scheet, P. & Stephens, M. Linkage disequilibrium-based quality control for large-scale genetic studies. PLoS Genet. 4, e1000147 (2008).
Article PubMed PubMed Central Google Scholar
Tishkoff, S. A. et al. Global patterns of linkage disequilibrium at the CD4 locus and modern human origins. Science 271, 1380–1387 (1996).
Article CAS PubMed Google Scholar
Kong, A. et al. Detection of sharing by descent, long-range phasing and haplotype imputation. Nature Genet. 40, 1068–1075 (2008). This paper describes the use of an IBD-based phasing method called 'long-range phasing' in a large sample from the Icelandic population.
Article CAS PubMed Google Scholar
Sabeti, P. C. et al. Detecting recent positive selection in the human genome from haplotype structure. Nature 419, 832–837 (2002).
Article CAS PubMed Google Scholar
Tao, H., Cox, D. R. & Frazer, K. A. Allele-specific KRT1 expression is a complex trait. PLoS Genet. 2, e93 (2006).
Article PubMed PubMed Central Google Scholar
Gusfield, D. Haplotype inference by pure parsimony. Lect. Notes Comp. Sci. 2676, 144–155 (2003).
Article Google Scholar
Wang, L. & Xu, Y. Haplotype inference by maximum parsimony. Bioinformatics 19, 1773–1780 (2003).
Article CAS PubMed Google Scholar
Weale, M. E. A survey of current software for haplotype phase inference. Hum. Genomics 1, 141–144 (2004).
Article CAS PubMed PubMed Central Google Scholar
Clark, A. G. Inference of haplotypes from PCR-amplified samples of diploid populations. Mol. Biol. Evol. 7, 111–122 (1990). This paper describes the first computational phasing method for more than two markers.
CAS PubMed Google Scholar
Dempster, A. P., Laird, N. M. & Rubin, D. B. Maximum likelihood from incomplete data via the EM algorithm. J. R. Statist. Soc. B 39, 1–38 (1977).
Google Scholar
Excoffier, L. & Slatkin, M. Maximum-likelihood estimation of molecular haplotype frequencies in a diploid population. Mol. Biol. Evol. 12, 921–927 (1995). This was one of the earliest papers describing the use of the EM algorithm for statistical phasing of unrelated individuals.
CAS PubMed Google Scholar
Hawley, M. E. & Kidd, K. K. HAPLO: a program using the EM algorithm to estimate the frequencies of multi-site haplotypes. J. Hered. 86, 409–411 (1995).
Article CAS PubMed Google Scholar
Long, J. C., Williams, R. C. & Urbanek, M. An E-M algorithm and testing strategy for multiple-locus haplotypes. Am. J. Hum. Genet. 56, 799–810 (1995).
CAS PubMed PubMed Central Google Scholar
Qin, Z. S., Niu, T. & Liu, J. S. Partition-ligation-expectation-maximization algorithm for haplotype inference with single-nucleotide polymorphisms. Am. J. Hum. Genet. 71, 1242–1247 (2002).
Article CAS PubMed PubMed Central Google Scholar
Stephens, M., Smith, N. J. & Donnelly, P. A new statistical method for haplotype reconstruction from population data. Am. J. Hum. Genet. 68, 978–989 (2001).
Article CAS PubMed PubMed Central Google Scholar
Excoffier, L. & Lischer, H. E. Arlequin suite ver 3.5: a new series of programs to perform population genetics analyses under Linux and Windows. Mol. Ecol. Resour. 10, 564–567 (2010).
Article PubMed Google Scholar
Drysdale, C. M. et al. Complex promoter and coding region β 2-adrenergic receptor haplotypes alter receptor expression and predict in vivo responsiveness. Proc. Natl Acad. Sci. USA 97, 10483–10488 (2000).
Article CAS PubMed PubMed Central Google Scholar
Rosenberg, N. et al. The frequent 5,10-methylenetetrahydrofolate reductase C677T polymorphism is associated with a common haplotype in whites, Japanese, and Africans. Am. J. Hum. Genet. 70, 758–762 (2002).
Article CAS PubMed PubMed Central Google Scholar
McVean, G. A. & Cardin, N. J. Approximating the coalescent with recombination. Phil. Trans. R. Soc. B 360, 1387–1393 (2005).
Article CAS PubMed PubMed Central Google Scholar
Li, N. & Stephens, M. Modeling linkage disequilibrium and identifying recombination hotspots using single-nucleotide polymorphism data. Genetics 165, 2213–2233 (2003). This paper describes the approximate coalescent model used by the MACH and IMPUTE statistical phasing methods. The model is similar to that used by PHASE.
CAS PubMed PubMed Central Google Scholar
Stephens, M. & Donnelly, P. Inference in molecular population genetics. J. R. Statist. Soc. B 62, 605–655 (2000).
Article Google Scholar
Fearnhead, P. & Donnelly, P. Estimating recombination rates from population genetic data. Genetics 159, 1299–1318 (2001).
CAS PubMed PubMed Central Google Scholar
Stephens, M. & Scheet, P. Accounting for decay of linkage disequilibrium in haplotype inference and missing-data imputation. Am. J. Hum. Genet. 76, 449–462 (2005). This paper describes PHASE, which has been considered as a gold standard for computational phasing accuracy, although it is too computationally intensive to be applied to large data sets.
Article CAS PubMed PubMed Central Google Scholar
Scheet, P. & Stephens, M. A fast and flexible statistical model for large-scale population genotype data: applications to inferring missing genotypes and haplotypic phase. Am. J. Hum. Genet. 78, 629–644 (2006). This paper describes fastPHASE, which was one of the first computational phasing methods suitable for genome-wide SNP data.
Article CAS PubMed PubMed Central Google Scholar
Howie, B. N., Donnelly, P. & Marchini, J. A flexible and accurate genotype imputation method for the next generation of genome-wide association studies. PLoS Genet. 5, e1000529 (2009).
Article PubMed PubMed Central Google Scholar
Celeux, G. & Diebolt, J. The SEM algorithm: a probabilistic teacher algorithm derived from the EM algorithm for the mixture problem. Comp. Statist. Quart. 2, 73–82 (1985).
Google Scholar
Tregouet, D. A., Escolano, S., Tiret, L., Mallet, A. & Golmard, J. L. A new algorithm for haplotype-based association analysis: the stochastic-EM algorithm. Ann. Hum. Genet. 68, 165–177 (2004).
Article CAS PubMed Google Scholar
Marchini, J. et al. A comparison of phasing algorithms for trios and unrelated individuals. Am. J. Hum. Genet. 78, 437–450 (2006).
Article CAS PubMed PubMed Central Google Scholar
Delaneau, O., Coulonges, C. & Zagury, J. F. Shape-IT: new rapid and accurate algorithm for haplotype inference. BMC Bioinformatics 9, 540 (2008).
Article PubMed PubMed Central Google Scholar
Browning, S. R. & Browning, B. L. Rapid and accurate haplotype phasing and missing-data inference for whole-genome association studies by use of localized haplotype clustering. Am. J. Hum. Genet. 81, 1084–1097 (2007). This paper describes the BEAGLE method for statistical phasing in samples of unrelated individuals.
Article CAS PubMed PubMed Central Google Scholar
Auton, A. et al. Global distribution of genomic diversity underscores rich complex history of continental human populations. Genome Res. 19, 795–803 (2009).
Article CAS PubMed PubMed Central Google Scholar
Sabeti, P. C. et al. Genome-wide detection and characterization of positive selection in human populations. Nature 449, 913–918 (2007).
Article CAS PubMed PubMed Central Google Scholar
Frazer, K. A. et al. A second generation human haplotype map of over 3.1 million SNPs. Nature 449, 851–861 (2007).
Article CAS PubMed Google Scholar
The International HapMap Consortium. Integrating common and rare genetic variation in diverse human populations. Nature 467, 52–58 (2010).
Kenny, E. E. et al. Systematic haplotype analysis resolves a complex plasma plant sterol locus on the Micronesian Island of Kosrae. Proc. Natl Acad. Sci. USA 106, 13886–13891 (2009).
Article CAS PubMed PubMed Central Google Scholar
Browning, S. R. Missing data imputation and haplotype phase inference for genome-wide association studies. Hum. Genet. 124, 439–450 (2008).
Article CAS PubMed PubMed Central Google Scholar
Tregouet, D. A. et al. Genome-wide haplotype association study identifies the SLC22A3-LPAL2-LPA gene cluster as a risk locus for coronary artery disease. Nature Genet. 41, 283–285 (2009).
Article CAS PubMed Google Scholar
Browning, B. L. & Browning, S. R. A fast, powerful method for detecting identity by descent. Am. J. Hum. Genet. 88, 173–182 (2011).
Article CAS PubMed PubMed Central Google Scholar
Browning, S. R. & Browning, B. L. High-resolution detection of identity by descent in unrelated individuals. Am. J. Hum. Genet. 86, 526–539 (2010).
Article CAS PubMed PubMed Central Google Scholar
Hickey, J. M. et al. A combined long-range phasing and long haplotype imputation method to impute phase for SNP genotypes. Genet. Sel. Evol. 43, 12 (2011).
Article PubMed PubMed Central Google Scholar
Daetwyler, H. D., Wiggans, G. R., Hayes, B. J., Woolliams, J. A. & Goddard, M. E. Imputation of missing genotypes from sparse to high density using long-range phasing. Genetics 24 Jun 2011 (doi:10.1534/genetics.111.128082).
Kong, A. et al. Parental origin of sequence variants associated with complex diseases. Nature 462, 868–874 (2009).
Article CAS PubMed PubMed Central Google Scholar
Holm, H. et al. A rare variant in MYH6 is associated with high risk of sick sinus syndrome. Nature Genet. 43, 316–320 (2011).
Article CAS PubMed Google Scholar
Kruglyak, L., Daly, M. J., ReeveDaly, M. P. & Lander, E. S. Parametric and nonparametric linkage analysis: a unified multipoint approach. Am. J. H. Genet. 58, 1347–1363 (1996).
CAS Google Scholar
Schaid, D. J., McDonnell, S. K., Wang, L., Cunningham, J. M. & Thibodeau, S. N. Caution on pedigree haplotype inference with software that assumes linkage equilibrium. Am. J. Hum. Genet. 71, 992–995 (2002).
Article PubMed PubMed Central Google Scholar
Roach, J. C. et al. Analysis of genetic inheritance in a family quartet by whole-genome sequencing. Science 328, 636–639 (2010).
Article CAS PubMed PubMed Central Google Scholar
Rohde, K. & Fuerst, R. Haplotyping and estimation of haplotype frequencies for closely linked biallelic multilocus genetic phenotypes including nuclear family information. Hum. Mutat. 17, 289–295 (2001).
Article CAS PubMed Google Scholar
Zhang, K., Sun, F. & Zhao, H. HAPLORE: a program for haplotype reconstruction in general pedigrees without recombination. Bioinformatics 21, 90–103 (2005).
Article CAS PubMed Google Scholar
Abecasis, G. R. & Wigginton, J. E. Handling marker-marker linkage disequilibrium: pedigree analysis with clustered markers. Am. J. Hum. Genet. 77, 754–767 (2005).
Article CAS PubMed PubMed Central Google Scholar
Zhang, F. & Deng, H. W. Confounding from cryptic relatedness in haplotype-based association studies. Genetica 138, 945–950 (2010).
Article PubMed Google Scholar
Nielsen, R., Paul, J. S., Albrechtsen, A. & Song, Y. S. Genotype and SNP calling from next-generation sequencing data. Nature Rev. Genet. 12, 443–451 (2011).
Article CAS PubMed Google Scholar
Andres, A. M. et al. Understanding the accuracy of statistical haplotype inference with sequence data of known phase. Genet. Epidemiol. 31, 659–671 (2007).
Article PubMed PubMed Central Google Scholar
Huang, L. et al. Genotype-imputation accuracy across worldwide human populations. Am. J. Hum. Genet. 84, 235–250 (2009).
Article CAS PubMed PubMed Central Google Scholar
Jostins, L., Morley, K. I. & Barrett, J. C. Imputation of low-frequency variants using the HapMap3 benefits from large, diverse reference sets. Eur. J. Hum. Genet. 19, 662–666 (2011).
Article PubMed PubMed Central Google Scholar
Geraci, F. A comparison of several algorithms for the single individual SNP haplotyping reconstruction problem. Bioinformatics 26, 2217–2225 (2010).
Article CAS PubMed PubMed Central Google Scholar
He, D., Choi, A., Pipatsrisawat, K., Darwiche, A. & Eskin, E. Optimal algorithms for haplotype assembly from whole-genome sequence data. Bioinformatics 26, i183–i190 (2010).
Article CAS PubMed PubMed Central Google Scholar
Long, Q., MacArthur, D., Ning, Z. & Tyler-Smith, C. HI: haplotype improver using paired-end short reads. Bioinformatics 25, 2436–2437 (2009).
Article CAS PubMed PubMed Central Google Scholar
International Human Genome Sequencing Consortium. Initial sequencing and analysis of the human genome. Nature 409, 860–921 (2001).
Kitzman, J. O. et al. Haplotype-resolved genome sequencing of a Gujarati Indian individual. Nature Biotech. 29, 59–63 (2011). This paper describes the use of an experimental phasing method that was applied to the sequence of an individual and the population-genetic inferences that were made using the phased haplotypes.
Article CAS Google Scholar
Suk, E.-K. K. et al. A comprehensively molecular haplotype-resolved genome of a European individual. Genome Res. 3 Aug 2011 (doi:10.1101/gr.125047.111).
Duitama, J., Huebsch, T., McEwen, G., Suk, E.-K. & Hoehe, M. R. in Proc. 1st ACM Int. Conf. Bioinf. Comp. Biol. 160–169 (Association for Computing Machinery, Niagara Falls, New York, 2010).
Bansal, V. & Bafna, V. HapCUT: an efficient and accurate algorithm for the haplotype assembly problem. Bioinformatics 24, i153–i159 (2008).
Article PubMed Google Scholar
Fan, H. C., Wang, J., Potanina, A. & Quake, S. R. Whole-genome molecular haplotyping of single cells. Nature Biotech. 29, 51–57 (2011).
Article CAS Google Scholar
Ma, L. et al. Direct determination of molecular haplotypes by chromosome microdissection. Nature Methods 7, 299–301 (2010).
Article CAS PubMed PubMed Central Google Scholar
Hert, D. G., Fredlake, C. P. & Barron, A. E. Advantages and limitations of next-generation sequencing technologies: a comparison of electrophoresis and non-electrophoresis methods. Electrophoresis 29, 4618–4626 (2008).
Article CAS PubMed Google Scholar
Levy, S. et al. The diploid genome sequence of an individual human. PLoS Biol. 5, e254 (2007).
Article PubMed PubMed Central Google Scholar
Metzker, M. L. Sequencing technologies — the next generation. Nature Rev. Genet. 11, 31–46 (2010).
Article CAS PubMed Google Scholar
Eid, J. et al. Real-time DNA sequencing from single polymerase molecules. Science 323, 133–138 (2009).
Article CAS PubMed Google Scholar
McKenna, A. et al. The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data. Genome Res. 20, 1297–1303 (2010).
Article CAS PubMed PubMed Central Google Scholar
Su, S. Y. et al. Inferring combined CNV/SNP haplotypes from genotype data. Bioinformatics 26, 1437–1445 (2010).
Article CAS PubMed PubMed Central Google Scholar
Li, Z. et al. A partition-ligation-combination-subdivision EM algorithm for haplotype inference with multiallelic markers: update of the SHEsis (http://analysis.bio-x.cn). Cell Res. 19, 519–523 (2009).
Article CAS PubMed Google Scholar
Cirulli, E. T. & Goldstein, D. B. Uncovering the roles of rare variants in common disease through whole-genome sequencing. Nature Rev. Genet. 11, 415–425 (2010).
Article CAS PubMed Google Scholar
Yang, H., Chen, X. & Wong, W. H. Completely phased genome sequencing through chromosome sorting. Proc. Natl Acad. Sci. USA 108, 12–17 (2011).
Article CAS PubMed Google Scholar
The UK IBD Genetics Consortium & The Wellcome Trust Case Control Consortium 2. Genome-wide association study of ulcerative colitis identifies three new susceptibility loci, including the HNF4A region. Nature Genet. 41, 1330–1334 (2009).
The Wellcome Trust Case Control Consortium. Genome-wide association study of 14,000 cases of seven common diseases and 3,000 shared controls. Nature 447, 661–678 (2007).