Haplotype phasing: existing methods and new developments (original) (raw)
Tewhey, R., Bansal, V., Torkamani, A., Topol, E. J. & Schork, N. J. The importance of phase information for human genomics. Nature Rev. Genet.12, 215–223 (2011). ArticleCASPubMed Google Scholar
Marchini, J., Howie, B., Myers, S., McVean, G. & Donnelly, P. A new multipoint method for genome-wide association studies by imputation of genotypes. Nature Genet.39, 906–913 (2007). ArticleCASPubMed Google Scholar
Browning, B. L. & Browning, S. R. A unified approach to genotype imputation and haplotype-phase inference for large data sets of trios and unrelated individuals. Am. J. Hum. Genet.84, 210–223 (2009). ArticleCASPubMedPubMed Central Google Scholar
Li, Y., Willer, C. J., Ding, J., Scheet, P. & Abecasis, G. R. MaCH: using sequence and genotype data to estimate haplotypes and unobserved genotypes. Genet. Epidemiol.34, 816–834 (2010). ArticlePubMedPubMed Central Google Scholar
Kang, H., Qin, Z. S., Niu, T. & Liu, J. S. Incorporating genotyping uncertainty in haplotype inference for single-nucleotide polymorphisms. Am. J. Hum. Genet.74, 495–510 (2004). ArticleCASPubMedPubMed Central Google Scholar
Browning, B. L. & Yu, Z. Simultaneous genotype calling and haplotype phasing improves genotype accuracy and reduces false-positive associations for genome-wide association studies. Am. J. Hum. Genet.85, 847–861 (2009). ArticleCASPubMedPubMed Central Google Scholar
Yu, Z., Garner, C., Ziogas, A., Anton-Culver, H. & Schaid, D. J. Genotype determination for polymorphisms in linkage disequilibrium. BMC Bioinformatics10, 63 (2009). ArticlePubMedPubMed Central Google Scholar
The 1000 Genomes Project Consortium. A map of human genome variation from population-scale sequencing. Nature467, 1061–1073 (2010).
Le, S. Q. & Durbin, R. SNP detection and genotyping from low-coverage sequencing data on multiple diploid samples. Genome Res.21, 952–960 (2011). ArticleCASPubMedPubMed Central Google Scholar
Li, Y., Sidore, C., Kang, H. M., Boehnke, M. & Abecasis, G. R. Low-coverage sequencing: Implications for design of complex trait association studies. Genome Res.21, 940–951 (2011). ArticleCASPubMedPubMed Central Google Scholar
Scheet, P. & Stephens, M. Linkage disequilibrium-based quality control for large-scale genetic studies. PLoS Genet.4, e1000147 (2008). ArticlePubMedPubMed Central Google Scholar
Tishkoff, S. A. et al. Global patterns of linkage disequilibrium at the CD4 locus and modern human origins. Science271, 1380–1387 (1996). ArticleCASPubMed Google Scholar
Kong, A. et al. Detection of sharing by descent, long-range phasing and haplotype imputation. Nature Genet.40, 1068–1075 (2008). This paper describes the use of an IBD-based phasing method called 'long-range phasing' in a large sample from the Icelandic population. ArticleCASPubMed Google Scholar
Sabeti, P. C. et al. Detecting recent positive selection in the human genome from haplotype structure. Nature419, 832–837 (2002). ArticleCASPubMed Google Scholar
Clark, A. G. Inference of haplotypes from PCR-amplified samples of diploid populations. Mol. Biol. Evol.7, 111–122 (1990). This paper describes the first computational phasing method for more than two markers. CASPubMed Google Scholar
Dempster, A. P., Laird, N. M. & Rubin, D. B. Maximum likelihood from incomplete data via the EM algorithm. J. R. Statist. Soc. B39, 1–38 (1977). Google Scholar
Excoffier, L. & Slatkin, M. Maximum-likelihood estimation of molecular haplotype frequencies in a diploid population. Mol. Biol. Evol.12, 921–927 (1995). This was one of the earliest papers describing the use of the EM algorithm for statistical phasing of unrelated individuals. CASPubMed Google Scholar
Hawley, M. E. & Kidd, K. K. HAPLO: a program using the EM algorithm to estimate the frequencies of multi-site haplotypes. J. Hered.86, 409–411 (1995). ArticleCASPubMed Google Scholar
Long, J. C., Williams, R. C. & Urbanek, M. An E-M algorithm and testing strategy for multiple-locus haplotypes. Am. J. Hum. Genet.56, 799–810 (1995). CASPubMedPubMed Central Google Scholar
Qin, Z. S., Niu, T. & Liu, J. S. Partition-ligation-expectation-maximization algorithm for haplotype inference with single-nucleotide polymorphisms. Am. J. Hum. Genet.71, 1242–1247 (2002). ArticleCASPubMedPubMed Central Google Scholar
Stephens, M., Smith, N. J. & Donnelly, P. A new statistical method for haplotype reconstruction from population data. Am. J. Hum. Genet.68, 978–989 (2001). ArticleCASPubMedPubMed Central Google Scholar
Excoffier, L. & Lischer, H. E. Arlequin suite ver 3.5: a new series of programs to perform population genetics analyses under Linux and Windows. Mol. Ecol. Resour.10, 564–567 (2010). ArticlePubMed Google Scholar
Drysdale, C. M. et al. Complex promoter and coding region β 2-adrenergic receptor haplotypes alter receptor expression and predict in vivo responsiveness. Proc. Natl Acad. Sci. USA97, 10483–10488 (2000). ArticleCASPubMedPubMed Central Google Scholar
Rosenberg, N. et al. The frequent 5,10-methylenetetrahydrofolate reductase C677T polymorphism is associated with a common haplotype in whites, Japanese, and Africans. Am. J. Hum. Genet.70, 758–762 (2002). ArticleCASPubMedPubMed Central Google Scholar
Li, N. & Stephens, M. Modeling linkage disequilibrium and identifying recombination hotspots using single-nucleotide polymorphism data. Genetics165, 2213–2233 (2003). This paper describes the approximate coalescent model used by the MACH and IMPUTE statistical phasing methods. The model is similar to that used by PHASE. CASPubMedPubMed Central Google Scholar
Stephens, M. & Donnelly, P. Inference in molecular population genetics. J. R. Statist. Soc. B62, 605–655 (2000). Article Google Scholar
Fearnhead, P. & Donnelly, P. Estimating recombination rates from population genetic data. Genetics159, 1299–1318 (2001). CASPubMedPubMed Central Google Scholar
Stephens, M. & Scheet, P. Accounting for decay of linkage disequilibrium in haplotype inference and missing-data imputation. Am. J. Hum. Genet.76, 449–462 (2005). This paper describes PHASE, which has been considered as a gold standard for computational phasing accuracy, although it is too computationally intensive to be applied to large data sets. ArticleCASPubMedPubMed Central Google Scholar
Scheet, P. & Stephens, M. A fast and flexible statistical model for large-scale population genotype data: applications to inferring missing genotypes and haplotypic phase. Am. J. Hum. Genet.78, 629–644 (2006). This paper describes fastPHASE, which was one of the first computational phasing methods suitable for genome-wide SNP data. ArticleCASPubMedPubMed Central Google Scholar
Howie, B. N., Donnelly, P. & Marchini, J. A flexible and accurate genotype imputation method for the next generation of genome-wide association studies. PLoS Genet.5, e1000529 (2009). ArticlePubMedPubMed Central Google Scholar
Celeux, G. & Diebolt, J. The SEM algorithm: a probabilistic teacher algorithm derived from the EM algorithm for the mixture problem. Comp. Statist. Quart.2, 73–82 (1985). Google Scholar
Tregouet, D. A., Escolano, S., Tiret, L., Mallet, A. & Golmard, J. L. A new algorithm for haplotype-based association analysis: the stochastic-EM algorithm. Ann. Hum. Genet.68, 165–177 (2004). ArticleCASPubMed Google Scholar
Delaneau, O., Coulonges, C. & Zagury, J. F. Shape-IT: new rapid and accurate algorithm for haplotype inference. BMC Bioinformatics9, 540 (2008). ArticlePubMedPubMed Central Google Scholar
Browning, S. R. & Browning, B. L. Rapid and accurate haplotype phasing and missing-data inference for whole-genome association studies by use of localized haplotype clustering. Am. J. Hum. Genet.81, 1084–1097 (2007). This paper describes the BEAGLE method for statistical phasing in samples of unrelated individuals. ArticleCASPubMedPubMed Central Google Scholar
Auton, A. et al. Global distribution of genomic diversity underscores rich complex history of continental human populations. Genome Res.19, 795–803 (2009). ArticleCASPubMedPubMed Central Google Scholar
Sabeti, P. C. et al. Genome-wide detection and characterization of positive selection in human populations. Nature449, 913–918 (2007). ArticleCASPubMedPubMed Central Google Scholar
Frazer, K. A. et al. A second generation human haplotype map of over 3.1 million SNPs. Nature449, 851–861 (2007). ArticleCASPubMed Google Scholar
The International HapMap Consortium. Integrating common and rare genetic variation in diverse human populations. Nature467, 52–58 (2010).
Kenny, E. E. et al. Systematic haplotype analysis resolves a complex plasma plant sterol locus on the Micronesian Island of Kosrae. Proc. Natl Acad. Sci. USA106, 13886–13891 (2009). ArticleCASPubMedPubMed Central Google Scholar
Browning, S. R. Missing data imputation and haplotype phase inference for genome-wide association studies. Hum. Genet.124, 439–450 (2008). ArticleCASPubMedPubMed Central Google Scholar
Tregouet, D. A. et al. Genome-wide haplotype association study identifies the SLC22A3-LPAL2-LPA gene cluster as a risk locus for coronary artery disease. Nature Genet.41, 283–285 (2009). ArticleCASPubMed Google Scholar
Browning, S. R. & Browning, B. L. High-resolution detection of identity by descent in unrelated individuals. Am. J. Hum. Genet.86, 526–539 (2010). ArticleCASPubMedPubMed Central Google Scholar
Hickey, J. M. et al. A combined long-range phasing and long haplotype imputation method to impute phase for SNP genotypes. Genet. Sel. Evol.43, 12 (2011). ArticlePubMedPubMed Central Google Scholar
Daetwyler, H. D., Wiggans, G. R., Hayes, B. J., Woolliams, J. A. & Goddard, M. E. Imputation of missing genotypes from sparse to high density using long-range phasing. Genetics 24 Jun 2011 (doi:10.1534/genetics.111.128082).
Holm, H. et al. A rare variant in MYH6 is associated with high risk of sick sinus syndrome. Nature Genet.43, 316–320 (2011). ArticleCASPubMed Google Scholar
Kruglyak, L., Daly, M. J., ReeveDaly, M. P. & Lander, E. S. Parametric and nonparametric linkage analysis: a unified multipoint approach. Am. J. H. Genet.58, 1347–1363 (1996). CAS Google Scholar
Schaid, D. J., McDonnell, S. K., Wang, L., Cunningham, J. M. & Thibodeau, S. N. Caution on pedigree haplotype inference with software that assumes linkage equilibrium. Am. J. Hum. Genet.71, 992–995 (2002). ArticlePubMedPubMed Central Google Scholar
Rohde, K. & Fuerst, R. Haplotyping and estimation of haplotype frequencies for closely linked biallelic multilocus genetic phenotypes including nuclear family information. Hum. Mutat.17, 289–295 (2001). ArticleCASPubMed Google Scholar
Zhang, K., Sun, F. & Zhao, H. HAPLORE: a program for haplotype reconstruction in general pedigrees without recombination. Bioinformatics21, 90–103 (2005). ArticleCASPubMed Google Scholar
Abecasis, G. R. & Wigginton, J. E. Handling marker-marker linkage disequilibrium: pedigree analysis with clustered markers. Am. J. Hum. Genet.77, 754–767 (2005). ArticleCASPubMedPubMed Central Google Scholar
Zhang, F. & Deng, H. W. Confounding from cryptic relatedness in haplotype-based association studies. Genetica138, 945–950 (2010). ArticlePubMed Google Scholar
Nielsen, R., Paul, J. S., Albrechtsen, A. & Song, Y. S. Genotype and SNP calling from next-generation sequencing data. Nature Rev. Genet.12, 443–451 (2011). ArticleCASPubMed Google Scholar
Andres, A. M. et al. Understanding the accuracy of statistical haplotype inference with sequence data of known phase. Genet. Epidemiol.31, 659–671 (2007). ArticlePubMedPubMed Central Google Scholar
Jostins, L., Morley, K. I. & Barrett, J. C. Imputation of low-frequency variants using the HapMap3 benefits from large, diverse reference sets. Eur. J. Hum. Genet.19, 662–666 (2011). ArticlePubMedPubMed Central Google Scholar
Geraci, F. A comparison of several algorithms for the single individual SNP haplotyping reconstruction problem. Bioinformatics26, 2217–2225 (2010). ArticleCASPubMedPubMed Central Google Scholar
He, D., Choi, A., Pipatsrisawat, K., Darwiche, A. & Eskin, E. Optimal algorithms for haplotype assembly from whole-genome sequence data. Bioinformatics26, i183–i190 (2010). ArticleCASPubMedPubMed Central Google Scholar
Long, Q., MacArthur, D., Ning, Z. & Tyler-Smith, C. HI: haplotype improver using paired-end short reads. Bioinformatics25, 2436–2437 (2009). ArticleCASPubMedPubMed Central Google Scholar
International Human Genome Sequencing Consortium. Initial sequencing and analysis of the human genome. Nature409, 860–921 (2001).
Kitzman, J. O. et al. Haplotype-resolved genome sequencing of a Gujarati Indian individual. Nature Biotech.29, 59–63 (2011). This paper describes the use of an experimental phasing method that was applied to the sequence of an individual and the population-genetic inferences that were made using the phased haplotypes. ArticleCAS Google Scholar
Suk, E.-K. K. et al. A comprehensively molecular haplotype-resolved genome of a European individual. Genome Res. 3 Aug 2011 (doi:10.1101/gr.125047.111).
Duitama, J., Huebsch, T., McEwen, G., Suk, E.-K. & Hoehe, M. R. in Proc. 1st ACM Int. Conf. Bioinf. Comp. Biol. 160–169 (Association for Computing Machinery, Niagara Falls, New York, 2010).
Bansal, V. & Bafna, V. HapCUT: an efficient and accurate algorithm for the haplotype assembly problem. Bioinformatics24, i153–i159 (2008). ArticlePubMed Google Scholar
Fan, H. C., Wang, J., Potanina, A. & Quake, S. R. Whole-genome molecular haplotyping of single cells. Nature Biotech.29, 51–57 (2011). ArticleCAS Google Scholar
Hert, D. G., Fredlake, C. P. & Barron, A. E. Advantages and limitations of next-generation sequencing technologies: a comparison of electrophoresis and non-electrophoresis methods. Electrophoresis29, 4618–4626 (2008). ArticleCASPubMed Google Scholar
Metzker, M. L. Sequencing technologies — the next generation. Nature Rev. Genet.11, 31–46 (2010). ArticleCASPubMed Google Scholar
Eid, J. et al. Real-time DNA sequencing from single polymerase molecules. Science323, 133–138 (2009). ArticleCASPubMed Google Scholar
McKenna, A. et al. The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data. Genome Res.20, 1297–1303 (2010). ArticleCASPubMedPubMed Central Google Scholar
Li, Z. et al. A partition-ligation-combination-subdivision EM algorithm for haplotype inference with multiallelic markers: update of the SHEsis (http://analysis.bio-x.cn). Cell Res.19, 519–523 (2009). ArticleCASPubMed Google Scholar
Cirulli, E. T. & Goldstein, D. B. Uncovering the roles of rare variants in common disease through whole-genome sequencing. Nature Rev. Genet.11, 415–425 (2010). ArticleCASPubMed Google Scholar
Yang, H., Chen, X. & Wong, W. H. Completely phased genome sequencing through chromosome sorting. Proc. Natl Acad. Sci. USA108, 12–17 (2011). ArticleCASPubMed Google Scholar
The UK IBD Genetics Consortium & The Wellcome Trust Case Control Consortium 2. Genome-wide association study of ulcerative colitis identifies three new susceptibility loci, including the HNF4A region. Nature Genet.41, 1330–1334 (2009).
The Wellcome Trust Case Control Consortium. Genome-wide association study of 14,000 cases of seven common diseases and 3,000 shared controls. Nature447, 661–678 (2007).