Haplotype phasing: existing methods and new developments - PubMed (original) (raw)

Review

Haplotype phasing: existing methods and new developments

Sharon R Browning et al. Nat Rev Genet. 2011.

Abstract

Determination of haplotype phase is becoming increasingly important as we enter the era of large-scale sequencing because many of its applications, such as imputing low-frequency variants and characterizing the relationship between genetic variation and disease susceptibility, are particularly relevant to sequence data. Haplotype phase can be generated through laboratory-based experimental methods, or it can be estimated using computational approaches. We assess the haplotype phasing methods that are available, focusing in particular on statistical methods, and we discuss the practical aspects of their application. We also describe recent developments that may transform this field, particularly the use of identity-by-descent for computational phasing.

PubMed Disclaimer

Figures

Figure 1

Figure 1. Statistical phasing of unrelated individuals using haplotype frequencies

Consider one individual with heterozygous genotype at each of three SNPs in a region. There are four possible haplotype configurations consistent with the genotype data (A–D). Suppose haplotype frequencies are available from other individuals in the population at these sites (provided below each phasing pattern). These frequencies may have been estimated from population data without additional modeling (with the a priori assumption that all haplotype frequency configurations are equally likely) or with a model that accounts for the biological processes of recombination and mutation (such as the Li and Stephens model). The population frequency of a haplotype pair is obtained using the Hardy-Weinberg principle (independence of the two haplotypes within an individual); the factor of two in the frequency of the haplotype pairs accounts for both possible assignments of maternal and paternal origin to the two haplotypes. The posterior probabilities of the phased data are obtained from the population frequencies of the possible haplotype pairs. In this example, the posterior probability of phasing B (93%) is much greater than that of phasing C (7%).

Figure 2

Figure 2. Comparison of recent statistical haplotype phasing methods

We compared phasing accuracy and computation time for BEAGLE 3.3.1, IMPUTE 2.1.2 and MACH 1.0.16. The sample was comprised of up to 5200 controls from the Wellcome Trust Case Control Consortium 2, and 44 offspring from the HapMap3 CEU trios (Utah residents with Northern and Western European ancestry) genotyped on Illumina Human1M SNP arrays. We evaluated accuracy for markers on chromosome 20 (21,166 markers after quality control filters). Phasing accuracy was measured in the HapMap trio offspring using the markers that have phase determined by parental genotypes. Accuracy is represented by switch error rate (see Box 1). BEAGLE was run with default settings with the low-memory option (use of the low-memory option does not affect accuracy but reduces memory usage at the cost of a 30–60% increase in computing time). To obtain results in a reasonable amount of time for MACH and to follow recommended practice for IMPUTE2, the data for MACH and IMPUTE2 were split into eleven 5.1 MB chunks and one 6.3 MB chunk, with 500 KB overlap for adjacent chunks. The two haplotypes for each individual were aligned across chunks using the phase of heterozygous genotypes near the center of the overlap region and the chunks were merged to yield a chromosome-wide phasing. Computing times are for the whole chromosome, and are obtained for MACH and IMPUTE2 by adding computing times for each chunk. A) and B) This comparison used parameter settings that are based on the current documentation for each program. Parameter settings for IMPUTE2 followed parameters in a prototype phasing script downloaded from the IMPUTE2 website: “–phase– include_buffer_in_output –stage_one -k 80 –iter 30 –burnin 10 –Ne 11500”. MACH options were “--round 50 --states 200 --phase”, as suggested in the MACH documentation. C and D) As above, but with increased model complexity or run-time for each method to obtain improved accuracy. BEAGLE was run 15 times and the results were combined by phasing successive heterozygotes using a majority vote from the 15 runs. MACH was run with 450 states (compared to 200 for the standard settings) and IMPUTE was run with 400 states (compared to 80 states for the standard settings).

Figure 3

Figure 3. Use of IBD to determine haplotype phase

Determining phase using IBD alone. When two individuals are known to be identical by descent (for example, if they are a parent–offspring pair), the individuals share an allele at each marker and this allele is determined by the genotype data when one or both individual is homozygous. In this example, the two individuals, with unphased genotypes given in the left-most columns, are identical by descent. SNP 1 is heterozygous in both individuals and thus cannot be phased using the IBD but may be able to be phased using population haplotype frequencies (see below). SNP 2 is homozygous in individual 2, and so the shared haplotype must have the C allele. Analogously, SNPs 3 and 4 are homozygous in individual 1, so the shared alleles are T and G, respectively. SNP 5 is homozygous in both individuals so phasing is trivial. The inferred shared haplotype is shaded green. Use of IBD phasing alone gives phasing shown in the IBD-phased haplotype columns, in which the phasing of SNP 1 is unknown. Determining phase using IBD and haplotype frequencies. Consider the same two identical by descent individuals as above. Phase is determined by IBD at SNPs 2–5, but is not determined at SNP 1 which is heterozygous in both individuals. Only haplotype phasings that satisfy the IBD-phasing constraints need be considered. Here the two identical by descent individuals are phased jointly, so the joint phase at SNP 1 must be consistent with the IBD, and the identical by descent haplotype is only included once in the probability of the haplotype configuration. The inferred identical by descent haplotype is shaded. Haplotype phasing A is much more probable (94%) than phasing B (6%).

Figure 4

Figure 4. Accuracy of statistical phasing of cryptic relatives when relationship is not explicitly accounted for

The same sets of individuals were phased as in Figure 2, with the addition of one parent of each HapMap CEU child (“Cryptic pairs” results) or both parents (“Cryptic trios” results). Phasing was performed with BEAGLE assuming all samples are unrelated. The “Unrelated” results are identical to those for BEAGLE in Figure 2A, and do not include any of the parents. It can be seen that adding relatives to the phase estimation greatly improves phase accuracy even when treating the individuals as unrelated. The phase accuracy would be significantly further improved by using the known relationships during the phase estimation.

Similar articles

Cited by

References

    1. Tewhey R, Bansal V, Torkamani A, Topol EJ, Schork NJ. The importance of phase information for human genomics. Nat Rev Genet. 2011;12:215–23. - PMC - PubMed
    1. Marchini J, Howie B, Myers S, McVean G, Donnelly P. A new multipoint method for genome-wide association studies by imputation of genotypes. Nature Genetics. 2007;39:906–13. - PubMed
    1. Browning BL, Browning SR. A unified approach to genotype imputation and haplotype-phase inference for large data sets of trios and unrelated individuals. Am J Hum Genet. 2009;84:210–23. - PMC - PubMed
    1. Li Y, Willer CJ, Ding J, Scheet P, Abecasis GR. MaCH: using sequence and genotype data to estimate haplotypes and unobserved genotypes. Genet Epidemiol. 2010;34:816–34. - PMC - PubMed
    1. Kang H, Qin ZS, Niu T, Liu JS. Incorporating genotyping uncertainty in haplotype inference for single-nucleotide polymorphisms. American Journal of Human Genetics. 2004;74:495–510. - PMC - PubMed

Publication types

MeSH terms

Grants and funding

LinkOut - more resources