HapCUT2: robust and accurate haplotype assembly for diverse sequencing technologies - PubMed (original) (raw)

HapCUT2: robust and accurate haplotype assembly for diverse sequencing technologies

Peter Edge et al. Genome Res. 2017 May.

Abstract

Many tools have been developed for haplotype assembly-the reconstruction of individual haplotypes using reads mapped to a reference genome sequence. Due to increasing interest in obtaining haplotype-resolved human genomes, a range of new sequencing protocols and technologies have been developed to enable the reconstruction of whole-genome haplotypes. However, existing computational methods designed to handle specific technologies do not scale well on data from different protocols. We describe a new algorithm, HapCUT2, that extends our previous method (HapCUT) to handle multiple sequencing technologies. Using simulations and whole-genome sequencing (WGS) data from multiple different data types-dilution pool sequencing, linked-read sequencing, single molecule real-time (SMRT) sequencing, and proximity ligation (Hi-C) sequencing-we show that HapCUT2 rapidly assembles haplotypes with best-in-class accuracy for all data types. In particular, HapCUT2 scales well for high sequencing coverage and rapidly assembled haplotypes for two long-read WGS data sets on which other methods struggled. Further, HapCUT2 directly models Hi-C specific error modalities, resulting in significant improvements in error rates compared to HapCUT, the only other method that could assemble haplotypes from Hi-C data. Using HapCUT2, haplotype assembly from a 90× coverage whole-genome Hi-C data set yielded high-resolution haplotypes (78.6% of variants phased in a single block) with high pairwise phasing accuracy (∼98% across chromosomes). Our results demonstrate that HapCUT2 is a robust tool for haplotype assembly applicable to data from diverse sequencing technologies.

PubMed Disclaimer

Figures

Figure 1.

Comparison of runtime (top panel) and switch + mismatch error rate (bottom panel) for HapCUT2 with four methods for haplotype assembly (HapCUT, RefHap, ProbHap, and FastHare) on simulated read data as a function of (A) mean coverage per variant (variants per read fixed at four); (B) mean variants per read (mean coverage per variant fixed at five); and (C) mean number of paired-end reads crossing a variant (mean coverage per variant fixed at five, read length 150 bp, random insert size up to a variable maximum value). Lines represent the mean of 10 replicate simulations. FastHare is not visible on C (bottom) due to significantly higher error rates.

Figure 2.

Accuracy of HapCUT2 compared to four other methods for haplotype assembly on diverse whole-genome sequence data sets for NA12878. (A) Fosmid dilution pool data (Duitama et al. 2012). (B) PacBio SMRT data (11× and 44× coverage). (C) 10X Genomics linked reads. (D) Whole-genome Hi-C data (40× and 90× coverage, created with MboI enzyme). Switch and mismatch error rates were calculated across all chromosomes using the subset of variants that were phased by all methods. For each data set, only methods that produced results within 20 CPU-h per chromosome are shown.

Figure 3.

Haplotype completeness and accuracy compared between Hi-C (MboI enzyme, 90× and 40× coverage) and PacBio SMRT (44× and 11× coverage). (A) Cumulative measure of the fraction of variants phased within a given number of the largest haplotype blocks. (B) Fraction of correctly phased variant pairs as a function of distance.

Figure 4.

Improvements in the (A) completeness and (B) accuracy (switch + mismatch error rates) of the largest haplotype block with increasing Hi-C sequencing coverage for two different restriction enzymes: MboI and HindIII. Results are presented using data for Chromosome 1 with coverage ranging from 18× to 200×.

References

1. Aguiar D, Istrail S. 2012. HapCompass: a fast cycle basis algorithm for accurate haplotype assembly of sequence data. J Comput Biol 19: 577–590. - PMC - PubMed
1. Amini S, Pushkarev D, Christiansen L, Kostem E, Royce T, Turk C, Pignatelli N, Adey A, Kitzman JO, Vijayan K, et al. 2014. Haplotype-resolved whole-genome sequencing by contiguity-preserving transposition and combinatorial indexing. Nat Genet 46: 1343–1349. - PMC - PubMed
1. Bansal V, Bafna V. 2008. HapCUT: an efficient and accurate algorithm for the haplotype assembly problem. Bioinformatics 24: i153–i159. - PubMed
1. Bansal V, Halpern AL, Axelrod N, Bafna V. 2008. An MCMC algorithm for haplotype assembly from whole-genome sequence data. Genome Res 18: 1336–1346. - PMC - PubMed
1. Belton JM, McCord RP, Gibcus JH, Naumova N, Zhan Y, Dekker J. 2012. Hi-C: a comprehensive technique to capture the conformation of genomes. Methods 58: 268–276. - PMC - PubMed

HapCUT2: robust and accurate haplotype assembly for diverse sequencing technologies - PubMed (original) (raw)