De novo assembly of a haplotype-resolved human genome (original) (raw)

Accession codes

Primary accessions

Gene Expression Omnibus

Sequence Read Archive

Referenced accessions

Gene Expression Omnibus

Sequence Read Archive

References

  1. Tewhey, R., Bansal, V., Torkamani, A., Topol, E.J. & Schork, N.J. The importance of phase information for human genomics. Nat. Rev. Genet. 12, 215–223 (2011).
    Article CAS PubMed PubMed Central Google Scholar
  2. Lander, E.S. et al. Initial sequencing and analysis of the human genome. Nature 409, 860–921 (2001).
    Article CAS PubMed Google Scholar
  3. Li, R. et al. SNP detection for massively parallel whole-genome resequencing. Genome Res. 19, 1124–1132 (2009).
    Article CAS PubMed PubMed Central Google Scholar
  4. Li, H. & Durbin, R. Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics 25, 1754–1760 (2009).
    Article CAS PubMed PubMed Central Google Scholar
  5. DePristo, M.A. et al. A framework for variation discovery and genotyping using next-generation DNA sequencing data. Nat. Genet. 43, 491–498 (2011).
    Article CAS PubMed PubMed Central Google Scholar
  6. Levy, S. et al. The diploid genome sequence of an individual human. PLoS Biol. 5, e254 (2007).
    Article PubMed PubMed Central Google Scholar
  7. Wang, J. et al. The diploid genome sequence of an Asian individual. Nature 456, 60–65 (2008).
    Article CAS PubMed PubMed Central Google Scholar
  8. Ahn, S.-M. et al. The first Korean genome sequence and analysis: full genome sequencing for a socio-ethnic group. Genome Res. 19, 1622–1629 (2009).
    Article CAS PubMed PubMed Central Google Scholar
  9. Tong, P. et al. Sequencing and analysis of an Irish human genome. Genome Biol. 11, R91 (2010).
    Article PubMed PubMed Central Google Scholar
  10. 1000 Genomes Project Consortium. A map of human genome variation from population-scale sequencing. Nature 467, 1061–1073 (2010).
  11. International HapMap Consortium. A haplotype map of the human genome. Nature 437, 1299–1320 (2005).
  12. 1000 Genomes Project Consortium. An integrated map of genetic variation from 1,092 human genomes. Nature 491, 56–65 (2012).
  13. Browning, S.R. & Browning, B.L. Rapid and accurate haplotype phasing and missing-data inference for whole-genome association studies by use of localized haplotype clustering. Am. J. Hum. Genet. 81, 1084–1097 (2007).
    Article CAS PubMed PubMed Central Google Scholar
  14. Duitama, J. et al. Fosmid-based whole genome haplotyping of a HapMap trio child: evaluation of single individual haplotyping techniques. Nucleic Acids Res. 40, 2041–2053 (2012).
    Article CAS PubMed Google Scholar
  15. Woolley, A.T., Guillemette, C., Li Cheung, C., Housman, D.E. & Lieber, C.M. Direct haplotyping of kilobase-size DNA using carbon nanotube probes. Nat. Biotechnol. 18, 760–763 (2000).
    Article CAS PubMed Google Scholar
  16. Cao, H. et al. A short-read multiplex sequencing method for reliable, cost-effective and high-throughput genotyping in large-scale studies. Hum. Mutat. 34, 1715–1720 (2013).
    Article CAS PubMed Google Scholar
  17. Dapprich, J., Ferriola, D., Magira, E.E., Kunkel, M. & Monos, D. SNP-specific extraction of haplotype-resolved targeted genomic regions. Nucleic Acids Res. 36, e94 (2008).
    Article PubMed PubMed Central Google Scholar
  18. Fan, H.C., Wang, J., Potanina, A. & Quake, S.R. Whole-genome molecular haplotyping of single cells. Nat. Biotechnol. 29, 51–57 (2011).
    CAS PubMed Google Scholar
  19. Kitzman, J.O. et al. Haplotype-resolved genome sequencing of a Gujarati Indian individual. Nat. Biotechnol. 29, 59–63 (2011).
    Article CAS PubMed Google Scholar
  20. Suk, E.-K. et al. A comprehensively molecular haplotype-resolved genome of a European individual. Genome Res. 21, 1672–1685 (2011).
    Article CAS PubMed PubMed Central Google Scholar
  21. Browning, S.R. & Browning, B.L. Haplotype phasing: existing methods and new developments. Nat. Rev. Genet. 12, 703–714 (2011).
    Article CAS PubMed PubMed Central Google Scholar
  22. Altshuler, D.M. et al. Integrating common and rare genetic variation in diverse human populations. Nature 467, 52–58 (2010).
    Article CAS PubMed Google Scholar
  23. Peters, B.A. et al. Accurate whole-genome sequencing and haplotyping from 10 to 20 human cells. Nature 487, 190–195 (2012).
    Article CAS PubMed PubMed Central Google Scholar
  24. Alkan, C., Coe, B.P. & Eichler, E.E. Genome structural variation discovery and genotyping. Nat. Rev. Genet. 12, 363–376 (2011).
    Article CAS PubMed PubMed Central Google Scholar
  25. Peng, Z. et al. Comprehensive analysis of RNA-seq data reveals extensive RNA editing in a human transcriptome. Nat. Biotechnol. 30, 253–260 (2012).
    Article CAS PubMed Google Scholar
  26. Li, Y. et al. The DNA methylome of human peripheral blood mononuclear cells. PLoS Biol. 8, e1000533 (2010).
    Article PubMed PubMed Central Google Scholar
  27. Luo, R. et al. SOAPdenovo2: an empirically improved memory-efficient short-read de novo assembler. Gigascience 1, 18 (2012).
    Article PubMed PubMed Central Google Scholar
  28. Zhang, G. et al. The oyster genome reveals stress adaptation and complexity of shell formation. Nature 490, 49–54 (2012).
    Article CAS PubMed Google Scholar
  29. Huang, W., Sherman, B.T. & Lempicki, R.A. Systematic and integrative analysis of large gene lists using DAVID bioinformatics resources. Nat. Protoc. 4, 44–57 (2009).
    Article CAS Google Scholar
  30. Wang, K., Li, M. & Hakonarson, H. ANNOVAR: functional annotation of genetic variants from high-throughput sequencing data. Nucleic Acids Res. 38, e164 (2010).
    Article PubMed PubMed Central Google Scholar
  31. Kirin, M. et al. Genomic runs of homozygosity record population history and consanguinity. PLoS ONE 5, e13996 (2010).
    Article PubMed PubMed Central Google Scholar
  32. Schwartz, S. et al. Human-mouse alignments with BLASTZ. Genome Res. 13, 103–107 (2003).
    Article CAS PubMed PubMed Central Google Scholar
  33. Han, W. et al. Molecular cloning and characterization of chemokine-like factor 1 (CKLF1), a novel human cytokine with unique structure and potential chemotactic activity. Biochem. J. 357, 127–135 (2001).
    Article CAS PubMed PubMed Central Google Scholar
  34. Zook, J.M. et al. Integrating human sequence data sets provides a resource of benchmark SNP and indel genotype calls. Nat. Biotechnol. 32, 246–251 (2014).
    Article CAS PubMed Google Scholar
  35. Aghaei, M., Karami-Tehrani, F., Salami, S. & Atri, M. Adenosine deaminase activity in the serum and malignant tumors of breast cancer: the assessment of isoenzyme ADA1 and ADA2 activities. Clin. Biochem. 38, 887–891 (2005).
    Article CAS PubMed Google Scholar
  36. Li, R. et al. Building the sequence map of the human pan-genome. Nat. Biotechnol. 28, 57–63 (2010).
    Article CAS PubMed Google Scholar
  37. Trapnell, C. et al. Transcript assembly and quantification by RNA-Seq reveals unannotated transcripts and isoform switching during cell differentiation. Nat. Biotechnol. 28, 511–515 (2010).
    Article CAS PubMed PubMed Central Google Scholar
  38. Ng, P.C. SIFT: predicting amino acid changes that affect protein function. Nucleic Acids Res. 31, 3812–3814 (2003).
    Article CAS PubMed PubMed Central Google Scholar
  39. Adzhubei, I.A. et al. A method and server for predicting damaging missense mutations. Nat. Methods 7, 248–249 (2010).
    Article CAS PubMed PubMed Central Google Scholar
  40. Rusk, N. Causal mutations in a haploid landscape. Nat. Methods 9, 36 (2012).
    Article CAS Google Scholar
  41. Yang, M. & Park, J.Y. DNA methylation in promoter region as biomarkers in prostate cancer. Methods Mol. Biol. 863, 67–109 (2012).
    Article CAS PubMed PubMed Central Google Scholar
  42. Huang, W., Sherman, B.T. & Lempicki, R.a. Bioinformatics enrichment tools: paths toward the comprehensive functional analysis of large gene lists. Nucleic Acids Res. 37, 1–13 (2009).
    Article Google Scholar
  43. Adey, A., Burton, J., Kitzman, J. & Hiatt, J. The haplotype-resolved genome and epigenome of the aneuploid HeLa cancer cell line. Nature 500, 207–211 (2013).
    Article CAS PubMed PubMed Central Google Scholar
  44. Li, Y., Willer, C.J., Ding, J., Scheet, P. & Abecasis, G.R. MaCH: using sequence and genotype data to estimate haplotypes and unobserved genotypes. Genet. Epidemiol. 34, 816–834 (2010).
    Article PubMed PubMed Central Google Scholar
  45. Khurana, E. et al. Integrative annotation of variants from 1092 humans: application to cancer genomics. Science 342, 1235587 (2013).
    Article PubMed PubMed Central Google Scholar
  46. Pemberton, T.J. et al. Genomic patterns of homozygosity in worldwide human populations. Am. J. Hum. Genet. 91, 275–292 (2012).
    Article CAS PubMed PubMed Central Google Scholar
  47. Bentley, D.R. et al. Accurate whole human genome sequencing using reversible terminator chemistry. Nature 456, 53–59 (2008).
    Article CAS PubMed PubMed Central Google Scholar
  48. Li, Y. et al. Structural variation in two human genomes mapped at single-nucleotide resolution by whole genome de novo assembly. Nat. Biotechnol. 29, 723–730 (2011).
    Article CAS PubMed Google Scholar
  49. Lam, H.Y.K. et al. Performance comparison of whole-genome sequencing platforms. Nat. Biotechnol. 30, 78–82 (2012).
    Article CAS Google Scholar
  50. Audic, S. & Claverie, J.M. The significance of digital gene expression profiles. Genome Res. 7, 986–995 (1997).
    Article CAS PubMed Google Scholar

Download references

Acknowledgements

We wish to recognize BGI-Shenzhen's sequencing platform for generating the data in this study. We thank the faculty and staff at BGI-Shenzhen who contributed to this project especially to H. Jiang and B. Wang, who helped on the experimental part, D. Cao, L. Lin and M. Li, who helped on the validation part, and L. Goodman, W. Chen and H. Jia, who provided critical comments on the manuscript. This work was supported by the State Key Development Program for Basic Research of China-973 Program (no. 2011CB809202); the Shenzhen Municipal Government of China (no. JC201005260191A); International Science and Technology Cooperation Program of Shenzhen (GJHZ20130417140835564); Tianjin Municipal Science and Technology special funds for Enterprise Development (no. 14ZXLJSY00320). G.W. was also supported by Alberta Innovates Technology Futures.

Author information

Author notes

  1. Hongzhi Cao, Honglong Wu, Ruibang Luo, Shujia Huang and Yuhui Sun: These authors contributed equally to this work.

Authors and Affiliations

  1. BGI-Shenzhen, Shenzhen, China
    Hongzhi Cao, Honglong Wu, Ruibang Luo, Shujia Huang, Yuhui Sun, Xin Tong, Yinlong Xie, Binghang Liu, Hailong Yang, Hancheng Zheng, Jian Li, Bo Li, Yu Wang, Fang Yang, Peng Sun, Siyang Liu, Peng Gao, Haodong Huang, Jing Sun, Dan Chen, Guangzhu He, Weihua Huang, Zheng Huang, Yue Li, Laurent C A M Tellier, Xiao Liu, Qiang Feng, Xun Xu, Xiuqing Zhang, Lars Bolund, Karsten Kristiansen, Rasmus Nielsen, Songgang Li, Jian Wang, Huanming Yang, Yingrui Li, Gane Ka-Shu Wong & Jun Wang
  2. BGI-Tianjin, Tianjin, China
    Hongzhi Cao & Honglong Wu
  3. Department of Biology, University of Copenhagen, Copenhagen, Denmark
    Hongzhi Cao, Hancheng Zheng, Jian Li, Siyang Liu, Laurent C A M Tellier, Xiao Liu, Qiang Feng, Anders Krogh, Karsten Kristiansen & Jun Wang
  4. HKU-BGI Bioinformatics Algorithms and Core Technology Research Laboratory, Hong Kong, China
    Ruibang Luo, Yinlong Xie & Binghang Liu
  5. School of Bioscience and Bioengineering, South China University of Technology, Guangzhou, China
    Shujia Huang, Yuhui Sun, Yinlong Xie, Yu Wang & Haodong Huang
  6. Institute of Biomedicine, University of Aarhus, Aarhus, Denmark
    Lars Bolund
  7. Danish Center for Translational Breast Cancer Research, Copenhagen, Denmark
    Lars Bolund
  8. Complete Genomics Inc., Mountain View, California, USA
    Radoje Drmanac & Snezana Drmanac
  9. Department of Integrative Biology, University of California, Berkeley, California, USA
    Rasmus Nielsen
  10. Department of Statistics, University of California, Berkeley, California, USA
    Rasmus Nielsen
  11. James D. Watson Institute of Genome Sciences, Hangzhou, China
    Jian Wang & Huanming Yang
  12. Princess Al Jawhara Albrahim Center of Excellence in the Research of Hereditary Disorders, King Abdulaziz University, Jeddah, Saudi Arabia
    Huanming Yang & Jun Wang
  13. Institute for Molecular Bioscience, The University of Queensland, Brisbane, Queensland, Australia
    Yingrui Li
  14. Department of Biological Sciences, University of Alberta, Edmonton, Alberta, Canada
    Gane Ka-Shu Wong
  15. Department of Medicine, University of Alberta, Edmonton, Alberta, Canada
    Gane Ka-Shu Wong
  16. Macau University of Science and Technology, Taipa, Macau, China
    Jun Wang
  17. Department of Medicine and State Key Laboratory of Pharmaceutical Biotechnology, University of Hong Kong, Hong Kong, China
    Jun Wang

Authors

  1. Hongzhi Cao
    You can also search for this author inPubMed Google Scholar
  2. Honglong Wu
    You can also search for this author inPubMed Google Scholar
  3. Ruibang Luo
    You can also search for this author inPubMed Google Scholar
  4. Shujia Huang
    You can also search for this author inPubMed Google Scholar
  5. Yuhui Sun
    You can also search for this author inPubMed Google Scholar
  6. Xin Tong
    You can also search for this author inPubMed Google Scholar
  7. Yinlong Xie
    You can also search for this author inPubMed Google Scholar
  8. Binghang Liu
    You can also search for this author inPubMed Google Scholar
  9. Hailong Yang
    You can also search for this author inPubMed Google Scholar
  10. Hancheng Zheng
    You can also search for this author inPubMed Google Scholar
  11. Jian Li
    You can also search for this author inPubMed Google Scholar
  12. Bo Li
    You can also search for this author inPubMed Google Scholar
  13. Yu Wang
    You can also search for this author inPubMed Google Scholar
  14. Fang Yang
    You can also search for this author inPubMed Google Scholar
  15. Peng Sun
    You can also search for this author inPubMed Google Scholar
  16. Siyang Liu
    You can also search for this author inPubMed Google Scholar
  17. Peng Gao
    You can also search for this author inPubMed Google Scholar
  18. Haodong Huang
    You can also search for this author inPubMed Google Scholar
  19. Jing Sun
    You can also search for this author inPubMed Google Scholar
  20. Dan Chen
    You can also search for this author inPubMed Google Scholar
  21. Guangzhu He
    You can also search for this author inPubMed Google Scholar
  22. Weihua Huang
    You can also search for this author inPubMed Google Scholar
  23. Zheng Huang
    You can also search for this author inPubMed Google Scholar
  24. Yue Li
    You can also search for this author inPubMed Google Scholar
  25. Laurent C A M Tellier
    You can also search for this author inPubMed Google Scholar
  26. Xiao Liu
    You can also search for this author inPubMed Google Scholar
  27. Qiang Feng
    You can also search for this author inPubMed Google Scholar
  28. Xun Xu
    You can also search for this author inPubMed Google Scholar
  29. Xiuqing Zhang
    You can also search for this author inPubMed Google Scholar
  30. Lars Bolund
    You can also search for this author inPubMed Google Scholar
  31. Anders Krogh
    You can also search for this author inPubMed Google Scholar
  32. Karsten Kristiansen
    You can also search for this author inPubMed Google Scholar
  33. Radoje Drmanac
    You can also search for this author inPubMed Google Scholar
  34. Snezana Drmanac
    You can also search for this author inPubMed Google Scholar
  35. Rasmus Nielsen
    You can also search for this author inPubMed Google Scholar
  36. Songgang Li
    You can also search for this author inPubMed Google Scholar
  37. Jian Wang
    You can also search for this author inPubMed Google Scholar
  38. Huanming Yang
    You can also search for this author inPubMed Google Scholar
  39. Yingrui Li
    You can also search for this author inPubMed Google Scholar
  40. Gane Ka-Shu Wong
    You can also search for this author inPubMed Google Scholar
  41. Jun Wang
    You can also search for this author inPubMed Google Scholar

Contributions

Ju.W., G.W., Yi.L. and H.C. managed the project. Ju.W., G.K.-S.W., Yi.L., H.C., H.W., R.L. and S.H. designed the analyses. Ha.Y., construct the fosmid library. R.L., Y.S., Y.X., Bi.L., G.H., W.H., Bo.L., Z.H. and Yu.L. conducted assembly and evaluation; H.C., H.W., S.H., Y.S., X.T., H.Z., J.L., Y.W., F.Y., P.S., Si.L., P.G., H.H., J.S., D.C., L.C.A.M.T., X.L., Q.F., X.X., X.Z., L.B., K.K., R.D., S.D., R.N., So.L., Ji.W. and Hu.Y. performed the data analyses. H.C., H.W. and S.H. did most of the writing with contributions from all authors.

Corresponding authors

Correspondence toYingrui Li, Gane Ka-Shu Wong or Jun Wang.

Ethics declarations

Competing interests

The authors declare no competing financial interests.

Integrated supplementary information

Supplementary Figure 1 Number and length distributions of the inferred fosmids.

Fosmid were inferred by mapping each pool’s sequence reads to hg19. We retained only those reads that mapped to within a 3~50 kb region. (a) Fosmid number in each pool. On average, there were ~32 fosmids per pool. (b) Fosmid size. The average length was 36.8 kb.

Supplementary Figure 2 Fosmids physical coverage distribution.

Blue curve denotes the theoretical coverage distribution, at an average coverage of 8x, and red curve denote the actual coverage. The average fosmid coverage was 8x, with a median of 7x. About 7% of YHref was not covered by fosmids, which may be due to a bias in the fosmid library construction and/or sequencing.

Supplementary Figure 3 Completeness of assembled sequence in each fosmid pool.

The horizontal axis represents the percentage of the fosmid sequence that was assembled in each pool. The vertical axis represents the proportion of fosmid pools at that given percentage. In total, 88.5% of the assembled pools contained at least 80% of the fosmid sequence, and 53.2% of the assembled pools contained at least 95% of the fosmid sequence.

Supplementary Figure 4 Contiguity of assembled sequence for individual fosmids.

The horizontal axis represents the ratio of the longest assembled sequence vs the inferred length of each defined fosmid. The vertical axis represents the proportion of fosmids at the given ratio. 54.7% of fosmids had a longest assembled sequence equal to, or longer than, half of the fosmid length. About 18% of the fosmids were completely assembled.

Supplementary Figure 5 Construction of the haplotype-resolved sequence.

The top (orange) bar represents the non-phased YHref sequence and the bottom (multi-color) bar represents the haplotype-resolved output. The middle (blue) bars represent the fosmid assembled haploid (FAH) sequences belonging to the same haplotype.

Supplementary Figure 6 Theoretical N50 length of haplotype phasing and long homozygous region.

a. Long homozygous regions (>=20 kb) for different populations in 1000 genomes project. Asians have more long homozygous region than other populations. This might be why YH had a shorter haplotype N50 than other individuals sequenced at a comparable fosmid depth. b. The theoretical N50 length distribution of haplotype phasing using the method of the current study, in 4 different individuals. Heterozygous marker numbers are shown at the top-left. The haplotype N50 of YH is expected to be 510 kb with a fosmid coverage of 4x per haplotype (or 8x for a 3 Gb genome).

Supplementary Figure 7 HDG coverage on hg19 and RefSeq genes.

Our HDG sequence was aligned to the hg19 genome using Lastz. Coverage of the chromosomes and gene regions was calculated. “Both” means covered by the two assembled haplotypes (blue), “Single” means covered by just one assembled haplotype (red). a. Coverage information for each chromosome. b. Proportion of RefSeq gene at given coverage.

Supplementary Figure 8 Length distributions of insertions and deletions.

a. Length distribution of short indels (<10 bp). Peaks at multiples of 3 bp in the exon distribution are expected because they do not disturb the reading frames. b. Length distribution of long indels (100 bp~1 kb). The peak at ~300 bp is due an enrichment for Alu element insertions and deletions. Note that there is no bias between insertion and deletion, which is progress compared to previous studies. c. Distribution of long indels (100 bp~1 kb) in unique versus repeat regions. As expected, there are more indels in the repeat regions and the peak at ~300 bp is more pronounced. d. Length distribution of homozygous and heterozygous long indels (100 bp~1 kb).

Supplementary Figure 9 SNP detection and intersection from different methods/platforms.

A total of ~4.0 M SNPs were detected by three different methods/platforms. The majority (68.2%) of these was consistent between all three datasets. However, there were still tens of thousands of methods/platforms specific calls.

Supplementary Figure 10 Indel detection and intersection from different platforms.

We show the number of small indels detected by each method/platform and their intersection, at a flank size of 50 bp. For the ~1 M indels detected, there was only 27.6% concordance.

Supplementary Figure 11 Example of a heterozygous deletion located inside a gene.

This heterozygous deletion was detected by the ASV method but difficult to find by either WGS resequencing method. The yellow block in the reference is the region that was missing from hap2. Below are the WGS reads aligned to this region. This 151 bp deletion covered the 5-UTR and a part of exon1 for the gene PSMD1.

Supplementary Figure 12 Example of a heterozygous insertion located inside a gene.

This heterozygous insertion was detected by the ASV method but difficult to find by either WGS resequencing method. The yellow block in hap1 is the region that was missing from the reference. Below are the WGS reads aligned to this region. Near the breakpoint there were very few reads, perhaps because the insert sequence influenced the alignment. This 54 bp insertion covered exon3 of the gene LATS2.

Supplementary Figure 13 Variation rate for YH vs hg19 and heterozygosity between the two haplotypes of YH.

The curves at the top and the right summarize the distribution of heterozygosity rates for the two haplotypes of YH and the variation between YH and hg19, respectively. The black line indicates the 99% cutoff for each distribution.

Supplementary Figure 14 The classification of novel gene sequences.

a. Classification of different types of novel and gap covered sequences. i) novel insertion; ii-iv), novel haplotypes; v-vii), gap covered sequences; viii), orphan scaffolds. b. Distribution of novel sequences based on their length and number, in 100 bp bins. Novel sequences of length >1000 bp accounted for 93% of the total length. The longest was 123 kb. c. Distribution of breakpoints for novel sequences. Most of the novel sequences were in non-coding (intron, repeat and intergenic) regions. Only 0.8% were in CDS regions. These distributions are subdivided by the length of the sequence, represented by the color bars. d. Repeat content based on RepeatMasker.

Supplementary Figure 15 Examples of cis- and trans-acting genes.

a. Cis-acting gene DSPP on 4q22.1 encoding dentin sialophosphoprotein. Mutations in DSPP are associated with Dentinogenesis imperfecta, Shields type II, and deafness. b. Trans-acting gene CA9 on 9p13.3. Diseases associated with mutations in CA9 include horseshoe kidney and renal cell carcinoma. GO annotations include carbonate dehydratase activity.

Supplementary Figure 16 Allele specific methylation and expression.

Venn diagram showing the relationship between allele specific methylation (ASM) and allele specific expression (ASE). The numbers refer to the gene count. The red/brown circle inside the larger ASM circle represents genes where ASM was detecting in the promoter region.

Supplementary Figure 17 Construction of the fosmid libraries.

Approximately 30 fosmid clones were cultured together to form a single fosmid pool. Then, 3 μg of DNA from each pool was digested, and fragments with insert size ranging from 180 to 800 bp were selected. Adapters containing the 11 bp barcode were ligated to these selected fragments to form a single pooled-fosmid library. Barcoded fragments from 60~320 single pooled-fosmid libraries were pooled again (evenly) to create a Stage I barcode library. DNA fragments of sizes between 180 bp to 650 bp (lengths exclude barcode) from each Stage I barcode library were used to construct two independent libraries (one with small insert sizes and one with intermediate insert sizes). Each library was then PCR amplified with index primers, each of which contained an 8 bp barcode, to form a Stage II barcode library.

Supplementary Figure 18 Indel positional concordance as a function of flank size for the different methods of detection.

To determine the best flank size for use in indel detection, we plotted the concordance between the ASV and resequencing based analyses. The results stabilize at above 50 bp.

Supplementary Figure 19 Length distributions for method-specific short indels.

Short indels (1-50 bp) detected only by one method/platform, were selected out and plotted according to the length. Top-right figure provided information for indels with length between 10 and 50.

Supplementary Figure 20 Example of ASV-specific indel supported by fosmid aligned reads.

This was a 3 bp heterozygous deletion in a region covered by fosmids from eight independent pools, two of which supported the deletion.

Supplementary information

Rights and permissions

About this article

Cite this article

Cao, H., Wu, H., Luo, R. et al. De novo assembly of a haplotype-resolved human genome.Nat Biotechnol 33, 617–622 (2015). https://doi.org/10.1038/nbt.3200

Download citation