Phased diploid genome assembly with single-molecule real-time sequencing (original) (raw)
Accession codes
Accessions
Sequence Read Archive
References
- Goffeau, A. et al. Life with 6000 genes. Science 274, 546, 563–567 (1996).
Article CAS PubMed Google Scholar - Myers, E.W. et al. A whole-genome assembly of Drosophila. Science 287, 2196–2204 (2000).
Article CAS PubMed Google Scholar - Bonfield, J.K., Smith, Kf. & Staden, R. A new DNA sequence assembly program. Nucleic Acids Res. 23, 4992–4999 (1995).
Article CAS PubMed PubMed Central Google Scholar - Mouse ENCODE Consortium. et al. An encyclopedia of mouse DNA elements (mouse ENCODE). Genome Biol. 13, 418 (2012).
- Celniker, S.E. et al. Unlocking the secrets of the genome. Nature 459, 927–930 (2009).
Article CAS PubMed PubMed Central Google Scholar - Auton, A. et al. A global reference for human genetic variation. Nature 526, 68–74 (2015).
Article PubMed Google Scholar - Earl, D. et al. Assemblathon 1: a competitive assessment of de novo short read assembly methods. Genome Res. 21, 2224–2241 (2011).
Article CAS PubMed PubMed Central Google Scholar - Church, D.M. et al. Extending reference assembly models. Genome Biol. 16, 13 (2015).
Article PubMed PubMed Central Google Scholar - Tewhey, R., Bansal, V., Torkamani, A., Topol, E.J. & Schork, N.J. The importance of phase information for human genomics. Nat. Rev. Genet. 12, 215–223 (2011).
Article CAS PubMed PubMed Central Google Scholar - Henson, J., Tischler, G. & Ning, Z. Next-generation sequencing and large genome assemblies. Pharmacogenomics 13, 901–915 (2012).
Article CAS PubMed Google Scholar - Alkan, C., Sajjadian, S. & Eichler, E.E. Limitations of next-generation genome sequence assembly. Nat. Methods 8, 61–65 (2011).
Article CAS PubMed Google Scholar - Vinson, J.P. et al. Assembly of polymorphic genomes: algorithms and application to Ciona savignyi. Genome Res. 15, 1127–1135 (2005).
Article PubMed PubMed Central Google Scholar - Levy, S. et al. The diploid genome sequence of an individual human. PLoS Biol. 5, e254 (2007).
Article PubMed PubMed Central Google Scholar - Iqbal, Z., Caccamo, M., Turner, I., Flicek, P. & McVean, G. De novo assembly and genotyping of variants using colored de Bruijn graphs. Nat. Genet. 44, 226–232 (2012).
Article CAS PubMed PubMed Central Google Scholar - Kajitani, R. et al. Efficient de novo assembly of highly heterozygous genomes from whole-genome shotgun short reads. Genome Res. 24, 1384–1395 (2014).
Article CAS PubMed PubMed Central Google Scholar - Roach, J.C. et al. Chromosomal haplotypes by genetic phasing of human families. Am. J. Hum. Genet. 89, 382–397 (2011).
Article CAS PubMed PubMed Central Google Scholar - Kirkness, E.F. et al. Sequencing of isolated sperm cells for direct haplotyping of a human genome. Genome Res. 23, 826–832 (2013).
Article CAS PubMed PubMed Central Google Scholar - Kitzman, J.O. et al. Haplotype-resolved genome sequencing of a Gujarati Indian individual. Nat. Biotechnol. 29, 59–63 (2011).
Article CAS PubMed Google Scholar - McCoy, R.C. et al. Illumina TruSeq synthetic long-reads empower de novo assembly and resolve complex, highly-repetitive transposable elements. PloS One 9, e106689 (2014).
Article PubMed PubMed Central Google Scholar - Mostovoy, Y. et al. A hybrid approach for de novo human genome sequence assembly and phasing. Nat. Methods 13, 587–590 (2016).
Article CAS PubMed PubMed Central Google Scholar - Berlin, K. et al. Assembling large genomes with single-molecule sequencing and locality-sensitive hashing. Nat. Biotechnol. 33, 623–630 (2015).
Article CAS PubMed Google Scholar - Gordon, D. et al. Long-read sequence assembly of the gorilla genome. Science 352, aae0344 (2016).
Article PubMed PubMed Central Google Scholar - Chin, C.-S. et al. Nonhybrid, finished microbial genome assemblies from long-read SMRT sequencing data. Nat. Methods 10, 563–569 (2013).
Article CAS PubMed Google Scholar - Fasulo, D., Halpern, A., Dew, I. & Mobarry, C. Efficiently detecting polymorphisms during the fragment assembly process. Bioinformatics 18, S294–S302 (2002).
Article PubMed Google Scholar - The Arabidopsis Genome Initiative Analysis of the genome sequence of the flowering plant Arabidopsis thaliana. Nature 408, 796–815 (2000).
- Gan, X. et al. Multiple reference genomes and transcriptomes for Arabidopsis thaliana. Nature 477, 419–423 (2011).
Article CAS PubMed PubMed Central Google Scholar - Simão, F.A., Waterhouse, R.M., Ioannidis, P., Kriventseva, E.V. & Zdobnov, E.M. BUSCO: assessing genome assembly and annotation completeness with single-copy orthologs. Bioinformatics 31, 3210–3212 (2015).
Article PubMed Google Scholar - Koren, S., Walenz, B.P., Berlin, K., Miller, J.R. & Phillippy, A.M. Canu: scalable and accurate long-read assembly via adaptive k-mer weighting and repeat separation. Preprint at bioRxiv http://dx.doi.org/10.1101/071282 (2016).
- Luo, R. et al. SOAPdenovo2: an empirically improved memory-efficient short-read de novo assembler. Gigascience 1, 18 (2012).
Article PubMed PubMed Central Google Scholar - Stanke, M. & Waack, S. Gene prediction with a hidden Markov model and a new intron submodel. Bioinformatics 19, ii215–ii225 (2003).
Article PubMed Google Scholar - Jaillon, O. et al. The grapevine genome sequence suggests ancestral hexaploidization in major angiosperm phyla. Nature 449, 463–467 (2007).
Article CAS PubMed Google Scholar - Patel, S., Swaminathan, P., Fennell, A. & Zeng, E. in Proceedings of the 2015 IEEE International Conference on Bioinformatics and Biomedicine (BIBM) (eds. Huan, J. et al.) 1771–1773 (EEE, 2015).
- Garrison, E. & Marth, G. Haplotype-based variant detection from short-read sequencing. Preprint at arXiv:1207.3907v2 [q-bio.GN] (2012).
- Bansal, V. & Bafna, V. HapCUT: an efficient and accurate algorithm for the haplotype assembly problem. Bioinformatics 24, i153–i159 (2008).
Article PubMed Google Scholar - Degner, J.F. et al. Effect of read-mapping biases on detecting allele-specific expression from RNA-sequencing data. Bioinformatics 25, 3207–3212 (2009).
Article CAS PubMed PubMed Central Google Scholar - Liu, Y.-G. & Whittier, R.F. Rapid preparation of megabase plant DNA from nuclei in agarose plugs and microbeads. Nucleic Acids Res. 22, 2168–2169 (1994).
Article CAS PubMed PubMed Central Google Scholar - Hayward, G.S. Unique double-stranded fragments of bacteriophage T5 DNA resulting from preferential shear-induced breakage at nicks. Proc. Natl. Acad. Sci. USA 71, 2108–2112 (1974).
Article CAS PubMed PubMed Central Google Scholar - Myers, G. Algorithms in Bioinformatics (eds. Brown, D. & Morgenstern, B.) 52–67 (Springer, 2014).
- Myers, E.W. The fragment assembly string graph. Bioinformatics 21, ii79–ii85 (2005).
CAS PubMed Google Scholar - Chaisson, M.J. & Tesler, G. Mapping single molecule sequencing reads using basic local alignment with successive refinement (BLASR): application and theory. BMC Bioinformatics 13, 238 (2012).
Article CAS PubMed PubMed Central Google Scholar
Acknowledgements
The sequencing of the Cabernet Sauvignon genome was supported in part by a gift from the J. Lohr Vineyards and Wines to D.C. We would also like to thank F. Neto for providing an early-release BUSCO plant data set. Clavicorona pyxidata DNA was provided by L. Nagy (Institute of Biochemistry Biological Research Centre of the Hungarian Academy of Sciences). We thank J. Puglisi, F. Jupe, A. Copeland, and A. Wenger for reading and critiquing the manuscript. The project was supported in part by National Institutes of Health award (R01-HG006677 to M.C.S.) and by National Science Foundation awards (DBI-1350041 and IOS-1237880 to M.C.S.; MCB 0929402; and MCB 1122246 to J.R.E.). J.R.E. is an investigator at the Howard Hughes Medical Institute and Gordon and Betty Moore Foundation (GBMF 3034).
Author information
Author notes
- Chen-Shan Chin and Paul Peluso: These authors contributed equally to this work.
Authors and Affiliations
- Pacific Biosciences, Menlo Park, California, USA
Chen-Shan Chin, Paul Peluso, Gregory T Concepcion, Christopher Dunn & David R Rank - Department of Computer Science, Johns Hopkins University, Baltimore, Maryland, USA
Fritz J Sedlazeck & Michael C Schatz - Simons Center for Quantitative Biology, Cold Spring Harbor Laboratory, Cold Spring Harbor, New York, USA
Maria Nattestad & Michael C Schatz - DOE Joint Genome Institute, Walnut Creek, California, USA
Alicia Clum - Genomic Analysis Laboratory, The Salk Institute for Biological Studies, La Jolla, California, USA
Ronan O'Malley, Chongyuan Luo & Joseph R Ecker - Department of Viticulture and Enology, University of California Davis, Davis, California, USA
Rosa Figueroa-Balderas, Abraham Morales-Cruz & Dario Cantu - Department of Biochemistry and Molecular Biology, University of Nevada, Reno, Nevada, USA
Grant R Cramer - Dipartimento di Biotecnologie, Universita' degli Studi di Verona, Verona, Italy
Massimo Delledonne - Department of Biology, Johns Hopkins University, Baltimore, Maryland, USA
Michael C Schatz
Authors
- Chen-Shan Chin
You can also search for this author inPubMed Google Scholar - Paul Peluso
You can also search for this author inPubMed Google Scholar - Fritz J Sedlazeck
You can also search for this author inPubMed Google Scholar - Maria Nattestad
You can also search for this author inPubMed Google Scholar - Gregory T Concepcion
You can also search for this author inPubMed Google Scholar - Alicia Clum
You can also search for this author inPubMed Google Scholar - Christopher Dunn
You can also search for this author inPubMed Google Scholar - Ronan O'Malley
You can also search for this author inPubMed Google Scholar - Rosa Figueroa-Balderas
You can also search for this author inPubMed Google Scholar - Abraham Morales-Cruz
You can also search for this author inPubMed Google Scholar - Grant R Cramer
You can also search for this author inPubMed Google Scholar - Massimo Delledonne
You can also search for this author inPubMed Google Scholar - Chongyuan Luo
You can also search for this author inPubMed Google Scholar - Joseph R Ecker
You can also search for this author inPubMed Google Scholar - Dario Cantu
You can also search for this author inPubMed Google Scholar - David R Rank
You can also search for this author inPubMed Google Scholar - Michael C Schatz
You can also search for this author inPubMed Google Scholar
Contributions
C-S.C., P.P., A.C., D.R.R., and M.C.S. conceived the idea of the FALCON–FALCON-Unzip assembler. C.-S.C, P.P., F.J.S., M.N., G.T.C., D.R.R., D.C., and M.C.S. designed the experiments and performed the analysis. P.P., D.C., D.R.R., and M.C.S. collected the sequencing data. R.O'M. C.L., and J.R.E. constructed the Col-0-Cvi-1. A.C., R.O'M. R.F.-B., A.M.-C., G.R.C., M.D., C.L., J.R.E., and D.C. collected the samples and prepared DNA for sequencing. C.-S.C., P.P., F.J.S., M.N., G.T.C., D.C., D.R.R., and M.C.S. wrote the manuscript. C.-S.C. and C.D. implemented the computer code.
Corresponding authors
Correspondence toChen-Shan Chin or Michael C Schatz.
Ethics declarations
Competing interests
C.-S.C., P.P., G.T.C., C.D., and D.R. are employees and shareholders of Pacific Biosciences, a company commercializing DNA sequencing technology.
Integrated supplementary information
Supplementary Figure 1 Schematics of the software and data process modules and the FACLON-Unzip assembly graph process for resolving haplotypes.
(a) Data dependence flow and software modules inside FALCON and FALCON-Unzip
(b) Left: Initial assembly graph of a contig in the Arabidopsis F1 hybrid assembly. The different colors represent different haplotype blocks and phases. Right: The assembly graph after “unzipping”. Conceptually, the unzipping step identifies the heterozygous SNPs and uses them to remove overlaps between reads from different haplotypes. After removing such overlaps, nodes from the different haplotypes in the assembly graph will no longer have edges between them. This allows FALCON-Unzip to identify long haplotype specific paths and construct haplotigs of them. The dashed circle region indicates haplotype blocks that can be extended through a bubble region.
Supplementary Figure 2 Reverse accumulative read length distribution of the three diploid genome datasets
Supplementary Figure 3 SOAPdenovo assembly sizes and N50 and NG50 sizes of the 3 genomes using different values of k using the raw reads and corrected by Lighter.
Supplementary Figure 4 Assemblytic analysis comparison of the Arabidopsis F1 assemblies from FALCON-Unzip, Platanus, and SOAPdenovo.
(a) Cumulative sequence length of three Arabidopsis F1 assemblies created by FALCON-Unzip, Platanus, and SOAPdenovo compared to the TAIR10 reference. (b) Variants called using Assemblytics from three Arabidopsis F1 assemblies created by FALCON-Unzip,Platanus, and SOAPdenovo.
Supplementary Figure 5 Variation comparison between the inbred line assemblies and the F1-hybrid for all Arabidopsis chromosome along with TAIR10 references.
Supplementary Figure 6 Homopolymer length and frequency in the TAIR10 Assembly.
Supplementary Figure 7 Assembly comparison: FALCON-Unzip V. vinifera cv. Cabernet Sauvignon assembly versus V. vinifera reference genome
(a) MUMmerplot of FALCON-Unzip V. vinifera cv. Cabernet Sauvignon assembly versus V. vinifera reference genome. For clarity only alignments >= 10,000 bp long to the primary chromosomes are displayed. (b) The synteny between PN40024 Chr1 from 5’- telomere to centromere (green line) to the longest contig 000000F (black line) and its associated haplotigs (blue lines). The vertical green and blue lines indicated homologous coding sequences between the sequences. The cyan lines in the bottom indicate the synteny between the primary contig and other primary contigs. (c) Synteny alignment between two primary contigs 000334F vs. 000000F. (d) Synteny alignment between two primary contigs 000057F vs 000075F.
Supplementary Figure 8 Comparison of the distribution the het-SNP site density of the three genomes
(a) The distribution of number of het-SNPs observed of the reads used for phasing of the longest contig of each genome in semi-log plot. (b) Fitting the distributions with a exponential function (density ~ c * exp(-a * het-SNP count)). We pick het-SNP count range of 10 to 200 for Arabidopsis, 50 to 200 for Vitis, and 10 to 100 for Clavicorona to catch the exponential decay part. The fitted parameter a = -0.0222, 0.0216, 0.0412 for Arabidopsis, Vitis and Clavicorona respectively. The fastest decay rate for Clavicorona indicates it has the least variation between the haplotypes among the three genomes. From this fitting, we expect to see about 45 (Arabidopsis), 46 (Vitis), and 24 (Clavicorona) per 10kb in the regions of interests.
Supplementary Figure 9 Example of a low heterozygosity region observed in Clavicorona genome.
The het-SNPs are called with FreeBayes on the alignments of the short read data to only the primary contigs. The contig 00003F has a low heterozygosity region from ~1.2Mb to ~2.7Mb.
Supplementary Figure 10 General schematic about how different levels of heterozygosity can affect the contig layout.
Supplementary Figure 11 Candidates for differentially expressed alleles from RNA-seq data.
(a)(b)We mapped both genomic reads (middle panel) and cDNA reads (lower panel) to the primary contigs from our Clavicorona pyxidata assembly. We also shows curated CDS sequences mapped to the contig (top panel). The genomic reads shows both alleles mapped while we only observe on major allele in the transcript reads.
Supplementary Figure 12 An Example of how the FALCON-sense algorithm generates consensus sequence.
Supplementary Figure 13 (a) Summary of the graph reduction from sequence overlaps to contigs. (b) Example on constructing haplotigs in the Clavicorona pyxidata assembly.
Supplementary Figure 14 Summary of the graph reduction from sequence overlaps to contigs.
Supplementary Figure 15 Summary of the greedy SNP phasing algorithm.
(a) All pairs of het-SNPs that are covered by multiple reads are evaluation. A “coupling score” is calculation from the number reads that support current haplotype assignment of the het-SNPs. (b)(c) We linearly scan through the het-SNP positions. If the total score is improved by flipping the haplotype assigned at one location, then we flip the assignment. (d) An example showing the “coupling score” before the flipping process (un-phased het-SNPs assignment) and afterward (phased het-SNP assignment).
Supplementary information
Supplementary Text and Figures
Supplementary Figures 1–15, Supplementary Tables 1–10 and Supplementary Note 1 (PDF 4833 kb)
Supplementary Data 1
SNP identified by nucmer between FALCON col-0 assembly and the TAIR10 reference (TXT 1920 kb)
Supplementary Data 2
List of syntenic regions identify between different primary contigs of the FALCON-Unzip Arabidopsis thaliana Col-0 x Cvi-1 assembly. (CSV 21668 kb)
Supplementary Data 3
List of syntenic regions identify between different primary contigs of the FALCON-Unzip Vitis vinifera assembly. (CSV 2183 kb)
Supplementary Data 4
Example of starting an AWS instance to run FALCON/FALCON-Unzip for Clavicorona pyxidata assembly (PDF 2523 kb)
Rights and permissions
About this article
Cite this article
Chin, CS., Peluso, P., Sedlazeck, F. et al. Phased diploid genome assembly with single-molecule real-time sequencing.Nat Methods 13, 1050–1054 (2016). https://doi.org/10.1038/nmeth.4035
- Received: 06 June 2016
- Accepted: 25 August 2016
- Published: 17 October 2016
- Issue Date: December 2016
- DOI: https://doi.org/10.1038/nmeth.4035