Highly accurate long-read HiFi sequencing data for five complex genomes - PubMed (original) (raw)

doi: 10.1038/s41597-020-00743-4.

Kristin Mars 1, Greg Young 1, Yu-Chih Tsai 1, Joseph W Karalius 1, Jane M Landolin 2, Nicholas Maurer 3, David Kudrna 4, Michael A Hardigan 5, Cynthia C Steiner 6, Steven J Knapp 5, Doreen Ware 7 8, Beth Shapiro 3 9, Paul Peluso 1, David R Rank 10

Affiliations

Highly accurate long-read HiFi sequencing data for five complex genomes

Ting Hon et al. Sci Data. 2020.

Abstract

The PacBio® HiFi sequencing method yields highly accurate long-read sequencing datasets with read lengths averaging 10-25 kb and accuracies greater than 99.5%. These accurate long reads can be used to improve results for complex applications such as single nucleotide and structural variant detection, genome assembly, assembly of difficult polyploid or highly repetitive genomes, and assembly of metagenomes. Currently, there is a need for sample data sets to both evaluate the benefits of these long accurate reads as well as for development of bioinformatic tools including genome assemblers, variant callers, and haplotyping algorithms. We present deep coverage HiFi datasets for five complex samples including the two inbred model genomes Mus musculus and Zea mays, as well as two complex genomes, octoploid Fragaria × ananassa and the diploid anuran Rana muscosa. Additionally, we release sequence data from a mock metagenome community. The datasets reported here can be used without restriction to develop new algorithms and explore complex genome structure and evolution. Data were generated on the PacBio Sequel II System.

PubMed Disclaimer

Conflict of interest statement

T.H., K.M., G.Y., Y-C. T., J.W.K., P.S.P. and D.R.R. are employees of Pacific Biosciences of California Inc. a company commercializing DNA sequencing technology. J.M.L. is an employee of Ravel Biotechnology Inc. a company commercializing disease detection from cell-free DNA. All other authors declare no competing interests.

Figures

Fig. 1

Fig. 1

Flowchart of HiFi sequence read generation and downstream applications.

Fig. 2

Fig. 2

Read length and quality distributions for the three sequenced samples with high quality finished sequence references. M. musculus read length (a) and accuracy (b), Z. mays read length (c) and accuracy (d), and Mock metagenome community ATTC MSA-1003 read length (e) and accuracy (f). All data is mapped to the genomic references (Table 1 and Supplementary Table 1) using minmap2. Accuracies are reported in Phred read quality space (Q value) = −10 × log10(P) where P is the measured error rate.

Fig. 3

Fig. 3

K-mer (length 21) distribution for all HiFi reads for each sequencing dataset. (a) M. musculus (b) Z. mays (c) F. × ananassa (d) R. muscosa (e) Mock metagenome community ATTC MSA-1003.

Similar articles

Cited by

References

    1. Bentley DR, et al. Accurate whole human genome sequencing using reversible terminator chemistry. Nature. 2008;456:53–59. doi: 10.1038/nature07517. - DOI - PMC - PubMed
    1. Rothberg JM, et al. An integrated semiconductor device enabling non-optical genome sequencing. Nature. 2011;475:348–352. doi: 10.1038/nature10242. - DOI - PubMed
    1. Eid J, et al. Real-Time DNA Sequencing from Single Polymerase Molecules. Science. 2009;323:133–138. doi: 10.1126/science.1162986. - DOI - PubMed
    1. Mikheyev AS, Tin MMY. A first look at the Oxford Nanopore MinION sequencer. Mol. Ecol. Resour. 2014;14:1097–1102. doi: 10.1111/1755-0998.12324. - DOI - PubMed
    1. Koboldt DC, Larson DE, Wilson RK. Using VarScan 2 for Germline Variant Calling and Somatic Mutation Detection. Curr. Protoc. Bioinforma. 2013;44:15.4.1–15.4.17. doi: 10.1002/0471250953.bi1504s44. - DOI - PMC - PubMed

Publication types

MeSH terms

Grants and funding

LinkOut - more resources