In vitro, long-range sequence information for de novo genome assembly via transposase contiguity (original) (raw)
- Jacob O. Kitzman1,4,
- Joshua N. Burton1,
- Riza Daza1,
- Akash Kumar1,
- Lena Christiansen2,
- Mostafa Ronaghi2,
- Sasan Amini2,
- Kevin L. Gunderson2,
- Frank J. Steemers2 and
- Jay Shendure1
- 1Department of Genome Sciences, University of Washington, Seattle, Washington 98115, USA;
- 2Illumina, Inc., Advanced Research Group, San Diego, California 92122, USA
- Corresponding author: shendure{at}uw.edu
- ↵Present addresses: 3Department of Molecular and Medical Genetics, Oregon Health and Science University, Portland, Oregon 97239, USA;
- ↵4 Department of Human Genetics, University of Michigan, Ann Arbor, Michigan 48109, USA.
Abstract
We describe a method that exploits contiguity preserving transposase sequencing (CPT-seq) to facilitate the scaffolding of de novo genome assemblies. CPT-seq is an entirely in vitro means of generating libraries comprised of 9216 indexed pools, each of which contains thousands of sparsely sequenced long fragments ranging from 5 kilobases to >1 megabase. These pools are “subhaploid,” in that the lengths of fragments contained in each pool sums to ∼5% to 10% of the full genome. The scaffolding approach described here, termed fragScaff, leverages coincidences between the content of different pools as a source of contiguity information. Specifically, CPT-seq data is mapped to a de novo genome assembly, followed by the identification of pairs of contigs or scaffolds whose ends disproportionately co-occur in the same indexed pools, consistent with true adjacency in the genome. Such candidate “joins” are used to construct a graph, which is then resolved by a minimum spanning tree. As a proof-of-concept, we apply CPT-seq and fragScaff to substantially boost the contiguity of de novo assemblies of the human, mouse, and fly genomes, increasing the scaffold N50 of de novo assemblies by eight- to 57-fold with high accuracy. We also demonstrate that fragScaff is complementary to Hi-C-based contact probability maps, providing midrange contiguity to support robust, accurate chromosome-scale de novo genome assemblies without the need for laborious in vivo cloning steps. Finally, we demonstrate CPT-seq as a means of anchoring unplaced novel human contigs to the reference genome as well as for detecting misassembled sequences.
Footnotes
[Supplemental material is available for this article.]
Article published online before print. Article, supplemental material, and publication date are at http://www.genome.org/cgi/doi/10.1101/gr.178319.114.
Received May 13, 2014.
Accepted August 4, 2014.
© 2014 Adey et al.; Published by Cold Spring Harbor Laboratory Press
This article is distributed exclusively by Cold Spring Harbor Laboratory Press for the first six months after the full-issue publication date (see http://genome.cshlp.org/site/misc/terms.xhtml). After six months, it is available under a Creative Commons License (Attribution-NonCommercial 4.0 International), as described at http://creativecommons.org/licenses/by-nc/4.0/.