High-throughput genome scaffolding from in vivo DNA interaction frequency - PubMed (original) (raw)

. 2013 Dec;31(12):1143-7.

doi: 10.1038/nbt.2768. Epub 2013 Nov 24.

Affiliations

High-throughput genome scaffolding from in vivo DNA interaction frequency

Noam Kaplan et al. Nat Biotechnol. 2013 Dec.

Abstract

Despite advances in DNA sequencing technology, assembly of complex genomes remains a major challenge, particularly for genomes sequenced using short reads, which yield highly fragmented assemblies. Here we show that genome-wide in vivo chromatin interaction frequency data, which are measurable with chromosome conformation capture-based experiments, can be used as genomic distance proxies to accurately position individual contigs without requiring any sequence overlap. We also use these data to construct approximate genome scaffolds de novo. Applying our approach to incomplete regions of the human genome, we predict the positions of 65 previously unplaced contigs, in agreement with alternative methods in 26/31 cases attempted in common. Our approach can theoretically bridge any gap size and should be applicable to any species for which global chromatin interaction data can be generated.

PubMed Disclaimer

Figures

Figure 1

Figure 1

Interaction frequency accurately predicts chromosome and locus for scaffold augmentation. (a) Average interaction frequency strongly separates interchromosomal from intrachromosomal interactions. For each 100kb contig in chromosome 1, we calculate its average interaction frequency with each chromosome. We exclude interaction data from the contig’s 1 Mb regions on each side, where the strongest interaction frequencies are typically found. The box plot shows the distribution of average interaction frequencies of all contigs over all chromosomes and demonstrates that the distribution of interchromosomal interaction frequencies is separated from intrachromsomal interaction frequencies. Whiskers represent minimal and maximal points within 1.5 of the interquartile range. (b) Naïve Bayes predictive performance at various gap sizes. We trained a Naïve Bayes classifier and predicted the chromosome of each contig, leaving out a 1/2/5/10 Mb flanking region on each side of the contig. The accuracy of all cross-validated predictions and of the confident predictions is shown by the left y-axis and the blue and red lines, respectively. The fraction of total predictions that are confident is shown by the right y-axis and the black line. (c) Genome-wide view of Naïve Bayes predictive performance. The prediction for each contig is marked by a short vertical line, colored according to its true chromosome. Predictions showed were performed leaving out a 1 Mb flanking region on each side of the contig. Predictions that did not pass the confidence threshold are marked as “NC”. (d) Interaction frequencies accurately predict chromosomal locus. For every contig, we exclude interaction data from the contig’s 1Mb flanking regions on each side and then predict its location in cross-validation. The inset shows the cumulative distribution of the absolute prediction error. All statistics are genome-wide.

Figure 2

Figure 2

Scaffold augmentation of the human genome. (a) Interaction frequency data of an unplaced contig with its predicted chromosome. Green bar marks the predicted contig position. (b) Predicted positions of unplaced contigs. Vertical lines indicate contigs. Green and red colors indicate agreement and disagreement with previous predictions. Black: newly placed contigs with no previous predictions.

Figure 3

Figure 3

De novo karyotyping (chromosome assignment). We retained every tenth 100 kb contig in the genome, leaving 0.9 Mb gaps between contigs. We then transformed the interaction frequencies into approximate distances and applied standard average linkage hierarchical clustering to the approximate distance matrix, without using any prior knowledge regarding the positions of the contigs. The cluster assignment for each contig is marked by a short vertical line, colored according to its true chromosome.

Figure 4

Figure 4

Accurate de novo chromosome scaffolding with interaction frequencies. (a, b) We retained every 10th 100 kb contig in the genome, leaving 0.9 Mb gaps between contigs. We then estimated the positions of all contigs, without using any prior knowledge regarding their positions. We arbitrarily scaled the predicted positions to the interval [0,1]. Note that the slope, which reflects scaling and orientation, is arbitrary. (a) Scaled predicted contig positions versus actual contig positions on chromosome 4. (b) Ranks of predicted contig positions versus rank of actual contig positions. (c, d) De novo scaffolding applied to a real set of contigs from chromosome 14 (see Methods). (c) Shown are the scaled predicted contig positions versus actual contig positions. (d) Ranks of predicted contig positions versus rank of actual contig positions.

Comment in

Similar articles

Cited by

References

    1. Nagarajan N, Pop M. Sequence assembly demystified. Nat. Rev. Genet. 2013;14:157–167. - PubMed
    1. Alkan C, Sajjadian S, Eichler EE. Limitations of next-generation genome sequence assembly. Nat. Methods. 2011;8:61–65. - PMC - PubMed
    1. Birney E. Assemblies: the good, the bad, the ugly. Nat. Methods. 2011;8:59–60. - PubMed
    1. Baker M. De novo genome assembly: what every biologist should know. Nat. Methods. 2012;9:333–337.
    1. Schatz MC, Delcher AL, Salzberg SL. Assembly of large genomes using second-generation sequencing. Genome Res. 2010;20:1165–1173. - PMC - PubMed

Publication types

MeSH terms

Substances

LinkOut - more resources